BTRFS RAID5 disk failed while balancing
If you clicked on the link to this topic: Thank you! I have the following setup: 6x 500GB HDD-Drives 1x 32GB NVME-SSD (Intel Optane) I used bcache to setup up the SSD as caching device and all other six drives are backing devices. After all that was in place, I formatted the six HHDs with btrfs in RAID5. Everything works as expected for the last 7 months now. By now I have a spare of 6x 2TB HDD drives and I want to replace the old 500GB disks one by one. So I started with the first one by deleting it from the btrfs. This worked fine, I had no issues there. After that I cleanly detached the empty disk from bcache, still everything is fine, so I removed it. Here are the commandlines for this: sudo btrfs device delete /dev/bcacheX /media/raid cat /sys/block/bcacheX/bcache/state cat /sys/block/bcacheX/bcache/dirty_data sudo sh -c "echo 1 > /sys/block/bcacheX/bcache/detach" cat /sys/block/bcacheX/bcache/state After that I installed one of 2TB drives, attached it to bcache and added it to the raid. The next step was to balance the data over to the new drive. Please see the commandlines: sudo make-bcache -B /dev/sdY sudo sh -c "echo '60a63f7c-2e68-4503-9f25-71b6b00e47b2' > /sys/block/bcacheY/bcache/attach" sudo sh -c "echo writeback > /sys/block/bcacheY/bcache/cache_mode" sudo btrfs device add /dev/bcacheY /media/raid sudo btrfs fi ba start /media/raid/ The balance worked fine until ~164GB were written to the new drive, this is about 50% of the data to be balanced. Suddenly write errors on the disk appear. The Raid slowly became unusable (I was running 3 VMs of the RAID while balancing). I think it worked for some time due to the SSD commiting the writes. At some point the balancing stopped and I was only able to kill the VMs. I checked the I/Os on the disks and the SSD spit out constant 1,2 GB/s read. I think the bcache somehow delivered data to the btrfs and it got rejected there and requested again, but this is just a guess. Anyway, I ended up resetting the host and I physically disconnected the broken disk and put a new one in place. I also created a bcache backing device on it and issued the following command to replace the faulty disk: sudo btrfs replace start -r 7 /dev/bcache5 /media/raid The filesystem needs to be mounted read/write for this command to work. It is now doing its work, but very slow, about 3,5 MB/s. Unfortunately the syslog reports a lot of these messages: ... scrub_missing_raid56_worker: 62 callbacks suppressed BTRFS error (device bcache0): failed to rebuild valid logical 4929143865344 for dev (null) ... BTRFS error (device bcache0): failed to rebuild valid logical 4932249866240 for dev (null) scrub_missing_raid56_worker: 1 callbacks suppressed BTRFS error (device bcache0): failed to rebuild valid logical 4933254250496 for dev (null) If I try to read a file from the filesystem, the output-command fails with a simple I/O error and the syslog shows something entries similar to this: BTRFS warning (device bcache0): csum failed root 5 ino 1143 off 7274496 csum 0xf554 expected csum 0x6340b527 mirror 2 So far, so good (or bad). It took about 6 hours for 4,3% of the replacement so far. No read or write errors have been reported for the replacement procedure ("btrfs replace status"). I will let it to its thing until finished. Before the first 2TB disk failed, 164 GB of data have been written according to "btrfs filesystem show". If I check the amount of data written to the new drive, the 4,3% represent about 82 GB (according to /proc/diskstats). I don't know how to interpret this, but anyway. And now finally my questions: If the replace command finishes successfully, what should I do next. A scrub? A balance? Another backup? ;-) Do you see anything that I have done wrong in this procedure? Do the warnings and the errors reported from btrfs mean, that the data is lost? :-( Here is some additional info (**edited**): $ sudo btrfs fi sh Total devices 7 FS bytes used 1.56TiB Label: none uuid: 9f765025-5354-47e4-afcc-a601b2a52703 devid0 size 1.82TiB used 164.03GiB path /dev/bcache5 devid1 size 465.76GiB used 360.03GiB path /dev/bcache4 devid3 size 465.76GiB used 360.00GiB path /dev/bcache3 devid4 size 465.76GiB used 359.03GiB path /dev/bcache1 devid5 size 465.76GiB used 360.00GiB path /dev/bcache0 devid6 size 465.76GiB used 360.03GiB path /dev/bcache2 *** Some devices missing $ sudo btrfs dev stats /media/raid/ [/dev/bcache5].write_io_errs0 [/dev/bcache5].read_io_errs 0 [/dev/bcache5].flush_io_errs0 [/dev/bcache5].corruption_errs 0 [/dev/bcache5].generation_errs 0 [/dev/bcache4].write_io_errs0 [/dev/bcache4].read_io_errs 0 [/dev/bcache4].flush_io_errs0 [/dev/bcache4].corruption_errs 0
[PATCH V8] Add support for BTRFS raid5/6 to GRUB
i All, the aim of this patches set is to provide support for a BTRFS raid5/6 filesystem in GRUB. The first patch, implements the basic support for raid5/6. I.e this works when all the disks are present. The next 5 patches, are preparatory ones. The 7th patch implements the raid5 recovery for btrfs (i.e. handling the disappearing of 1 disk). The 8th patch makes the code for handling the raid6 recovery more generic. The last one implements the raid6 recovery for btrfs (i.e. handling the disappearing up to two disks). I tested the code in grub-emu, and it works both with all the disks, and with some disks missing. I checked the crc32 calculated from grub and from linux and these matched. Finally I checked if the support for md raid6 still works properly, and it does (with all drives and with up to 2 drives missing) Comments are welcome. Changelog v1: initial support for btrfs raid5/6. No recovery allowed v2: full support for btrfs raid5/6. Recovery allowed v3: some minor cleanup suggested by Daniel Kiper; reusing the original raid6 recovery code of grub v4: Several spell fix; better description of the RAID layout in btrfs, and the variables which describes the stripe positioning; split the patch #5 in two (#5 and #6) v5: Several spell fix; improved code comment in patch #1, small clean up in the code v6: Small cleanup; improved the wording in the RAID6 layout description; in the function raid6_recover_read_buffer() avoid a unnecessary memcpy in case of invalid data; v7: - patch 2,3,5,6,8 received an Review-by Daniel, and were unchanged from the last time (only minor cleanup in the commit description requested by Daniel) - patch 7 received some small update rearranging a for(), and some bracket around if() - patch 4, received an update message which explains better why NULL is stored in data->devices_attached[] - patch 9, received a blank line to separate better a code line from a previous comment. A description of 'parities_pos' was added - patch 1, received a major update about the variable meaning description in the comment. However I suspect that we need some further review to reach a fully agreement about this text. NB: the update are relate only to comments v8: - patch 2,5,6,8 received an Review-by Daniel, and were unchanged from the last time (only minor cleanup in the commit description requested by Daniel) - patch 1 received some adjustement to the variables description due to the different terminology between BTRFS and other RAID implementatio. Added a description for the "nparities" variable. - patch 3 removed some unnecessary curly brackets (change request by Daniel) - patch 4 received an improved commit description about why and how the function find_device() is changed - patch 7 received an update which transforms a i = 0; while(i..) i++; in for( i = 0. ; i++); - patch 9 received an update to the comment BR G.Baroncelli -- gpg @keyserver.linux.it: Goffredo Baroncelli Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5
Re: [RFC] Add support for BTRFS raid5/6 to GRUB
On 04/23/2018 01:50 PM, Daniel Kiper wrote: > On Tue, Apr 17, 2018 at 09:57:40PM +0200, Goffredo Baroncelli wrote: >> Hi All, >> >> Below you can find a patch to add support for accessing files from >> grub in a RAID5/6 btrfs filesystem. This is a RFC because it is >> missing the support for recovery (i.e. if some devices are missed). In >> the next days (weeks ?) I will extend this patch to support also this >> case. >> >> Comments are welcome. > > More or less LGTM. Just a nitpick below... I am happy to take full blown > patch into GRUB if it is ready. Thanks for the comments; however now I implemented also the recovery. It is under testing. Let me few days and I will resubmit the patches. > >> BR >> G.Baroncelli >> >> >> --- >> >> commit 8c80a1b7c913faf50f95c5c76b4666ed17685666 >> Author: Goffredo Baroncelli <kreij...@inwind.it> >> Date: Tue Apr 17 21:40:31 2018 +0200 >> >> Add initial support for btrfs raid5/6 chunk >> >> diff --git a/grub-core/fs/btrfs.c b/grub-core/fs/btrfs.c >> index be195448d..4c5632acb 100644 >> --- a/grub-core/fs/btrfs.c >> +++ b/grub-core/fs/btrfs.c >> @@ -119,6 +119,8 @@ struct grub_btrfs_chunk_item >> #define GRUB_BTRFS_CHUNK_TYPE_RAID1 0x10 >> #define GRUB_BTRFS_CHUNK_TYPE_DUPLICATED0x20 >> #define GRUB_BTRFS_CHUNK_TYPE_RAID100x40 >> +#define GRUB_BTRFS_CHUNK_TYPE_RAID5 0x80 >> +#define GRUB_BTRFS_CHUNK_TYPE_RAID60x100 >>grub_uint8_t dummy2[0xc]; >>grub_uint16_t nstripes; >>grub_uint16_t nsubstripes; >> @@ -764,6 +766,39 @@ grub_btrfs_read_logical (struct grub_btrfs_data *data, >> grub_disk_addr_t addr, >>stripe_offset = low + chunk_stripe_length >> * high; >>csize = chunk_stripe_length - low; >> + break; >> +} >> + case GRUB_BTRFS_CHUNK_TYPE_RAID5: >> + case GRUB_BTRFS_CHUNK_TYPE_RAID6: >> +{ >> + grub_uint64_t nparities; >> + grub_uint64_t parity_pos; >> + grub_uint64_t stripe_nr, high; >> + grub_uint64_t low; >> + >> + redundancy = 1; /* no redundancy for now */ >> + >> + if (grub_le_to_cpu64 (chunk->type) & GRUB_BTRFS_CHUNK_TYPE_RAID5) >> +{ >> + grub_dprintf ("btrfs", "RAID5\n"); >> + nparities = 1; >> +} >> + else >> +{ >> + grub_dprintf ("btrfs", "RAID6\n"); >> + nparities = 2; >> +} >> + >> + stripe_nr = grub_divmod64 (off, chunk_stripe_length, ); >> + >> + high = grub_divmod64 (stripe_nr, nstripes - nparities, ); >> + grub_divmod64 (high+nstripes-nparities, nstripes, _pos); >> + grub_divmod64 (parity_pos+nparities+stripen, nstripes, ); > > Missing spaces around "+" and "-". > > Daniel > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- gpg @keyserver.linux.it: Goffredo Baroncelli Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Add support for BTRFS raid5/6 to GRUB
On Tue, Apr 17, 2018 at 09:57:40PM +0200, Goffredo Baroncelli wrote: > Hi All, > > Below you can find a patch to add support for accessing files from > grub in a RAID5/6 btrfs filesystem. This is a RFC because it is > missing the support for recovery (i.e. if some devices are missed). In > the next days (weeks ?) I will extend this patch to support also this > case. > > Comments are welcome. More or less LGTM. Just a nitpick below... I am happy to take full blown patch into GRUB if it is ready. > BR > G.Baroncelli > > > --- > > commit 8c80a1b7c913faf50f95c5c76b4666ed17685666 > Author: Goffredo Baroncelli <kreij...@inwind.it> > Date: Tue Apr 17 21:40:31 2018 +0200 > > Add initial support for btrfs raid5/6 chunk > > diff --git a/grub-core/fs/btrfs.c b/grub-core/fs/btrfs.c > index be195448d..4c5632acb 100644 > --- a/grub-core/fs/btrfs.c > +++ b/grub-core/fs/btrfs.c > @@ -119,6 +119,8 @@ struct grub_btrfs_chunk_item > #define GRUB_BTRFS_CHUNK_TYPE_RAID1 0x10 > #define GRUB_BTRFS_CHUNK_TYPE_DUPLICATED0x20 > #define GRUB_BTRFS_CHUNK_TYPE_RAID100x40 > +#define GRUB_BTRFS_CHUNK_TYPE_RAID5 0x80 > +#define GRUB_BTRFS_CHUNK_TYPE_RAID60x100 >grub_uint8_t dummy2[0xc]; >grub_uint16_t nstripes; >grub_uint16_t nsubstripes; > @@ -764,6 +766,39 @@ grub_btrfs_read_logical (struct grub_btrfs_data *data, > grub_disk_addr_t addr, > stripe_offset = low + chunk_stripe_length > * high; > csize = chunk_stripe_length - low; > + break; > + } > + case GRUB_BTRFS_CHUNK_TYPE_RAID5: > + case GRUB_BTRFS_CHUNK_TYPE_RAID6: > + { > + grub_uint64_t nparities; > + grub_uint64_t parity_pos; > + grub_uint64_t stripe_nr, high; > + grub_uint64_t low; > + > + redundancy = 1; /* no redundancy for now */ > + > + if (grub_le_to_cpu64 (chunk->type) & GRUB_BTRFS_CHUNK_TYPE_RAID5) > + { > + grub_dprintf ("btrfs", "RAID5\n"); > + nparities = 1; > + } > + else > + { > + grub_dprintf ("btrfs", "RAID6\n"); > + nparities = 2; > + } > + > + stripe_nr = grub_divmod64 (off, chunk_stripe_length, ); > + > + high = grub_divmod64 (stripe_nr, nstripes - nparities, ); > + grub_divmod64 (high+nstripes-nparities, nstripes, _pos); > + grub_divmod64 (parity_pos+nparities+stripen, nstripes, ); Missing spaces around "+" and "-". Daniel -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC] Add support for BTRFS raid5/6 to GRUB
Hi All, Below you can find a patch to add support for accessing files from grub in a RAID5/6 btrfs filesystem. This is a RFC because it is missing the support for recovery (i.e. if some devices are missed). In the next days (weeks ?) I will extend this patch to support also this case. Comments are welcome. BR G.Baroncelli --- commit 8c80a1b7c913faf50f95c5c76b4666ed17685666 Author: Goffredo Baroncelli <kreij...@inwind.it> Date: Tue Apr 17 21:40:31 2018 +0200 Add initial support for btrfs raid5/6 chunk diff --git a/grub-core/fs/btrfs.c b/grub-core/fs/btrfs.c index be195448d..4c5632acb 100644 --- a/grub-core/fs/btrfs.c +++ b/grub-core/fs/btrfs.c @@ -119,6 +119,8 @@ struct grub_btrfs_chunk_item #define GRUB_BTRFS_CHUNK_TYPE_RAID1 0x10 #define GRUB_BTRFS_CHUNK_TYPE_DUPLICATED0x20 #define GRUB_BTRFS_CHUNK_TYPE_RAID100x40 +#define GRUB_BTRFS_CHUNK_TYPE_RAID5 0x80 +#define GRUB_BTRFS_CHUNK_TYPE_RAID60x100 grub_uint8_t dummy2[0xc]; grub_uint16_t nstripes; grub_uint16_t nsubstripes; @@ -764,6 +766,39 @@ grub_btrfs_read_logical (struct grub_btrfs_data *data, grub_disk_addr_t addr, stripe_offset = low + chunk_stripe_length * high; csize = chunk_stripe_length - low; + break; + } + case GRUB_BTRFS_CHUNK_TYPE_RAID5: + case GRUB_BTRFS_CHUNK_TYPE_RAID6: + { + grub_uint64_t nparities; + grub_uint64_t parity_pos; + grub_uint64_t stripe_nr, high; + grub_uint64_t low; + + redundancy = 1; /* no redundancy for now */ + + if (grub_le_to_cpu64 (chunk->type) & GRUB_BTRFS_CHUNK_TYPE_RAID5) + { + grub_dprintf ("btrfs", "RAID5\n"); + nparities = 1; + } + else + { + grub_dprintf ("btrfs", "RAID6\n"); + nparities = 2; + } + + stripe_nr = grub_divmod64 (off, chunk_stripe_length, ); + + high = grub_divmod64 (stripe_nr, nstripes - nparities, ); + grub_divmod64 (high+nstripes-nparities, nstripes, _pos); + grub_divmod64 (parity_pos+nparities+stripen, nstripes, ); + + stripe_offset = low + chunk_stripe_length * high; + csize = chunk_stripe_length - low; + break; } default: -- gpg @keyserver.linux.it: Goffredo Baroncelli Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs Raid5 issue.
On 2017年08月23日 00:37, Robert LeBlanc wrote: Thanks for the explanations. Chris, I don't think 'degraded' did anything to help the mounting, I just passed it in to see if it would help (I'm not sure if btrfs is "smart" enough to ignore a drive if it would increase the chance of mounting the volume even if it is degraded, but one could hope). I believe the key was 'nologreplay'. Here is some info about the corrupted fs: # btrfs fi show /tmp/root/ Label: 'kvm-btrfs' uuid: fef29f0a-dc4c-4cc4-b524-914e6630803c Total devices 3 FS bytes used 3.30TiB devid1 size 2.73TiB used 2.09TiB path /dev/bcache32 devid2 size 2.73TiB used 2.09TiB path /dev/bcache0 devid3 size 2.73TiB used 2.09TiB path /dev/bcache16 # btrfs fi usage /tmp/root/ WARNING: RAID56 detected, not implemented WARNING: RAID56 detected, not implemented WARNING: RAID56 detected, not implemented Overall: Device size: 8.18TiB Device allocated:0.00B Device unallocated:8.18TiB Device missing: 0.00B Used:0.00B Free (estimated):0.00B (min: 8.00EiB) Data ratio: 0.00 Metadata ratio: 0.00 Global reserve: 512.00MiB (used: 0.00B) Data,RAID5: Size:4.15TiB, Used:3.28TiB /dev/bcache02.08TiB /dev/bcache16 2.08TiB /dev/bcache32 2.08TiB Metadata,RAID5: Size:22.00GiB, Used:20.69GiB /dev/bcache0 11.00GiB /dev/bcache16 11.00GiB /dev/bcache32 11.00GiB System,RAID5: Size:64.00MiB, Used:400.00KiB /dev/bcache0 32.00MiB /dev/bcache16 32.00MiB /dev/bcache32 32.00MiB Unallocated: /dev/bcache0 655.00GiB /dev/bcache16 655.00GiB /dev/bcache32 656.49GiB So it looks like I set the metadata and system data to RAID5 and not RAID1. I guess that it could have been affected by the write hole causing the problem I was seeing. Since I get the same space usage with RAID1 and RAID5, Well, RAID1 has larger space usage than 3-disk RAID5. Space efficiency will be 50% for RAID1 while 66% for 3-disk RAID5. So you may lost some available space. I think I'm just going to use RAID1. I don't need stripe performance or anything like that. And RAID5/6 won't always improve performance. Especially when IO blocksize is smaller than full stripe size (in your case it's 128K). When doing sequential IO with blocksize smaller than 128K, there will be an obvious performance drop due to RMW cycle. This is not limited to Btrfs RAID56 but all RAID56. It would be nice if btrfs supported hotplug and re-plug a little better so that it is more "production" quality, but I just have to be patient. I'm familiar with Gluster and contributed code to Ceph, so I'm familiar with those types of distributed systems. I really like them, but the complexity is quite overkill for my needs at home. As far as bcache performance: I have two Crucial MX200 250GB drives that were md raid1 containing /boot (ext2), swap and then bcache. I have 2 WD Reds and a Seagate Barracuda Desktop drive all 3TB. With bcache in writeback, apt-get would be painfully slow. Running iostat, the SSDs would be doing a few hundred IOPs and the backing disks would be very busy and would be the limiting factor overall. Even though apt-get just downloaded the file (should be on the SSDs because of writeback), it still involved the backend disks way too much. The amount of dirty data was always less than 10% so there should have been plenty of space to free up cache without having to flush. I experimented with changing the size of contiguous IO to force more to cache, increasing the dirty ratio, etc, nothing seemed to provide the performance I was hoping. To be fair having a pair of SSDs (md raid1) caching three spindles (btrfs raid5) may not be an ideal configuration. If I had three SSDs, one for each drive, then it may have performed better?? I have also ~980 snapshots spread over a years time, so I don't know how much that impacts things. I did use a btrfs utility to help find duplicate files/chunks and dedupe them so that updated system binaries between upgraded LXC containers would use the same space on disk and be more efficient in bcache cache usage. Well, RAID1 ssd, offline dedupe, bcache, many snapshots, way more complex than I though. So I'm uncertain where the bottleneck is. After restoring the root and LXC roots snapshots on the SSD (broke the md raid1 so I could restore to one of them), I ran apt-get and got upwards to 2,400 IOPs with it being sustained around 1,200 IOPs (btrfs single on md raid1 degraded). I know that btrfs has some performance challenges, but I don't think I was hitting those. I was most likely a very unusual set-up of bcache and btrfs raid that caused the problem. I have bcache on 10
Re: Btrfs Raid5 issue.
Thanks for the explanations. Chris, I don't think 'degraded' did anything to help the mounting, I just passed it in to see if it would help (I'm not sure if btrfs is "smart" enough to ignore a drive if it would increase the chance of mounting the volume even if it is degraded, but one could hope). I believe the key was 'nologreplay'. Here is some info about the corrupted fs: # btrfs fi show /tmp/root/ Label: 'kvm-btrfs' uuid: fef29f0a-dc4c-4cc4-b524-914e6630803c Total devices 3 FS bytes used 3.30TiB devid1 size 2.73TiB used 2.09TiB path /dev/bcache32 devid2 size 2.73TiB used 2.09TiB path /dev/bcache0 devid3 size 2.73TiB used 2.09TiB path /dev/bcache16 # btrfs fi usage /tmp/root/ WARNING: RAID56 detected, not implemented WARNING: RAID56 detected, not implemented WARNING: RAID56 detected, not implemented Overall: Device size: 8.18TiB Device allocated:0.00B Device unallocated:8.18TiB Device missing: 0.00B Used:0.00B Free (estimated):0.00B (min: 8.00EiB) Data ratio: 0.00 Metadata ratio: 0.00 Global reserve: 512.00MiB (used: 0.00B) Data,RAID5: Size:4.15TiB, Used:3.28TiB /dev/bcache02.08TiB /dev/bcache16 2.08TiB /dev/bcache32 2.08TiB Metadata,RAID5: Size:22.00GiB, Used:20.69GiB /dev/bcache0 11.00GiB /dev/bcache16 11.00GiB /dev/bcache32 11.00GiB System,RAID5: Size:64.00MiB, Used:400.00KiB /dev/bcache0 32.00MiB /dev/bcache16 32.00MiB /dev/bcache32 32.00MiB Unallocated: /dev/bcache0 655.00GiB /dev/bcache16 655.00GiB /dev/bcache32 656.49GiB So it looks like I set the metadata and system data to RAID5 and not RAID1. I guess that it could have been affected by the write hole causing the problem I was seeing. Since I get the same space usage with RAID1 and RAID5, I think I'm just going to use RAID1. I don't need stripe performance or anything like that. It would be nice if btrfs supported hotplug and re-plug a little better so that it is more "production" quality, but I just have to be patient. I'm familiar with Gluster and contributed code to Ceph, so I'm familiar with those types of distributed systems. I really like them, but the complexity is quite overkill for my needs at home. As far as bcache performance: I have two Crucial MX200 250GB drives that were md raid1 containing /boot (ext2), swap and then bcache. I have 2 WD Reds and a Seagate Barracuda Desktop drive all 3TB. With bcache in writeback, apt-get would be painfully slow. Running iostat, the SSDs would be doing a few hundred IOPs and the backing disks would be very busy and would be the limiting factor overall. Even though apt-get just downloaded the file (should be on the SSDs because of writeback), it still involved the backend disks way too much. The amount of dirty data was always less than 10% so there should have been plenty of space to free up cache without having to flush. I experimented with changing the size of contiguous IO to force more to cache, increasing the dirty ratio, etc, nothing seemed to provide the performance I was hoping. To be fair having a pair of SSDs (md raid1) caching three spindles (btrfs raid5) may not be an ideal configuration. If I had three SSDs, one for each drive, then it may have performed better?? I have also ~980 snapshots spread over a years time, so I don't know how much that impacts things. I did use a btrfs utility to help find duplicate files/chunks and dedupe them so that updated system binaries between upgraded LXC containers would use the same space on disk and be more efficient in bcache cache usage. After restoring the root and LXC roots snapshots on the SSD (broke the md raid1 so I could restore to one of them), I ran apt-get and got upwards to 2,400 IOPs with it being sustained around 1,200 IOPs (btrfs single on md raid1 degraded). I know that btrfs has some performance challenges, but I don't think I was hitting those. I was most likely a very unusual set-up of bcache and btrfs raid that caused the problem. I have bcache on 10 year old desktop box with a single nvme drive that performs a little better, but it is hard to be certain because of its age. It has bcache in write-around (since there is only a single nvme) and btrfs in raid1. I haven't watched that box as closely because it is responsive enough. It also only has four Gb of RAM so it constantly has to swap (web pages are hogs these days) and one of the reasons to retrofit that box with nvme rather than MX200. If you have any other questions, feel free to ask. Thanks Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in t
Re: Btrfs Raid5 issue.
On 2017年08月22日 13:19, Robert LeBlanc wrote: Chris and Qu thanks for your help. I was able to restore the data off the volume. I only could not read one file that I tried to rsync (a MySQl bin log), but it wasn't critical as I had an off-site snapshot from that morning and ownclould could resync the files that were changed anyway. This turned out much better than the md RAID failure that I had a year ago. Much faster recovery thanks to snapshots. Is there anything you would like from this damaged filesystem to help determine what went wrong and to help make btrfs better? If I don't hear back from you in a day, I'll destroy it so that I can add the disks into the new btrfs volumes to restore redundancy. Feel free to destroy the old images. If nologreplay works, that's good enough. The problem seems to be extent tree, but it's too hard to locate the real problem. Bcache wasn't providing the performance I was hoping for, so I'm putting the root and roots for my LXC containers on the SSDs (btrfs RAID1) and the bulk stuff on the three spindle drives (btrfs RAID1). Well, I'm more interested in the bcache performance. I was considering to using my Intel 600P NVMe to cache one 2.5' HGST 1T HDD (7200rpm) in my btrfs KVM host (also my daily machine). Would you please share more details about the performance problem? (Maybe it's about some btrfs performance problems, not bcache. Btrfs is not good at workload like DB or metadata heavy operations) For some reason, it seemed that the btrfs RAID5 setup required one of the drives, but I thought I had data with RAID5 and metadata with 2 copies. Was I missing something else that prevented mounting with that specific drive? I don't want to get into a situation where one drive dies and I can't get to any data. The direct cause is btrfs fails to replay its log, and it's corrupted extent tree causing log replay failed. And normally such failure will definitely cause problem, so btrfs just stop the mount procedure. In your case, if "nologreplay" is specified, btrfs skips the problem, and since you must specify RO for nologrelay, btrfs has nothing to do with extent tree at all. So btrfs can be mounted. Why extent tree get corrupted is still unknown. If your metadata is also RAID5, then write-hole may be the cause. If your metadata profile is RAID1, then I don't know why this could happen. So from this point of view, even we fixed btrfs scrub/race problems, it's still not good enough to survive a disk removal in real world. With RAID1 setup, at least we don't need to care about write hole and csum will help us to determine which copy is correct, so I think it will be much better than RAID56. If you have spare time, you could try to hot-plug RAID1 devices to verify how it works. But please note that, re-attach plugged device may need to umount the fs and re-scan btrfs. And even you're using 3 devices with RAID1, it's still 2 copies. So you can lose at most 1 device. Thanks, Qu Thank you again. Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs Raid5 issue.
On Mon, Aug 21, 2017 at 11:19 PM, Robert LeBlanc <rob...@leblancnet.us> wrote: > Chris and Qu thanks for your help. I was able to restore the data off > the volume. I only could not read one file that I tried to rsync (a > MySQl bin log), but it wasn't critical as I had an off-site snapshot > from that morning and ownclould could resync the files that were > changed anyway. This turned out much better than the md RAID failure > that I had a year ago. Much faster recovery thanks to snapshots. > > Is there anything you would like from this damaged filesystem to help > determine what went wrong and to help make btrfs better? If I don't > hear back from you in a day, I'll destroy it so that I can add the > disks into the new btrfs volumes to restore redundancy. > > Bcache wasn't providing the performance I was hoping for, so I'm > putting the root and roots for my LXC containers on the SSDs (btrfs > RAID1) and the bulk stuff on the three spindle drives (btrfs RAID1). > For some reason, it seemed that the btrfs RAID5 setup required one of > the drives, but I thought I had data with RAID5 and metadata with 2 > copies. Was I missing something else that prevented mounting with that > specific drive? I don't want to get into a situation where one drive > dies and I can't get to any data. With all three connected, what do you get for 'btrfs fi show' ? The first email says the supers on all three drives are OK, but still it's confusing the degraded is working. It suggests it's not finding something on one of the drives that it needs to mount - usually that's the first superblock or it could be the system block group is partly corrupt or read error or something; and when degraded it makes it possible to mount. Anyway at least all of the data is safe now. Pretty much all you can do to guard against data loss is backups. Any degraded state is precarious because it requires just one more thing to go wrong and it's all bad news from there. Gluster is pretty easy to setup, and use either gluster native mount on linux or smb with everything else. Stick a big drive in a raspberry pi (or two) and even though it's only fast ethernet (haha, now slow 100bps ethernet) it will still replicate automatically as well as failover. Plus one of those could be XFS if you wanted to hedge your bets. Or one of the less expensive Intel NUCs will also work if you want to stick with x86. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs Raid5 issue.
Chris and Qu thanks for your help. I was able to restore the data off the volume. I only could not read one file that I tried to rsync (a MySQl bin log), but it wasn't critical as I had an off-site snapshot from that morning and ownclould could resync the files that were changed anyway. This turned out much better than the md RAID failure that I had a year ago. Much faster recovery thanks to snapshots. Is there anything you would like from this damaged filesystem to help determine what went wrong and to help make btrfs better? If I don't hear back from you in a day, I'll destroy it so that I can add the disks into the new btrfs volumes to restore redundancy. Bcache wasn't providing the performance I was hoping for, so I'm putting the root and roots for my LXC containers on the SSDs (btrfs RAID1) and the bulk stuff on the three spindle drives (btrfs RAID1). For some reason, it seemed that the btrfs RAID5 setup required one of the drives, but I thought I had data with RAID5 and metadata with 2 copies. Was I missing something else that prevented mounting with that specific drive? I don't want to get into a situation where one drive dies and I can't get to any data. Thank you again. Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs Raid5 issue.
On Mon, Aug 21, 2017 at 10:31 AM, Robert LeBlancwrote: > Qu, > > Sorry, I'm not on the list (I was for a few years about three years ago). > > I looked at the backup roots like you mentioned. > > # ./btrfs inspect dump-super -f /dev/bcache0 > superblock: bytenr=65536, device=/dev/bcache0 > - > csum_type 0 (crc32c) > csum_size 4 > csum0x45302c8f [match] > bytenr 65536 > flags 0x1 > ( WRITTEN ) > magic _BHRfS_M [match] > fsidfef29f0a-dc4c-4cc4-b524-914e6630803c > label kvm-btrfs > generation 1620386 > root5310022877184 > sys_array_size 161 > chunk_root_generation 1620164 > root_level 1 > chunk_root 4725030256640 > chunk_root_level1 > log_root2876047507456 > log_root_transid0 > log_root_level 0 > total_bytes 8998588280832 > bytes_used 3625869234176 > sectorsize 4096 > nodesize16384 > leafsize (deprecated) 16384 > stripesize 4096 > root_dir6 > num_devices 3 > compat_flags0x0 > compat_ro_flags 0x0 > incompat_flags 0x1e1 > ( MIXED_BACKREF | > BIG_METADATA | > EXTENDED_IREF | > RAID56 | > SKINNY_METADATA ) > cache_generation1620386 > uuid_tree_generation42 > dev_item.uuid cb56a9b7-8d67-4ae8-8cb0-076b0b93f9c4 > dev_item.fsid fef29f0a-dc4c-4cc4-b524-914e6630803c [match] > dev_item.type 0 > dev_item.total_bytes2998998654976 > dev_item.bytes_used 2295693574144 > dev_item.io_align 4096 > dev_item.io_width 4096 > dev_item.sector_size4096 > dev_item.devid 2 > dev_item.dev_group 0 > dev_item.seek_speed 0 > dev_item.bandwidth 0 > dev_item.generation 0 > sys_chunk_array[2048]: > item 0 key (FIRST_CHUNK_TREE CHUNK_ITEM 4725030256640) > length 67108864 owner 2 stripe_len 65536 type > SYSTEM|RAID5 > io_align 65536 io_width 65536 sector_size 4096 > num_stripes 3 sub_stripes 1 > stripe 0 devid 1 offset 2185232384 > dev_uuid e273c794-b231-4d86-9a38-53a6d2fa8643 > stripe 1 devid 3 offset 1195075698688 > dev_uuid 120d6a05-b0bc-46c8-a87e-ca4fe5008d09 > stripe 2 devid 2 offset 41340108800 > dev_uuid cb56a9b7-8d67-4ae8-8cb0-076b0b93f9c4 > backup_roots[4]: > backup 0: > backup_tree_root: 5309879451648 gen: 1620384 > level: 1 > backup_chunk_root: 4725030256640 gen: 1620164 > level: 1 > backup_extent_root: 5309910958080 gen: 1620385 > level: 2 > backup_fs_root: 3658468147200 gen: 1618016 > level: 1 > backup_dev_root:5309904224256 gen: 1620384 > level: 1 > backup_csum_root: 5309910532096 gen: 1620385 > level: 3 > backup_total_bytes: 8998588280832 > backup_bytes_used: 3625871646720 > backup_num_devices: 3 > > backup 1: > backup_tree_root: 5309780492288 gen: 1620385 > level: 1 > backup_chunk_root: 4725030256640 gen: 1620164 > level: 1 > backup_extent_root: 5309659037696 gen: 1620385 > level: 2 > backup_fs_root: 0 gen: 0 level: 0 > backup_dev_root:5309872275456 gen: 1620385 > level: 1 > backup_csum_root: 5309674536960 gen: 1620385 > level: 3 > backup_total_bytes: 8998588280832 > backup_bytes_used: 3625869234176 > backup_num_devices: 3 Well that's strange. A backup entry with a null fs root. > I noticed on that page that there is a 'nologreplay' mount option so I > tried it with degraded and it requires ro, but the volume mounted and > I can "see" things on the volume. Degraded suggests it's not finding one of the three devices. > So with this nologreplay option, if I do a btrfs send of the subvolume > that I'm interested in (I don't think it was being written to at the > time of failure), would it copy (send) over the corruption as well. Anything that results in EIO will get included in the send, and by default receive fails. You can use verbose messaging on the receive side, and use -E option to permit the errors. But file system specific problems aren't going to
Re: Btrfs Raid5 issue.
Qu, Sorry, I'm not on the list (I was for a few years about three years ago). I looked at the backup roots like you mentioned. # ./btrfs inspect dump-super -f /dev/bcache0 superblock: bytenr=65536, device=/dev/bcache0 - csum_type 0 (crc32c) csum_size 4 csum0x45302c8f [match] bytenr 65536 flags 0x1 ( WRITTEN ) magic _BHRfS_M [match] fsidfef29f0a-dc4c-4cc4-b524-914e6630803c label kvm-btrfs generation 1620386 root5310022877184 sys_array_size 161 chunk_root_generation 1620164 root_level 1 chunk_root 4725030256640 chunk_root_level1 log_root2876047507456 log_root_transid0 log_root_level 0 total_bytes 8998588280832 bytes_used 3625869234176 sectorsize 4096 nodesize16384 leafsize (deprecated) 16384 stripesize 4096 root_dir6 num_devices 3 compat_flags0x0 compat_ro_flags 0x0 incompat_flags 0x1e1 ( MIXED_BACKREF | BIG_METADATA | EXTENDED_IREF | RAID56 | SKINNY_METADATA ) cache_generation1620386 uuid_tree_generation42 dev_item.uuid cb56a9b7-8d67-4ae8-8cb0-076b0b93f9c4 dev_item.fsid fef29f0a-dc4c-4cc4-b524-914e6630803c [match] dev_item.type 0 dev_item.total_bytes2998998654976 dev_item.bytes_used 2295693574144 dev_item.io_align 4096 dev_item.io_width 4096 dev_item.sector_size4096 dev_item.devid 2 dev_item.dev_group 0 dev_item.seek_speed 0 dev_item.bandwidth 0 dev_item.generation 0 sys_chunk_array[2048]: item 0 key (FIRST_CHUNK_TREE CHUNK_ITEM 4725030256640) length 67108864 owner 2 stripe_len 65536 type SYSTEM|RAID5 io_align 65536 io_width 65536 sector_size 4096 num_stripes 3 sub_stripes 1 stripe 0 devid 1 offset 2185232384 dev_uuid e273c794-b231-4d86-9a38-53a6d2fa8643 stripe 1 devid 3 offset 1195075698688 dev_uuid 120d6a05-b0bc-46c8-a87e-ca4fe5008d09 stripe 2 devid 2 offset 41340108800 dev_uuid cb56a9b7-8d67-4ae8-8cb0-076b0b93f9c4 backup_roots[4]: backup 0: backup_tree_root: 5309879451648 gen: 1620384level: 1 backup_chunk_root: 4725030256640 gen: 1620164level: 1 backup_extent_root: 5309910958080 gen: 1620385level: 2 backup_fs_root: 3658468147200 gen: 1618016level: 1 backup_dev_root:5309904224256 gen: 1620384level: 1 backup_csum_root: 5309910532096 gen: 1620385level: 3 backup_total_bytes: 8998588280832 backup_bytes_used: 3625871646720 backup_num_devices: 3 backup 1: backup_tree_root: 5309780492288 gen: 1620385level: 1 backup_chunk_root: 4725030256640 gen: 1620164level: 1 backup_extent_root: 5309659037696 gen: 1620385level: 2 backup_fs_root: 0 gen: 0 level: 0 backup_dev_root:5309872275456 gen: 1620385level: 1 backup_csum_root: 5309674536960 gen: 1620385level: 3 backup_total_bytes: 8998588280832 backup_bytes_used: 3625869234176 backup_num_devices: 3 backup 2: backup_tree_root: 5310022877184 gen: 1620386level: 1 backup_chunk_root: 4725030256640 gen: 1620164level: 1 backup_extent_root: 2876048949248 gen: 1620387level: 2 backup_fs_root: 3658468147200 gen: 1618016level: 1 backup_dev_root:5309872275456 gen: 1620385level: 1 backup_csum_root: 5310042259456 gen: 1620386level: 3 backup_total_bytes: 8998588280832 backup_bytes_used: 3625869250560 backup_num_devices: 3 backup 3: backup_tree_root: 5309771448320 gen: 1620383level: 1 backup_chunk_root: 4725030256640 gen: 1620164level: 1 backup_extent_root: 5309779804160 gen: 1620384level: 2 backup_fs_root: 3658468147200 gen: 1618016level: 1 backup_dev_root:5309848158208
Re: Btrfs Raid5 issue.
I lost enough Btrfs m=d=s=RAID5 filesystems in past experiments (I didn't try using RAID5 for metadata and system chunks in the last few years) to faulty SATA cables + hotplug enabled SATA controllers (where a disk could disappear and reappear "as the wind blew"). Since then, I made a habit of always disabling hotplug for all SATA disks involved with Btrfs, even those with m=d=s=single profile (and I never desired to built multi-devices filesystems from USB attached disks anyway but this is good reason for me to explicitly avoid that). I am not sure if other RAID profiles are affected in a similar way or it's just RAID56. (Well, I mean RAID0 is obviously toast and RAID1/10 will obviously get degraded but I am not sure if it's possible to re-sync RAID1/10 with a simple balance [possibly even without remounting and doing manual device delete/add?] or the filesystem has to be recreated from scratch [like RAID5].) I think this hotplug problem is an entirely different issue from the RAID56-scrub race-conditions (which are now considered fixed in linux 4.12) and nobody is currently working on this (if it's RAID56-only then I don't expect it anytime soon [think years]). -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs Raid5 issue.
On 2017年08月21日 12:33, Robert LeBlanc wrote: I've been running btrfs in a raid5 for about a year now with bcache in front of it. Yesterday, one of my drives was acting really slow, so I was going to move it to a different port. I guess I get too comfortable hot plugging drives in at work and didn't think twice about what could go wrong, hey I set it up in RAID5 so it will be fine. Well, it wasn't... Well, Btrfs RAID5 is not that safe. I would recommend to use RAID1 for metadata at least. (And in your case, your metadata is damaged, so I really recommend to use a better profile for your metadata) I was aware of the write hole issue, and thought it was committed to the 4.12 branch, so I was running 4.12.5 at the time. I have two SSDs that are in an md RAID1 that is the cache for the three backing devices in bcache (bcache{0..2} or bcache{0,16,32} depending on the kernel booted. I have all my critical data saved off on btrfs snapshots on a different host, but I don't transfer my MythTV subs that often, so I'd like to try to recover some of that if possible. What is really interesting is that I could not boot the first time (root on the btrfs volume), but I rebooted again and the fs was in read-only mode, but only one of the three disks was in read-only. I tried to reboot again and it never mounted again after that. I see some messages in dmesg like this: [ 151.201637] BTRFS info (device bcache0): disk space caching is enabled [ 151.201640] BTRFS info (device bcache0): has skinny extents [ 151.215697] BTRFS info (device bcache0): bdev /dev/bcache16 errs: wr 309, rd 319, flush 39, corrupt 0, gen 0 [ 151.931764] BTRFS info (device bcache0): detected SSD devices, enabling SSD mode [ 152.058915] BTRFS error (device bcache0): parent transid verify failed on 5309837426688 wanted 1620383 found 1619473 [ 152.059944] BTRFS error (device bcache0): parent transid verify failed on 5309837426688 wanted 1620383 found 1619473 Normally transid error indicates bigger problem, and normally hard to trace. [ 152.060018] BTRFS: error (device bcache0) in __btrfs_free_extent:6989: errno=-5 IO failure [ 152.060060] BTRFS: error (device bcache0) in btrfs_run_delayed_refs:3009: errno=-5 IO failure [ 152.071613] BTRFS info (device bcache0): delayed_refs has NO entry [ 152.074126] BTRFS: error (device bcache0) in btrfs_replay_log:2475: errno=-5 IO failure (Failed to recover log tree) [ 152.074244] BTRFS error (device bcache0): cleaner transaction attach returned -30 [ 152.148993] BTRFS error (device bcache0): open_ctree failed So, I thought that the log was corrupted, I could live without the last 30 seconds or so, I tried `btrfs rescue zero-log /dev/bcache0` and I get a backtrace. Yes, your idea about log is correct. It's log replay causing problem. But the root cause seems to be corrupted extent tree, which is not easy to fix. I ran `btrfs rescue chunk-recover /dev/bcache0` and it spent hours scanning the three disks and at the end tried to fix the logs (or tree, I can't remember exactly) and then I got another backtrace. Today, I compiled 4.13-rc6 to see if some of the latest fixes would help, no dice (the dmesg above is from 4.13-rc6). I compiled the latest master of btrfs-progs, no progress. Things I've tried: mount mount -o degraded mount -o degraded,ro mount -o degraded (with each drive disconnected in turn to see if in would start without one of the drives) btrfs rescue chunk-recover btrfs rescue super-recover (all drives report the superblocks are fine) btrfs rescue zero-log (always has a backtrace) I think that's some other problem causing the backtrace. Normally extent tree corruption or transid error. btrfs check I know that bcache complicates things, but I'm hoping for two things. 1. Try to get what I can off the volume. 2. Provide some information that can help make btrfs/bcache better for the future. Here is what `btrfs rescue zero-log` outputs: # ./btrfs rescue zero-log /dev/bcache0 Clearing log on /dev/bcache0, previous log_root 2876047507456, level 0 parent transid verify failed on 5309233872896 wanted 1620381 found 1619462 parent transid verify failed on 5309233872896 wanted 1620381 found 1619462 checksum verify failed on 5309233872896 found 6A103358 wanted 8EF38EEE checksum verify failed on 5309233872896 found 6A103358 wanted 8EF38EEE bytenr mismatch, want=5309233872896, have=65536 parent transid verify failed on 5309233872896 wanted 1620381 found 1619462 parent transid verify failed on 5309233872896 wanted 1620381 found 1619462 checksum verify failed on 5309233872896 found 6A103358 wanted 8EF38EEE checksum verify failed on 5309233872896 found 6A103358 wanted 8EF38EEE bytenr mismatch, want=5309233872896, have=65536 parent transid verify failed on 5309233872896 wanted 1620381 found 1619462 parent transid verify failed on 5309233872896 wanted 1620381 found 1619462 checksum verify failed on 5309233872896 found 6A103358 wanted 8EF38EEE checksum verify failed on 5309233872896
Btrfs Raid5 issue.
I've been running btrfs in a raid5 for about a year now with bcache in front of it. Yesterday, one of my drives was acting really slow, so I was going to move it to a different port. I guess I get too comfortable hot plugging drives in at work and didn't think twice about what could go wrong, hey I set it up in RAID5 so it will be fine. Well, it wasn't... I was aware of the write hole issue, and thought it was committed to the 4.12 branch, so I was running 4.12.5 at the time. I have two SSDs that are in an md RAID1 that is the cache for the three backing devices in bcache (bcache{0..2} or bcache{0,16,32} depending on the kernel booted. I have all my critical data saved off on btrfs snapshots on a different host, but I don't transfer my MythTV subs that often, so I'd like to try to recover some of that if possible. What is really interesting is that I could not boot the first time (root on the btrfs volume), but I rebooted again and the fs was in read-only mode, but only one of the three disks was in read-only. I tried to reboot again and it never mounted again after that. I see some messages in dmesg like this: [ 151.201637] BTRFS info (device bcache0): disk space caching is enabled [ 151.201640] BTRFS info (device bcache0): has skinny extents [ 151.215697] BTRFS info (device bcache0): bdev /dev/bcache16 errs: wr 309, rd 319, flush 39, corrupt 0, gen 0 [ 151.931764] BTRFS info (device bcache0): detected SSD devices, enabling SSD mode [ 152.058915] BTRFS error (device bcache0): parent transid verify failed on 5309837426688 wanted 1620383 found 1619473 [ 152.059944] BTRFS error (device bcache0): parent transid verify failed on 5309837426688 wanted 1620383 found 1619473 [ 152.060018] BTRFS: error (device bcache0) in __btrfs_free_extent:6989: errno=-5 IO failure [ 152.060060] BTRFS: error (device bcache0) in btrfs_run_delayed_refs:3009: errno=-5 IO failure [ 152.071613] BTRFS info (device bcache0): delayed_refs has NO entry [ 152.074126] BTRFS: error (device bcache0) in btrfs_replay_log:2475: errno=-5 IO failure (Failed to recover log tree) [ 152.074244] BTRFS error (device bcache0): cleaner transaction attach returned -30 [ 152.148993] BTRFS error (device bcache0): open_ctree failed So, I thought that the log was corrupted, I could live without the last 30 seconds or so, I tried `btrfs rescue zero-log /dev/bcache0` and I get a backtrace. I ran `btrfs rescue chunk-recover /dev/bcache0` and it spent hours scanning the three disks and at the end tried to fix the logs (or tree, I can't remember exactly) and then I got another backtrace. Today, I compiled 4.13-rc6 to see if some of the latest fixes would help, no dice (the dmesg above is from 4.13-rc6). I compiled the latest master of btrfs-progs, no progress. Things I've tried: mount mount -o degraded mount -o degraded,ro mount -o degraded (with each drive disconnected in turn to see if in would start without one of the drives) btrfs rescue chunk-recover btrfs rescue super-recover (all drives report the superblocks are fine) btrfs rescue zero-log (always has a backtrace) btrfs check I know that bcache complicates things, but I'm hoping for two things. 1. Try to get what I can off the volume. 2. Provide some information that can help make btrfs/bcache better for the future. Here is what `btrfs rescue zero-log` outputs: # ./btrfs rescue zero-log /dev/bcache0 Clearing log on /dev/bcache0, previous log_root 2876047507456, level 0 parent transid verify failed on 5309233872896 wanted 1620381 found 1619462 parent transid verify failed on 5309233872896 wanted 1620381 found 1619462 checksum verify failed on 5309233872896 found 6A103358 wanted 8EF38EEE checksum verify failed on 5309233872896 found 6A103358 wanted 8EF38EEE bytenr mismatch, want=5309233872896, have=65536 parent transid verify failed on 5309233872896 wanted 1620381 found 1619462 parent transid verify failed on 5309233872896 wanted 1620381 found 1619462 checksum verify failed on 5309233872896 found 6A103358 wanted 8EF38EEE checksum verify failed on 5309233872896 found 6A103358 wanted 8EF38EEE bytenr mismatch, want=5309233872896, have=65536 parent transid verify failed on 5309233872896 wanted 1620381 found 1619462 parent transid verify failed on 5309233872896 wanted 1620381 found 1619462 checksum verify failed on 5309233872896 found 6A103358 wanted 8EF38EEE checksum verify failed on 5309233872896 found 6A103358 wanted 8EF38EEE bytenr mismatch, want=5309233872896, have=65536 parent transid verify failed on 5309233872896 wanted 1620381 found 1619462 parent transid verify failed on 5309233872896 wanted 1620381 found 1619462 checksum verify failed on 5309233872896 found 6A103358 wanted 8EF38EEE checksum verify failed on 5309233872896 found 6A103358 wanted 8EF38EEE bytenr mismatch, want=5309233872896, have=65536 parent transid verify failed on 5309233872896 wanted 1620381 found 1619462 parent transid verify failed on 5309233872896 wanted 1620381 found 1619462 checksum verify failed on
Re: Btrfs/RAID5 became unmountable after SATA cable fault
It seems like I accidentally managed to break my Btrfs/RAID5 filesystem, yet again, in a similar fashion. This time around, I ran into some random libata driver issue (?) instead of a faulty hardware part but the end result is quiet similar. I issued the command (replacing X with valid letters for every hard-drives in the system): # echo 1 > /sys/block/sdX/device/queue_depth and I ended up with read-only filesystems. I checked dmesg and saw write errors on every disks (not just those in RAID-5). I tried to reboot immediately without success. My root filesystem with a single-disk Btrfs (which is an SSD, so it has "single" profile for both data and metadata) was unmountable, thus the kernel was stuck in a panic-reboot cycle. I managed to fix this one by booting from an USB stick and trying various recovery methods (like mounting it with "-o clear_cache,nospace_cache,recovery" and running "btrfs rescue chunk-recovery") until everything seemed to be fine (it can now be mounted read-write without error messages in the kernel-log, can be fully scrubbed without errors reported, it passes in "btrfs check", files can be actually written and read, etc). Once my system was up and running (well, sort of), I realized my /data is also un-mountable. I tried the same recovery methods on this RAID-5 filesystem but nothing seemed to help (there is an exception with the recovery attempts: the system drive was a small and fast SSD so "chunk-recovery" was a viable option to try but this one consists of huge slow HDDs - so, I tried to run it as a last-resort over-night but I found an unresponsive machine on the morning with the process stuck relatively early in the process). I can always mount it read-only and access files on it, seemingly without errors (I compared some of the contents with backups and it looks good) but as soon as I mount it read-write, all hell breaks loose and it falls into read-only state in no time (with some files seemingly disappearing from the filesystem) and the kernel log is starting to get spammed with various kind of error messages (including missing csums, etc). After mounting it like this: # mount /dev/sdb /data -o rw,noatime,nospace_cache and doing: # btrfs scrub start /data the result is: scrub status for 7d4769d6-2473-4c94-b476-4facce24b425 scrub started at Sat Jul 23 13:50:55 2016 and was aborted after 00:05:30 total bytes scrubbed: 18.99GiB with 16 errors error details: read=16 corrected errors: 0, uncorrectable errors: 16, unverified errors: 0 The relevant dmesg output is: [ 1047.709830] BTRFS info (device sdc): disabling disk space caching [ 1047.709846] BTRFS: has skinny extents [ 1047.895818] BTRFS info (device sdc): bdev /dev/sdc errs: wr 4, rd 0, flush 0, corrupt 0, gen 0 [ 1047.895835] BTRFS info (device sdc): bdev /dev/sdb errs: wr 4, rd 0, flush 0, corrupt 0, gen 0 [ 1065.764352] BTRFS: checking UUID tree [ 1386.423973] BTRFS error (device sdc): parent transid verify failed on 24431936729088 wanted 585936 found 586145 [ 1386.430922] BTRFS error (device sdc): parent transid verify failed on 24431936729088 wanted 585936 found 586145 [ 1411.738955] BTRFS error (device sdc): parent transid verify failed on 24432322764800 wanted 585779 found 586145 [ 1411.948040] BTRFS error (device sdc): parent transid verify failed on 24432322764800 wanted 585779 found 586145 [ 1412.040964] BTRFS error (device sdc): parent transid verify failed on 24432322764800 wanted 585779 found 586145 [ 1412.040980] BTRFS error (device sdc): parent transid verify failed on 24432322764800 wanted 585779 found 586145 [ 1412.041134] BTRFS error (device sdc): parent transid verify failed on 24432322764800 wanted 585779 found 586145 [ 1412.042628] BTRFS error (device sdc): parent transid verify failed on 24432322764800 wanted 585779 found 586145 [ 1412.042748] BTRFS error (device sdc): parent transid verify failed on 24432322764800 wanted 585779 found 586145 [ 1499.45] BTRFS error (device sdc): parent transid verify failed on 24432312270848 wanted 585779 found 586143 [ 1499.230264] BTRFS error (device sdc): parent transid verify failed on 24432312270848 wanted 585779 found 586143 [ 1525.865143] BTRFS error (device sdc): parent transid verify failed on 24432367730688 wanted 585779 found 586144 [ 1525.880537] BTRFS error (device sdc): parent transid verify failed on 24432367730688 wanted 585779 found 586144 [ 1552.434209] BTRFS error (device sdc): parent transid verify failed on 24432415821824 wanted 585781 found 586144 [ 1552.437325] BTRFS error (device sdc): parent transid verify failed on 24432415821824 wanted 585781 found 586144 btrfs check /dev/sdc results in: Checking filesystem on /dev/sdc UUID: 7d4769d6-2473-4c94-b476-4facce24b425 checking extents parent transid verify failed on 24431859855360 wanted 585941 found 586144 parent transid verify failed on 24431859855360 wanted 585941 found 586144 checksum verify fa
Re: Adventures in btrfs raid5 disk recovery
On Wed, Jul 6, 2016 at 1:15 PM, Austin S. Hemmelgarnwrote: > On 2016-07-06 14:45, Chris Murphy wrote: >> I think it's statistically 0 people changing this from default. It's >> people with drives that have no SCT ERC support, used in raid1+, who >> happen to stumble upon this very obscure work around to avoid link >> resets in the face of media defects. Rare. > > Not as much as you think, once someone has this issue, they usually put > preventative measures in place on any system where it applies. I'd be > willing to bet that most sysadmins at big companies like RedHat or Oracle > are setting this. SCT ERC yes. Changing the kernel's command timer? I think almost zero. >> Well they have link resets and their file system presumably face >> plants as a result of a pile of commands in the queue returning as >> unsuccessful. So they have premature death of their system, rather >> than it getting sluggish. This is a long standing indicator on Windows >> to just reinstall the OS and restore data from backups -> the user has >> an opportunity to freshen up user data backup, and the reinstallation >> and restore from backup results in freshly written sectors which is >> how bad sectors get fixed. The marginally bad sectors get new writes >> and now read fast (or fast enough), and the persistently bad sectors >> result in the drive firmware remapping to reserve sectors. >> >> The main thing in my opinion is less extension of drive life, as it is >> the user gets to use the system, albeit sluggish, to make a backup of >> their data rather than possibly losing it. > > The extension of the drive's lifetime is a nice benefit, but not what my > point was here. For people in this particular case, it will almost > certainly only make things better (although at first it may make performance > worse). I'm not sure why it makes performance worse. The options are, slower reads vs a file system that almost certainly face plants upon a link reset. >> Basically it's: >> >> For SATA and USB drives: >> >> if data redundant, then enable short SCT ERC time if supported, if not >> supported then extend SCSI command timer to 200; >> >> if data not redundant, then disable SCT ERC if supported, and extend >> SCSI command timer to 200. >> >> For SCSI (SAS most likely these days), keep things the same as now. >> But that's only because this is a rare enough configuration now I >> don't know if we really know the problems there. It may be that their >> error recovery in 7 seconds is massively better and more reliable than >> consumer drives over 180 seconds. > > I don't see why you would think this is not common. I was not clear. Single device SAS is probably not common. They're typically being used in arrays where data is redundant. Using such a drive with short error recovery as a single boot drive? Probably not that common. > Separately, USB gets _really_ complicated if you want to cover everything, > USB drives may or may not present as non-rotational, may or may not show up > as SATA or SCSI bridges (there are some of the more expensive flash drives > that actually use SSD controllers plus USB-SAT chips internally), if they do > show up as such, may or may not support the required commands (most don't, > but it's seemingly hit or miss which do). Yup. Well, do what we can instead of just ignoring the problem? They can still be polled for features including SCT ERC and if it's not supported or configurable then fallback to increasing the command timer. I'm not sure what else can be done anyway. The main obstacle is squaring the device capability (low level) with storage stack redundancy 0 or 1 (high level). Something has to be aware of both to ideally get all devices ideally configured. >> Yep it's imperfect unless there's the proper cross communication >> between layers. There are some such things like hardware raid geometry >> that optionally poke through (when supported by hardware raid drivers) >> so that things like mkfs.xfs can automatically provide the right sunit >> swidth for optimized layout; which the device mapper already does >> automatically. So it could be done it's just a matter of how big of a >> problem is this to build it, vs just going with a new one size fits >> all default command timer? > > The other problem though is that the existing things pass through > _read-only_ data, while this requires writable data to be passed through, > which leads to all kinds of complicated issues potentially. I'm aware. There are also plenty of bugs even if write were to pass through. I've encountered more drives than not which accept only one SCT ERC change per poweron. A 2nd change causes the drive to offline and vanish off the bus. So no doubt this whole area is fragile enough not even the drive, controller, enclosure vendors are aware of where all the bodies are buried. What I think is fairly well established is that at least on Windows their lower level stuff including kernel
Re: Adventures in btrfs raid5 disk recovery
On 2016-07-06 14:45, Chris Murphy wrote: On Wed, Jul 6, 2016 at 11:18 AM, Austin S. Hemmelgarnwrote: On 2016-07-06 12:43, Chris Murphy wrote: So does it make sense to just set the default to 180? Or is there a smarter way to do this? I don't know. Just thinking about this: 1. People who are setting this somewhere will be functionally unaffected. I think it's statistically 0 people changing this from default. It's people with drives that have no SCT ERC support, used in raid1+, who happen to stumble upon this very obscure work around to avoid link resets in the face of media defects. Rare. Not as much as you think, once someone has this issue, they usually put preventative measures in place on any system where it applies. I'd be willing to bet that most sysadmins at big companies like RedHat or Oracle are setting this. 2. People using single disks which have lots of errors may or may not see an apparent degradation of performance, but will likely have the life expectancy of their device extended. Well they have link resets and their file system presumably face plants as a result of a pile of commands in the queue returning as unsuccessful. So they have premature death of their system, rather than it getting sluggish. This is a long standing indicator on Windows to just reinstall the OS and restore data from backups -> the user has an opportunity to freshen up user data backup, and the reinstallation and restore from backup results in freshly written sectors which is how bad sectors get fixed. The marginally bad sectors get new writes and now read fast (or fast enough), and the persistently bad sectors result in the drive firmware remapping to reserve sectors. The main thing in my opinion is less extension of drive life, as it is the user gets to use the system, albeit sluggish, to make a backup of their data rather than possibly losing it. The extension of the drive's lifetime is a nice benefit, but not what my point was here. For people in this particular case, it will almost certainly only make things better (although at first it may make performance worse). 3. Individuals who are not setting this but should be will on average be no worse off than before other than seeing a bigger performance hit on a disk error. 4. People with single disks which are new will see no functional change until the disk has an error. I follow. In an ideal situation, what I'd want to see is: 1. If the device supports SCT ERC, set scsi_command_timer to reasonable percentage over that (probably something like 25%, which would give roughly 10 seconds for the normal 7 second ERC timer). 2. If the device is actually a SCSI device, keep the 30 second timer (IIRC< this is reasonable for SCSI disks). 3. Otherwise, set the timer to 200 (we need a slight buffer over the expected disk timeout to account for things like latency outside of the disk). Well if it's a non-redundant configuration, you'd want those long recoveries permitted, rather than enable SCT ERC. The drive has the ability to relocate sector data on a marginal (slow) read that's still successful. But clearly many manufacturers tolerate slow reads that don't result in immediate reallocation or overwrite or we wouldn't be in this situation in the first place. I think this auto reallocation is thwarted by enabling SCT ERC. It just flat out gives up and reports a read error. So it is still data loss in the non-redundant configuration and thus not an improvement. I agree, but if it's only the kernel doing this, then we can't make judgements based on userspace usage. Also, the first situation while not optimal is still better than what happens now, at least there you will get an I/O error in a reasonable amount of time (as opposed to after a really long time if ever). Basically it's: For SATA and USB drives: if data redundant, then enable short SCT ERC time if supported, if not supported then extend SCSI command timer to 200; if data not redundant, then disable SCT ERC if supported, and extend SCSI command timer to 200. For SCSI (SAS most likely these days), keep things the same as now. But that's only because this is a rare enough configuration now I don't know if we really know the problems there. It may be that their error recovery in 7 seconds is massively better and more reliable than consumer drives over 180 seconds. I don't see why you would think this is not common. If you count just by systems, then it's absolutely outnumbered at least 100 to 1 by regular ATA disks. If you look at individual disks though, the reverse is true, because people who use SCSI drives tend to use _lots_ of disks (think big data centers, NAS and SAN systems and such). OTOH, both are probably vastly outnumbered by stuff that doesn't use either standard for storage... Separately, USB gets _really_ complicated if you want to cover everything, USB drives may or may not present as non-rotational, may or may not show
Re: Adventures in btrfs raid5 disk recovery
On Wed, Jul 6, 2016 at 11:18 AM, Austin S. Hemmelgarnwrote: > On 2016-07-06 12:43, Chris Murphy wrote: >> So does it make sense to just set the default to 180? Or is there a >> smarter way to do this? I don't know. > > Just thinking about this: > 1. People who are setting this somewhere will be functionally unaffected. I think it's statistically 0 people changing this from default. It's people with drives that have no SCT ERC support, used in raid1+, who happen to stumble upon this very obscure work around to avoid link resets in the face of media defects. Rare. > 2. People using single disks which have lots of errors may or may not see an > apparent degradation of performance, but will likely have the life > expectancy of their device extended. Well they have link resets and their file system presumably face plants as a result of a pile of commands in the queue returning as unsuccessful. So they have premature death of their system, rather than it getting sluggish. This is a long standing indicator on Windows to just reinstall the OS and restore data from backups -> the user has an opportunity to freshen up user data backup, and the reinstallation and restore from backup results in freshly written sectors which is how bad sectors get fixed. The marginally bad sectors get new writes and now read fast (or fast enough), and the persistently bad sectors result in the drive firmware remapping to reserve sectors. The main thing in my opinion is less extension of drive life, as it is the user gets to use the system, albeit sluggish, to make a backup of their data rather than possibly losing it. > 3. Individuals who are not setting this but should be will on average be no > worse off than before other than seeing a bigger performance hit on a disk > error. > 4. People with single disks which are new will see no functional change > until the disk has an error. I follow. > > In an ideal situation, what I'd want to see is: > 1. If the device supports SCT ERC, set scsi_command_timer to reasonable > percentage over that (probably something like 25%, which would give roughly > 10 seconds for the normal 7 second ERC timer). > 2. If the device is actually a SCSI device, keep the 30 second timer (IIRC< > this is reasonable for SCSI disks). > 3. Otherwise, set the timer to 200 (we need a slight buffer over the > expected disk timeout to account for things like latency outside of the > disk). Well if it's a non-redundant configuration, you'd want those long recoveries permitted, rather than enable SCT ERC. The drive has the ability to relocate sector data on a marginal (slow) read that's still successful. But clearly many manufacturers tolerate slow reads that don't result in immediate reallocation or overwrite or we wouldn't be in this situation in the first place. I think this auto reallocation is thwarted by enabling SCT ERC. It just flat out gives up and reports a read error. So it is still data loss in the non-redundant configuration and thus not an improvement. Basically it's: For SATA and USB drives: if data redundant, then enable short SCT ERC time if supported, if not supported then extend SCSI command timer to 200; if data not redundant, then disable SCT ERC if supported, and extend SCSI command timer to 200. For SCSI (SAS most likely these days), keep things the same as now. But that's only because this is a rare enough configuration now I don't know if we really know the problems there. It may be that their error recovery in 7 seconds is massively better and more reliable than consumer drives over 180 seconds. > >> >> I suspect, but haven't tested, that ZFS On Linux would be equally affected, unless they're completely reimplementing their own block layer (?) So there are quite a few parties now negatively impacted by the current default behavior. >>> >>> >>> OTOH, I would not be surprised if the stance there is 'you get no support >>> if >>> your not using enterprise drives', not because of the project itself, but >>> because it's ZFS. Part of their minimum recommended hardware >>> requirements >>> is ECC RAM, so it wouldn't surprise me if enterprise storage devices are >>> there too. >> >> >> http://open-zfs.org/wiki/Hardware >> "Consistent performance requires hard drives that support error >> recovery control. " >> >> "Drives that lack such functionality can be expected to have >> arbitrarily high limits. Several minutes is not impossible. Drives >> with this functionality typically default to 7 seconds. ZFS does not >> currently adjust this setting on drives. However, it is advisable to >> write a script to set the error recovery time to a low value, such as >> 0.1 seconds until ZFS is modified to control it. This must be done on >> every boot. " >> >> They do not explicitly require enterprise drives, but they clearly >> expect SCT ERC enabled to some sane value. >> >> At least for Btrfs and ZFS, the mkfs is in a position to know all >>
Re: Adventures in btrfs raid5 disk recovery
On 2016-07-06 12:43, Chris Murphy wrote: On Wed, Jul 6, 2016 at 5:51 AM, Austin S. Hemmelgarnwrote: On 2016-07-05 19:05, Chris Murphy wrote: Related: http://www.spinics.net/lists/raid/msg52880.html Looks like there is some traction to figuring out what to do about this, whether it's a udev rule or something that happens in the kernel itself. Pretty much the only hardware setup unaffected by this are those with enterprise or NAS drives. Every configuration of a consumer drive, single, linear/concat, and all software (mdadm, lvm, Btrfs) RAID Levels are adversely affected by this. The thing I don't get about this is that while the per-device settings on a given system are policy, the default value is not, and should be expected to work correctly (but not necessarily optimally) on as many systems as possible, so any claim that this should be fixed in udev are bogus by the regular kernel rules. Sure. But changing it in the kernel leads to what other consequences? It fixes the problem under discussion but what problem will it introduce? I think it's valid to explore this, at the least so affected parties can be informed. Also, the problem isn't instigated by Linux, rather by drive manufacturers introducing a whole new kind of error recovery, with an order of magnitude longer recovery time. Now probably most hardware in the field are such drives. Even SSDs like my Samsung 840 EVO that support SCT ERC have it disabled, therefore the top end recovery time is undiscoverable in the device itself. Maybe it's buried in a spec. So does it make sense to just set the default to 180? Or is there a smarter way to do this? I don't know. Just thinking about this: 1. People who are setting this somewhere will be functionally unaffected. 2. People using single disks which have lots of errors may or may not see an apparent degradation of performance, but will likely have the life expectancy of their device extended. 3. Individuals who are not setting this but should be will on average be no worse off than before other than seeing a bigger performance hit on a disk error. 4. People with single disks which are new will see no functional change until the disk has an error. In an ideal situation, what I'd want to see is: 1. If the device supports SCT ERC, set scsi_command_timer to reasonable percentage over that (probably something like 25%, which would give roughly 10 seconds for the normal 7 second ERC timer). 2. If the device is actually a SCSI device, keep the 30 second timer (IIRC< this is reasonable for SCSI disks). 3. Otherwise, set the timer to 200 (we need a slight buffer over the expected disk timeout to account for things like latency outside of the disk). I suspect, but haven't tested, that ZFS On Linux would be equally affected, unless they're completely reimplementing their own block layer (?) So there are quite a few parties now negatively impacted by the current default behavior. OTOH, I would not be surprised if the stance there is 'you get no support if your not using enterprise drives', not because of the project itself, but because it's ZFS. Part of their minimum recommended hardware requirements is ECC RAM, so it wouldn't surprise me if enterprise storage devices are there too. http://open-zfs.org/wiki/Hardware "Consistent performance requires hard drives that support error recovery control. " "Drives that lack such functionality can be expected to have arbitrarily high limits. Several minutes is not impossible. Drives with this functionality typically default to 7 seconds. ZFS does not currently adjust this setting on drives. However, it is advisable to write a script to set the error recovery time to a low value, such as 0.1 seconds until ZFS is modified to control it. This must be done on every boot. " They do not explicitly require enterprise drives, but they clearly expect SCT ERC enabled to some sane value. At least for Btrfs and ZFS, the mkfs is in a position to know all parameters for properly setting SCT ERC and the SCSI command timer for every device. Maybe it could create the udev rule? Single and raid0 profiles need to permit long recoveries; where raid1, 5, 6 need to set things for very short recoveries. Possibly mdadm and lvm tools do the same thing. I"m pretty certain they don't create rules, or even try to check the drive for SCT ERC support. The problem with doing this is that you can't be certain that your underlying device is actually a physical storage device or not, and thus you have to check more than just the SCT ERC commands, and many people (myself included) don't like tools doing things that modify the persistent functioning of their system that the tool itself is not intended to do (and messing with block layer settings falls into that category for a mkfs tool). -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at
Re: Adventures in btrfs raid5 disk recovery
On Wed, Jul 6, 2016 at 5:51 AM, Austin S. Hemmelgarnwrote: > On 2016-07-05 19:05, Chris Murphy wrote: >> >> Related: >> http://www.spinics.net/lists/raid/msg52880.html >> >> Looks like there is some traction to figuring out what to do about >> this, whether it's a udev rule or something that happens in the kernel >> itself. Pretty much the only hardware setup unaffected by this are >> those with enterprise or NAS drives. Every configuration of a consumer >> drive, single, linear/concat, and all software (mdadm, lvm, Btrfs) >> RAID Levels are adversely affected by this. > > The thing I don't get about this is that while the per-device settings on a > given system are policy, the default value is not, and should be expected to > work correctly (but not necessarily optimally) on as many systems as > possible, so any claim that this should be fixed in udev are bogus by the > regular kernel rules. Sure. But changing it in the kernel leads to what other consequences? It fixes the problem under discussion but what problem will it introduce? I think it's valid to explore this, at the least so affected parties can be informed. Also, the problem isn't instigated by Linux, rather by drive manufacturers introducing a whole new kind of error recovery, with an order of magnitude longer recovery time. Now probably most hardware in the field are such drives. Even SSDs like my Samsung 840 EVO that support SCT ERC have it disabled, therefore the top end recovery time is undiscoverable in the device itself. Maybe it's buried in a spec. So does it make sense to just set the default to 180? Or is there a smarter way to do this? I don't know. >> I suspect, but haven't tested, that ZFS On Linux would be equally >> affected, unless they're completely reimplementing their own block >> layer (?) So there are quite a few parties now negatively impacted by >> the current default behavior. > > OTOH, I would not be surprised if the stance there is 'you get no support if > your not using enterprise drives', not because of the project itself, but > because it's ZFS. Part of their minimum recommended hardware requirements > is ECC RAM, so it wouldn't surprise me if enterprise storage devices are > there too. http://open-zfs.org/wiki/Hardware "Consistent performance requires hard drives that support error recovery control. " "Drives that lack such functionality can be expected to have arbitrarily high limits. Several minutes is not impossible. Drives with this functionality typically default to 7 seconds. ZFS does not currently adjust this setting on drives. However, it is advisable to write a script to set the error recovery time to a low value, such as 0.1 seconds until ZFS is modified to control it. This must be done on every boot. " They do not explicitly require enterprise drives, but they clearly expect SCT ERC enabled to some sane value. At least for Btrfs and ZFS, the mkfs is in a position to know all parameters for properly setting SCT ERC and the SCSI command timer for every device. Maybe it could create the udev rule? Single and raid0 profiles need to permit long recoveries; where raid1, 5, 6 need to set things for very short recoveries. Possibly mdadm and lvm tools do the same thing. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Adventures in btrfs raid5 disk recovery
On 2016-07-05 19:05, Chris Murphy wrote: Related: http://www.spinics.net/lists/raid/msg52880.html Looks like there is some traction to figuring out what to do about this, whether it's a udev rule or something that happens in the kernel itself. Pretty much the only hardware setup unaffected by this are those with enterprise or NAS drives. Every configuration of a consumer drive, single, linear/concat, and all software (mdadm, lvm, Btrfs) RAID Levels are adversely affected by this. The thing I don't get about this is that while the per-device settings on a given system are policy, the default value is not, and should be expected to work correctly (but not necessarily optimally) on as many systems as possible, so any claim that this should be fixed in udev are bogus by the regular kernel rules. I suspect, but haven't tested, that ZFS On Linux would be equally affected, unless they're completely reimplementing their own block layer (?) So there are quite a few parties now negatively impacted by the current default behavior. OTOH, I would not be surprised if the stance there is 'you get no support if your not using enterprise drives', not because of the project itself, but because it's ZFS. Part of their minimum recommended hardware requirements is ECC RAM, so it wouldn't surprise me if enterprise storage devices are there too. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Adventures in btrfs raid5 disk recovery
Related: http://www.spinics.net/lists/raid/msg52880.html Looks like there is some traction to figuring out what to do about this, whether it's a udev rule or something that happens in the kernel itself. Pretty much the only hardware setup unaffected by this are those with enterprise or NAS drives. Every configuration of a consumer drive, single, linear/concat, and all software (mdadm, lvm, Btrfs) RAID Levels are adversely affected by this. I suspect, but haven't tested, that ZFS On Linux would be equally affected, unless they're completely reimplementing their own block layer (?) So there are quite a few parties now negatively impacted by the current default behavior. Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Adventures in btrfs raid5 disk recovery
On 29/06/16 04:01, Chris Murphy wrote: > Just wiping the slate clean to summarize: > > > 1. We have a consistent ~1 in 3 maybe 1 in 2, reproducible corruption > of *data extent* parity during a scrub with raid5. Goffredo and I have > both reproduced it. It's a big bug. It might still be useful if > someone else can reproduce it too. > > Goffredo, can you file a bug at bugzilla.kernel.org and reference your > bug thread? I don't know if the key developers know about this, it > might be worth pinging them on IRC once the bug is filed. > > Unknown if it affects balance, or raid 6. And if it affects raid 6, is > p or q corrupted, or both? Unknown how this manifests on metadata > raid5 profile (only tested was data raid5). Presumably if there is > metadata corruption that's fixed during a scrub, and its parity is > overwritten with corrupt parity, the next time there's a degraded > state, the file system would face plant somehow. And we've seen quite > a few degraded raid5's (and even 6's) face plant in inexplicable ways > and we just kinda go, shit. Which is what the fs is doing when it > encounters a pile of csum errors. It treats the csum errors as a > signal to disregard the fs rather than maybe only being suspicious of > the fs. Could it turn out that these file systems were recoverable, > just that Btrfs wasn't tolerating any csum error and wouldn't proceed > further? I believe this is the same case for RAID6 based on my experiences. I actually wondered if the system halts were the result of a TON of csum errors - not the actual result of those errors. Just about every system hang when to 100% CPU usage on all cores and the system just stopped was after a flood of csum errors. If it was only one or two (or I copied data off via a network connection where the read rate was slower), I found I had a MUCH lower chance of the system locking up. In fact, now that I think about it, when I was copying data to an external USB drive (maxed out at ~30MB/sec), I still got csum errors - but the system never hung. Every crash ended with the last line along the lines of "Stopped recurring error. Your system needs rebooting". I wonder if this error reporting was altered, that the system wouldn't go down. Of course I have no way of testing this. -- Steven Haigh Email: net...@crc.id.au Web: https://www.crc.id.au Phone: (03) 9001 6090 - 0412 935 897 signature.asc Description: OpenPGP digital signature
Re: Adventures in btrfs raid5 disk recovery
Just wiping the slate clean to summarize: 1. We have a consistent ~1 in 3 maybe 1 in 2, reproducible corruption of *data extent* parity during a scrub with raid5. Goffredo and I have both reproduced it. It's a big bug. It might still be useful if someone else can reproduce it too. Goffredo, can you file a bug at bugzilla.kernel.org and reference your bug thread? I don't know if the key developers know about this, it might be worth pinging them on IRC once the bug is filed. Unknown if it affects balance, or raid 6. And if it affects raid 6, is p or q corrupted, or both? Unknown how this manifests on metadata raid5 profile (only tested was data raid5). Presumably if there is metadata corruption that's fixed during a scrub, and its parity is overwritten with corrupt parity, the next time there's a degraded state, the file system would face plant somehow. And we've seen quite a few degraded raid5's (and even 6's) face plant in inexplicable ways and we just kinda go, shit. Which is what the fs is doing when it encounters a pile of csum errors. It treats the csum errors as a signal to disregard the fs rather than maybe only being suspicious of the fs. Could it turn out that these file systems were recoverable, just that Btrfs wasn't tolerating any csum error and wouldn't proceed further? 2. The existing scrub code computes parity on-the-fly, compares it with what's on-disk, and overwrites if there's a mismatch. If there's a mismatch, there's no message anywhere. It's a feature request to get a message on parity mismatches. An additional feature request would be to get a parity_error counter along the lines of the other error counters we have for scrub stats and dev stats. 3. I think it's a more significant change to get parity checksums stored some where. Right now the csum tree holds item type EXTENT_CSUM but parity is not an extent, it's also not data, it's a variant of data. So it seems to me we'd need a new item type PARITY_CSUM to get it into the existing csum tree. And I'm not sure what incompatibility that brings; presumably older kernels could mount such a volume ro safely, but shouldn't write to it, including btrfs check --repair should probably fail. Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Adventures in btrfs raid5 disk recovery
On 28/06/16 22:25, Austin S. Hemmelgarn wrote: > On 2016-06-28 08:14, Steven Haigh wrote: >> On 28/06/16 22:05, Austin S. Hemmelgarn wrote: >>> On 2016-06-27 17:57, Zygo Blaxell wrote: On Mon, Jun 27, 2016 at 10:17:04AM -0600, Chris Murphy wrote: > On Mon, Jun 27, 2016 at 5:21 AM, Austin S. Hemmelgarn >wrote: >> On 2016-06-25 12:44, Chris Murphy wrote: >>> On Fri, Jun 24, 2016 at 12:19 PM, Austin S. Hemmelgarn >>> wrote: >>> >>> OK but hold on. During scrub, it should read data, compute checksums >>> *and* parity, and compare those to what's on-disk - > EXTENT_CSUM in >>> the checksum tree, and the parity strip in the chunk tree. And if >>> parity is wrong, then it should be replaced. >> >> Except that's horribly inefficient. With limited exceptions >> involving >> highly situational co-processors, computing a checksum of a parity >> block is >> always going to be faster than computing parity for the stripe. By >> using >> that to check parity, we can safely speed up the common case of near >> zero >> errors during a scrub by a pretty significant factor. > > OK I'm in favor of that. Although somehow md gets away with this by > computing and checking parity for its scrubs, and still manages to > keep drives saturated in the process - at least HDDs, I'm not sure how > it fares on SSDs. A modest desktop CPU can compute raid6 parity at 6GB/sec, a less-modest one at more than 10GB/sec. Maybe a bottleneck is within reach of an array of SSDs vs. a slow CPU. >>> OK, great for people who are using modern desktop or server CPU's. Not >>> everyone has that luxury, and even on many such CPU's, it's _still_ >>> faster to computer CRC32c checksums. On top of that, we don't appear to >>> be using the in-kernel parity-raid libraries (or if we are, I haven't >>> been able to find where we are calling the functions for it), so we >>> don't necessarily get assembly optimized or co-processor accelerated >>> computation of the parity itself. The other thing that I didn't mention >>> above though, is that computing parity checksums will always take less >>> time than computing parity, because you have to process significantly >>> less data. On a 4 disk RAID5 array, you're processing roughly 2/3 as >>> much data to do the parity checksums instead of parity itself, which >>> means that the parity computation would need to be 200% faster than the >>> CRC32c computation to break even, and this margin gets bigger and bigger >>> as you add more disks. >>> >>> On small arrays, this obviously won't have much impact. Once you start >>> to scale past a few TB though, even a few hundred MB/s faster processing >>> means a significant decrease in processing time. Say you have a CPU >>> which gets about 12.0GB/s for RAID5 parity, and and about 12.25GB/s for >>> CRC32c (~2% is a conservative ratio assuming you use the CRC32c >>> instruction and assembly optimized RAID5 parity computations on a modern >>> x86_64 processor (the ratio on both the mobile Core i5 in my laptop and >>> the Xeon E3 in my home server is closer to 5%)). Assuming those >>> numbers, and that we're already checking checksums on non-parity blocks, >>> processing 120TB of data in a 4 disk array (which gives 40TB of parity >>> data, so 160TB total) gives: >>> For computing the parity to scrub: >>> 120TB / 12.25GB = 9795.9 seconds for processing CRC32c csums of all the >>> regular data >>> 120TB / 12GB= 1 seconds for processing parity of all stripes >>> = 19795.9 seconds total >>> ~ 5.4 hours total >>> >>> For computing csums of the parity: >>> 120TB / 12.25GB = 9795.9 seconds for processing CRC32c csums of all the >>> regular data >>> 40TB / 12.25GB = 3265.3 seconds for processing CRC32c csums of all the >>> parity data >>> = 13061.2 seconds total >>> ~ 3.6 hours total >>> >>> The checksum based computation is approximately 34% faster than the >>> parity computation. Much of this of course is that you have to process >>> the regular data twice for the parity computation method (once for >>> csums, once for parity). You could probably do one pass computing both >>> values, but that would need to be done carefully; and, without >>> significant optimization, would likely not get you much benefit other >>> than cutting the number of loads in half. >> >> And it all means jack shit because you don't get the data to disk that >> quick. Who cares if its 500% faster - if it still saturates the >> throughput of the actual drives, what difference does it make? > It has less impact on everything else running on the system at the time > because it uses less CPU time and potentially less memory. This is the > exact same reason that you want your RAID parity computation performance > as good as possible, the less time the CPU spends
Re: Adventures in btrfs raid5 disk recovery
On 2016-06-28 08:14, Steven Haigh wrote: On 28/06/16 22:05, Austin S. Hemmelgarn wrote: On 2016-06-27 17:57, Zygo Blaxell wrote: On Mon, Jun 27, 2016 at 10:17:04AM -0600, Chris Murphy wrote: On Mon, Jun 27, 2016 at 5:21 AM, Austin S. Hemmelgarnwrote: On 2016-06-25 12:44, Chris Murphy wrote: On Fri, Jun 24, 2016 at 12:19 PM, Austin S. Hemmelgarn wrote: OK but hold on. During scrub, it should read data, compute checksums *and* parity, and compare those to what's on-disk - > EXTENT_CSUM in the checksum tree, and the parity strip in the chunk tree. And if parity is wrong, then it should be replaced. Except that's horribly inefficient. With limited exceptions involving highly situational co-processors, computing a checksum of a parity block is always going to be faster than computing parity for the stripe. By using that to check parity, we can safely speed up the common case of near zero errors during a scrub by a pretty significant factor. OK I'm in favor of that. Although somehow md gets away with this by computing and checking parity for its scrubs, and still manages to keep drives saturated in the process - at least HDDs, I'm not sure how it fares on SSDs. A modest desktop CPU can compute raid6 parity at 6GB/sec, a less-modest one at more than 10GB/sec. Maybe a bottleneck is within reach of an array of SSDs vs. a slow CPU. OK, great for people who are using modern desktop or server CPU's. Not everyone has that luxury, and even on many such CPU's, it's _still_ faster to computer CRC32c checksums. On top of that, we don't appear to be using the in-kernel parity-raid libraries (or if we are, I haven't been able to find where we are calling the functions for it), so we don't necessarily get assembly optimized or co-processor accelerated computation of the parity itself. The other thing that I didn't mention above though, is that computing parity checksums will always take less time than computing parity, because you have to process significantly less data. On a 4 disk RAID5 array, you're processing roughly 2/3 as much data to do the parity checksums instead of parity itself, which means that the parity computation would need to be 200% faster than the CRC32c computation to break even, and this margin gets bigger and bigger as you add more disks. On small arrays, this obviously won't have much impact. Once you start to scale past a few TB though, even a few hundred MB/s faster processing means a significant decrease in processing time. Say you have a CPU which gets about 12.0GB/s for RAID5 parity, and and about 12.25GB/s for CRC32c (~2% is a conservative ratio assuming you use the CRC32c instruction and assembly optimized RAID5 parity computations on a modern x86_64 processor (the ratio on both the mobile Core i5 in my laptop and the Xeon E3 in my home server is closer to 5%)). Assuming those numbers, and that we're already checking checksums on non-parity blocks, processing 120TB of data in a 4 disk array (which gives 40TB of parity data, so 160TB total) gives: For computing the parity to scrub: 120TB / 12.25GB = 9795.9 seconds for processing CRC32c csums of all the regular data 120TB / 12GB= 1 seconds for processing parity of all stripes = 19795.9 seconds total ~ 5.4 hours total For computing csums of the parity: 120TB / 12.25GB = 9795.9 seconds for processing CRC32c csums of all the regular data 40TB / 12.25GB = 3265.3 seconds for processing CRC32c csums of all the parity data = 13061.2 seconds total ~ 3.6 hours total The checksum based computation is approximately 34% faster than the parity computation. Much of this of course is that you have to process the regular data twice for the parity computation method (once for csums, once for parity). You could probably do one pass computing both values, but that would need to be done carefully; and, without significant optimization, would likely not get you much benefit other than cutting the number of loads in half. And it all means jack shit because you don't get the data to disk that quick. Who cares if its 500% faster - if it still saturates the throughput of the actual drives, what difference does it make? It has less impact on everything else running on the system at the time because it uses less CPU time and potentially less memory. This is the exact same reason that you want your RAID parity computation performance as good as possible, the less time the CPU spends on that, the more it can spend on other things. On top of that, there are high-end systems that do have SSD's that can get multiple GB/s of data transfer per second, and NVDIMM's are starting to become popular in the server market, and those give you data transfer speeds equivalent to regular memory bandwidth (which can be well over 20GB/s on decent hardware (I've got a relatively inexpensive system using DDR3-1866 RAM that has
Re: Adventures in btrfs raid5 disk recovery
On 28/06/16 22:05, Austin S. Hemmelgarn wrote: > On 2016-06-27 17:57, Zygo Blaxell wrote: >> On Mon, Jun 27, 2016 at 10:17:04AM -0600, Chris Murphy wrote: >>> On Mon, Jun 27, 2016 at 5:21 AM, Austin S. Hemmelgarn >>>wrote: On 2016-06-25 12:44, Chris Murphy wrote: > On Fri, Jun 24, 2016 at 12:19 PM, Austin S. Hemmelgarn > wrote: > > OK but hold on. During scrub, it should read data, compute checksums > *and* parity, and compare those to what's on-disk - > EXTENT_CSUM in > the checksum tree, and the parity strip in the chunk tree. And if > parity is wrong, then it should be replaced. Except that's horribly inefficient. With limited exceptions involving highly situational co-processors, computing a checksum of a parity block is always going to be faster than computing parity for the stripe. By using that to check parity, we can safely speed up the common case of near zero errors during a scrub by a pretty significant factor. >>> >>> OK I'm in favor of that. Although somehow md gets away with this by >>> computing and checking parity for its scrubs, and still manages to >>> keep drives saturated in the process - at least HDDs, I'm not sure how >>> it fares on SSDs. >> >> A modest desktop CPU can compute raid6 parity at 6GB/sec, a less-modest >> one at more than 10GB/sec. Maybe a bottleneck is within reach of an >> array of SSDs vs. a slow CPU. > OK, great for people who are using modern desktop or server CPU's. Not > everyone has that luxury, and even on many such CPU's, it's _still_ > faster to computer CRC32c checksums. On top of that, we don't appear to > be using the in-kernel parity-raid libraries (or if we are, I haven't > been able to find where we are calling the functions for it), so we > don't necessarily get assembly optimized or co-processor accelerated > computation of the parity itself. The other thing that I didn't mention > above though, is that computing parity checksums will always take less > time than computing parity, because you have to process significantly > less data. On a 4 disk RAID5 array, you're processing roughly 2/3 as > much data to do the parity checksums instead of parity itself, which > means that the parity computation would need to be 200% faster than the > CRC32c computation to break even, and this margin gets bigger and bigger > as you add more disks. > > On small arrays, this obviously won't have much impact. Once you start > to scale past a few TB though, even a few hundred MB/s faster processing > means a significant decrease in processing time. Say you have a CPU > which gets about 12.0GB/s for RAID5 parity, and and about 12.25GB/s for > CRC32c (~2% is a conservative ratio assuming you use the CRC32c > instruction and assembly optimized RAID5 parity computations on a modern > x86_64 processor (the ratio on both the mobile Core i5 in my laptop and > the Xeon E3 in my home server is closer to 5%)). Assuming those > numbers, and that we're already checking checksums on non-parity blocks, > processing 120TB of data in a 4 disk array (which gives 40TB of parity > data, so 160TB total) gives: > For computing the parity to scrub: > 120TB / 12.25GB = 9795.9 seconds for processing CRC32c csums of all the > regular data > 120TB / 12GB= 1 seconds for processing parity of all stripes > = 19795.9 seconds total > ~ 5.4 hours total > > For computing csums of the parity: > 120TB / 12.25GB = 9795.9 seconds for processing CRC32c csums of all the > regular data > 40TB / 12.25GB = 3265.3 seconds for processing CRC32c csums of all the > parity data > = 13061.2 seconds total > ~ 3.6 hours total > > The checksum based computation is approximately 34% faster than the > parity computation. Much of this of course is that you have to process > the regular data twice for the parity computation method (once for > csums, once for parity). You could probably do one pass computing both > values, but that would need to be done carefully; and, without > significant optimization, would likely not get you much benefit other > than cutting the number of loads in half. And it all means jack shit because you don't get the data to disk that quick. Who cares if its 500% faster - if it still saturates the throughput of the actual drives, what difference does it make? I'm all for actual solutions, but the nirvana fallacy seems to apply here... -- Steven Haigh Email: net...@crc.id.au Web: https://www.crc.id.au Phone: (03) 9001 6090 - 0412 935 897 signature.asc Description: OpenPGP digital signature
Re: Adventures in btrfs raid5 disk recovery
On 2016-06-27 17:57, Zygo Blaxell wrote: On Mon, Jun 27, 2016 at 10:17:04AM -0600, Chris Murphy wrote: On Mon, Jun 27, 2016 at 5:21 AM, Austin S. Hemmelgarnwrote: On 2016-06-25 12:44, Chris Murphy wrote: On Fri, Jun 24, 2016 at 12:19 PM, Austin S. Hemmelgarn wrote: OK but hold on. During scrub, it should read data, compute checksums *and* parity, and compare those to what's on-disk - > EXTENT_CSUM in the checksum tree, and the parity strip in the chunk tree. And if parity is wrong, then it should be replaced. Except that's horribly inefficient. With limited exceptions involving highly situational co-processors, computing a checksum of a parity block is always going to be faster than computing parity for the stripe. By using that to check parity, we can safely speed up the common case of near zero errors during a scrub by a pretty significant factor. OK I'm in favor of that. Although somehow md gets away with this by computing and checking parity for its scrubs, and still manages to keep drives saturated in the process - at least HDDs, I'm not sure how it fares on SSDs. A modest desktop CPU can compute raid6 parity at 6GB/sec, a less-modest one at more than 10GB/sec. Maybe a bottleneck is within reach of an array of SSDs vs. a slow CPU. OK, great for people who are using modern desktop or server CPU's. Not everyone has that luxury, and even on many such CPU's, it's _still_ faster to computer CRC32c checksums. On top of that, we don't appear to be using the in-kernel parity-raid libraries (or if we are, I haven't been able to find where we are calling the functions for it), so we don't necessarily get assembly optimized or co-processor accelerated computation of the parity itself. The other thing that I didn't mention above though, is that computing parity checksums will always take less time than computing parity, because you have to process significantly less data. On a 4 disk RAID5 array, you're processing roughly 2/3 as much data to do the parity checksums instead of parity itself, which means that the parity computation would need to be 200% faster than the CRC32c computation to break even, and this margin gets bigger and bigger as you add more disks. On small arrays, this obviously won't have much impact. Once you start to scale past a few TB though, even a few hundred MB/s faster processing means a significant decrease in processing time. Say you have a CPU which gets about 12.0GB/s for RAID5 parity, and and about 12.25GB/s for CRC32c (~2% is a conservative ratio assuming you use the CRC32c instruction and assembly optimized RAID5 parity computations on a modern x86_64 processor (the ratio on both the mobile Core i5 in my laptop and the Xeon E3 in my home server is closer to 5%)). Assuming those numbers, and that we're already checking checksums on non-parity blocks, processing 120TB of data in a 4 disk array (which gives 40TB of parity data, so 160TB total) gives: For computing the parity to scrub: 120TB / 12.25GB = 9795.9 seconds for processing CRC32c csums of all the regular data 120TB / 12GB= 1 seconds for processing parity of all stripes = 19795.9 seconds total ~ 5.4 hours total For computing csums of the parity: 120TB / 12.25GB = 9795.9 seconds for processing CRC32c csums of all the regular data 40TB / 12.25GB = 3265.3 seconds for processing CRC32c csums of all the parity data = 13061.2 seconds total ~ 3.6 hours total The checksum based computation is approximately 34% faster than the parity computation. Much of this of course is that you have to process the regular data twice for the parity computation method (once for csums, once for parity). You could probably do one pass computing both values, but that would need to be done carefully; and, without significant optimization, would likely not get you much benefit other than cutting the number of loads in half. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Adventures in btrfs raid5 disk recovery
On 2016-06-27 23:17, Zygo Blaxell wrote: On Mon, Jun 27, 2016 at 08:39:21PM -0600, Chris Murphy wrote: On Mon, Jun 27, 2016 at 7:52 PM, Zygo Blaxellwrote: On Mon, Jun 27, 2016 at 04:30:23PM -0600, Chris Murphy wrote: Btrfs does have something of a work around for when things get slow, and that's balance, read and rewrite everything. The write forces sector remapping by the drive firmware for bad sectors. It's a crude form of "resilvering" as ZFS calls it. In what manner is it crude? Balance relocates extents, looks up backrefs, and rewrites metadata, all of which are extra work above what is required by resilvering (and extra work that is proportional to the number of backrefs and the (currently extremely poor) performance of the backref walking code, so snapshots and large files multiply the workload). Resilvering should just read data, reconstruct it from a mirror if necessary, and write it back to the original location (or read one mirror and rewrite the other). That's more like what scrub does, except scrub rewrites only the blocks it couldn't read (or that failed csum). It's worth pointing out that balance was not designed for resilvering, it was designed for reshaping arrays, converting replication profiles, and compaction at the chunk level. Balance is not a resilvering tool, that just happens to be a useful side effect of running a balance (actually, so is the chunk level compaction). -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Adventures in btrfs raid5 disk recovery
On Mon, Jun 27, 2016 at 08:39:21PM -0600, Chris Murphy wrote: > On Mon, Jun 27, 2016 at 7:52 PM, Zygo Blaxell >wrote: > > On Mon, Jun 27, 2016 at 04:30:23PM -0600, Chris Murphy wrote: > >> Btrfs does have something of a work around for when things get slow, > >> and that's balance, read and rewrite everything. The write forces > >> sector remapping by the drive firmware for bad sectors. > > > > It's a crude form of "resilvering" as ZFS calls it. > > In what manner is it crude? Balance relocates extents, looks up backrefs, and rewrites metadata, all of which are extra work above what is required by resilvering (and extra work that is proportional to the number of backrefs and the (currently extremely poor) performance of the backref walking code, so snapshots and large files multiply the workload). Resilvering should just read data, reconstruct it from a mirror if necessary, and write it back to the original location (or read one mirror and rewrite the other). That's more like what scrub does, except scrub rewrites only the blocks it couldn't read (or that failed csum). > > Last time I checked all the RAID implementations on Linux (ok, so that's > > pretty much just md-raid) had some sort of repair capability. > > You can read man 4 md, and you can also look on linux-raid@, it's very > clearly necessary for the drive to report a read or write error > explicitly with LBA for md to do repairs. If there are link resets, > bad sectors accumulate and the obvious inevitably happens. I am looking at the md code. It looks at ->bi_error, and nothing else as far as I can tell. It doesn't even care if the error is EIO--any non-zero return value from the lower bio layer seems to trigger automatic recovery. signature.asc Description: Digital signature
Re: Adventures in btrfs raid5 disk recovery
On Mon, Jun 27, 2016 at 7:52 PM, Zygo Blaxell <ce3g8...@umail.furryterror.org> wrote: > On Mon, Jun 27, 2016 at 04:30:23PM -0600, Chris Murphy wrote: >> On Mon, Jun 27, 2016 at 3:57 PM, Zygo Blaxell >> <ce3g8...@umail.furryterror.org> wrote: >> > On Mon, Jun 27, 2016 at 10:17:04AM -0600, Chris Murphy wrote: >> > If anything, I want the timeout to be shorter so that upper layers with >> > redundancy can get an EIO and initiate repair promptly, and admins can >> > get notified to evict chronic offenders from their drive slots, without >> > having to pay extra for hard disk firmware with that feature. >> >> The drive totally thwarts this. It doesn't report back to the kernel >> what command is hung, as far as I'm aware. It just hangs and goes into >> a so called "deep recovery" there is no way to know what sector is >> causing the problem > > I'm proposing just treat the link reset _as_ an EIO, unless transparent > link resets are required for link speed negotiation or something. That's not one EIO, that's possibly 31 items in the command queue that get knocked over when the link is reset. I don't have the expertise to know whether it's sane to interpret many EIO all at once as an implicit indication of bad sectors. Off hand I think that's probably specious. > The drive wouldn't be thwarting anything, the host would just ignore it > (unless the drive doesn't respond to a link reset until after its internal > timeout, in which case nothing is saved by shortening the timeout). > >> until the drive reports a read error, which will >> include the affected sector LBA. > > It doesn't matter which sector. Chances are good that it was more than > one of the outstanding requested sectors anyway. Rewrite them all. *shrug* even if valid, it only helps the raid 1+ cases. It does nothing to help raid0, linear/concat, or single device deployments. Those users also deserve to have access to their data, if the drive can recover it by giving it enough time to do so. > We know which sectors they are because somebody has an IO operation > waiting for a status on each of them (unless they're using AIO or some > other API where a request can be fired at a hard drive and the reply > discarded). Notify all of them that their IO failed and move on. Dunno, maybe. > >> Btrfs does have something of a work around for when things get slow, >> and that's balance, read and rewrite everything. The write forces >> sector remapping by the drive firmware for bad sectors. > > It's a crude form of "resilvering" as ZFS calls it. In what manner is it crude? > If btrfs sees EIO from a lower block layer it will try to reconstruct the > missing data (but not repair it). If that happens during a scrub, > it will also attempt to rewrite the missing data over the original > offending sectors. This happens every few months in my server pool, > and seems to be working even on btrfs raid5. > > Last time I checked all the RAID implementations on Linux (ok, so that's > pretty much just md-raid) had some sort of repair capability. You can read man 4 md, and you can also look on linux-raid@, it's very clearly necessary for the drive to report a read or write error explicitly with LBA for md to do repairs. If there are link resets, bad sectors accumulate and the obvious inevitably happens. > >> For single drives and RAID 0, the only possible solution is to not do >> link resets for up to 3 minutes and hope the drive returns the single >> copy of data. > > So perhaps the timeout should be influenced by higher layers, e.g. if a > disk becomes part of a raid1, its timeout should be shortened by default, > while a timeout for a disk that is not used in by redundant layer should > be longer. And there are a pile of reasons why link resets are necessary that have nothing to do with bad sectors. So if you end up with a drive or controller misbehaving and the new behavior is to force a bunch of new (corrective) writes to the drive right after a reset it could actually make its problems worse for all we know. I think it's highly speculative to assume hung block devices means bad sector and should be treated as a bad sector, and that doing so will cause no other side effects. That's a question for block device/SCSI experts to opine on whether this is at all sane to do. I'm sure they're reasonably aware of this problem that if it were that simple they'd have done that already, but conversely 5 years of telling users to change the command timer or stop using the wrong kind of drives for RAID really isn't sufficiently good advice either. The reality is that manufacturers of drives have handed us drives that far and wide don't support SCT ERC or it's disabled by default, so yeah maybe
Re: Adventures in btrfs raid5 disk recovery
On Mon, Jun 27, 2016 at 04:30:23PM -0600, Chris Murphy wrote: > On Mon, Jun 27, 2016 at 3:57 PM, Zygo Blaxell > <ce3g8...@umail.furryterror.org> wrote: > > On Mon, Jun 27, 2016 at 10:17:04AM -0600, Chris Murphy wrote: > > If anything, I want the timeout to be shorter so that upper layers with > > redundancy can get an EIO and initiate repair promptly, and admins can > > get notified to evict chronic offenders from their drive slots, without > > having to pay extra for hard disk firmware with that feature. > > The drive totally thwarts this. It doesn't report back to the kernel > what command is hung, as far as I'm aware. It just hangs and goes into > a so called "deep recovery" there is no way to know what sector is > causing the problem I'm proposing just treat the link reset _as_ an EIO, unless transparent link resets are required for link speed negotiation or something. The drive wouldn't be thwarting anything, the host would just ignore it (unless the drive doesn't respond to a link reset until after its internal timeout, in which case nothing is saved by shortening the timeout). > until the drive reports a read error, which will > include the affected sector LBA. It doesn't matter which sector. Chances are good that it was more than one of the outstanding requested sectors anyway. Rewrite them all. We know which sectors they are because somebody has an IO operation waiting for a status on each of them (unless they're using AIO or some other API where a request can be fired at a hard drive and the reply discarded). Notify all of them that their IO failed and move on. > Btrfs does have something of a work around for when things get slow, > and that's balance, read and rewrite everything. The write forces > sector remapping by the drive firmware for bad sectors. It's a crude form of "resilvering" as ZFS calls it. > > The upper layers could time the IOs, and make their own decisions based > > on the timing (e.g. btrfs or mdadm could proactively repair anything that > > took more than 10 seconds to read). That might be a better approach, > > since shortening the time to an EIO is only useful when you have a > > redundancy layer in place to do something about them. > > For RAID with redundancy, that's doable, although I have no idea what > work is needed, or even if it's possible, to track commands in this > manner, and fall back to some kind of repair mode as if it were a read > error. If btrfs sees EIO from a lower block layer it will try to reconstruct the missing data (but not repair it). If that happens during a scrub, it will also attempt to rewrite the missing data over the original offending sectors. This happens every few months in my server pool, and seems to be working even on btrfs raid5. Last time I checked all the RAID implementations on Linux (ok, so that's pretty much just md-raid) had some sort of repair capability. lvm uses (or can use) the md-raid implementation. ext4 and xfs on naked disk partitions will have problems, but that's because they were designed in the 1990's when we were young and naive and still believed hard disks would one day become reliable devices without buggy firmware. > For single drives and RAID 0, the only possible solution is to not do > link resets for up to 3 minutes and hope the drive returns the single > copy of data. So perhaps the timeout should be influenced by higher layers, e.g. if a disk becomes part of a raid1, its timeout should be shortened by default, while a timeout for a disk that is not used in by redundant layer should be longer. > Even in the case of Btrfs DUP, it's thwarted without a read error > reported from the drive (or it returning bad data). That case gets messy--different timeouts for different parts of the disk. Probably not practical. signature.asc Description: Digital signature
Re: Adventures in btrfs raid5 disk recovery
On Mon, Jun 27, 2016 at 3:57 PM, Zygo Blaxellwrote: > On Mon, Jun 27, 2016 at 10:17:04AM -0600, Chris Murphy wrote: > >> It just came up again in a thread over the weekend on linux-raid@. I'm >> going to ask while people are paying attention if a patch to change >> the 30 second time out to something a lot higher has ever been >> floated, what the negatives might be, and where to get this fixed if >> it wouldn't be accepted in the kernel code directly. > > Defaults are defaults, they're not for everyone. 30 seconds is about > two minutes too short for an SMR drive's worst-case write latency, or > 28 seconds too long for an OLTP system, or just right for an end-user's > personal machine with a low-energy desktop drive and a long spin-up time. The question is where is the correct place to change the default to broadly capture most use cases, because it's definitely incompatible with consumer SATA drives, whether in an enclosure or not. Maybe it's with the kernel teams at each distribution? Or maybe an upstream udev rule? In any case something needs to give here because it's been years of bugging users about this misconfiguration and people constantly run into it, which means user education is not working. > > Once a drive starts taking 30+ seconds to do I/O, I consider the drive > failed in the sense that it's too slow to meet latency requirements. Well that is then a mismatch between use case and the drive purchasing decision. Consumer drives do this. It's how they're designed to work. > When the problem is that it's already taking too long, the solution is > not waiting even longer. To put things in perspective, consider that > server hardware watchdog timeouts are typically 60 seconds by default > (if not maximum). If you want the data retrieved from that particular device, the only solution is waiting longer. The alternative is what you get, an IO error (well actually you get a link reset, which also means the entire command queue is purged on SATA drives). > If anything, I want the timeout to be shorter so that upper layers with > redundancy can get an EIO and initiate repair promptly, and admins can > get notified to evict chronic offenders from their drive slots, without > having to pay extra for hard disk firmware with that feature. The drive totally thwarts this. It doesn't report back to the kernel what command is hung, as far as I'm aware. It just hangs and goes into a so called "deep recovery" there is no way to know what sector is causing the problem until the drive reports a read error, which will include the affected sector LBA. Btrfs does have something of a work around for when things get slow, and that's balance, read and rewrite everything. The write forces sector remapping by the drive firmware for bad sectors. >> *Ideally* I think we'd want two timeouts. I'd like to see commands >> have a timer that results in merely a warning that could be used by >> e.g. btrfs scrub to know "hey this sector range is 'slow' I'm going to >> write over those sectors". That's how bad sectors start out, they read >> slower and eventually go beyond 30 seconds and now it's all link >> resets. If the problem could be fixed before then... that's the best >> scenario. > > What's the downside of a link reset? Can the driver not just return > EIO for all the outstanding IOs in progress at reset, and let the upper > layers deal with it? Or is the problem that the upper layers are all > horribly broken by EIOs, or drive firmware horribly broken by link resets? Link reset clears the entire command queue on SATA drives, and it wipes away any possibility of finding out what LBA or even a range of LBAs, is the source of the stall. So it pretty much gets you nothing. > The upper layers could time the IOs, and make their own decisions based > on the timing (e.g. btrfs or mdadm could proactively repair anything that > took more than 10 seconds to read). That might be a better approach, > since shortening the time to an EIO is only useful when you have a > redundancy layer in place to do something about them. For RAID with redundancy, that's doable, although I have no idea what work is needed, or even if it's possible, to track commands in this manner, and fall back to some kind of repair mode as if it were a read error. For single drives and RAID 0, the only possible solution is to not do link resets for up to 3 minutes and hope the drive returns the single copy of data. Even in the case of Btrfs DUP, it's thwarted without a read error reported from the drive (or it returning bad data). -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Adventures in btrfs raid5 disk recovery
On Mon, Jun 27, 2016 at 10:17:04AM -0600, Chris Murphy wrote: > On Mon, Jun 27, 2016 at 5:21 AM, Austin S. Hemmelgarn >wrote: > > On 2016-06-25 12:44, Chris Murphy wrote: > >> On Fri, Jun 24, 2016 at 12:19 PM, Austin S. Hemmelgarn > >> wrote: > >> > >> OK but hold on. During scrub, it should read data, compute checksums > >> *and* parity, and compare those to what's on-disk - > EXTENT_CSUM in > >> the checksum tree, and the parity strip in the chunk tree. And if > >> parity is wrong, then it should be replaced. > > > > Except that's horribly inefficient. With limited exceptions involving > > highly situational co-processors, computing a checksum of a parity block is > > always going to be faster than computing parity for the stripe. By using > > that to check parity, we can safely speed up the common case of near zero > > errors during a scrub by a pretty significant factor. > > OK I'm in favor of that. Although somehow md gets away with this by > computing and checking parity for its scrubs, and still manages to > keep drives saturated in the process - at least HDDs, I'm not sure how > it fares on SSDs. A modest desktop CPU can compute raid6 parity at 6GB/sec, a less-modest one at more than 10GB/sec. Maybe a bottleneck is within reach of an array of SSDs vs. a slow CPU. > It just came up again in a thread over the weekend on linux-raid@. I'm > going to ask while people are paying attention if a patch to change > the 30 second time out to something a lot higher has ever been > floated, what the negatives might be, and where to get this fixed if > it wouldn't be accepted in the kernel code directly. Defaults are defaults, they're not for everyone. 30 seconds is about two minutes too short for an SMR drive's worst-case write latency, or 28 seconds too long for an OLTP system, or just right for an end-user's personal machine with a low-energy desktop drive and a long spin-up time. Once a drive starts taking 30+ seconds to do I/O, I consider the drive failed in the sense that it's too slow to meet latency requirements. When the problem is that it's already taking too long, the solution is not waiting even longer. To put things in perspective, consider that server hardware watchdog timeouts are typically 60 seconds by default (if not maximum). If anything, I want the timeout to be shorter so that upper layers with redundancy can get an EIO and initiate repair promptly, and admins can get notified to evict chronic offenders from their drive slots, without having to pay extra for hard disk firmware with that feature. > *Ideally* I think we'd want two timeouts. I'd like to see commands > have a timer that results in merely a warning that could be used by > e.g. btrfs scrub to know "hey this sector range is 'slow' I'm going to > write over those sectors". That's how bad sectors start out, they read > slower and eventually go beyond 30 seconds and now it's all link > resets. If the problem could be fixed before then... that's the best > scenario. What's the downside of a link reset? Can the driver not just return EIO for all the outstanding IOs in progress at reset, and let the upper layers deal with it? Or is the problem that the upper layers are all horribly broken by EIOs, or drive firmware horribly broken by link resets? The upper layers could time the IOs, and make their own decisions based on the timing (e.g. btrfs or mdadm could proactively repair anything that took more than 10 seconds to read). That might be a better approach, since shortening the time to an EIO is only useful when you have a redundancy layer in place to do something about them. > The 2nd timer would be, OK the controller or drive just face planted, reset. > > -- > Chris Murphy > signature.asc Description: Digital signature
Re: Adventures in btrfs raid5 disk recovery
On Mon, Jun 27, 2016 at 6:17 PM, Chris Murphy <li...@colorremedies.com> wrote: > On Mon, Jun 27, 2016 at 5:21 AM, Austin S. Hemmelgarn > <ahferro...@gmail.com> wrote: >> On 2016-06-25 12:44, Chris Murphy wrote: >>> >>> On Fri, Jun 24, 2016 at 12:19 PM, Austin S. Hemmelgarn >>> <ahferro...@gmail.com> wrote: >>> >>>> Well, the obvious major advantage that comes to mind for me to >>>> checksumming >>>> parity is that it would let us scrub the parity data itself and verify >>>> it. >>> >>> >>> OK but hold on. During scrub, it should read data, compute checksums >>> *and* parity, and compare those to what's on-disk - > EXTENT_CSUM in >>> the checksum tree, and the parity strip in the chunk tree. And if >>> parity is wrong, then it should be replaced. >> >> Except that's horribly inefficient. With limited exceptions involving >> highly situational co-processors, computing a checksum of a parity block is >> always going to be faster than computing parity for the stripe. By using >> that to check parity, we can safely speed up the common case of near zero >> errors during a scrub by a pretty significant factor. > > OK I'm in favor of that. Although somehow md gets away with this by > computing and checking parity for its scrubs, and still manages to > keep drives saturated in the process - at least HDDs, I'm not sure how > it fares on SSDs. What I read in this thread clarifies the different flavors of errors I saw when trying btrfs raid5 while corrupting 1 device or just unexpectedly removing a device and replacing it with a fresh one. Especially the lack of parity csums I was not aware of and I think this is really wrong. Consider a 4 disk btrfs raid10 and a 3 disk btrfs raid5. Both protect against the loss of 1 device or badblocks on 1 device. In the current design (unoptimized for performance), raid10 reads from 2 disk and raid5 as well (as far as I remember) per task/process. Which pair of strips for raid10 is pseudo-random AFAIK, so one could get low throughput if some device in the array is older/slower and that one is picked. From device to fs logical layer is just a simple function, namely copy, so having the option to keep data in-place (zerocopy). The data is at least read by the csum check and in case of failure, the btrfs code picks the alternative strip and corrects etc. For raid5, assuming it does avoid the parity in principle, it is also a strip pair and csum check. In case of csum failure, one needs the parity strip parity calculation. To me, It looks like that the 'fear' of this calculation has made raid56 as a sort of add-on, instead of a more integral part. Looking at raid6 perf test at boot in dmesg, it is 30GByte/s, even higher than memory bandwidth. So although a calculation is needed in case data0strip+paritystrip would be used instead of data0strip+data1strip, I think looking at total cost, it can be cheaper than spending time on seeks, at least on HDDs. If the parity calculation is treated in a transparent way, same as copy, then there is more flexibility in selecting disks (and strips) and enables easier design and performance optimizations I think. >> The ideal situation that I'd like to see for scrub WRT parity is: >> 1. Store checksums for the parity itself. >> 2. During scrub, if the checksum is good, the parity is good, and we just >> saved the time of computing the whole parity block. >> 3. If the checksum is not good, then compute the parity. If the parity just >> computed matches what is there already, the checksum is bad and should be >> rewritten (and we should probably recompute the whole block of checksums >> it's in), otherwise, the parity was bad, write out the new parity and update >> the checksum. This 3rd point: if parity matches but csum is not good, then there is a btrfs design error or some hardware/CPU/memory problem. Compare with btrfs raid10: if the copies match but csum wrong, then there is something fatally wrong. Just the first step, csum check and if wrong, it would mean you generate the assumed corrupt strip newly from the 3 others. And for 3 disk raid5 from the 2 others, whether it is copying or paritycalculation. >> 4. Have an option to skip the csum check on the parity and always compute >> it. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Adventures in btrfs raid5 disk recovery
For what it's worth I found btrfs-map-logical can produce mapping for raid5 (didn't test raid6) by specifying the extent block length. If that's omitted it only shows the device+mapping for the first strip. This example is a 3 disk raid5, with a 128KiB file all in a single extent. [root@f24s ~]# btrfs-map-logical -l 14157742080 /dev/VG/a mirror 1 logical 14157742080 physical 1109327872 device /dev/mapper/VG-a mirror 2 logical 14157742080 physical 2183069696 device /dev/mapper/VG-c [root@f24s ~]# btrfs-map-logical -l 14157742080 -b 131072 /dev/VG/a mirror 1 logical 14157742080 physical 1109327872 device /dev/mapper/VG-a mirror 1 logical 14157807616 physical 1075773440 device /dev/mapper/VG-b mirror 2 logical 14157742080 physical 2183069696 device /dev/mapper/VG-c mirror 2 logical 14157807616 physical 2183069696 device /dev/mapper/VG-c It's also possible to use -c and -o to copy the extent to a file and more easily diff it with a control file, rather than using dd. Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Adventures in btrfs raid5 disk recovery
On Mon, Jun 27, 2016 at 5:21 AM, Austin S. Hemmelgarnwrote: > On 2016-06-25 12:44, Chris Murphy wrote: >> >> On Fri, Jun 24, 2016 at 12:19 PM, Austin S. Hemmelgarn >> wrote: >> >>> Well, the obvious major advantage that comes to mind for me to >>> checksumming >>> parity is that it would let us scrub the parity data itself and verify >>> it. >> >> >> OK but hold on. During scrub, it should read data, compute checksums >> *and* parity, and compare those to what's on-disk - > EXTENT_CSUM in >> the checksum tree, and the parity strip in the chunk tree. And if >> parity is wrong, then it should be replaced. > > Except that's horribly inefficient. With limited exceptions involving > highly situational co-processors, computing a checksum of a parity block is > always going to be faster than computing parity for the stripe. By using > that to check parity, we can safely speed up the common case of near zero > errors during a scrub by a pretty significant factor. OK I'm in favor of that. Although somehow md gets away with this by computing and checking parity for its scrubs, and still manages to keep drives saturated in the process - at least HDDs, I'm not sure how it fares on SSDs. > The ideal situation that I'd like to see for scrub WRT parity is: > 1. Store checksums for the parity itself. > 2. During scrub, if the checksum is good, the parity is good, and we just > saved the time of computing the whole parity block. > 3. If the checksum is not good, then compute the parity. If the parity just > computed matches what is there already, the checksum is bad and should be > rewritten (and we should probably recompute the whole block of checksums > it's in), otherwise, the parity was bad, write out the new parity and update > the checksum. > 4. Have an option to skip the csum check on the parity and always compute > it. >> >> >> Even check > md/sync_action does this. So no pun intended but Btrfs >> isn't even at parity with mdadm on data integrity if it doesn't check >> if the parity matches data. > > Except that MD and LVM don't have checksums to verify anything outside of > the very high-level metadata. They have to compute the parity during a > scrub because that's the _only_ way they have to check data integrity. Just > because that's the only way for them to check it does not mean we have to > follow their design, especially considering that we have other, faster ways > to check it. I'm not opposed to this optimization. But retroactively better qualifying my previous "major advantage" what I meant was in terms of solving functional deficiency. >> The much bigger problem we have right now that affects Btrfs, >> LVM/mdadm md raid, is this silly bad default with non-enterprise >> drives having no configurable SCT ERC, with ensuing long recovery >> times, and the kernel SCSI command timer at 30 seconds - which >> actually also fucks over regular single disk users also because it >> means they don't get the "benefit" of long recovery times, which is >> the whole g'd point of that feature. This itself causes so many >> problems where bad sectors just get worse and don't get fixed up >> because of all the link resets. So I still think it's a bullshit >> default kernel side because it pretty much affects the majority use >> case, it is only a non-problem with proprietary hardware raid, and >> software raid using enterprise (or NAS specific) drives that already >> have short recovery times by default. > > On this, we can agree. It just came up again in a thread over the weekend on linux-raid@. I'm going to ask while people are paying attention if a patch to change the 30 second time out to something a lot higher has ever been floated, what the negatives might be, and where to get this fixed if it wouldn't be accepted in the kernel code directly. *Ideally* I think we'd want two timeouts. I'd like to see commands have a timer that results in merely a warning that could be used by e.g. btrfs scrub to know "hey this sector range is 'slow' I'm going to write over those sectors". That's how bad sectors start out, they read slower and eventually go beyond 30 seconds and now it's all link resets. If the problem could be fixed before then... that's the best scenario. The 2nd timer would be, OK the controller or drive just face planted, reset. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Adventures in btrfs raid5 disk recovery
On 2016-06-25 12:44, Chris Murphy wrote: On Fri, Jun 24, 2016 at 12:19 PM, Austin S. Hemmelgarnwrote: Well, the obvious major advantage that comes to mind for me to checksumming parity is that it would let us scrub the parity data itself and verify it. OK but hold on. During scrub, it should read data, compute checksums *and* parity, and compare those to what's on-disk - > EXTENT_CSUM in the checksum tree, and the parity strip in the chunk tree. And if parity is wrong, then it should be replaced. Except that's horribly inefficient. With limited exceptions involving highly situational co-processors, computing a checksum of a parity block is always going to be faster than computing parity for the stripe. By using that to check parity, we can safely speed up the common case of near zero errors during a scrub by a pretty significant factor. The ideal situation that I'd like to see for scrub WRT parity is: 1. Store checksums for the parity itself. 2. During scrub, if the checksum is good, the parity is good, and we just saved the time of computing the whole parity block. 3. If the checksum is not good, then compute the parity. If the parity just computed matches what is there already, the checksum is bad and should be rewritten (and we should probably recompute the whole block of checksums it's in), otherwise, the parity was bad, write out the new parity and update the checksum. 4. Have an option to skip the csum check on the parity and always compute it. Even check > md/sync_action does this. So no pun intended but Btrfs isn't even at parity with mdadm on data integrity if it doesn't check if the parity matches data. Except that MD and LVM don't have checksums to verify anything outside of the very high-level metadata. They have to compute the parity during a scrub because that's the _only_ way they have to check data integrity. Just because that's the only way for them to check it does not mean we have to follow their design, especially considering that we have other, faster ways to check it. I'd personally much rather know my parity is bad before I need to use it than after using it to reconstruct data and getting an error there, and I'd be willing to be that most seasoned sysadmins working for companies using big storage arrays likely feel the same about it. That doesn't require parity csums though. It just requires computing parity during a scrub and comparing it to the parity on disk to make sure they're the same. If they aren't, assuming no other error for that full stripe read, then the parity block is replaced. It does not require it, but it can make it significantly more efficient, and even a 1% increase in efficiency is a huge difference on a big array. So that's also something to check in the code or poke a system with a stick and see what happens. I could see it being practical to have an option to turn this off for performance reasons or similar, but again, I have a feeling that most people would rather be able to check if a rebuild will eat data before trying to rebuild (depending on the situation in such a case, it will sometimes just make more sense to nuke the array and restore from a backup instead of spending time waiting for it to rebuild). The much bigger problem we have right now that affects Btrfs, LVM/mdadm md raid, is this silly bad default with non-enterprise drives having no configurable SCT ERC, with ensuing long recovery times, and the kernel SCSI command timer at 30 seconds - which actually also fucks over regular single disk users also because it means they don't get the "benefit" of long recovery times, which is the whole g'd point of that feature. This itself causes so many problems where bad sectors just get worse and don't get fixed up because of all the link resets. So I still think it's a bullshit default kernel side because it pretty much affects the majority use case, it is only a non-problem with proprietary hardware raid, and software raid using enterprise (or NAS specific) drives that already have short recovery times by default. On this, we can agree. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Adventures in btrfs raid5 disk recovery
On Sun, Jun 26, 2016 at 1:54 AM, Andrei Borzenkovwrote: > 26.06.2016 00:52, Chris Murphy пишет: >> Interestingly enough, so far I'm finding with full stripe writes, i.e. >> 3x raid5, exactly 128KiB data writes, devid 3 is always parity. This >> is raid4. > > That's not what code suggests and what I see in practice - parity seems > to be distributed across all disks; each new 128KiB file (extent) has > parity on new disk. At least as long as we can trust btrfs-map-logical > to always show parity as "mirror 2". tl;dr Andrei is correct there's no raid4 behavior here. Looks like mirror 2 is always parity, more on that below. > > Do you see consecutive full stripes in your tests? Or how do you > determine which devid has parity for a given full stripe? I do see consecutive full stripe writes, but it doesn't always happen. But not checking the consecutivity is where I became confused. [root@f24s ~]# filefrag -v /mnt/5/ab* Filesystem type is: 9123683e File size of /mnt/5/ab128_2.txt is 131072 (32 blocks of 4096 bytes) ext: logical_offset:physical_offset: length: expected: flags: 0:0.. 31:3456128.. 3456159: 32: last,eof /mnt/5/ab128_2.txt: 1 extent found File size of /mnt/5/ab128_3.txt is 131072 (32 blocks of 4096 bytes) ext: logical_offset:physical_offset: length: expected: flags: 0:0.. 31:3456224.. 3456255: 32: last,eof /mnt/5/ab128_3.txt: 1 extent found File size of /mnt/5/ab128_4.txt is 131072 (32 blocks of 4096 bytes) ext: logical_offset:physical_offset: length: expected: flags: 0:0.. 31:3456320.. 3456351: 32: last,eof /mnt/5/ab128_4.txt: 1 extent found File size of /mnt/5/ab128_5.txt is 131072 (32 blocks of 4096 bytes) ext: logical_offset:physical_offset: length: expected: flags: 0:0.. 31:3456352.. 3456383: 32: last,eof /mnt/5/ab128_5.txt: 1 extent found File size of /mnt/5/ab128_6.txt is 131072 (32 blocks of 4096 bytes) ext: logical_offset:physical_offset: length: expected: flags: 0:0.. 31:3456384.. 3456415: 32: last,eof /mnt/5/ab128_6.txt: 1 extent found File size of /mnt/5/ab128_7.txt is 131072 (32 blocks of 4096 bytes) ext: logical_offset:physical_offset: length: expected: flags: 0:0.. 31:3456416.. 3456447: 32: last,eof /mnt/5/ab128_7.txt: 1 extent found File size of /mnt/5/ab128_8.txt is 131072 (32 blocks of 4096 bytes) ext: logical_offset:physical_offset: length: expected: flags: 0:0.. 31:3456448.. 3456479: 32: last,eof /mnt/5/ab128_8.txt: 1 extent found File size of /mnt/5/ab128_9.txt is 131072 (32 blocks of 4096 bytes) ext: logical_offset:physical_offset: length: expected: flags: 0:0.. 31:3456480.. 3456511: 32: last,eof /mnt/5/ab128_9.txt: 1 extent found File size of /mnt/5/ab128.txt is 131072 (32 blocks of 4096 bytes) ext: logical_offset:physical_offset: length: expected: flags: 0:0.. 31:3456096.. 3456127: 32: last,eof /mnt/5/ab128.txt: 1 extent found Starting with the bottom file then from the top so they're in 4096 byte block order; and the 2nd column is the difference in value: 3456096 3456128 32 3456224 96 3456320 96 3456352 32 3456384 32 3456416 32 3456448 32 3456480 32 So the first two files are consecutive full stripe writes. The next two aren't. The next five are. They were all copied at the same time. I don't know why they aren't always consecutive writes. [root@f24s ~]# btrfs-map-logical -l $[4096*3456096] /dev/VG/a mirror 1 logical 14156169216 physical 1108541440 device /dev/mapper/VG-a mirror 2 logical 14156169216 physical 2182283264 device /dev/mapper/VG-c [root@f24s ~]# btrfs-map-logical -l $[4096*3456128] /dev/VG/a mirror 1 logical 14156300288 physical 1075052544 device /dev/mapper/VG-b mirror 2 logical 14156300288 physical 1108606976 device /dev/mapper/VG-a [root@f24s ~]# btrfs-map-logical -l $[4096*3456224] /dev/VG/a mirror 1 logical 14156693504 physical 1075249152 device /dev/mapper/VG-b mirror 2 logical 14156693504 physical 1108803584 device /dev/mapper/VG-a [root@f24s ~]# btrfs-map-logical -l $[4096*3456320] /dev/VG/a mirror 1 logical 14157086720 physical 1075445760 device /dev/mapper/VG-b mirror 2 logical 14157086720 physical 1109000192 device /dev/mapper/VG-a [root@f24s ~]# btrfs-map-logical -l $[4096*3456352] /dev/VG/a mirror 1 logical 14157217792 physical 2182807552 device /dev/mapper/VG-c mirror 2 logical 14157217792 physical 1075511296 device /dev/mapper/VG-b [root@f24s ~]# btrfs-map-logical -l $[4096*3456384] /dev/VG/a mirror 1 logical 14157348864 physical 1109131264 device /dev/mapper/VG-a mirror 2 logical 14157348864 physical 2182873088 device /dev/mapper/VG-c
Re: Adventures in btrfs raid5 disk recovery
Andrei Borzenkov posted on Sun, 26 Jun 2016 10:54:16 +0300 as excerpted: > P.S. usage of "stripe" to mean "stripe element" actually adds to > confusion when reading code :) ... and posts (including patches, which I guess are code as well, just not applied yet). I've been noticing that in the "stripe length" patches, when the comment associated with the patch suggests it's "strip length" they're actually talking about, using the "N strips, one per device, make a stripe" definition. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Adventures in btrfs raid5 disk recovery
26.06.2016 00:52, Chris Murphy пишет: > Interestingly enough, so far I'm finding with full stripe writes, i.e. > 3x raid5, exactly 128KiB data writes, devid 3 is always parity. This > is raid4. That's not what code suggests and what I see in practice - parity seems to be distributed across all disks; each new 128KiB file (extent) has parity on new disk. At least as long as we can trust btrfs-map-logical to always show parity as "mirror 2". Do you see consecutive full stripes in your tests? Or how do you determine which devid has parity for a given full stripe? This information is not actually stored anywhere, it is computed based on block group geometry and logical stripe offset. P.S. usage of "stripe" to mean "stripe element" actually adds to confusion when reading code :) -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Adventures in btrfs raid5 disk recovery
Interestingly enough, so far I'm finding with full stripe writes, i.e. 3x raid5, exactly 128KiB data writes, devid 3 is always parity. This is raid4. So...I wonder if some of these slow cases end up with a bunch of stripes that are effectively raid4-like, and have a lot of parity overwrites, which is where raid4 suffers due to disk contention. Totally speculative as the sample size is too small and distinctly non-random. Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Adventures in btrfs raid5 disk recovery
On Fri, Jun 24, 2016 at 12:19 PM, Austin S. Hemmelgarnwrote: > Well, the obvious major advantage that comes to mind for me to checksumming > parity is that it would let us scrub the parity data itself and verify it. OK but hold on. During scrub, it should read data, compute checksums *and* parity, and compare those to what's on-disk - > EXTENT_CSUM in the checksum tree, and the parity strip in the chunk tree. And if parity is wrong, then it should be replaced. Even check > md/sync_action does this. So no pun intended but Btrfs isn't even at parity with mdadm on data integrity if it doesn't check if the parity matches data. > I'd personally much rather know my parity is bad before I need to use it > than after using it to reconstruct data and getting an error there, and I'd > be willing to be that most seasoned sysadmins working for companies using > big storage arrays likely feel the same about it. That doesn't require parity csums though. It just requires computing parity during a scrub and comparing it to the parity on disk to make sure they're the same. If they aren't, assuming no other error for that full stripe read, then the parity block is replaced. So that's also something to check in the code or poke a system with a stick and see what happens. > I could see it being > practical to have an option to turn this off for performance reasons or > similar, but again, I have a feeling that most people would rather be able > to check if a rebuild will eat data before trying to rebuild (depending on > the situation in such a case, it will sometimes just make more sense to nuke > the array and restore from a backup instead of spending time waiting for it > to rebuild). The much bigger problem we have right now that affects Btrfs, LVM/mdadm md raid, is this silly bad default with non-enterprise drives having no configurable SCT ERC, with ensuing long recovery times, and the kernel SCSI command timer at 30 seconds - which actually also fucks over regular single disk users also because it means they don't get the "benefit" of long recovery times, which is the whole g'd point of that feature. This itself causes so many problems where bad sectors just get worse and don't get fixed up because of all the link resets. So I still think it's a bullshit default kernel side because it pretty much affects the majority use case, it is only a non-problem with proprietary hardware raid, and software raid using enterprise (or NAS specific) drives that already have short recovery times by default. This has been true for a very long time, maybe a decade. And it's such complete utter crap that this hasn't been dealt with properly by any party. No distribution has fixed this for their users. Upstream udev hasn't dealt with it. And kernel folks haven't dealt with it. It's a perverse joke on the user to do this out of the box. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Adventures in btrfs raid5 disk recovery
On 2016-06-24 13:52, Chris Murphy wrote: On Fri, Jun 24, 2016 at 11:21 AM, Andrei Borzenkovwrote: 24.06.2016 20:06, Chris Murphy пишет: On Fri, Jun 24, 2016 at 3:52 AM, Andrei Borzenkov wrote: On Fri, Jun 24, 2016 at 11:50 AM, Hugo Mills wrote: eta)data and RAID56 parity is not data. Checksums are not parity, correct. However, every data block (including, I think, the parity) is checksummed and put into the csum tree. This allows the FS to determine where damage has occurred, rather thansimply detecting that it has occurred (which would be the case if the parity doesn't match the data, or if the two copies of a RAID-1 array don't match). Yes, that is what I wrote below. But that means that RAID5 with one degraded disk won't be able to reconstruct data on this degraded disk because reconstructed extent content won't match checksum. Which kinda makes RAID5 pointless. I don't understand this. Whether the failed disk means a stripe is missing a data strip or parity strip, if any other strip is damaged of course the reconstruction isn't going to match checksum. This does not make raid5 pointless. Yes, you are right. We have double failure here. Still, in current situation we apparently may end with btrfs reconstructing missing block using wrong information. As was mentioned elsewhere, btrfs does not verify checksum of reconstructed block, meaning data corruption. Well that'd be bad, but also good in that it would explain a lot of problems people have when metadata is also raid5. In this whole thread the premise is the metadata is raid1, so the fs doesn't totally face plant we just get a bunch of weird data corruptions. The metadata raid5 case were sorta "WTF happened?" and not much was really said about it other than telling the user to scrape off what they can and start over. Anyway, while not good I still think this is not super problematic to at least *do* check EXTENT_CSUM after reconstruction from parity rather than assuming that reconstruction happened correctly. The data needed to pass fail the rebuild is already on the disk. It just needs to be checked. Better would be to get parity csummed and put into the csum tree. But I don't know how much that helps. Think about always computing and writing csums for parity, which almost never get used vs keeping things the way they are now and just *checking our work* after reconstruction from parity. If there's some obvious major advantage to checksumming the parity I'm all ears but I'm not thinking of it at the moment. Well, the obvious major advantage that comes to mind for me to checksumming parity is that it would let us scrub the parity data itself and verify it. I'd personally much rather know my parity is bad before I need to use it than after using it to reconstruct data and getting an error there, and I'd be willing to be that most seasoned sysadmins working for companies using big storage arrays likely feel the same about it. I could see it being practical to have an option to turn this off for performance reasons or similar, but again, I have a feeling that most people would rather be able to check if a rebuild will eat data before trying to rebuild (depending on the situation in such a case, it will sometimes just make more sense to nuke the array and restore from a backup instead of spending time waiting for it to rebuild). -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Adventures in btrfs raid5 disk recovery
On Fri, Jun 24, 2016 at 11:40:56AM -0600, Chris Murphy wrote: > On Fri, Jun 24, 2016 at 4:16 AM, Hugo Mills <h...@carfax.org.uk> wrote: > > On Fri, Jun 24, 2016 at 12:52:21PM +0300, Andrei Borzenkov wrote: > >For data, say you have n-1 good devices, with n-1 blocks on them. > > Each block has a checksum in the metadata, so you can read that > > checksum, read the blocks, and verify that they're not damaged. From > > those n-1 known-good blocks (all data, or one parity and the rest > > data) you can reconstruct the remaining block. That reconstructed > > block won't be checked against the csum for the missing block -- it'll > > just be written and a new csum for it written with it. > > The last sentence is hugely problematic. Parity doesn't appear to be > either CoW'd or checksummed. If it is used for reconstruction and the > reconstructed data isn't compared to the data's EXTENT_CSUM entry, but > that entry is rather recomputed and written, that's just like blindly > trusting the parity is correct and then authenticating it with a csum. I think what happens is the data is recomputed, but the csum on the data is _not_ updated (the csum does not reside in the raid56 code). A read of the reconstructed data would get a csum failure (of course, every 4 billionth time this happens the csum is correct by random chance, so you wouldn't want to be reading parity blocks from a drive full of garbage, but that's a different matter). > It's not difficult to test. Corrupt one byte of parity. Yank a drive. > Add a new one. Start a reconstruction with scrub or balance (or both > to see if they differ) and find out what happens. What should happen > is the reconstruct should work for everything except that one file. If > it's reconstructed silently, it should contain visible corruption and > we all collectively raise our eyebrows. I've done something like that test: write random data to 1000 random blocks on one disk, then run scrub. It reconstructs the data without problems (except for the minor wart that 'scrub status -d' counts the randomly against every device, while 'dev stats' counts all the errors on the disk that was corrupted). Disk-side data corruption is a thing I have to deal with a few times each year, so I tested the btrfs raid5 implementation for that case before I started using it. As far as I can tell so far, everything in btrfs raid5 works properly if a disk fails _while the filesystem is not mounted_. The problem I see in the field is not *silent* corruption. It's a whole lot of very *noisy* corruption detected under circumstances where I'd expect to see no corruption at all (silent or otherwise). signature.asc Description: Digital signature
Re: Adventures in btrfs raid5 disk recovery
On Fri, Jun 24, 2016 at 11:21 AM, Andrei Borzenkovwrote: > 24.06.2016 20:06, Chris Murphy пишет: >> On Fri, Jun 24, 2016 at 3:52 AM, Andrei Borzenkov >> wrote: >>> On Fri, Jun 24, 2016 at 11:50 AM, Hugo Mills wrote: >>> eta)data and RAID56 parity is not data. Checksums are not parity, correct. However, every data block (including, I think, the parity) is checksummed and put into the csum tree. This allows the FS to determine where damage has occurred, rather thansimply detecting that it has occurred (which would be the case if the parity doesn't match the data, or if the two copies of a RAID-1 array don't match). >>> >>> Yes, that is what I wrote below. But that means that RAID5 with one >>> degraded disk won't be able to reconstruct data on this degraded disk >>> because reconstructed extent content won't match checksum. Which kinda >>> makes RAID5 pointless. >> >> I don't understand this. Whether the failed disk means a stripe is >> missing a data strip or parity strip, if any other strip is damaged of >> course the reconstruction isn't going to match checksum. This does not >> make raid5 pointless. >> > > Yes, you are right. We have double failure here. Still, in current > situation we apparently may end with btrfs reconstructing missing block > using wrong information. As was mentioned elsewhere, btrfs does not > verify checksum of reconstructed block, meaning data corruption. Well that'd be bad, but also good in that it would explain a lot of problems people have when metadata is also raid5. In this whole thread the premise is the metadata is raid1, so the fs doesn't totally face plant we just get a bunch of weird data corruptions. The metadata raid5 case were sorta "WTF happened?" and not much was really said about it other than telling the user to scrape off what they can and start over. Anyway, while not good I still think this is not super problematic to at least *do* check EXTENT_CSUM after reconstruction from parity rather than assuming that reconstruction happened correctly. The data needed to pass fail the rebuild is already on the disk. It just needs to be checked. Better would be to get parity csummed and put into the csum tree. But I don't know how much that helps. Think about always computing and writing csums for parity, which almost never get used vs keeping things the way they are now and just *checking our work* after reconstruction from parity. If there's some obvious major advantage to checksumming the parity I'm all ears but I'm not thinking of it at the moment. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Adventures in btrfs raid5 disk recovery
On Fri, Jun 24, 2016 at 4:16 AM, Hugo Millswrote: > On Fri, Jun 24, 2016 at 12:52:21PM +0300, Andrei Borzenkov wrote: >> Yes, that is what I wrote below. But that means that RAID5 with one >> degraded disk won't be able to reconstruct data on this degraded disk >> because reconstructed extent content won't match checksum. Which kinda >> makes RAID5 pointless. > >Eh? How do you come to that conclusion? > >For data, say you have n-1 good devices, with n-1 blocks on them. > Each block has a checksum in the metadata, so you can read that > checksum, read the blocks, and verify that they're not damaged. From > those n-1 known-good blocks (all data, or one parity and the rest > data) you can reconstruct the remaining block. That reconstructed > block won't be checked against the csum for the missing block -- it'll > just be written and a new csum for it written with it. The last sentence is hugely problematic. Parity doesn't appear to be either CoW'd or checksummed. If it is used for reconstruction and the reconstructed data isn't compared to the data's EXTENT_CSUM entry, but that entry is rather recomputed and written, that's just like blindly trusting the parity is correct and then authenticating it with a csum. It's not difficult to test. Corrupt one byte of parity. Yank a drive. Add a new one. Start a reconstruction with scrub or balance (or both to see if they differ) and find out what happens. What should happen is the reconstruct should work for everything except that one file. If it's reconstructed silently, it should contain visible corruption and we all collectively raise our eyebrows. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Adventures in btrfs raid5 disk recovery
On Fri, Jun 24, 2016 at 4:16 AM, Andrei Borzenkovwrote: > On Fri, Jun 24, 2016 at 8:20 AM, Chris Murphy wrote: > >> [root@f24s ~]# filefrag -v /mnt/5/* >> Filesystem type is: 9123683e >> File size of /mnt/5/a.txt is 16383 (4 blocks of 4096 bytes) >> ext: logical_offset:physical_offset: length: expected: flags: >>0:0.. 3:2931712.. 2931715: 4: >> last,eof > > Hmm ... I wonder what is wrong here (openSUSE Tumbleweed) > > nohostname:~ # filefrag -v /mnt/1 > Filesystem type is: 9123683e > File size of /mnt/1 is 3072 (1 block of 4096 bytes) > ext: logical_offset:physical_offset: length: expected: flags: >0:0.. 0: 269376..269376: 1: last,eof > /mnt/1: 1 extent found > > But! > > nohostname:~ # filefrag -v /etc/passwd > Filesystem type is: 9123683e > File size of /etc/passwd is 1527 (1 block of 4096 bytes) > ext: logical_offset:physical_offset: length: expected: flags: >0:0..4095: 0.. 4095: 4096: > last,not_aligned,inline,eof > /etc/passwd: 1 extent found > nohostname:~ # > > Why it works for one filesystem but does not work for an other one? > ... >> >> So at the old address, it shows the "a..." is still there. And at >> the added single block for this file at new logical and physical >> addresses, is the modification substituting the first "a" for "g". >> >> In this case, no rmw, no partial stripe modification, and no data >> already on-disk is at risk. > > You misunderstand the nature of problem. What is put at risk is data > that is already on disk and "shares" parity with new data. > > As example, here are the first 64K in several extents on 4 disk RAID5 > with so far single data chunk > > item 6 key (FIRST_CHUNK_TREE CHUNK_ITEM 1103101952) itemoff > 15491 itemsize 176 > chunk length 3221225472 owner 2 stripe_len 65536 > type DATA|RAID5 num_stripes 4 > stripe 0 devid 4 offset 9437184 > dev uuid: ed13e42e-1633-4230-891c-897e86d1c0be > stripe 1 devid 3 offset 9437184 > dev uuid: 10032b95-3f48-4ea0-a9ee-90064c53da1f > stripe 2 devid 2 offset 1074790400 > dev uuid: cd749bd9-3d72-43b4-89a8-45e4a92658cf > stripe 3 devid 1 offset 1094713344 > dev uuid: 41538b9f-3869-4c32-b3e2-30aa2ea1534e > dev extent chunk_tree 3 > chunk objectid 256 chunk offset 1103101952 length 1073741824 > > > item 5 key (1 DEV_EXTENT 1094713344) itemoff 16027 itemsize 48 > dev extent chunk_tree 3 > chunk objectid 256 chunk offset 1103101952 length 1073741824 > item 7 key (2 DEV_EXTENT 1074790400) itemoff 15931 itemsize 48 > dev extent chunk_tree 3 > chunk objectid 256 chunk offset 1103101952 length 1073741824 > item 9 key (3 DEV_EXTENT 9437184) itemoff 15835 itemsize 48 > dev extent chunk_tree 3 > chunk objectid 256 chunk offset 1103101952 length 1073741824 > item 11 key (4 DEV_EXTENT 9437184) itemoff 15739 itemsize 48 > dev extent chunk_tree 3 > chunk objectid 256 chunk offset 1103101952 length 1073741824 > > where devid 1 = sdb1, 2 = sdc1 etc. > > Now let's write some data (I created several files) up to 64K in size: > > mirror 1 logical 1103364096 physical 1074855936 device /dev/sdc1 > mirror 2 logical 1103364096 physical 9502720 device /dev/sde1 > mirror 1 logical 1103368192 physical 1074860032 device /dev/sdc1 > mirror 2 logical 1103368192 physical 9506816 device /dev/sde1 > mirror 1 logical 1103372288 physical 1074864128 device /dev/sdc1 > mirror 2 logical 1103372288 physical 9510912 device /dev/sde1 > mirror 1 logical 1103376384 physical 1074868224 device /dev/sdc1 > mirror 2 logical 1103376384 physical 9515008 device /dev/sde1 > mirror 1 logical 1103380480 physical 1074872320 device /dev/sdc1 > mirror 2 logical 1103380480 physical 9519104 device /dev/sde1 > > Note that btrfs allocates 64K on the same device before switching to > the next one. What is a bit misleading here, sdc1 is data and sde1 is > parity (you can see it in checksum tree, where only items for sdc1 > exist). > > Now let's write next 64k and see what happens > > nohostname:~ # btrfs-map-logical -l 1103429632 -b 65536 /dev/sdb1 > mirror 1 logical 1103429632 physical 1094778880 device /dev/sdb1 > mirror 2 logical 1103429632 physical 9502720 device /dev/sde1 > > See? btrfs now allocates new stripe on sdb1; this stripe is at the > same offset as previous one on sdc1 (64K) and so shares the same > parity stripe on sde1. Yep, I've seen this also. What's not clear is if there's any optimization where it's doing partial strip writes, i.e. only a
Re: Adventures in btrfs raid5 disk recovery
24.06.2016 20:06, Chris Murphy пишет: > On Fri, Jun 24, 2016 at 3:52 AM, Andrei Borzenkovwrote: >> On Fri, Jun 24, 2016 at 11:50 AM, Hugo Mills wrote: >> eta)data and RAID56 parity is not data. >>> >>>Checksums are not parity, correct. However, every data block >>> (including, I think, the parity) is checksummed and put into the csum >>> tree. This allows the FS to determine where damage has occurred, >>> rather thansimply detecting that it has occurred (which would be the >>> case if the parity doesn't match the data, or if the two copies of a >>> RAID-1 array don't match). >>> >> >> Yes, that is what I wrote below. But that means that RAID5 with one >> degraded disk won't be able to reconstruct data on this degraded disk >> because reconstructed extent content won't match checksum. Which kinda >> makes RAID5 pointless. > > I don't understand this. Whether the failed disk means a stripe is > missing a data strip or parity strip, if any other strip is damaged of > course the reconstruction isn't going to match checksum. This does not > make raid5 pointless. > Yes, you are right. We have double failure here. Still, in current situation we apparently may end with btrfs reconstructing missing block using wrong information. As was mentioned elsewhere, btrfs does not verify checksum of reconstructed block, meaning data corruption. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Adventures in btrfs raid5 disk recovery
On Fri, Jun 24, 2016 at 3:52 AM, Andrei Borzenkovwrote: > On Fri, Jun 24, 2016 at 11:50 AM, Hugo Mills wrote: >eta)data and RAID56 parity is not data. >> >>Checksums are not parity, correct. However, every data block >> (including, I think, the parity) is checksummed and put into the csum >> tree. This allows the FS to determine where damage has occurred, >> rather thansimply detecting that it has occurred (which would be the >> case if the parity doesn't match the data, or if the two copies of a >> RAID-1 array don't match). >> > > Yes, that is what I wrote below. But that means that RAID5 with one > degraded disk won't be able to reconstruct data on this degraded disk > because reconstructed extent content won't match checksum. Which kinda > makes RAID5 pointless. I don't understand this. Whether the failed disk means a stripe is missing a data strip or parity strip, if any other strip is damaged of course the reconstruction isn't going to match checksum. This does not make raid5 pointless. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Adventures in btrfs raid5 disk recovery
On Fri, Jun 24, 2016 at 2:50 AM, Hugo Millswrote: >Checksums are not parity, correct. However, every data block > (including, I think, the parity) is checksummed and put into the csum > tree. I don't see how parity is checksummed. It definitely is not in the csum tree. Two file systems, one raid5, one single, each with a single identical file: raid5 item 0 key (EXTENT_CSUM EXTENT_CSUM 12009865216) itemoff 16155 itemsize 128 extent csum item single item 0 key (EXTENT_CSUM EXTENT_CSUM 2168717312) itemoff 16155 itemsize 128 extent csum item That's the only entry in the csum tree. The raid5 one is not 33.33% bigger to account for the extra parity being checksummed. Now, if parity is used to reconstruction of data, that data *is* checksummed so if it fails checksum after reconstruction the information is available to determine it was incorrectly reconstructed. The notes in btrfs/raid56.c recognize the possibility of parity corruption and how to handle it. But I think that corruption is inferred. Maybe the parity csums are in some other metadata item, but I don't see how it's in the csum tree. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Adventures in btrfs raid5 disk recovery
On Fri, Jun 24, 2016 at 10:52:53AM -0600, Chris Murphy wrote: > On Fri, Jun 24, 2016 at 2:50 AM, Hugo Millswrote: > > >Checksums are not parity, correct. However, every data block > > (including, I think, the parity) is checksummed and put into the csum > > tree. > > I don't see how parity is checksummed. It definitely is not in the > csum tree. Two file systems, one raid5, one single, each with a single > identical file: It isn't -- I was wrong up there, and corrected myself in a later message after investigation. (Although in this case, I regard reality as being at fault ;) ). Hugo. > raid5 > item 0 key (EXTENT_CSUM EXTENT_CSUM 12009865216) itemoff 16155 itemsize > 128 > extent csum item > > single > > item 0 key (EXTENT_CSUM EXTENT_CSUM 2168717312) itemoff 16155 itemsize 128 > extent csum item > > That's the only entry in the csum tree. The raid5 one is not 33.33% > bigger to account for the extra parity being checksummed. > > Now, if parity is used to reconstruction of data, that data *is* > checksummed so if it fails checksum after reconstruction the > information is available to determine it was incorrectly > reconstructed. The notes in btrfs/raid56.c recognize the possibility > of parity corruption and how to handle it. But I think that corruption > is inferred. Maybe the parity csums are in some other metadata item, > but I don't see how it's in the csum tree. > > -- Hugo Mills | Great oxymorons of the world, no. 2: hugo@... carfax.org.uk | Common Sense http://carfax.org.uk/ | PGP: E2AB1DE4 | signature.asc Description: Digital signature
Re: Adventures in btrfs raid5 disk recovery
On Fri, Jun 24, 2016 at 07:02:34AM +0300, Andrei Borzenkov wrote: > >> I don't read code well enough, but I'd be surprised if Btrfs > >> reconstructs from parity and doesn't then check the resulting > >> reconstructed data to its EXTENT_CSUM. > > > > I wouldn't be surprised if both things happen in different code paths, > > given the number of different paths leading into the raid56 code and > > the number of distinct failure modes it seems to have. > > Well, the problem is that parity block cannot be redirected on write as > data blocks; which makes it impossible to version control it. The only > solution I see is to always use full stripe writes by either wasting > time in fixed width stripe or using variable width, so that every stripe > always gets new version of parity. This makes it possible to keep parity > checksums like data checksums. The allocator could try harder to avoid partial stripe writes. We can write multiple small extents to the same stripe as long as we always do it all within one transaction, and then later treat the entire stripe as read-only until every extent is removed. It would be possible to do that by fudging extent lengths (effectively adding a bunch of prealloc-ish space if we have a partial write after all the delalloc stuff is done), but it could also waste some blocks on every single transaction, or create a bunch of "free but unavailable" space that makes df/statvfs output even more wrong than it usually is. The raid5 rmw code could try to relocate the other extents sharing a stripe, but I fear that with the current state of backref walking code that would make raid5 spectacularly slow if a filesystem is anywhere near full. We could also write rmw parity block updates to a journal (like another log tree). That would enable us to at least fix up the parity blocks after a crash, and close the write hole. That's an on-disk format change though. signature.asc Description: Digital signature
Re: Adventures in btrfs raid5 disk recovery
On Thu, Jun 23, 2016 at 11:20:40PM -0600, Chris Murphy wrote: > [root@f24s ~]# filefrag -v /mnt/5/* > Filesystem type is: 9123683e > File size of /mnt/5/a.txt is 16383 (4 blocks of 4096 bytes) > ext: logical_offset:physical_offset: length: expected: flags: >0:0.. 0:2931732.. 2931732: 1: >1:1.. 3:2931713.. 2931715: 3:2931733: last,eof > /mnt/5/a.txt: 2 extents found > File size of /mnt/5/b.txt is 16383 (4 blocks of 4096 bytes) > ext: logical_offset:physical_offset: length: expected: flags: >0:0.. 3:2931716.. 2931719: 4: last,eof > /mnt/5/b.txt: 1 extent found > File size of /mnt/5/c.txt is 16383 (4 blocks of 4096 bytes) > ext: logical_offset:physical_offset: length: expected: flags: >0:0.. 3:2931720.. 2931723: 4: last,eof > /mnt/5/c.txt: 1 extent found > File size of /mnt/5/d.txt is 16383 (4 blocks of 4096 bytes) > ext: logical_offset:physical_offset: length: expected: flags: >0:0.. 3:2931724.. 2931727: 4: last,eof > /mnt/5/d.txt: 1 extent found > File size of /mnt/5/e.txt is 16383 (4 blocks of 4096 bytes) > ext: logical_offset:physical_offset: length: expected: flags: >0:0.. 3:2931728.. 2931731: 4: last,eof > /mnt/5/e.txt: 1 extent found > So at the old address, it shows the "a..." is still there. And at > the added single block for this file at new logical and physical > addresses, is the modification substituting the first "a" for "g". > > In this case, no rmw, no partial stripe modification, and no data > already on-disk is at risk. Even the metadata leaf/node is cow'd, it > has a new logical and physical address as well, which contains > information for all five files. Well, of course not. You're not setting up the conditions for failure. The extent at 2931712..2931715 is 4 blocks long, so when you overwrite part of the extent all 4 blocks remain occupied. You need extents that are shorter than the stripe width, and you need to write to the same stripe in two different btrfs transactions (i.e. you need to delete an extent and then have a new extent mapped in the old location). signature.asc Description: Digital signature
Re: Adventures in btrfs raid5 disk recovery
On 2016-06-24 06:59, Hugo Mills wrote: On Fri, Jun 24, 2016 at 01:19:30PM +0300, Andrei Borzenkov wrote: On Fri, Jun 24, 2016 at 1:16 PM, Hugo Millswrote: On Fri, Jun 24, 2016 at 12:52:21PM +0300, Andrei Borzenkov wrote: On Fri, Jun 24, 2016 at 11:50 AM, Hugo Mills wrote: On Fri, Jun 24, 2016 at 07:02:34AM +0300, Andrei Borzenkov wrote: 24.06.2016 04:47, Zygo Blaxell пишет: On Thu, Jun 23, 2016 at 06:26:22PM -0600, Chris Murphy wrote: On Thu, Jun 23, 2016 at 1:32 PM, Goffredo Baroncelli wrote: The raid5 write hole is avoided in BTRFS (and in ZFS) thanks to the checksum. Yeah I'm kinda confused on this point. https://btrfs.wiki.kernel.org/index.php/RAID56 It says there is a write hole for Btrfs. But defines it in terms of parity possibly being stale after a crash. I think the term comes not from merely parity being wrong but parity being wrong *and* then being used to wrongly reconstruct data because it's blindly trusted. I think the opposite is more likely, as the layers above raid56 seem to check the data against sums before raid56 ever sees it. (If those layers seem inverted to you, I agree, but OTOH there are probably good reason to do it that way). Yes, that's how I read code as well. btrfs layer that does checksumming is unaware of parity blocks at all; for all practical purposes they do not exist. What happens is approximately 1. logical extent is allocated and checksum computed 2. it is mapped to physical area(s) on disks, skipping over what would be parity blocks 3. when these areas are written out, RAID56 parity is computed and filled in IOW btrfs checksums are for (meta)data and RAID56 parity is not data. Checksums are not parity, correct. However, every data block (including, I think, the parity) is checksummed and put into the csum tree. This allows the FS to determine where damage has occurred, rather thansimply detecting that it has occurred (which would be the case if the parity doesn't match the data, or if the two copies of a RAID-1 array don't match). Yes, that is what I wrote below. But that means that RAID5 with one degraded disk won't be able to reconstruct data on this degraded disk because reconstructed extent content won't match checksum. Which kinda makes RAID5 pointless. Eh? How do you come to that conclusion? For data, say you have n-1 good devices, with n-1 blocks on them. Each block has a checksum in the metadata, so you can read that checksum, read the blocks, and verify that they're not damaged. From those n-1 known-good blocks (all data, or one parity and the rest We do not know whether parity is good or not because as far as I can tell parity is not checksummed. I was about to write a devastating rebuttal of this... then I actually tested it, and holy crap you're right. I've just closed the terminal in question by accident, so I can't copy-and-paste, but the way I checked was: # mkfs.btrfs -mraid1 -draid5 /dev/loop{0,1,2} # mount /dev/loop0 foo # dd if=/dev/urandom of=foo/file bs=4k count=32 # umount /dev/loop0 # btrfs-debug-tree /dev/loop0 then look at the csum tree: item 0 key (EXTENT_CSUM EXTENT_CSUM 351469568) itemoff 16155 itemsize 128 extent csum item There is a single csum item in it, of length 128. At 4 bytes per csum, that's 32 checksums, which covers the 32 4KiB blocks I wrote, leaving nothing for the parity. This is fundamentally broken, and I think we need to change the wiki to indicate that the parity RAID implementation is not recommended, because it doesn't actually do the job it's meant to in a reliable way. :( So item 4 now then, together with: 1. Rebuilds seemingly randomly decide based on the filesystem whether or not to take an insanely long time (always happens on some arrays, never happens on others, I have yet to see a report where it happens intermittently). 2. Failed disks seem to occasionally cause irreversible data corruption. 3. Classic erasure-code write-hole, just slightly different because of COW. TBH, as much as I hate to say this, it looks like the raid5/6 code needs redone from scratch. At an absolute minimum, we need to put a warning in mkfs for people using raid5/6 to tell them they shouldn't be using it outside of testing. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Adventures in btrfs raid5 disk recovery
On 2016-06-24 01:20, Chris Murphy wrote: On Thu, Jun 23, 2016 at 8:07 PM, Zygo Blaxellwrote: With simple files changing one character with vi and gedit, I get completely different logical and physical numbers with each change, so it's clearly cowing the entire stripe (192KiB in my 3 dev raid5). You are COWing the entire file because vi and gedit do truncate followed by full-file write. I'm seeing the file inode changes with either a vi or gedit modification, even when file size is exactly the same, just character substitute. So as far as VFS and Btrfs are concerned, it's an entirely different file, so it's like faux-CoW that would have happened on any file system, not an overwrite. Yes, at least Vim (which is what most Linux systems use for vi) writes to a temporary file then does a replace by rename. The idea is that POSIX implies that this should be atomic (except it's not actually required by POSIX, and even on some journaled and COW filesystems, it isn't actually atomic). -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Adventures in btrfs raid5 disk recovery
On Fri, Jun 24, 2016 at 01:19:30PM +0300, Andrei Borzenkov wrote: > On Fri, Jun 24, 2016 at 1:16 PM, Hugo Millswrote: > > On Fri, Jun 24, 2016 at 12:52:21PM +0300, Andrei Borzenkov wrote: > >> On Fri, Jun 24, 2016 at 11:50 AM, Hugo Mills wrote: > >> > On Fri, Jun 24, 2016 at 07:02:34AM +0300, Andrei Borzenkov wrote: > >> >> 24.06.2016 04:47, Zygo Blaxell пишет: > >> >> > On Thu, Jun 23, 2016 at 06:26:22PM -0600, Chris Murphy wrote: > >> >> >> On Thu, Jun 23, 2016 at 1:32 PM, Goffredo Baroncelli > >> >> >> wrote: > >> >> >>> The raid5 write hole is avoided in BTRFS (and in ZFS) thanks to the > >> >> >>> checksum. > >> >> >> > >> >> >> Yeah I'm kinda confused on this point. > >> >> >> > >> >> >> https://btrfs.wiki.kernel.org/index.php/RAID56 > >> >> >> > >> >> >> It says there is a write hole for Btrfs. But defines it in terms of > >> >> >> parity possibly being stale after a crash. I think the term comes not > >> >> >> from merely parity being wrong but parity being wrong *and* then > >> >> >> being > >> >> >> used to wrongly reconstruct data because it's blindly trusted. > >> >> > > >> >> > I think the opposite is more likely, as the layers above raid56 > >> >> > seem to check the data against sums before raid56 ever sees it. > >> >> > (If those layers seem inverted to you, I agree, but OTOH there are > >> >> > probably good reason to do it that way). > >> >> > > >> >> > >> >> Yes, that's how I read code as well. btrfs layer that does checksumming > >> >> is unaware of parity blocks at all; for all practical purposes they do > >> >> not exist. What happens is approximately > >> >> > >> >> 1. logical extent is allocated and checksum computed > >> >> 2. it is mapped to physical area(s) on disks, skipping over what would > >> >> be parity blocks > >> >> 3. when these areas are written out, RAID56 parity is computed and > >> >> filled in > >> >> > >> >> IOW btrfs checksums are for (meta)data and RAID56 parity is not data. > >> > > >> >Checksums are not parity, correct. However, every data block > >> > (including, I think, the parity) is checksummed and put into the csum > >> > tree. This allows the FS to determine where damage has occurred, > >> > rather thansimply detecting that it has occurred (which would be the > >> > case if the parity doesn't match the data, or if the two copies of a > >> > RAID-1 array don't match). > >> > > >> > >> Yes, that is what I wrote below. But that means that RAID5 with one > >> degraded disk won't be able to reconstruct data on this degraded disk > >> because reconstructed extent content won't match checksum. Which kinda > >> makes RAID5 pointless. > > > >Eh? How do you come to that conclusion? > > > >For data, say you have n-1 good devices, with n-1 blocks on them. > > Each block has a checksum in the metadata, so you can read that > > checksum, read the blocks, and verify that they're not damaged. From > > those n-1 known-good blocks (all data, or one parity and the rest > > We do not know whether parity is good or not because as far as I can > tell parity is not checksummed. I was about to write a devastating rebuttal of this... then I actually tested it, and holy crap you're right. I've just closed the terminal in question by accident, so I can't copy-and-paste, but the way I checked was: # mkfs.btrfs -mraid1 -draid5 /dev/loop{0,1,2} # mount /dev/loop0 foo # dd if=/dev/urandom of=foo/file bs=4k count=32 # umount /dev/loop0 # btrfs-debug-tree /dev/loop0 then look at the csum tree: item 0 key (EXTENT_CSUM EXTENT_CSUM 351469568) itemoff 16155 itemsize 128 extent csum item There is a single csum item in it, of length 128. At 4 bytes per csum, that's 32 checksums, which covers the 32 4KiB blocks I wrote, leaving nothing for the parity. This is fundamentally broken, and I think we need to change the wiki to indicate that the parity RAID implementation is not recommended, because it doesn't actually do the job it's meant to in a reliable way. :( Hugo. > > data) you can reconstruct the remaining block. That reconstructed > > block won't be checked against the csum for the missing block -- it'll > > just be written and a new csum for it written with it. > > > > So we have silent corruption. I fail to understand how it is an improvement :) > > >Hugo. > > > >> ... > >> > > >> >> > It looks like uncorrectable failures might occur because parity is > >> >> > correct, but the parity checksum is out of date, so the parity > >> >> > checksum > >> >> > doesn't match even though data blindly reconstructed from the parity > >> >> > *would* match the data. > >> >> > > >> >> > >> >> Yep, that is how I read it too. So if your data is checksummed, it > >> >> should at least avoid silent corruption. > >> >> > > -- Hugo Mills | Debugging is like hitting yourself in the head with hugo@... carfax.org.uk | hammer: it feels so good when you find the bug, and
Re: Adventures in btrfs raid5 disk recovery
On Fri, Jun 24, 2016 at 1:16 PM, Hugo Millswrote: > On Fri, Jun 24, 2016 at 12:52:21PM +0300, Andrei Borzenkov wrote: >> On Fri, Jun 24, 2016 at 11:50 AM, Hugo Mills wrote: >> > On Fri, Jun 24, 2016 at 07:02:34AM +0300, Andrei Borzenkov wrote: >> >> 24.06.2016 04:47, Zygo Blaxell пишет: >> >> > On Thu, Jun 23, 2016 at 06:26:22PM -0600, Chris Murphy wrote: >> >> >> On Thu, Jun 23, 2016 at 1:32 PM, Goffredo Baroncelli >> >> >> wrote: >> >> >>> The raid5 write hole is avoided in BTRFS (and in ZFS) thanks to the >> >> >>> checksum. >> >> >> >> >> >> Yeah I'm kinda confused on this point. >> >> >> >> >> >> https://btrfs.wiki.kernel.org/index.php/RAID56 >> >> >> >> >> >> It says there is a write hole for Btrfs. But defines it in terms of >> >> >> parity possibly being stale after a crash. I think the term comes not >> >> >> from merely parity being wrong but parity being wrong *and* then being >> >> >> used to wrongly reconstruct data because it's blindly trusted. >> >> > >> >> > I think the opposite is more likely, as the layers above raid56 >> >> > seem to check the data against sums before raid56 ever sees it. >> >> > (If those layers seem inverted to you, I agree, but OTOH there are >> >> > probably good reason to do it that way). >> >> > >> >> >> >> Yes, that's how I read code as well. btrfs layer that does checksumming >> >> is unaware of parity blocks at all; for all practical purposes they do >> >> not exist. What happens is approximately >> >> >> >> 1. logical extent is allocated and checksum computed >> >> 2. it is mapped to physical area(s) on disks, skipping over what would >> >> be parity blocks >> >> 3. when these areas are written out, RAID56 parity is computed and filled >> >> in >> >> >> >> IOW btrfs checksums are for (meta)data and RAID56 parity is not data. >> > >> >Checksums are not parity, correct. However, every data block >> > (including, I think, the parity) is checksummed and put into the csum >> > tree. This allows the FS to determine where damage has occurred, >> > rather thansimply detecting that it has occurred (which would be the >> > case if the parity doesn't match the data, or if the two copies of a >> > RAID-1 array don't match). >> > >> >> Yes, that is what I wrote below. But that means that RAID5 with one >> degraded disk won't be able to reconstruct data on this degraded disk >> because reconstructed extent content won't match checksum. Which kinda >> makes RAID5 pointless. > >Eh? How do you come to that conclusion? > >For data, say you have n-1 good devices, with n-1 blocks on them. > Each block has a checksum in the metadata, so you can read that > checksum, read the blocks, and verify that they're not damaged. From > those n-1 known-good blocks (all data, or one parity and the rest We do not know whether parity is good or not because as far as I can tell parity is not checksummed. > data) you can reconstruct the remaining block. That reconstructed > block won't be checked against the csum for the missing block -- it'll > just be written and a new csum for it written with it. > So we have silent corruption. I fail to understand how it is an improvement :) >Hugo. > >> ... >> > >> >> > It looks like uncorrectable failures might occur because parity is >> >> > correct, but the parity checksum is out of date, so the parity checksum >> >> > doesn't match even though data blindly reconstructed from the parity >> >> > *would* match the data. >> >> > >> >> >> >> Yep, that is how I read it too. So if your data is checksummed, it >> >> should at least avoid silent corruption. >> >> > > -- > Hugo Mills | Debugging is like hitting yourself in the head with > hugo@... carfax.org.uk | hammer: it feels so good when you find the bug, and > http://carfax.org.uk/ | you're allowed to stop debugging. > PGP: E2AB1DE4 |PotatoEngineer -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Adventures in btrfs raid5 disk recovery
On Fri, Jun 24, 2016 at 12:52:21PM +0300, Andrei Borzenkov wrote: > On Fri, Jun 24, 2016 at 11:50 AM, Hugo Millswrote: > > On Fri, Jun 24, 2016 at 07:02:34AM +0300, Andrei Borzenkov wrote: > >> 24.06.2016 04:47, Zygo Blaxell пишет: > >> > On Thu, Jun 23, 2016 at 06:26:22PM -0600, Chris Murphy wrote: > >> >> On Thu, Jun 23, 2016 at 1:32 PM, Goffredo Baroncelli > >> >> wrote: > >> >>> The raid5 write hole is avoided in BTRFS (and in ZFS) thanks to the > >> >>> checksum. > >> >> > >> >> Yeah I'm kinda confused on this point. > >> >> > >> >> https://btrfs.wiki.kernel.org/index.php/RAID56 > >> >> > >> >> It says there is a write hole for Btrfs. But defines it in terms of > >> >> parity possibly being stale after a crash. I think the term comes not > >> >> from merely parity being wrong but parity being wrong *and* then being > >> >> used to wrongly reconstruct data because it's blindly trusted. > >> > > >> > I think the opposite is more likely, as the layers above raid56 > >> > seem to check the data against sums before raid56 ever sees it. > >> > (If those layers seem inverted to you, I agree, but OTOH there are > >> > probably good reason to do it that way). > >> > > >> > >> Yes, that's how I read code as well. btrfs layer that does checksumming > >> is unaware of parity blocks at all; for all practical purposes they do > >> not exist. What happens is approximately > >> > >> 1. logical extent is allocated and checksum computed > >> 2. it is mapped to physical area(s) on disks, skipping over what would > >> be parity blocks > >> 3. when these areas are written out, RAID56 parity is computed and filled > >> in > >> > >> IOW btrfs checksums are for (meta)data and RAID56 parity is not data. > > > >Checksums are not parity, correct. However, every data block > > (including, I think, the parity) is checksummed and put into the csum > > tree. This allows the FS to determine where damage has occurred, > > rather thansimply detecting that it has occurred (which would be the > > case if the parity doesn't match the data, or if the two copies of a > > RAID-1 array don't match). > > > > Yes, that is what I wrote below. But that means that RAID5 with one > degraded disk won't be able to reconstruct data on this degraded disk > because reconstructed extent content won't match checksum. Which kinda > makes RAID5 pointless. Eh? How do you come to that conclusion? For data, say you have n-1 good devices, with n-1 blocks on them. Each block has a checksum in the metadata, so you can read that checksum, read the blocks, and verify that they're not damaged. From those n-1 known-good blocks (all data, or one parity and the rest data) you can reconstruct the remaining block. That reconstructed block won't be checked against the csum for the missing block -- it'll just be written and a new csum for it written with it. Hugo. > ... > > > >> > It looks like uncorrectable failures might occur because parity is > >> > correct, but the parity checksum is out of date, so the parity checksum > >> > doesn't match even though data blindly reconstructed from the parity > >> > *would* match the data. > >> > > >> > >> Yep, that is how I read it too. So if your data is checksummed, it > >> should at least avoid silent corruption. > >> -- Hugo Mills | Debugging is like hitting yourself in the head with hugo@... carfax.org.uk | hammer: it feels so good when you find the bug, and http://carfax.org.uk/ | you're allowed to stop debugging. PGP: E2AB1DE4 |PotatoEngineer signature.asc Description: Digital signature
Re: Adventures in btrfs raid5 disk recovery
On Fri, Jun 24, 2016 at 8:20 AM, Chris Murphywrote: > [root@f24s ~]# filefrag -v /mnt/5/* > Filesystem type is: 9123683e > File size of /mnt/5/a.txt is 16383 (4 blocks of 4096 bytes) > ext: logical_offset:physical_offset: length: expected: flags: >0:0.. 3:2931712.. 2931715: 4: last,eof Hmm ... I wonder what is wrong here (openSUSE Tumbleweed) nohostname:~ # filefrag -v /mnt/1 Filesystem type is: 9123683e File size of /mnt/1 is 3072 (1 block of 4096 bytes) ext: logical_offset:physical_offset: length: expected: flags: 0:0.. 0: 269376..269376: 1: last,eof /mnt/1: 1 extent found But! nohostname:~ # filefrag -v /etc/passwd Filesystem type is: 9123683e File size of /etc/passwd is 1527 (1 block of 4096 bytes) ext: logical_offset:physical_offset: length: expected: flags: 0:0..4095: 0.. 4095: 4096: last,not_aligned,inline,eof /etc/passwd: 1 extent found nohostname:~ # Why it works for one filesystem but does not work for an other one? ... > > So at the old address, it shows the "a..." is still there. And at > the added single block for this file at new logical and physical > addresses, is the modification substituting the first "a" for "g". > > In this case, no rmw, no partial stripe modification, and no data > already on-disk is at risk. You misunderstand the nature of problem. What is put at risk is data that is already on disk and "shares" parity with new data. As example, here are the first 64K in several extents on 4 disk RAID5 with so far single data chunk item 6 key (FIRST_CHUNK_TREE CHUNK_ITEM 1103101952) itemoff 15491 itemsize 176 chunk length 3221225472 owner 2 stripe_len 65536 type DATA|RAID5 num_stripes 4 stripe 0 devid 4 offset 9437184 dev uuid: ed13e42e-1633-4230-891c-897e86d1c0be stripe 1 devid 3 offset 9437184 dev uuid: 10032b95-3f48-4ea0-a9ee-90064c53da1f stripe 2 devid 2 offset 1074790400 dev uuid: cd749bd9-3d72-43b4-89a8-45e4a92658cf stripe 3 devid 1 offset 1094713344 dev uuid: 41538b9f-3869-4c32-b3e2-30aa2ea1534e dev extent chunk_tree 3 chunk objectid 256 chunk offset 1103101952 length 1073741824 item 5 key (1 DEV_EXTENT 1094713344) itemoff 16027 itemsize 48 dev extent chunk_tree 3 chunk objectid 256 chunk offset 1103101952 length 1073741824 item 7 key (2 DEV_EXTENT 1074790400) itemoff 15931 itemsize 48 dev extent chunk_tree 3 chunk objectid 256 chunk offset 1103101952 length 1073741824 item 9 key (3 DEV_EXTENT 9437184) itemoff 15835 itemsize 48 dev extent chunk_tree 3 chunk objectid 256 chunk offset 1103101952 length 1073741824 item 11 key (4 DEV_EXTENT 9437184) itemoff 15739 itemsize 48 dev extent chunk_tree 3 chunk objectid 256 chunk offset 1103101952 length 1073741824 where devid 1 = sdb1, 2 = sdc1 etc. Now let's write some data (I created several files) up to 64K in size: mirror 1 logical 1103364096 physical 1074855936 device /dev/sdc1 mirror 2 logical 1103364096 physical 9502720 device /dev/sde1 mirror 1 logical 1103368192 physical 1074860032 device /dev/sdc1 mirror 2 logical 1103368192 physical 9506816 device /dev/sde1 mirror 1 logical 1103372288 physical 1074864128 device /dev/sdc1 mirror 2 logical 1103372288 physical 9510912 device /dev/sde1 mirror 1 logical 1103376384 physical 1074868224 device /dev/sdc1 mirror 2 logical 1103376384 physical 9515008 device /dev/sde1 mirror 1 logical 1103380480 physical 1074872320 device /dev/sdc1 mirror 2 logical 1103380480 physical 9519104 device /dev/sde1 Note that btrfs allocates 64K on the same device before switching to the next one. What is a bit misleading here, sdc1 is data and sde1 is parity (you can see it in checksum tree, where only items for sdc1 exist). Now let's write next 64k and see what happens nohostname:~ # btrfs-map-logical -l 1103429632 -b 65536 /dev/sdb1 mirror 1 logical 1103429632 physical 1094778880 device /dev/sdb1 mirror 2 logical 1103429632 physical 9502720 device /dev/sde1 See? btrfs now allocates new stripe on sdb1; this stripe is at the same offset as previous one on sdc1 (64K) and so shares the same parity stripe on sde1. If you compare 64K on sde1 at offset 9502720 before and after, you will see that it has changed. INPLACE. Without CoW. This is exactly what puts existing data on sdc1 at risk - if sdb1 is updated but sde1 is not, attempt to reconstruct data on sdc1 will either fail (if we have checksums) or result in silent corruption. -- To unsubscribe from this list: send the line "unsubscribe
Re: Adventures in btrfs raid5 disk recovery
On Fri, Jun 24, 2016 at 11:50 AM, Hugo Millswrote: > On Fri, Jun 24, 2016 at 07:02:34AM +0300, Andrei Borzenkov wrote: >> 24.06.2016 04:47, Zygo Blaxell пишет: >> > On Thu, Jun 23, 2016 at 06:26:22PM -0600, Chris Murphy wrote: >> >> On Thu, Jun 23, 2016 at 1:32 PM, Goffredo Baroncelli >> >> wrote: >> >>> The raid5 write hole is avoided in BTRFS (and in ZFS) thanks to the >> >>> checksum. >> >> >> >> Yeah I'm kinda confused on this point. >> >> >> >> https://btrfs.wiki.kernel.org/index.php/RAID56 >> >> >> >> It says there is a write hole for Btrfs. But defines it in terms of >> >> parity possibly being stale after a crash. I think the term comes not >> >> from merely parity being wrong but parity being wrong *and* then being >> >> used to wrongly reconstruct data because it's blindly trusted. >> > >> > I think the opposite is more likely, as the layers above raid56 >> > seem to check the data against sums before raid56 ever sees it. >> > (If those layers seem inverted to you, I agree, but OTOH there are >> > probably good reason to do it that way). >> > >> >> Yes, that's how I read code as well. btrfs layer that does checksumming >> is unaware of parity blocks at all; for all practical purposes they do >> not exist. What happens is approximately >> >> 1. logical extent is allocated and checksum computed >> 2. it is mapped to physical area(s) on disks, skipping over what would >> be parity blocks >> 3. when these areas are written out, RAID56 parity is computed and filled in >> >> IOW btrfs checksums are for (meta)data and RAID56 parity is not data. > >Checksums are not parity, correct. However, every data block > (including, I think, the parity) is checksummed and put into the csum > tree. This allows the FS to determine where damage has occurred, > rather thansimply detecting that it has occurred (which would be the > case if the parity doesn't match the data, or if the two copies of a > RAID-1 array don't match). > Yes, that is what I wrote below. But that means that RAID5 with one degraded disk won't be able to reconstruct data on this degraded disk because reconstructed extent content won't match checksum. Which kinda makes RAID5 pointless. ... > >> > It looks like uncorrectable failures might occur because parity is >> > correct, but the parity checksum is out of date, so the parity checksum >> > doesn't match even though data blindly reconstructed from the parity >> > *would* match the data. >> > >> >> Yep, that is how I read it too. So if your data is checksummed, it >> should at least avoid silent corruption. >> -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Adventures in btrfs raid5 disk recovery
On Fri, Jun 24, 2016 at 07:02:34AM +0300, Andrei Borzenkov wrote: > 24.06.2016 04:47, Zygo Blaxell пишет: > > On Thu, Jun 23, 2016 at 06:26:22PM -0600, Chris Murphy wrote: > >> On Thu, Jun 23, 2016 at 1:32 PM, Goffredo Baroncelli> >> wrote: > >>> The raid5 write hole is avoided in BTRFS (and in ZFS) thanks to the > >>> checksum. > >> > >> Yeah I'm kinda confused on this point. > >> > >> https://btrfs.wiki.kernel.org/index.php/RAID56 > >> > >> It says there is a write hole for Btrfs. But defines it in terms of > >> parity possibly being stale after a crash. I think the term comes not > >> from merely parity being wrong but parity being wrong *and* then being > >> used to wrongly reconstruct data because it's blindly trusted. > > > > I think the opposite is more likely, as the layers above raid56 > > seem to check the data against sums before raid56 ever sees it. > > (If those layers seem inverted to you, I agree, but OTOH there are > > probably good reason to do it that way). > > > > Yes, that's how I read code as well. btrfs layer that does checksumming > is unaware of parity blocks at all; for all practical purposes they do > not exist. What happens is approximately > > 1. logical extent is allocated and checksum computed > 2. it is mapped to physical area(s) on disks, skipping over what would > be parity blocks > 3. when these areas are written out, RAID56 parity is computed and filled in > > IOW btrfs checksums are for (meta)data and RAID56 parity is not data. Checksums are not parity, correct. However, every data block (including, I think, the parity) is checksummed and put into the csum tree. This allows the FS to determine where damage has occurred, rather thansimply detecting that it has occurred (which would be the case if the parity doesn't match the data, or if the two copies of a RAID-1 array don't match). (Note that csums for metadata are stored in the metadata block itself, not in the csum tree). Hugo. > > It looks like uncorrectable failures might occur because parity is > > correct, but the parity checksum is out of date, so the parity checksum > > doesn't match even though data blindly reconstructed from the parity > > *would* match the data. > > > > Yep, that is how I read it too. So if your data is checksummed, it > should at least avoid silent corruption. > > >> I don't read code well enough, but I'd be surprised if Btrfs > >> reconstructs from parity and doesn't then check the resulting > >> reconstructed data to its EXTENT_CSUM. > > > > I wouldn't be surprised if both things happen in different code paths, > > given the number of different paths leading into the raid56 code and > > the number of distinct failure modes it seems to have. > > > > Well, the problem is that parity block cannot be redirected on write as > data blocks; which makes it impossible to version control it. The only > solution I see is to always use full stripe writes by either wasting > time in fixed width stripe or using variable width, so that every stripe > always gets new version of parity. This makes it possible to keep parity > checksums like data checksums. > -- Hugo Mills | Darkling's First Law of Filesystems: hugo@... carfax.org.uk | The user hates their data http://carfax.org.uk/ | PGP: E2AB1DE4 | signature.asc Description: Digital signature
Re: Adventures in btrfs raid5 disk recovery
On Thu, Jun 23, 2016 at 8:07 PM, Zygo Blaxellwrote: >> With simple files changing one character with vi and gedit, >> I get completely different logical and physical numbers with each >> change, so it's clearly cowing the entire stripe (192KiB in my 3 dev >> raid5). > > You are COWing the entire file because vi and gedit do truncate followed > by full-file write. I'm seeing the file inode changes with either a vi or gedit modification, even when file size is exactly the same, just character substitute. So as far as VFS and Btrfs are concerned, it's an entirely different file, so it's like faux-CoW that would have happened on any file system, not an overwrite. > Try again with 'dd conv=notrunc bs=4k count=1 seek=N of=...' or > edit the file with a sector-level hex editor. The inode is now the same, one of the 4096 byte blocks is dereferenced, a new 4096 byte block is referenced, and written, the other 3 blocks remain untouched, the other files in the stripe remain untouched. So it's pretty clearly cow'd in this case. [root@f24s ~]# filefrag -v /mnt/5/* Filesystem type is: 9123683e File size of /mnt/5/a.txt is 16383 (4 blocks of 4096 bytes) ext: logical_offset:physical_offset: length: expected: flags: 0:0.. 3:2931712.. 2931715: 4: last,eof /mnt/5/a.txt: 1 extent found File size of /mnt/5/b.txt is 16383 (4 blocks of 4096 bytes) ext: logical_offset:physical_offset: length: expected: flags: 0:0.. 3:2931716.. 2931719: 4: last,eof /mnt/5/b.txt: 1 extent found File size of /mnt/5/c.txt is 16383 (4 blocks of 4096 bytes) ext: logical_offset:physical_offset: length: expected: flags: 0:0.. 3:2931720.. 2931723: 4: last,eof /mnt/5/c.txt: 1 extent found File size of /mnt/5/d.txt is 16383 (4 blocks of 4096 bytes) ext: logical_offset:physical_offset: length: expected: flags: 0:0.. 3:2931724.. 2931727: 4: last,eof /mnt/5/d.txt: 1 extent found File size of /mnt/5/e.txt is 16383 (4 blocks of 4096 bytes) ext: logical_offset:physical_offset: length: expected: flags: 0:0.. 3:2931728.. 2931731: 4: last,eof /mnt/5/e.txt: 1 extent found [root@f24s ~]# ls -li /mnt/5/* 285 -rw-r--r--. 1 root root 16383 Jun 23 22:57 /mnt/5/a.txt 286 -rw-r--r--. 1 root root 16383 Jun 23 22:57 /mnt/5/b.txt 287 -rw-r--r--. 1 root root 16383 Jun 23 22:57 /mnt/5/c.txt 288 -rw-r--r--. 1 root root 16383 Jun 23 22:57 /mnt/5/d.txt 289 -rw-r--r--. 1 root root 16383 Jun 23 22:57 /mnt/5/e.txt [root@f24s ~]# btrfs-map-logical -l $[4096*2931712] /dev/VG/a mirror 1 logical 12008292352 physical 34603008 device /dev/mapper/VG-a mirror 2 logical 12008292352 physical 1108344832 device /dev/mapper/VG-c [root@f24s ~]# btrfs-map-logical -l $[4096*2931716] /dev/VG/a mirror 1 logical 12008308736 physical 34619392 device /dev/mapper/VG-a mirror 2 logical 12008308736 physical 1108361216 device /dev/mapper/VG-c [root@f24s ~]# btrfs-map-logical -l $[4096*2931720] /dev/VG/a mirror 1 logical 12008325120 physical 34635776 device /dev/mapper/VG-a mirror 2 logical 12008325120 physical 1108377600 device /dev/mapper/VG-c [root@f24s ~]# btrfs-map-logical -l $[4096*2931724] /dev/VG/a mirror 1 logical 12008341504 physical 34652160 device /dev/mapper/VG-a mirror 2 logical 12008341504 physical 1108393984 device /dev/mapper/VG-c [root@f24s ~]# btrfs-map-logical -l $[4096*2931728] /dev/VG/a mirror 1 logical 12008357888 physical 1048576 device /dev/mapper/VG-b mirror 2 logical 12008357888 physical 1108344832 device /dev/mapper/VG-c [root@f24s ~]# echo -n "g" | dd of=/mnt/5/a.txt conv=notrunc 0+1 records in 0+1 records out 1 byte copied, 0.000314582 s, 3.2 kB/s [root@f24s ~]# ls -li /mnt/5/* 285 -rw-r--r--. 1 root root 16383 Jun 23 23:06 /mnt/5/a.txt 286 -rw-r--r--. 1 root root 16383 Jun 23 22:57 /mnt/5/b.txt 287 -rw-r--r--. 1 root root 16383 Jun 23 22:57 /mnt/5/c.txt 288 -rw-r--r--. 1 root root 16383 Jun 23 22:57 /mnt/5/d.txt 289 -rw-r--r--. 1 root root 16383 Jun 23 22:57 /mnt/5/e.txt [root@f24s ~]# filefrag -v /mnt/5/* Filesystem type is: 9123683e File size of /mnt/5/a.txt is 16383 (4 blocks of 4096 bytes) ext: logical_offset:physical_offset: length: expected: flags: 0:0.. 0:2931732.. 2931732: 1: 1:1.. 3:2931713.. 2931715: 3:2931733: last,eof /mnt/5/a.txt: 2 extents found File size of /mnt/5/b.txt is 16383 (4 blocks of 4096 bytes) ext: logical_offset:physical_offset: length: expected: flags: 0:0.. 3:2931716.. 2931719: 4: last,eof /mnt/5/b.txt: 1 extent found File size of /mnt/5/c.txt is 16383 (4 blocks of 4096 bytes) ext: logical_offset:physical_offset: length: expected: flags: 0:0.. 3:2931720.. 2931723: 4:
Re: Adventures in btrfs raid5 disk recovery
24.06.2016 04:47, Zygo Blaxell пишет: > On Thu, Jun 23, 2016 at 06:26:22PM -0600, Chris Murphy wrote: >> On Thu, Jun 23, 2016 at 1:32 PM, Goffredo Baroncelli>> wrote: >>> The raid5 write hole is avoided in BTRFS (and in ZFS) thanks to the >>> checksum. >> >> Yeah I'm kinda confused on this point. >> >> https://btrfs.wiki.kernel.org/index.php/RAID56 >> >> It says there is a write hole for Btrfs. But defines it in terms of >> parity possibly being stale after a crash. I think the term comes not >> from merely parity being wrong but parity being wrong *and* then being >> used to wrongly reconstruct data because it's blindly trusted. > > I think the opposite is more likely, as the layers above raid56 > seem to check the data against sums before raid56 ever sees it. > (If those layers seem inverted to you, I agree, but OTOH there are > probably good reason to do it that way). > Yes, that's how I read code as well. btrfs layer that does checksumming is unaware of parity blocks at all; for all practical purposes they do not exist. What happens is approximately 1. logical extent is allocated and checksum computed 2. it is mapped to physical area(s) on disks, skipping over what would be parity blocks 3. when these areas are written out, RAID56 parity is computed and filled in IOW btrfs checksums are for (meta)data and RAID56 parity is not data. > It looks like uncorrectable failures might occur because parity is > correct, but the parity checksum is out of date, so the parity checksum > doesn't match even though data blindly reconstructed from the parity > *would* match the data. > Yep, that is how I read it too. So if your data is checksummed, it should at least avoid silent corruption. >> I don't read code well enough, but I'd be surprised if Btrfs >> reconstructs from parity and doesn't then check the resulting >> reconstructed data to its EXTENT_CSUM. > > I wouldn't be surprised if both things happen in different code paths, > given the number of different paths leading into the raid56 code and > the number of distinct failure modes it seems to have. > Well, the problem is that parity block cannot be redirected on write as data blocks; which makes it impossible to version control it. The only solution I see is to always use full stripe writes by either wasting time in fixed width stripe or using variable width, so that every stripe always gets new version of parity. This makes it possible to keep parity checksums like data checksums. signature.asc Description: OpenPGP digital signature
Re: Adventures in btrfs raid5 disk recovery
On Thu, Jun 23, 2016 at 05:37:09PM -0600, Chris Murphy wrote: > > I expect that parity is in this data block group, and therefore is > > checksummed the same as any other data in that block group. > > This appears to be wrong. Comparing the same file, one file only, one > two new Btrfs volumes, one volume single, one volume raid5, I get a > single csum tree entry: > > raid5 > item 0 key (EXTENT_CSUM EXTENT_CSUM 12009865216) itemoff 16155 itemsize > 128 > extent csum item > > single > > item 0 key (EXTENT_CSUM EXTENT_CSUM 2168717312) itemoff 16155 itemsize 128 > extent csum item > > They're both the same size. They both contain the same data. So it > looks like parity is not separately checksummed. I'm inclined to agree because I didn't find any code that *writes* parity csums...but if there are no parity csums, what does this code do? scrub.c: static noinline_for_stack int scrub_raid56_parity(struct scrub_ctx *sctx, [...] ret = btrfs_lookup_csums_range(csum_root, extent_logical, extent_logical + extent_len - 1, >csum_list, 1); if (ret) goto out; ret = scrub_extent_for_parity(sparity, extent_logical, extent_len, extent_physical, extent_dev, flags, generation, extent_mirror_num); signature.asc Description: Digital signature
Re: Adventures in btrfs raid5 disk recovery
On Thu, Jun 23, 2016 at 05:37:09PM -0600, Chris Murphy wrote: > > So in your example of degraded writes, no matter what the on disk > > format makes it discoverable there is a problem: > > > > A. The "updating" is still always COW so there is no overwriting. > > There is RMW code in btrfs/raid56.c but I don't know when that gets > triggered. RMW seems to be for cases where part of a stripe is modified but the entire stripe has not yet been read into memory. It reads the remaining blocks (reconstructing missing blocks if necessary) then calculates new parity blocks. > With simple files changing one character with vi and gedit, > I get completely different logical and physical numbers with each > change, so it's clearly cowing the entire stripe (192KiB in my 3 dev > raid5). You are COWing the entire file because vi and gedit do truncate followed by full-file write. Try again with 'dd conv=notrunc bs=4k count=1 seek=N of=...' or edit the file with a sector-level hex editor. > [root@f24s ~]# filefrag -v /mnt/5/64k-a-then64k-b.txt > Filesystem type is: 9123683e > File size of /mnt/5/64k-a-then64k-b.txt is 131072 (32 blocks of 4096 bytes) > ext: logical_offset:physical_offset: length: expected: flags: >0:0.. 31:2931744.. 2931775: 32: last,eof > /mnt/5/64k-a-then64k-b.txt: 1 extent found > [root@f24s ~]# btrfs-map-logical -l $[4096*2931744] /dev/VG/a > mirror 1 logical 12008423424 physical 1114112 device /dev/mapper/VG-b > mirror 2 logical 12008423424 physical 34668544 device /dev/mapper/VG-a > [root@f24s ~]# vi /mnt/5/64k-a-then64k-b.txt > [root@f24s ~]# filefrag -v /mnt/5/64k-a-then64k-b.txt > Filesystem type is: 9123683e > File size of /mnt/5/64k-a-then64k-b.txt is 131072 (32 blocks of 4096 bytes) > ext: logical_offset:physical_offset: length: expected: flags: >0:0.. 31:2931776.. 2931807: 32: last,eof > /mnt/5/64k-a-then64k-b.txt: 1 extent found > [root@f24s ~]# btrfs-map-logical -l $[4096*29317776] /dev/VG/a > No extent found at range [120085610496,120085626880) > [root@f24s ~]# btrfs-map-logical -l $[4096*2931776] /dev/VG/a > mirror 1 logical 12008554496 physical 1108475904 device /dev/mapper/VG-c > mirror 2 logical 12008554496 physical 1179648 device /dev/mapper/VG-b > [root@f24s ~]# > > There is a neat bug/rfe I found for btrfs-map-logical, it doesn't > report back the physical locations for all num_stripes on the volume. > It only spits back two, and sometimes it's the two data strips, > sometimes it's one data and one parity strip. > > > [1] > https://bugzilla.kernel.org/show_bug.cgi?id=120941 > > > -- > Chris Murphy > signature.asc Description: Digital signature
Re: Adventures in btrfs raid5 disk recovery
On Thu, Jun 23, 2016 at 06:26:22PM -0600, Chris Murphy wrote: > On Thu, Jun 23, 2016 at 1:32 PM, Goffredo Baroncelli> wrote: > > The raid5 write hole is avoided in BTRFS (and in ZFS) thanks to the > > checksum. > > Yeah I'm kinda confused on this point. > > https://btrfs.wiki.kernel.org/index.php/RAID56 > > It says there is a write hole for Btrfs. But defines it in terms of > parity possibly being stale after a crash. I think the term comes not > from merely parity being wrong but parity being wrong *and* then being > used to wrongly reconstruct data because it's blindly trusted. I think the opposite is more likely, as the layers above raid56 seem to check the data against sums before raid56 ever sees it. (If those layers seem inverted to you, I agree, but OTOH there are probably good reason to do it that way). It looks like uncorrectable failures might occur because parity is correct, but the parity checksum is out of date, so the parity checksum doesn't match even though data blindly reconstructed from the parity *would* match the data. > I don't read code well enough, but I'd be surprised if Btrfs > reconstructs from parity and doesn't then check the resulting > reconstructed data to its EXTENT_CSUM. I wouldn't be surprised if both things happen in different code paths, given the number of different paths leading into the raid56 code and the number of distinct failure modes it seems to have. signature.asc Description: Digital signature
Re: Adventures in btrfs raid5 disk recovery
On Thu, Jun 23, 2016 at 09:32:50PM +0200, Goffredo Baroncelli wrote: > The raid write hole happens when a stripe is not completely written > on the platters: the parity and the related data mismatch. In this > case a "simple" raid5 may return wrong data if the parity is used to > compute the data. But this happens because a "simple" raid5 is unable > to detected if the returned data is right or none. > > The raid5 write hole is avoided in BTRFS (and in ZFS) thanks to the > checksum. Checksums do not help with the raid5 write hole. The way btrfs does checksums might even make it worse. ZFS reduces the number of disks in a stripe when a disk failure is detected so that writes are always in non-degraded mode, and they presumably avoid sub-stripe-width data allocations or use journalling to avoid the write hole. btrfs seems to use neither tactic. At best, btrfs will avoid creating new block groups on disks that are missing at mount time, and it doesn't deal with sub-stripe-width allocations at all. I'm working from two assumptions as I haven't found all the relevant code yet: 1. btrfs writes parity stripes at fixed locations relative to the data in the same stripe. If this is true, then the parity blocks are _not_ CoW while the data blocks and their checksums _are_ CoW. I don't know if the parity block checksums are also CoW. 2. btrfs sometimes puts data from two different transactions in the same stripe at the same time--a fundamental violation of the CoW concept. I inferred this from the logical block addresses. Unless I'm missing something in the code somewhere, parity blocks can have out-of-date checksums for short periods of time between flushes and commits. This would lose data by falsely reporting valid parity blocks as checksum failures. If any *single* failure occurs at the same time (such as a missing write or disk failure) a small amount of data will be lost. > BTRFS is able to discard the wrong data: i.e. in case of a 3 disks > raid5, the right data may be extracted from the data1+data2 or if the > checksum doesn't match from data1+parity or if the checksum doesn't > match from data2+parity. Suppose we have a sequence like this (3-disk RAID5 array, one stripe containing 2 data and 1 parity block) starting with the stripe empty: 1. write data block 1 to disk 1 of stripe (parity is now invalid, no checksum yet) 2. write parity block to disk 3 in stripe (parity becomes valid again, no checksum yet) 3. commit metadata pointing to block 1 (parity and checksums now valid) 4. write data block 2 to disk 2 of stripe (parity and parity checksum now invalid) 5. write parity block to disk 3 in stripe (parity valid now, parity checksum still invalid) 6. commit metadata pointing to block 2 (parity and checksums now valid) We can be interrupted at any point between step 1 and 4 with no data loss. Before step 3 the data and parity blocks are not part of the extent tree so their contents are irrelevant. After step 3 (assuming each step is completed in order) data block 1 is part of the extent tree and can be reconstructed if any one disk fails. This is the part of btrfs raid5 that works. If we are interrupted between steps 4 and 6 (e.g. power fails), a single disk failure or corruption will cause data loss in block 1. Note that block 1 is *not* written between steps 4 and 6, so we are retroactively damaging some previously written data that is not part of the current transaction. If we are interrupted between steps 4 and 5, we can no longer reconstruct block 1 (block2 ^ parity) or block 2 (block1 ^ parity) because the parity block doesn't match the data blocks in the same stripe (i.e. block1 ^ block2 != parity). If we are interrupted between step 5 and 6, the parity block checksum committed at step 3 will fail. Data block 2 will not be accessible since the metadata was not written to point to it, but data block 1 will be intact, readable, and have a correct checksum as long as none of the disks fail. This can be repaired by a scrub (scrub will simply throw the parity block away and reconstruct it from block1 and block2). If disk 1 fails before the next scrub, data block 1 will be lost because btrfs will believe the parity block is incorrect even though it is not. This risk happens on *every* write to a stripe that is not a full stripe write and contains existing committed data blocks. It will occur more often on full and heavily fragmented filesystems (filesystems which have these properties are more likely to write new data on stripes that already contain old data). In cases where an entire stripe is written at once, or a stripe is partially filled but no further writes ever modify the stripe, everything works as intended in btrfs. > NOTE2: this works if only one write is corrupted. If more write
Re: Adventures in btrfs raid5 disk recovery
On Thu, Jun 23, 2016 at 1:32 PM, Goffredo Baroncelliwrote: > > The raid5 write hole is avoided in BTRFS (and in ZFS) thanks to the checksum. Yeah I'm kinda confused on this point. https://btrfs.wiki.kernel.org/index.php/RAID56 It says there is a write hole for Btrfs. But defines it in terms of parity possibly being stale after a crash. I think the term comes not from merely parity being wrong but parity being wrong *and* then being used to wrongly reconstruct data because it's blindly trusted. I don't read code well enough, but I'd be surprised if Btrfs reconstructs from parity and doesn't then check the resulting reconstructed data to its EXTENT_CSUM. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Adventures in btrfs raid5 disk recovery
On Wed, Jun 22, 2016 at 11:14 AM, Chris Murphywrote: > > However, from btrfs-debug-tree from a 3 device raid5 volume: > > item 5 key (FIRST_CHUNK_TREE CHUNK_ITEM 1103101952) itemoff 15621 itemsize 144 > chunk length 2147483648 owner 2 stripe_len 65536 > type DATA|RAID5 num_stripes 3 > stripe 0 devid 2 offset 9437184 > dev uuid: 3c6f37eb-5cae-455a-82bc-a1b0877dea55 > stripe 1 devid 1 offset 1094713344 > dev uuid: 13104709-6f30-4982-979e-4f055c326fad > stripe 2 devid 3 offset 1083179008 > dev uuid: d45fc482-a0c1-46b1-98c1-41cea5a11c80 > > I expect that parity is in this data block group, and therefore is > checksummed the same as any other data in that block group. This appears to be wrong. Comparing the same file, one file only, one two new Btrfs volumes, one volume single, one volume raid5, I get a single csum tree entry: raid5 item 0 key (EXTENT_CSUM EXTENT_CSUM 12009865216) itemoff 16155 itemsize 128 extent csum item single item 0 key (EXTENT_CSUM EXTENT_CSUM 2168717312) itemoff 16155 itemsize 128 extent csum item They're both the same size. They both contain the same data. So it looks like parity is not separately checksummed. If there's a missing 64KiB data strip (bad sector, or dead drive), the reconstruction of that strip from parity should match available csums for those blocks. So in this way it's possible to infer if the parity strip is bad. But, it also means assuming everything else about this full stripe: the remaining data strips and their csums, are correct. > So in your example of degraded writes, no matter what the on disk > format makes it discoverable there is a problem: > > A. The "updating" is still always COW so there is no overwriting. There is RMW code in btrfs/raid56.c but I don't know when that gets triggered. With simple files changing one character with vi and gedit, I get completely different logical and physical numbers with each change, so it's clearly cowing the entire stripe (192KiB in my 3 dev raid5). [root@f24s ~]# filefrag -v /mnt/5/64k-a-then64k-b.txt Filesystem type is: 9123683e File size of /mnt/5/64k-a-then64k-b.txt is 131072 (32 blocks of 4096 bytes) ext: logical_offset:physical_offset: length: expected: flags: 0:0.. 31:2931744.. 2931775: 32: last,eof /mnt/5/64k-a-then64k-b.txt: 1 extent found [root@f24s ~]# btrfs-map-logical -l $[4096*2931744] /dev/VG/a mirror 1 logical 12008423424 physical 1114112 device /dev/mapper/VG-b mirror 2 logical 12008423424 physical 34668544 device /dev/mapper/VG-a [root@f24s ~]# vi /mnt/5/64k-a-then64k-b.txt [root@f24s ~]# filefrag -v /mnt/5/64k-a-then64k-b.txt Filesystem type is: 9123683e File size of /mnt/5/64k-a-then64k-b.txt is 131072 (32 blocks of 4096 bytes) ext: logical_offset:physical_offset: length: expected: flags: 0:0.. 31:2931776.. 2931807: 32: last,eof /mnt/5/64k-a-then64k-b.txt: 1 extent found [root@f24s ~]# btrfs-map-logical -l $[4096*29317776] /dev/VG/a No extent found at range [120085610496,120085626880) [root@f24s ~]# btrfs-map-logical -l $[4096*2931776] /dev/VG/a mirror 1 logical 12008554496 physical 1108475904 device /dev/mapper/VG-c mirror 2 logical 12008554496 physical 1179648 device /dev/mapper/VG-b [root@f24s ~]# There is a neat bug/rfe I found for btrfs-map-logical, it doesn't report back the physical locations for all num_stripes on the volume. It only spits back two, and sometimes it's the two data strips, sometimes it's one data and one parity strip. [1] https://bugzilla.kernel.org/show_bug.cgi?id=120941 -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Adventures in btrfs raid5 disk recovery
On 2016-06-22 22:35, Zygo Blaxell wrote: >> I do not know the exact nature of the Btrfs raid56 write hole. Maybe a >> > dev or someone who knows can explain it. > If you have 3 raid5 devices, they might be laid out on disk like this > (e.g. with a 16K stripe width): > > Address: 0..16K16..32K 32..64K > Disk 1: [0..16K][32..64K] [PARITY] > Disk 2: [16..32K] [PARITY][80..96K] > Disk 3: [PARITY][64..80K] [96..112K] > > btrfs logical address ranges are inside []. Disk physical address ranges > are shown at the top of each column. (I've simplified the mapping here; > pretend all the addresses are relative to the start of a block group). > > If we want to write a 32K extent at logical address 0, we'd write all > three disks in one column (disk1 gets 0..16K, disk2 gets 16..32K, disk3 > gets parity for the other two disks). The parity will be temporarily > invalid for the time between the first disk write and the last disk write. > In non-degraded mode the parity isn't necessary, but in degraded mode > the entire column cannot be reconstructed because of invalid parity. > > To see why this could be a problem, suppose btrfs writes a 4K extent at > logical address 32K. This requires updating (at least) disk 1 (where the > logical address 32K resides) and disk 2 (the parity for this column). > This means any data that existed at logical addresses 36K..80K (or at > least 32..36K and 64..68K) has its parity temporarily invalidated between > the write to the first and last disks. If there were metadata pointing > to other blocks in this column, the metadata temporarily points to > damaged data during the write. If there is no data in other blocks in > this column then it doesn't matter that the parity doesn't match--the > content of the reconstructed unallocated blocks would be undefined > even in the success cases. [...] Sorry, but I can follow you. RAID5 protect you in case of a failure (or a missing write) of a *single* disk. The raid write hole happens when a stripe is not completely written on the platters: the parity and the related data mismatch. In this case a "simple" raid5 may return wrong data if the parity is used to compute the data. But this happens because a "simple" raid5 is unable to detected if the returned data is right or none. The raid5 write hole is avoided in BTRFS (and in ZFS) thanks to the checksum. BTRFS is able to discard the wrong data: i.e. in case of a 3 disks raid5, the right data may be extracted from the data1+data2 or if the checksum doesn't match from data1+parity or if the checksum doesn't match from data2+parity. NOTE1: the real difference between the BTRFS (and ZFS) raid and the "simple" raid5 is that the latter doesn't try another pair of disks. NOTE2: this works if only one write is corrupted. If more writes (== more disks) are involved, you got checksum mismatch. If more than one write are corrupted, raid5 is unable to protect you. In case of "degraded mode", you don't have any redundancy. So if a stripe of a degraded filesystem is not fully written to the disk, is like a block not fully written to the disk. And you have checksums mismatch. But this is not what is called raid write hole. On 2016-06-22 22:35, Zygo Blaxell wrote: > If in the future btrfs allocates physical block 2412725692 to > a different file, up to 3 other blocks in this file (most likely > 2412725689..2412725691) could be lost if a crash or disk I/O error also > occurs during the same transaction. btrfs does do this--in fact, the > _very next block_ allocated by the filesystem is 2412725692: > > # head -c 4096 < /dev/urandom >> f; sync; filefrag -v f > Filesystem type is: 9123683e > File size of f is 45056 (11 blocks of 4096 bytes) >ext: logical_offset:physical_offset: length: expected: > flags: > 0:0.. 0: 2412725689..2412725689: 1: > 1:1.. 1: 2412725690..2412725690: 1: > 2:2.. 2: 2412725691..2412725691: 1: > 3:3.. 3: 2412725701..2412725701: 1: 2412725692: > 4:4.. 4: 2412725693..2412725693: 1: 2412725702: > 5:5.. 5: 2412725694..2412725694: 1: > 6:6.. 6: 2412725695..2412725695: 1: > 7:7.. 7: 2412725698..2412725698: 1: 2412725696: > 8:8.. 8: 2412725699..2412725699: 1: > 9:9.. 9: 2412725700..2412725700: 1: > 10: 10.. 10: 2412725692..2412725692: 1: 2412725701: > last,eof > f: 5 extents found You are assuming that if you touch a block, all the blocks of the same stripe spread over the disks are involved. I disagree. The only parts which are involved, are the part
Re: Adventures in btrfs raid5 disk recovery
On Wed, Jun 22, 2016 at 11:14:30AM -0600, Chris Murphy wrote: > > Before deploying raid5, I tested these by intentionally corrupting > > one disk in an otherwise healthy raid5 array and watching the result. > > It's difficult to reproduce if no one understands how you > intentionally corrupted that disk. Literal reading, you corrupted the > entire disk, but that's impractical. The fs is expected to behave > differently depending on what's been corrupted and how much. The first round of testing I did (a year ago, when deciding whether btrfs raid5 was mature enough to start using) was: Create a 5-disk RAID5 Put some known data on it until it's full (i.e. random test patterns). At the time I didn't do any tests involving compressible data, which I now realize was a serious gap in my test coverage. Pick 1000 random blocks (excluding superblocks) on one of the disks and write random data to them Read and verify the data through the filesystem, do scrub, etc. Exercise all the btrfs features related to error reporting and recovery. I expected scrub and dev stat to report accurate corruption counts (except for the 1 in 4 billion case where a bad csum matches by random chance), and I expect all the data to be reconstructed since only one drive was corrupted (assuming there are no unplanned disk failures during the test, obviously) and the corruption occurred while the filesystem was offline so there was no possibility of RAID write hole. My results from that testing were that everything worked except for the mostly-harmless quirk where scrub counts errors on random disks instead of the disk where the errors occur. > I don't often use the -Bd options, so I haven't tested it thoroughly, > but what you're describing sounds like a bug in user space tools. I've > found it reflects the same information as btrfs dev stats, and dev > stats have been reliable in my testing. Don't the user space tools just read what the kernel tells them? I don't know how *not* to produce this behavior on btrfs raid5 or raid6. It should show up on any btrfs raid56 system. > > A different thing happens if there is a crash. In that case, scrub cannot > > repair the errors. Every btrfs raid5 filesystem I've deployed so far > > behaves this way when disks turn bad. I had assumed it was a software bug > > in the comparatively new raid5 support that would get fixed eventually. > > This is really annoyingly vague. You don't give a complete recipe for > reproducing this sequence. Here's what I'm understanding and what I'm > missing: > > 1. The intentional corruption, extent of which is undefined, is still present. No intentional corruption here (quote: "A different thing happens if there is a crash..."). Now we are talking about the baseline behavior when there is a crash on a btrfs raid5 array, especially crashes triggered by a disk-level failure (e.g. watchdog timeout because a disk or controller has hung) but also ordinary power failures or other crashes triggered by external causes. > 2. A drive is bad, but that doesn't tell us if it's totally dead, or > only intermittently spitting out spurious information. The most common drive-initiated reboot case is that one drive temporarily locks up and triggers the host to perform a watchdog reset. The reset is successful and the filesystem can be mounted again with all drives present; however, a small amount of raid5 data appears to be corrupted each time. The raid1 metadata passes all the integrity checks I can throw at it: btrfs check, scrub, balance, walk the filesystem with find -type f -exec cat ..., compare with the last backup, etc. Usually when I detect this case, I delete any corrupted data, delete the disk that triggers the lockups and have no further problems with that array. > 3. Is the volume remounted degraded or is the bad drive still being > used by Btrfs? Because Btrfs has no concept (patches pending) of drive > faulty state like md, let alone an automatic change to that faulty > state. It just keeps on trying to read or write to bad drives, even if > they're physically removed. In the baseline case the filesystem has all drives present after remount. It could be as simple as power-cycling the host while writes are active. > 4. You've initiated a scrub, and the corruption in 1 is not fixed. In this pattern, btrfs may find both correctable and uncorrectable corrupted data, usually on one of the drives. scrub fixes the correctable corruption, but fails on the uncorrectable. > OK so what am I missing? Nothing yet. The above is the "normal" btrfs raid5 crash experience with a non-degraded raid5 array. A few megabytes of corrupt extents can easily be restored from backups or deleted and everything's fine after that. In my *current* failure case, I'
Re: Adventures in btrfs raid5 disk recovery
On Mon, Jun 20, 2016 at 7:55 PM, Zygo Blaxell <ce3g8...@umail.furryterror.org> wrote: > On Mon, Jun 20, 2016 at 03:27:03PM -0600, Chris Murphy wrote: >> On Mon, Jun 20, 2016 at 2:40 PM, Zygo Blaxell >> <ce3g8...@umail.furryterror.org> wrote: >> > On Mon, Jun 20, 2016 at 01:30:11PM -0600, Chris Murphy wrote: >> >> >> For me the critical question is what does "some corrupted sectors" mean? >> > >> > On other raid5 arrays, I would observe a small amount of corruption every >> > time there was a system crash (some of which were triggered by disk >> > failures, some not). >> >> What test are you using to determine there is corruption, and how much >> data is corrupted? Is this on every disk? Non-deterministically fewer >> than all disks? Have you identified this as a torn write or >> misdirected write or is it just garbage at some sectors? And what's >> the size? Partial sector? Partial md chunk (or fs block?) > > In earlier cases, scrub, read(), and btrfs dev stat all reported the > incidents differently. Scrub would attribute errors randomly to disks > (error counts spread randomly across all the disks in the 'btrfs scrub > status -d' output). 'dev stat' would correctly increment counts on only > those disks which had individually had an event (e.g. media error or > SATA bus reset). > > Before deploying raid5, I tested these by intentionally corrupting > one disk in an otherwise healthy raid5 array and watching the result. It's difficult to reproduce if no one understands how you intentionally corrupted that disk. Literal reading, you corrupted the entire disk, but that's impractical. The fs is expected to behave differently depending on what's been corrupted and how much. > When scrub identified an inode and offset in the kernel log, the csum > failure log message matched the offsets producing EIO on read(), but > the statistics reported by scrub about which disk had been corrupted > were mostly wrong. In such cases a scrub could repair the data. I don't often use the -Bd options, so I haven't tested it thoroughly, but what you're describing sounds like a bug in user space tools. I've found it reflects the same information as btrfs dev stats, and dev stats have been reliable in my testing. > A different thing happens if there is a crash. In that case, scrub cannot > repair the errors. Every btrfs raid5 filesystem I've deployed so far > behaves this way when disks turn bad. I had assumed it was a software bug > in the comparatively new raid5 support that would get fixed eventually. This is really annoyingly vague. You don't give a complete recipe for reproducing this sequence. Here's what I'm understanding and what I'm missing: 1. The intentional corruption, extent of which is undefined, is still present. 2. A drive is bad, but that doesn't tell us if it's totally dead, or only intermittently spitting out spurious information. 3. Is the volume remounted degraded or is the bad drive still being used by Btrfs? Because Btrfs has no concept (patches pending) of drive faulty state like md, let alone an automatic change to that faulty state. It just keeps on trying to read or write to bad drives, even if they're physically removed. 4. You've initiated a scrub, and the corruption in 1 is not fixed. OK so what am I missing? Because it sounds to me like you have two copies of data that are gone. For raid 5 that's data loss, scrub can't fix things. Corruption is missing data. The bad drive is missing data. What values do you get for smartctl -l scterc /dev/sdX cat /sys/block/sdX/device/timeout >> This is on Btrfs? This isn't supposed to be possible. Even a literal >> overwrite of a file is not an overwrite on Btrfs unless the file is >> nodatacow. Data extents get written, then the metadata is updated to >> point to those new blocks. There should be flush or fua requests to >> make sure the order is such that the fs points to either the old or >> new file, in either case uncorrupted. That's why I'm curious about the >> nature of this corruption. It sounds like your hardware is not exactly >> honoring flush requests. > > That's true when all the writes are ordered within a single device, but > possibly not when writes must be synchronized across multiple devices. I think that's a big problem, the fs cannot be consistent if the super block points to any tree whose metadata or data isn't on stable media. But if you think it's happening you might benefit from integrity checking, maybe try just the metadata one for starters which is the check_int mount option (it must be compiled in first for that mount option to work). https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/tree/fs/btrfs/check-integrity.c?id=refs/tags/v4.6.2 >
Re: Adventures in btrfs raid5 disk recovery - update
TL;DR: Kernel 4.6.2 causes a world of pain. Use 4.5.7 instead. 'btrfs dev stat' doesn't seem to count "csum failed" (i.e. corruption) errors in compressed extents. On Sun, Jun 19, 2016 at 11:44:27PM -0400, Zygo Blaxell wrote: > Not so long ago, I had a disk fail in a btrfs filesystem with raid1 > metadata and raid5 data. I mounted the filesystem readonly, replaced > the failing disk, and attempted to recover by adding the new disk and > deleting the missing disk. > I'm currently using kernel 4.6.2 That turned out to be a mistake. 4.6.2 has some severe problems. Over the past few days I've been upgrading other machines from 4.5.7 to 4.6.2. This morning I saw the aggregate data coming back from those machines, and it's all bad: stalls in snapshot delete, balance, and sync; some machines just lock up with no console messages; a lot of watchdog timeouts. None of the machines could get to an uptime over 26 hours and still be in a usable state. I switched to 4.5.7 and the crashes, balance/delete hangs, and some of the data corruption modes stopped. > I'm > getting EIO randomly all over the filesystem, including in files that were > written entirely _after_ the disk failure. There were actually four distinct corruption modes happening: 1. There are some number (16500 so far) "normal" corrupt blocks: read repeatably returns EIO, they show up in scrub with sane log messages, and replacing the files that contain these blocks makes them go away. These blocks appear to be contained in extents that coincide with the date of the disk failure. Interestingly, no matter how many times I read these blocks, I get no increase in the 'btrfs dev stat' numbers even though I get kernel csum failure messages. That looks like a bug. 2. When attempting to replace corrupted files with rsync, I had used 'rsync --inplace'. This caused bad blocks to be overwritten within extents, but does not necessarily replace the _entire_ extent containing a bad block. This creates corrupt blocks that show up in scrub, balance, and device delete, but not when reading files. It also updates the timestamps so a file with old corruption looks "new" to an insufficiently sophisticated analysis tool. 3. Files were corrupted while they were written and accessed via NFS. This created files with correct btrfs checksums, but garbage contents. This would show up as failures during 'git gc' or rsync checksum mismatches. During one of the many VM crashes, any writes in progress at the time of the crash were lost. This effectively rewound the filesystem several minutes each time as btrfs reverts to the previous committed tree on the next mount. 4.6.2's hanging issues made this worse by delaying btrfs commits indefinitely. The NFS clients were completely unaware of this, so when the VM rebooted, files ended up with holes, or would just disappear while in use. 4. After a VM crash and the filesystem reverted to the previous committed tree, files with bad blocks that had been repaired through the NFS server or with rsync would be "unrepaired" (i.e. the filesystem would revert back to the original corrupted blocks after the mount). Combinations of these could occur as well for extra confusion, and some corrupted blocks are contained in many files thanks to dedup. With kernel 4.5.7 there have been no lockups during commit and no VM crashes, so I haven't seen any of corruption modes 3 and 4 since 4.5.7. Balance is now running normally to move the remaining data off the missing disk. ETA is 558 hours. See you in mid-July! ;) signature.asc Description: Digital signature
Re: Adventures in btrfs raid5 disk recovery
On Mon, Jun 20, 2016 at 09:55:59PM -0400, Zygo Blaxell wrote: > In this current case, I'm getting things like this: > > [12008.243867] BTRFS info (device vdc): csum failed ino 4420604 extent > 26805825306624 csum 4105596028 wanted 787343232 mirror 0 [...] > The other other weird thing here is that I can't find an example in the > logs of an extent with an EIO that isn't compressed. I've been looking > up a random sample of the extent numbers, matching them up to filefrag > output, and finding e.g. the one compressed extent in the middle of an > otherwise uncompressed git pack file. That's...odd. Maybe there's a > problem with compressed extents in particular? I'll see if I can > script something to check all the logs at once... No need for a script: this message wording appears only in fs/btrfs/compression.c so it can only ever be emitted by reading a compressed extent. Maybe there's a problem specific to raid5, degraded mode, and compressed extents? signature.asc Description: Digital signature
Re: Adventures in btrfs raid5 disk recovery
On Mon, Jun 20, 2016 at 03:27:03PM -0600, Chris Murphy wrote: > On Mon, Jun 20, 2016 at 2:40 PM, Zygo Blaxell > <ce3g8...@umail.furryterror.org> wrote: > > On Mon, Jun 20, 2016 at 01:30:11PM -0600, Chris Murphy wrote: > > >> For me the critical question is what does "some corrupted sectors" mean? > > > > On other raid5 arrays, I would observe a small amount of corruption every > > time there was a system crash (some of which were triggered by disk > > failures, some not). > > What test are you using to determine there is corruption, and how much > data is corrupted? Is this on every disk? Non-deterministically fewer > than all disks? Have you identified this as a torn write or > misdirected write or is it just garbage at some sectors? And what's > the size? Partial sector? Partial md chunk (or fs block?) In earlier cases, scrub, read(), and btrfs dev stat all reported the incidents differently. Scrub would attribute errors randomly to disks (error counts spread randomly across all the disks in the 'btrfs scrub status -d' output). 'dev stat' would correctly increment counts on only those disks which had individually had an event (e.g. media error or SATA bus reset). Before deploying raid5, I tested these by intentionally corrupting one disk in an otherwise healthy raid5 array and watching the result. When scrub identified an inode and offset in the kernel log, the csum failure log message matched the offsets producing EIO on read(), but the statistics reported by scrub about which disk had been corrupted were mostly wrong. In such cases a scrub could repair the data. A different thing happens if there is a crash. In that case, scrub cannot repair the errors. Every btrfs raid5 filesystem I've deployed so far behaves this way when disks turn bad. I had assumed it was a software bug in the comparatively new raid5 support that would get fixed eventually. In this current case, I'm getting things like this: [12008.243867] BTRFS info (device vdc): csum failed ino 4420604 extent 26805825306624 csum 4105596028 wanted 787343232 mirror 0 [12008.243876] BTRFS info (device vdc): csum failed ino 4420604 extent 26805825306624 csum 1689373462 wanted 787343232 mirror 0 [12008.243885] BTRFS info (device vdc): csum failed ino 4420604 extent 26805825306624 csum 3621611229 wanted 787343232 mirror 0 [12008.243893] BTRFS info (device vdc): csum failed ino 4420604 extent 26805825306624 csum 113993114 wanted 787343232 mirror 0 [12008.243902] BTRFS info (device vdc): csum failed ino 4420604 extent 26805825306624 csum 1464956834 wanted 787343232 mirror 0 [12008.243911] BTRFS info (device vdc): csum failed ino 4420604 extent 26805825306624 csum 2545274038 wanted 787343232 mirror 0 [12008.243942] BTRFS info (device vdc): csum failed ino 4420604 extent 26805825306624 csum 4090153227 wanted 787343232 mirror 0 [12008.243952] BTRFS info (device vdc): csum failed ino 4420604 extent 26805825306624 csum 4129844199 wanted 787343232 mirror 0 [12008.243961] BTRFS info (device vdc): csum failed ino 4420604 extent 26805825306624 csum 4129844199 wanted 787343232 mirror 0 [12008.243976] BTRFS info (device vdc): csum failed ino 4420604 extent 26805825306624 csum 172651968 wanted 787343232 mirror 0 [12008.246158] BTRFS info (device vdc): csum failed ino 4420604 extent 26805825306624 csum 4129844199 wanted 787343232 mirror 1 [12008.247557] BTRFS info (device vdc): csum failed ino 4420604 extent 26805825306624 csum 1374425809 wanted 787343232 mirror 1 [12008.403493] BTRFS info (device vdc): csum failed ino 4420604 extent 26805825306624 csum 1567917468 wanted 787343232 mirror 1 [12008.409809] BTRFS info (device vdc): csum failed ino 4420604 extent 26805825306624 csum 2881359629 wanted 787343232 mirror 0 [12008.411165] BTRFS info (device vdc): csum failed ino 4420604 extent 26805825306624 csum 3021442070 wanted 787343232 mirror 0 [12008.411180] BTRFS info (device vdc): csum failed ino 4420604 extent 26805825306624 csum 3984314874 wanted 787343232 mirror 0 [12008.411189] BTRFS info (device vdc): csum failed ino 4420604 extent 26805825306624 csum 599192427 wanted 787343232 mirror 0 [12008.411199] BTRFS info (device vdc): csum failed ino 4420604 extent 26805825306624 csum 2887010053 wanted 787343232 mirror 0 [12008.411208] BTRFS info (device vdc): csum failed ino 4420604 extent 26805825306624 csum 1314141634 wanted 787343232 mirror 0 [12008.411217] BTRFS info (device vdc): csum failed ino 4420604 extent 26805825306624 csum 3156167613 wanted 787343232 mirror 0 [12008.411227] BTRFS info (device vdc): csum failed ino 4420604 extent 26805825306624 csum 565550942 wanted 787343232 mirror 0 [12008.411236] BTRFS info (device vdc): csum failed ino 4420604 extent 26805825306624 csum 4068631390 wanted 787343232 mirror 0 [12008.411245] BTRFS info (device vdc): csum failed ino 4420604 extent 26805825306624 csum 531263990 wanted 787343232 mirror 0 [120
Re: Adventures in btrfs raid5 disk recovery
On Mon, Jun 20, 2016 at 2:40 PM, Zygo Blaxellwrote: > On Mon, Jun 20, 2016 at 01:30:11PM -0600, Chris Murphy wrote: >> For me the critical question is what does "some corrupted sectors" mean? > > On other raid5 arrays, I would observe a small amount of corruption every > time there was a system crash (some of which were triggered by disk > failures, some not). What test are you using to determine there is corruption, and how much data is corrupted? Is this on every disk? Non-deterministically fewer than all disks? Have you identified this as a torn write or misdirected write or is it just garbage at some sectors? And what's the size? Partial sector? Partial md chunk (or fs block?) > It looked like any writes in progress at the time > of the failure would be damaged. In the past I would just mop up the > corrupt files (they were always the last extents written, easy to find > with find-new or scrub) and have no further problems. This is on Btrfs? This isn't supposed to be possible. Even a literal overwrite of a file is not an overwrite on Btrfs unless the file is nodatacow. Data extents get written, then the metadata is updated to point to those new blocks. There should be flush or fua requests to make sure the order is such that the fs points to either the old or new file, in either case uncorrupted. That's why I'm curious about the nature of this corruption. It sounds like your hardware is not exactly honoring flush requests. With md raid and any other file system, it's pure luck that such corrupted writes would only affect data extents and not the fs metadata. Corrupted fs metadata is not well tolerated by any file system, not least of which is most of them have no idea the metadata is corrupt. At least Btrfs can determine this and if there's another copy use that or just stop and face plant before more damage happens. Maybe an exception now is XFS v5 metadata which employs checksumming. But it still doesn't know if data extents are wrong (i.e. a torn or misdirected write). I've had perhaps a hundred power off during write with Btrfs and SSD and I don't ever see corrupt files. It's definitely not normal to see this with Btrfs. > In the earlier > cases there were no new instances of corruption after the initial failure > event and manual cleanup. > > Now that I did a little deeper into this, I do see one fairly significant > piece of data: > > root@host:~# btrfs dev stat /data | grep -v ' 0$' > [/dev/vdc].corruption_errs 16774 > [/dev/vde].write_io_errs 121 > [/dev/vde].read_io_errs4 > [devid:8].read_io_errs16 > > Prior to the failure of devid:8, vde had 121 write errors and 4 read > errors (these counter values are months old and the errors were long > since repaired by scrub). The 16774 corruption errors on vdc are all > new since the devid:8 failure, though. On md RAID 5 and 6, if the array gets parity mismatch counts above 0 doing a scrub (check > md/sync_action) there's a hardware problem. It's entirely possible you've found a bug, but it must be extremely obscure to basically not have hit everyone trying Btrfs raid56. I think you need to track down the source of this corruption and stop it however possible; whether that's changing hardware, or making sure the system isn't crashing. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Adventures in btrfs raid5 disk recovery
On Mon, Jun 20, 2016 at 01:30:11PM -0600, Chris Murphy wrote: > On Mon, Jun 20, 2016 at 1:11 PM, Zygo Blaxell >wrote: > > On Mon, Jun 20, 2016 at 11:13:51PM +0500, Roman Mamedov wrote: > >> On Sun, 19 Jun 2016 23:44:27 -0400 > Seems difficult at best due to this: > >>The normal 'device delete' operation got about 25% of the way in, > then got stuck on some corrupted sectors and aborting with EIO. > > In effect it's like a 2 disk failure for a raid5 (or it's > intermittently a 2 disk failure but always at least a 1 disk failure). > That's not something md raid recovers from. Even manual recovery in > such a case is far from certain. > > Perhaps Roman's advice is also a question about the cause of this > corruption? I'm wondering this myself. That's the real problem here as > I see it. Losing a drive is ordinary. Additional corruptions happening > afterward is not. And are those corrupt sectors hardware corruptions, > or Btrfs corruptions at the time the data was written to disk, or > Btrfs being confused as it's reading the data from disk? > For me the critical question is what does "some corrupted sectors" mean? On other raid5 arrays, I would observe a small amount of corruption every time there was a system crash (some of which were triggered by disk failures, some not). It looked like any writes in progress at the time of the failure would be damaged. In the past I would just mop up the corrupt files (they were always the last extents written, easy to find with find-new or scrub) and have no further problems. In the earlier cases there were no new instances of corruption after the initial failure event and manual cleanup. Now that I did a little deeper into this, I do see one fairly significant piece of data: root@host:~# btrfs dev stat /data | grep -v ' 0$' [/dev/vdc].corruption_errs 16774 [/dev/vde].write_io_errs 121 [/dev/vde].read_io_errs4 [devid:8].read_io_errs16 Prior to the failure of devid:8, vde had 121 write errors and 4 read errors (these counter values are months old and the errors were long since repaired by scrub). The 16774 corruption errors on vdc are all new since the devid:8 failure, though. > > > -- > Chris Murphy > signature.asc Description: Digital signature
Re: Adventures in btrfs raid5 disk recovery
On Mon, Jun 20, 2016 at 1:11 PM, Zygo Blaxell <ce3g8...@umail.furryterror.org> wrote: > On Mon, Jun 20, 2016 at 11:13:51PM +0500, Roman Mamedov wrote: >> On Sun, 19 Jun 2016 23:44:27 -0400 >> Zygo Blaxell <ce3g8...@umail.furryterror.org> wrote: >> From a practical standpoint, [aside from not using Btrfs RAID5], you'd be >> better off shutting down the system, booting a rescue OS, copying the content >> of the failing disk to the replacement one using 'ddrescue', then removing >> the >> bad disk, and after boot up your main system wouldn't notice anything has >> ever >> happened, aside from a few recoverable CRC errors in the "holes" on the areas >> which ddrescue failed to copy. > > I'm aware of ddrescue and myrescue, but in this case the disk has failed, > past tense. At this point the remaining choices are to make btrfs native > raid5 recovery work, or to restore from backups. Seems difficult at best due to this: >>The normal 'device delete' operation got about 25% of the way in, then got stuck on some corrupted sectors and aborting with EIO. In effect it's like a 2 disk failure for a raid5 (or it's intermittently a 2 disk failure but always at least a 1 disk failure). That's not something md raid recovers from. Even manual recovery in such a case is far from certain. Perhaps Roman's advice is also a question about the cause of this corruption? I'm wondering this myself. That's the real problem here as I see it. Losing a drive is ordinary. Additional corruptions happening afterward is not. And are those corrupt sectors hardware corruptions, or Btrfs corruptions at the time the data was written to disk, or Btrfs being confused as it's reading the data from disk? For me the critical question is what does "some corrupted sectors" mean? -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Adventures in btrfs raid5 disk recovery
On Mon, Jun 20, 2016 at 11:13:51PM +0500, Roman Mamedov wrote: > On Sun, 19 Jun 2016 23:44:27 -0400 > Zygo Blaxell <ce3g8...@umail.furryterror.org> wrote: > From a practical standpoint, [aside from not using Btrfs RAID5], you'd be > better off shutting down the system, booting a rescue OS, copying the content > of the failing disk to the replacement one using 'ddrescue', then removing the > bad disk, and after boot up your main system wouldn't notice anything has ever > happened, aside from a few recoverable CRC errors in the "holes" on the areas > which ddrescue failed to copy. I'm aware of ddrescue and myrescue, but in this case the disk has failed, past tense. At this point the remaining choices are to make btrfs native raid5 recovery work, or to restore from backups. > But in general it's commendable that you're experimenting with doing things > "the native way", as this is provides feedback to the developers and could > help > make the RAID implementation better. I guess that's the whole point of the > exercise and the report, and hope this ends up being useful for everyone. The intent was both to provide a cautionary tale for anyone considering deploying a btrfs raid5 system today, and to possibly engage some developers to help solve the problems. The underlying causes seem to be somewhat removed from where the symptoms are appearing, and at the moment I don't understand this code well enough to know where to look for them. Any assistance would be greatly appreciated. > -- > With respect, > Roman signature.asc Description: Digital signature
Re: Adventures in btrfs raid5 disk recovery
On Sun, 19 Jun 2016 23:44:27 -0400 Zygo Blaxell <ce3g8...@umail.furryterror.org> wrote: > It's not going well so far. Pay attention, there are at least four > separate problems in here and we're not even half done yet. > > I'm currently using kernel 4.6.2 with btrfs fixes forward-ported from > 4.5.7, because 4.5.7 has a number of fixes that 4.6.2 doesn't. I have > also pulled in some patches from the 4.7-rc series. > > This fixed a few problems I encountered early on, and I'm still making > forward progress, but I've only replaced 50% of the failed disk so far, > and this is week four of this particular project. From a practical standpoint, [aside from not using Btrfs RAID5], you'd be better off shutting down the system, booting a rescue OS, copying the content of the failing disk to the replacement one using 'ddrescue', then removing the bad disk, and after boot up your main system wouldn't notice anything has ever happened, aside from a few recoverable CRC errors in the "holes" on the areas which ddrescue failed to copy. But in general it's commendable that you're experimenting with doing things "the native way", as this is provides feedback to the developers and could help make the RAID implementation better. I guess that's the whole point of the exercise and the report, and hope this ends up being useful for everyone. -- With respect, Roman pgp8h7EycbEd6.pgp Description: OpenPGP digital signature
Adventures in btrfs raid5 disk recovery
Not so long ago, I had a disk fail in a btrfs filesystem with raid1 metadata and raid5 data. I mounted the filesystem readonly, replaced the failing disk, and attempted to recover by adding the new disk and deleting the missing disk. It's not going well so far. Pay attention, there are at least four separate problems in here and we're not even half done yet. I'm currently using kernel 4.6.2 with btrfs fixes forward-ported from 4.5.7, because 4.5.7 has a number of fixes that 4.6.2 doesn't. I have also pulled in some patches from the 4.7-rc series. This fixed a few problems I encountered early on, and I'm still making forward progress, but I've only replaced 50% of the failed disk so far, and this is week four of this particular project. What worked: 'mount -odegraded,...' successfully mounts the filesystem RW. 'btrfs device add' adds the new disk. Success! The first thing I did was balance the metadata onto non-missing disks. That went well. Now there are only data chunks to recover from the missing disk. Success! The normal 'device delete' operation got about 25% of the way in, then got stuck on some corrupted sectors and aborting with EIO. That ends the success, but I've had similar problems with raid5 arrays before and been able to solve them. I've managed to remove about half of the data from the missing disk so far. 'balance start -ddevid=,drange=0..1000' (with increasing values for drange) is able to move data off the failed disk while avoiding the damaged regions. It looks like this process could reduce the amount of data on "missing" devices to a manageable number, then I could identify the offending corrupted extents with 'btrfs scrub', remove the files containing them, and finish the device delete operation. Hope! What doesn't work: The first problem is that the kernel keeps crashing. I put the filesystem and all its disks in a KVM so the crashes are less disruptive, and I can debug them (or at least collect panic logs). OK now crashes are merely a performance problem. Why did I mention 'btrfs scrub' above? Because 'btrfs scrub' tells me where corrupted blocks are. 'device delete' fills my kernel logs with lines like this: [26054.744158] BTRFS info (device vdc): relocating block group 27753592127488 flags 129 [26809.746993] BTRFS warning (device vdc): csum failed ino 404 off 6021976064 csum 778377694 expected csum 2827380172 [26809.747029] BTRFS warning (device vdc): csum failed ino 404 off 6021980160 csum 3776938678 expected csum 514150079 [26809.747077] BTRFS warning (device vdc): csum failed ino 404 off 6021984256 csum 470593400 expected csum 642831408 [26809.747093] BTRFS warning (device vdc): csum failed ino 404 off 6021988352 csum 796755777 expected csum 690854341 [26809.747108] BTRFS warning (device vdc): csum failed ino 404 off 6021992448 csum 4115095129 expected csum 249712906 [26809.747122] BTRFS warning (device vdc): csum failed ino 404 off 6021996544 csum 2337431338 expected csum 1869250975 [26809.747138] BTRFS warning (device vdc): csum failed ino 404 off 6022000640 csum 3543852608 expected csum 1929026437 [26809.747154] BTRFS warning (device vdc): csum failed ino 404 off 6022004736 csum 3417780495 expected csum 3698318115 [26809.747169] BTRFS warning (device vdc): csum failed ino 404 off 6022008832 csum 3423877520 expected csum 2981727596 [26809.747183] BTRFS warning (device vdc): csum failed ino 404 off 6022012928 csum 550838742 expected csum 1005563554 [26896.379773] BTRFS info (device vdc): relocating block group 27753592127488 flags 129 [27791.128098] __readpage_endio_check: 7 callbacks suppressed [27791.236794] BTRFS warning (device vdc): csum failed ino 405 off 6021980160 csum 3776938678 expected csum 514150079 [27791.236799] BTRFS warning (device vdc): csum failed ino 405 off 6021971968 csum 3304844252 expected csum 4171523312 [27791.236821] BTRFS warning (device vdc): csum failed ino 405 off 6021984256 csum 470593400 expected csum 642831408 [27791.236825] BTRFS warning (device vdc): csum failed ino 405 off 6021988352 csum 796755777 expected csum 690854341 [27791.236842] BTRFS warning (device vdc): csum failed ino 405 off 6021992448 csum 4115095129 expected csum 249712906 [27791.236847] BTRFS warning (device vdc): csum failed ino 405 off 6021996544 csum 2337431338 expected csum 1869250975 [27791.236857] BTRFS warning (device vdc): csum failed ino 405 off 6022004736 csum 3417780495 expected csum 3698318115 [27791.236864] BTRFS warning (device vdc): csum failed ino 405 off 6022000640 csum 3543852608 expected csum 1929026437 [27791.236874] BTRFS warning (device vdc): csum failed ino 405 off 6022008832 csum 3423877520 expected csum 2981727596 [27791.236978] BTRFS warning (device vdc): csum failed ino 405 off 6021976064 csum 778377694 expected
Re: One disc of 3-disc btrfs-raid5 failed - files only partially readable
>> > Do you think there is still a chance to recover those files? >> >> You can use btrfs restore to get files off a damaged fs. > > This however does work - thank you! > Now since I'm a bit short on disc space, can I remove the disc that > previously disappeared (and thus doesn't have all the > data) from the RAID, format it and run btrfs rescue on the degraded array, > saving the rescued data to the now free disc? In theory btrfs restore should be able to read files from (unmounted) /dev/sdb (devid 2) + /dev/sdc (devid 3). The kernel code should still be able to mount devid 2 + devid 3 in degraded mode, but btrfs restore needs unmounted fs and I am not sure if userspace tools can also decode raid5 degraded well enough. For a single device, so non-raid profiles, it might be different. lf you unplug /dev/sda (devid 1) you can dry-run btrfs restore -v -D and see if it would work. If not, maybe first save the files that have csum errors with restore (all 3 discs connected) to other storage and then delete the files from the normally mounted 3 discs raid5 array and then do a normal copy from degraded,ro mounted 2 disc to the newly formatted /dev/sda. Hopefully there's enough space in total. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: One disc of 3-disc btrfs-raid5 failed - files only partially readable
Henk Slager gmail.com> writes: > You could use 1-time mount option clear_cache, then mount normally and > cache will be rebuild automatically (but also corrected if you don't > clear it) This didn't help, gave me [ 316.111596] BTRFS info (device sda): force clearing of disk cache [ 316.111605] BTRFS info (device sda): disk space caching is enabled [ 316.111608] BTRFS: has skinny extents [ 316.227354] BTRFS info (device sda): bdev /dev/sda errs: wr 180547340, rd 592949011, flush 4967, corrupt 582096433, gen 26993 and still [ 498.552298] BTRFS warning (device sda): csum failed ino 171545 off 2269560832 csum 2566472073 expected csum 874509527 [ 498.552325] BTRFS warning (device sda): csum failed ino 171545 off 2269564928 csum 2566472073 expected csum 2434927850 > > Do you think there is still a chance to recover those files? > > You can use btrfs restore to get files off a damaged fs. This however does work - thank you! Now since I'm a bit short on disc space, can I remove the disc that previously disappeared (and thus doesn't have all the data) from the RAID, format it and run btrfs rescue on the degraded array, saving the rescued data to the now free disc? -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: One disc of 3-disc btrfs-raid5 failed - files only partially readable
On Sun, Feb 7, 2016 at 6:28 PM, Benjamin Valentin <benpi...@googlemail.com> wrote: > Hi, > > I created a btrfs volume with 3x8TB drives (ST8000AS0002-1NA) in raid5 > configuration. > I copied some TB of data onto it without errors (from eSATA drives, so > rather fast - I mention that because of [1]), then set it up as a > fileserver where it had data read and written to it over a gigabit > ethernet connection for several days. > This however didn't go so well because after one day, one of the drives > dropped off the SATA bus. > > I don't know if that was related to [1] (I was running Linux 4.4-rc6 to > avoid that) and by now all evidence has been eaten by logrotate :\ > > But I was not concerned for I had set up raid5 to provide redundancy > against one disc failure - unfortunately it did not. > > When trying to read a file I'd get an I/O error after some hundret MB > (this is random across multiple files, but consistent for the same > file) on both files written before and after the disc failue. > > (There was still data being written to the volume at this point.) > > After a reboot a couple days later the drive showed up again and SMART > reported no errors, but the I/O errors remained. > > I then ran btrfs scrub (this took about 10 days) and afterwards I was > again able to completely read all files written *before* the disc > failure. > > However, many files written *after* the event (while only 2 drives were > online) are still only readable up to a point: > > $ dd if=Dr.Strangelove.mkv of=/dev/null > dd: error reading ‘Dr.Strangelove.mkv’: > Input/output error > 5331736+0 records in > 5331736+0 records out > 2729848832 bytes (2,7 GB) copied, 11,1318 s, 245 MB/s > > $ ls -sh > 4,4G Dr.Strangelove.mkv > > [ 197.321552] BTRFS warning (device sda): csum failed ino 171545 off > 2269564928 csum 2566472073 expected csum 2434927850 > [ 197.321574] BTRFS warning (device sda): csum failed ino 171545 off > 2269569024 csum 566472073 expected csum 212160686 > [ 197.321592] BTRFS warning (device sda): csum failed ino 171545 off > 2269573120 csum 2566472073 expected sum 2202342500 > > I tried btrfs check --repair but to no avail, got some > > [ 4549.762299] BTRFS warning (device sda): failed to load free space cache > for block group 1614937063424, rebuilding it now > [ 4549.790389] BTRFS error (device sda): csum mismatch on free space cache > > and this result > > checking extents > Fixed 0 roots. > checking free space cache > checking fs roots > checking csums > checking root refs > enabling repair mode > Checking filesystem on /dev/sda > UUID: ed263a9a-f65c-4bb6-8ee7-0df42b7fbfb8 > cache and super generation don't match, space cache will be invalidated > found 11674258875712 bytes used err is 0 > total csum bytes: 11387937220 > total tree bytes: 13011156992 > total fs tree bytes: 338083840 > total extent tree bytes: 99123200 > btree space waste bytes: 1079766991 > file data blocks allocated: 14669115838464 > referenced 14668840665088 > > when I mount the volume with -o nospace_cache I instead get > > [ 6985.165421] BTRFS warning (device sda): csum failed ino 171545 off > 2269560832 csum 2566472073 expected csum 874509527 > [ 6985.165469] BTRFS warning (device sda): csum failed ino 171545 off > 2269564928 csum 566472073 expected csum 2434927850 > [ 6985.165490] BTRFS warning (device sda): csum failed ino 171545 off > 2269569024 csum 2566472073 expected csum 212160686 > > when trying to read the file. You could use 1-time mount option clear_cache, then mount normally and cache will be rebuild automatically (but also corrected if you don't clear it) > Do you think there is still a chance to recover those files? You can use btrfs restore to get files off a damaged fs. > Also am I mistaken to believe that btrfs-raid5 would continue to > function when one disc fails? The problem you encountered is quite typical unfortunately, the answer is yes if you stop writing to the fs. But thats not acceptable of course. A key problem of btrfs raid (also in recent kernels like 4.4) is that when a (redundant) device goes offline (like pulling SATA cable or HDD firmware crash) btrfs/kernel does not notice or does not act correctly upon it under various circumstances. So same as in you case, the writing to disappeared device seems to continue. For just the data, this might then still be recoverable, but for the rest of the structures, it might corrupt the fs heavily. What should happen is that the btrfs+kernel+fs state switches to degraded mode and warn about devicefailure so that user can take action. Or completely automatically start using a spare disk that is standby but connected. But this spare disk method is currently just patched in
One disc of 3-disc btrfs-raid5 failed - files only partially readable
Hi, I created a btrfs volume with 3x8TB drives (ST8000AS0002-1NA) in raid5 configuration. I copied some TB of data onto it without errors (from eSATA drives, so rather fast - I mention that because of [1]), then set it up as a fileserver where it had data read and written to it over a gigabit ethernet connection for several days. This however didn't go so well because after one day, one of the drives dropped off the SATA bus. I don't know if that was related to [1] (I was running Linux 4.4-rc6 to avoid that) and by now all evidence has been eaten by logrotate :\ But I was not concerned for I had set up raid5 to provide redundancy against one disc failure - unfortunately it did not. When trying to read a file I'd get an I/O error after some hundret MB (this is random across multiple files, but consistent for the same file) on both files written before and after the disc failue. (There was still data being written to the volume at this point.) After a reboot a couple days later the drive showed up again and SMART reported no errors, but the I/O errors remained. I then ran btrfs scrub (this took about 10 days) and afterwards I was again able to completely read all files written *before* the disc failure. However, many files written *after* the event (while only 2 drives were online) are still only readable up to a point: $ dd if=Dr.Strangelove.mkv of=/dev/null dd: error reading ‘Dr.Strangelove.mkv’: Input/output error 5331736+0 records in 5331736+0 records out 2729848832 bytes (2,7 GB) copied, 11,1318 s, 245 MB/s $ ls -sh 4,4G Dr.Strangelove.mkv [ 197.321552] BTRFS warning (device sda): csum failed ino 171545 off 2269564928 csum 2566472073 expected csum 2434927850 [ 197.321574] BTRFS warning (device sda): csum failed ino 171545 off 2269569024 csum 566472073 expected csum 212160686 [ 197.321592] BTRFS warning (device sda): csum failed ino 171545 off 2269573120 csum 2566472073 expected sum 2202342500 I tried btrfs check --repair but to no avail, got some [ 4549.762299] BTRFS warning (device sda): failed to load free space cache for block group 1614937063424, rebuilding it now [ 4549.790389] BTRFS error (device sda): csum mismatch on free space cache and this result checking extents Fixed 0 roots. checking free space cache checking fs roots checking csums checking root refs enabling repair mode Checking filesystem on /dev/sda UUID: ed263a9a-f65c-4bb6-8ee7-0df42b7fbfb8 cache and super generation don't match, space cache will be invalidated found 11674258875712 bytes used err is 0 total csum bytes: 11387937220 total tree bytes: 13011156992 total fs tree bytes: 338083840 total extent tree bytes: 99123200 btree space waste bytes: 1079766991 file data blocks allocated: 14669115838464 referenced 14668840665088 when I mount the volume with -o nospace_cache I instead get [ 6985.165421] BTRFS warning (device sda): csum failed ino 171545 off 2269560832 csum 2566472073 expected csum 874509527 [ 6985.165469] BTRFS warning (device sda): csum failed ino 171545 off 2269564928 csum 566472073 expected csum 2434927850 [ 6985.165490] BTRFS warning (device sda): csum failed ino 171545 off 2269569024 csum 2566472073 expected csum 212160686 when trying to read the file. Do you think there is still a chance to recover those files? Also am I mistaken to believe that btrfs-raid5 would continue to function when one disc fails? If you need any more info I'm happy to provide that - here is some information about the system: Linux nashorn 4.4.0-2-generic #16-Ubuntu SMP Thu Jan 28 15:44:21 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux btrfs-progs v4.4 Label: 'data' uuid: ed263a9a-f65c-4bb6-8ee7-0df42b7fbfb8 Total devices 3 FS bytes used 10.62TiB devid1 size 7.28TiB used 5.33TiB path /dev/sda devid2 size 7.28TiB used 5.33TiB path /dev/sdb devid3 size 7.28TiB used 5.33TiB path /dev/sdc Data, RAID5: total=10.64TiB, used=10.61TiB System, RAID1: total=40.00MiB, used=928.00KiB Metadata, RAID1: total=13.00GiB, used=12.12GiB GlobalReserve, single: total=512.00MiB, used=0.00B Thank you! [1] https://bugzilla.kernel.org/show_bug.cgi?id=93581 [2] full dmesg: http://paste.ubuntu.com/14965237/ -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs/RAID5 became unmountable after SATA cable fault
On 6 November 2015 at 10:03, Janos Toth F.wrote: > > Although I updated the firmware of the drives. (I found an IMPORTANT > update when I went there to download SeaTools, although there was no > change log to tell me why this was important). This might changed the > error handling behavior of the drive...? I've had Seagate drives not reporting errors until I updated the firmware. They tended to timeout instead. Got a shitload of SMART errors after I updated, but they still didn't handle errors very well (became unresponsive). -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs/RAID5 became unmountable after SATA cable fault
I created a fresh RAID-5 mode Btrfs on the same 3 disks (including the faulty one which is still producing numerous random read errors) and Btrfs now seems to work exactly as I would anticipate. I copied some data and verified the checksum. The data is readable and correct regardless of the constant warning messages in the kernel log about the read errors on the single faulty HDD (the bad behavior is confirmed by the SMART logs and I tested it in a different PC as well...). I also ran several scrubs and now it always finishes with X corrected and 0 uncorrected errors. (The errors are supposedly corrected but the faulty HDD keeps randomly corrupting the data...) The last time I saw uncorrected errors during the scrub and not every data was readable. Rather strange... I ran 24 hours of Gimps/Prime95 Blend stresstest without errors on the problematic machine. Although I updated the firmware of the drives. (I found an IMPORTANT update when I went there to download SeaTools, although there was no change log to tell me why this was important). This might changed the error handling behavior of the drive...? -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs/RAID5 became unmountable after SATA cable fault
On 2015-11-04 23:06, Duncan wrote: (Tho I should mention, while not on zfs, I've actually had my own problems with ECC RAM too. In my case, the RAM was certified to run at speeds faster than it was actually reliable at, such that actually stored data, what the ECC protects, was fine, the data was actually getting damaged in transit to/from the RAM. On a lightly loaded system, such as one running many memory tests or under normal desktop usage conditions, the RAM was generally fine, no problems. But on a heavily loaded system, such as when doing parallel builds (I run gentoo, which builds from sources in ordered to get the higher level of option flexibility that comes only when you can toggle build-time options), I'd often have memory faults and my builds would fail. The most common failure, BTW, was on tarball decompression, bunzip2 or the like, since the tarballs contained checksums that were verified on data decompression, and often they'd fail to verify. Once I updated the BIOS to one that would let me set the memory speed instead of using the speed the modules themselves reported, and I declocked the memory just one notch (this was DDR1, IIRC I declocked from the PC3200 it was rated, to PC3000 speeds), not only was the memory then 100% reliable, but I could and did actually reduce the number of wait- states for various operations, and it was STILL 100% reliable. It simply couldn't handle the raw speeds it was certified to run, is all, tho it did handle it well enough, enough of the time, to make the problem far more difficult to diagnose and confirm than it would have been had the problem appeared at low load as well. As it happens, I was running reiserfs at the time, and it handled both that hardware issue, and a number of others I've had, far better than I'd have expected of /any/ filesystem, when the memory feeding it is simply not reliable. Reiserfs metadata, in particular, seems incredibly resilient in the face of hardware issues, and I lost far less data than I might have expected, tho without checksums and with bad memory, I imagine I had occasional undetected bitflip corruption in files here or there, but generally nothing I detected. I still use reiserfs on my spinning rust today, but it's not well suited to SSD, which is where I run btrfs. But the point for this discussion is that just because it's ECC RAM doesn't mean you can't have memory related errors, just that if you do, they're likely to be different errors, "transit errors", that will tend to be undetected by many memory checkers, at least the ones that don't tend to run full out memory bandwidth if they're simply checking that what was stored in a cell can be read back, unchanged.) I've actually seen similar issues with both ECC and non-ECC memory myself. Any time I'm getting RAM for a system that I can afford to over-spec, I get the next higher speed and under-clock it (which in turn means I can lower the timing parameters and usually get a faster system than if I was running it at the rated speed). FWIW, I also make a point of doing multiple memtest86+ runs (at a minimum, one running single core, and one with forced SMP) when I get new RAM, and even have a run-level configured on my Gentoo based home server system where it boots Xen and fires up twice as many VM's running memtest86+ as I have CPU cores, which is usually enough to fully saturate memory bandwidth and check for the type of issues you mentioned having above (although the BOINC client I run usually does a good job of triggering those kind of issues fast, distributed computing apps tend to be memory bound and use a lot of memory bandwidth). smime.p7s Description: S/MIME Cryptographic Signature
Re: Btrfs/RAID5 became unmountable after SATA cable fault
Duncan wrote: Austin S Hemmelgarn posted on Wed, 04 Nov 2015 13:45:37 -0500 as excerpted: On 2015-11-04 13:01, Janos Toth F. wrote: But the worst part is that there are some ISO files which were seemingly copied without errors but their external checksums (the one which I can calculate with md5sum and compare to the one supplied by the publisher of the ISO file) don't match! Well... this, I cannot understand. How could these files become corrupt from a single disk failure? And more importantly: how could these files be copied without errors? Why didn't Btrfs gave a read error when the checksums didn't add up? If you can prove that there was a checksum mismatch and BTRFS returned invalid data instead of a read error or going to the other disk, then that is a very serious bug that needs to be fixed. You need to keep in mind also however that it's completely possible that the data was bad before you wrote it to the filesystem, and if that's the case, there's nothing any filesystem can do to fix it for you. As Austin suggests, if btrfs is returning data, and you haven't turned off checksumming with nodatasum or nocow, then it's almost certainly returning the data it was given to write out in the first place. Whether that data it was given to write out was correct, however, is an /entirely/ different matter. If ISOs are failing their external checksums, then something is going on. Had you verified the external checksums when you first got the files? That is, are you sure the files were correct as downloaded and/or ripped? Where were the ISOs stored between original procurement/validation and writing to btrfs? Is it possible you still have some/all of them on that media? Do they still external-checksum-verify there? Basically, assuming btrfs checksums are validating, there's three other likely possibilities for where the corruption could have come from before writing to btrfs. Either the files were bad as downloaded or otherwise procured -- which is why I asked whether you verified them upon receipt -- or you have memory that's going bad, or your temporary storage is going bad, before the files ever got written to btrfs. The memory going bad is a particularly worrying possibility, considering... Now I am really considering to move from Linux to Windows and from Btrfs RAID-5 to Storage Spaces RAID-1 + ReFS (the only limitation is that ReFS is only "self-healing" on RAID-1, not RAID-5, so I need a new motherboard with more native SATA connectors and an extra HDD). That one seemed to actually do what it promises (abort any read operations upon checksum errors [which always happens seamlessly on every read] but look at the redundant data first and seamlessly "self-heal" if possible). The only thing which made Btrfs to look as a better alternative was the RAID-5 support. But I recently experienced two cases of 1 drive failing of 3 and it always tuned out as a smaller or bigger disaster (completely lost data or inconsistent data). Have you considered looking into ZFS? I hate to suggest it as an alternative to BTRFS, but it's a much more mature and well tested technology than ReFS, and has many of the same features as BTRFS (and even has the option for triple parity instead of the double you get with RAID6). If you do consider ZFS, make a point to look at FreeBSD in addition to the Linux version, the BSD one was a much better written port of the original Solaris drivers, and has better performance in many cases (and as much as I hate to admit it, BSD is way more reliable than Linux in most use cases). You should also seriously consider whether the convenience of having a filesystem that fixes internal errors itself with no user intervention is worth the risk of it corrupting your data. Returning correct data whenever possible is one thing, being 'self-healing' is completely different. When you start talking about things that automatically fix internal errors without user intervention is when most seasoned system administrators start to get really nervous. Self correcting systems have just as much chance to make things worse as they do to make things better, and most of them depend on the underlying hardware working correctly to actually provide any guarantee of reliability. I too would point you at ZFS, but there's one VERY BIG caveat, and one related smaller one! The people who have a lot of ZFS experience say it's generally quite reliable, but gobs of **RELIABLE** memory are *absolutely* *critical*! The self-healing works well, *PROVIDED* memory isn't producing errors. Absolutely reliable memory is in fact *so* critical, that running ZFS on non-ECC memory is severely discouraged as a very real risk to your data. Which is why the above hints that your memory may be bad are so worrying. Don't even *THINK* about ZFS, particularly its self-healing features, if you're not absolutely sure your memory is 100% reliable, because apparently, based on the comment's I've seen, if it's not, you
Re: Btrfs/RAID5 became unmountable after SATA cable fault
Austin S Hemmelgarn posted on Wed, 04 Nov 2015 13:45:37 -0500 as excerpted: > On 2015-11-04 13:01, Janos Toth F. wrote: >> But the worst part is that there are some ISO files which were >> seemingly copied without errors but their external checksums (the one >> which I can calculate with md5sum and compare to the one supplied by >> the publisher of the ISO file) don't match! >> Well... this, I cannot understand. >> How could these files become corrupt from a single disk failure? And >> more importantly: how could these files be copied without errors? Why >> didn't Btrfs gave a read error when the checksums didn't add up? > If you can prove that there was a checksum mismatch and BTRFS returned > invalid data instead of a read error or going to the other disk, then > that is a very serious bug that needs to be fixed. You need to keep in > mind also however that it's completely possible that the data was bad > before you wrote it to the filesystem, and if that's the case, there's > nothing any filesystem can do to fix it for you. As Austin suggests, if btrfs is returning data, and you haven't turned off checksumming with nodatasum or nocow, then it's almost certainly returning the data it was given to write out in the first place. Whether that data it was given to write out was correct, however, is an /entirely/ different matter. If ISOs are failing their external checksums, then something is going on. Had you verified the external checksums when you first got the files? That is, are you sure the files were correct as downloaded and/or ripped? Where were the ISOs stored between original procurement/validation and writing to btrfs? Is it possible you still have some/all of them on that media? Do they still external-checksum-verify there? Basically, assuming btrfs checksums are validating, there's three other likely possibilities for where the corruption could have come from before writing to btrfs. Either the files were bad as downloaded or otherwise procured -- which is why I asked whether you verified them upon receipt -- or you have memory that's going bad, or your temporary storage is going bad, before the files ever got written to btrfs. The memory going bad is a particularly worrying possibility, considering... >> Now I am really considering to move from Linux to Windows and from >> Btrfs RAID-5 to Storage Spaces RAID-1 + ReFS (the only limitation is >> that ReFS is only "self-healing" on RAID-1, not RAID-5, so I need a new >> motherboard with more native SATA connectors and an extra HDD). That >> one seemed to actually do what it promises (abort any read operations >> upon checksum errors [which always happens seamlessly on every read] >> but look at the redundant data first and seamlessly "self-heal" if >> possible). The only thing which made Btrfs to look as a better >> alternative was the RAID-5 support. But I recently experienced two >> cases of 1 drive failing of 3 and it always tuned out as a smaller or >> bigger disaster (completely lost data or inconsistent data). > Have you considered looking into ZFS? I hate to suggest it as an > alternative to BTRFS, but it's a much more mature and well tested > technology than ReFS, and has many of the same features as BTRFS (and > even has the option for triple parity instead of the double you get with > RAID6). If you do consider ZFS, make a point to look at FreeBSD in > addition to the Linux version, the BSD one was a much better written > port of the original Solaris drivers, and has better performance in many > cases (and as much as I hate to admit it, BSD is way more reliable than > Linux in most use cases). > > You should also seriously consider whether the convenience of having a > filesystem that fixes internal errors itself with no user intervention > is worth the risk of it corrupting your data. Returning correct data > whenever possible is one thing, being 'self-healing' is completely > different. When you start talking about things that automatically fix > internal errors without user intervention is when most seasoned system > administrators start to get really nervous. Self correcting systems > have just as much chance to make things worse as they do to make things > better, and most of them depend on the underlying hardware working > correctly to actually provide any guarantee of reliability. I too would point you at ZFS, but there's one VERY BIG caveat, and one related smaller one! The people who have a lot of ZFS experience say it's generally quite reliable, but gobs of **RELIABLE** memory are *absolutely* *critical*! The self-healing works well, *PROVIDED* memory isn't producing errors. Absolutely reliable memory is in fact *so* critical, that running ZFS on non-ECC memory is severely discouraged as a very real risk to your data. Which is why the above hints that your memory may be bad are so worrying. Don't even *THINK* about ZFS, particularly its self-healing features, if you're not
Re: Btrfs/RAID5 became unmountable after SATA cable fault
Well. Now I am really confused about Btrfs RAID-5! So, I replaced all SATA cables (which are explicitly marked for beeing aimed at SATA3 speeds) and all the 3x2Tb WD Red 2.0 drives with 3x4Tb Seagate Contellation ES 3 drives and started from sratch. I secure-erased every drives, created an empty filesystem and ran a "long" SMART self-test on all drivers before I started using the storage space (the tests finished without errors, all drivers looked fine, 0 zero bad sectors, 0 read or SATA CEC errors... all looked perfectly fine at the time...). It didn't take long before I realized that one of the new drives started failing. I started a scrub and it reported both corrected and uncorrectable errors. I looked at the SMART data. 2 drives look perfectly fine and 1 drive seems to be really sick. The latter one has some "reallocated" and several hundreds of "pending" sectors among other error indications in the log. I guess it's not the drive surface but the HDD controller (or may be a head) which is really dying. I figured the uncorrectable errors are write errors which is not surprising given the perceived "health" of the drive according to it's SMART attributes and error logs. That's understandable. Although, I tried to copy data from the filesystem and it failed at various ways. There was a file which couldn't be copied at all. Good question why. I guess it's because the filesystem needs to be repaired to get the checksums and parities sorted out first. That's also understandable (though unexpected, I thought RAID-5 Btrfs is sort-of "self-healing" in these situations, it should theoretically still be able to reconstruct and present the correct data, based on checksums and parities seamlessly and only place error in the kernel log...). But the worst part is that there are some ISO files which were seemingly copied without errors but their external checksums (the one which I can calculate with md5sum and compare to the one supplied by the publisher of the ISO file) don't match! Well... this, I cannot understand. How could these files become corrupt from a single disk failure? And more importantly: how could these files be copied without errors? Why didn't Btrfs gave a read error when the checksums didn't add up? Isn't Btrfs supposed to constantly check the integrity of the file data during any normal read operations and give an error instead of spitting out corrupt data as if it was perfectly legit? I thought that's how it is supposed to work. What's the point of full data checksuming if only an explicitly requested scrub operation might look for errors? I thought's it's the logical thing to do if checksum verification happens during every single read operation and passing that check is mandatory in order to get any data out of the filesystem (might be excluding the Direct-I/O mode but I never use that on Btrfs - if that's even actually supported, I don't know). Now I am really considering to move from Linux to Windows and from Btrfs RAID-5 to Storage Spaces RAID-1 + ReFS (the only limitation is that ReFS is only "self-healing" on RAID-1, not RAID-5, so I need a new motherboard with more native SATA connectors and an extra HDD). That one seemed to actually do what it promises (abort any read operations upon checksum errors [which always happens seamlessly on every read] but look at the redundant data first and seamlessly "self-heal" if possible). The only thing which made Btrfs to look as a better alternative was the RAID-5 support. But I recently experienced two cases of 1 drive failing of 3 and it always tuned out as a smaller or bigger disaster (completely lost data or inconsistent data). Does anybody have ideas what might went wrong in this second scenario? -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs/RAID5 became unmountable after SATA cable fault
On 2015-11-04 13:01, Janos Toth F. wrote: But the worst part is that there are some ISO files which were seemingly copied without errors but their external checksums (the one which I can calculate with md5sum and compare to the one supplied by the publisher of the ISO file) don't match! Well... this, I cannot understand. How could these files become corrupt from a single disk failure? And more importantly: how could these files be copied without errors? Why didn't Btrfs gave a read error when the checksums didn't add up? If you can prove that there was a checksum mismatch and BTRFS returned invalid data instead of a read error or going to the other disk, then that is a very serious bug that needs to be fixed. You need to keep in mind also however that it's completely possible that the data was bad before you wrote it to the filesystem, and if that's the case, there's nothing any filesystem can do to fix it for you. Isn't Btrfs supposed to constantly check the integrity of the file data during any normal read operations and give an error instead of spitting out corrupt data as if it was perfectly legit? I thought that's how it is supposed to work. Assuming that all of your hardware is working exactly like it's supposed to, yes it should work that way. If however, you have something that corrupts the data in RAM before or while BTRFS is computing the checksum prior to writing the data, the it's fully possible for bad data to get written to disk and still have a perfectly correct checksum. Bad RAM may also explain your issues mentioned above with not being able to copy stuff off of the filesystem. Also, if you're using NOCOW files (or just the mount option), those very specifically do not store checksums for the blocks, because there is no way to do it without significant risk of data corruption. What's the point of full data checksuming if only an explicitly requested scrub operation might look for errors? I thought's it's the logical thing to do if checksum verification happens during every single read operation and passing that check is mandatory in order to get any data out of the filesystem (might be excluding the Direct-I/O mode but I never use that on Btrfs - if that's even actually supported, I don't know). Now I am really considering to move from Linux to Windows and from Btrfs RAID-5 to Storage Spaces RAID-1 + ReFS (the only limitation is that ReFS is only "self-healing" on RAID-1, not RAID-5, so I need a new motherboard with more native SATA connectors and an extra HDD). That one seemed to actually do what it promises (abort any read operations upon checksum errors [which always happens seamlessly on every read] but look at the redundant data first and seamlessly "self-heal" if possible). The only thing which made Btrfs to look as a better alternative was the RAID-5 support. But I recently experienced two cases of 1 drive failing of 3 and it always tuned out as a smaller or bigger disaster (completely lost data or inconsistent data). Have you considered looking into ZFS? I hate to suggest it as an alternative to BTRFS, but it's a much more mature and well tested technology than ReFS, and has many of the same features as BTRFS (and even has the option for triple parity instead of the double you get with RAID6). If you do consider ZFS, make a point to look at FreeBSD in addition to the Linux version, the BSD one was a much better written port of the original Solaris drivers, and has better performance in many cases (and as much as I hate to admit it, BSD is way more reliable than Linux in most use cases). You should also seriously consider whether the convenience of having a filesystem that fixes internal errors itself with no user intervention is worth the risk of it corrupting your data. Returning correct data whenever possible is one thing, being 'self-healing' is completely different. When you start talking about things that automatically fix internal errors without user intervention is when most seasoned system administrators start to get really nervous. Self correcting systems have just as much chance to make things worse as they do to make things better, and most of them depend on the underlying hardware working correctly to actually provide any guarantee of reliability. I cannot count the number of stories I've heard of 'self-healing' hardware RAID controllers destroying data. smime.p7s Description: S/MIME Cryptographic Signature
Re: Btrfs/RAID5 became unmountable after SATA cable fault
I went through all the recovery options I could find (starting from read-only to "extraordinarily dangerous"). Nothing seemed to work. A Windows based proprietary recovery software (ReclaiMe) could scratch the surface but only that (it showed me the whole original folder structure after a few minutes of scanning and the "preview" of some some plaintext files was promising but most of the bigger files seemed to be broken). I used this as a bulk storage for backups and all the things I didn't care to keep in more than one copies but that includes my "scratchpad", so I cared enough to use RAID5 mode and to try restoring some things. Any last ideas before I "ata secure erase" and sell/repurpose the disks? -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs/RAID5 became unmountable after SATA cable fault
If it is for mostly archival storage, I would suggest you take a look at snapraid. On Wed, Oct 21, 2015 at 9:09 AM, Janos Toth F.wrote: > I went through all the recovery options I could find (starting from > read-only to "extraordinarily dangerous"). Nothing seemed to work. > > A Windows based proprietary recovery software (ReclaiMe) could scratch > the surface but only that (it showed me the whole original folder > structure after a few minutes of scanning and the "preview" of some > some plaintext files was promising but most of the bigger files seemed > to be broken). > > I used this as a bulk storage for backups and all the things I didn't > care to keep in more than one copies but that includes my > "scratchpad", so I cared enough to use RAID5 mode and to try restoring > some things. > > Any last ideas before I "ata secure erase" and sell/repurpose the disks? > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs/RAID5 became unmountable after SATA cable fault
Maybe hold off erasing the drives a little in case someone wants to collect some extra data for diagnosing how/why the filesystem got into this unrecoverable state. A single device having issues should not cause the whole filesystem to become unrecoverable. On Wed, Oct 21, 2015 at 9:09 AM, Janos Toth F.wrote: > I went through all the recovery options I could find (starting from > read-only to "extraordinarily dangerous"). Nothing seemed to work. > > A Windows based proprietary recovery software (ReclaiMe) could scratch > the surface but only that (it showed me the whole original folder > structure after a few minutes of scanning and the "preview" of some > some plaintext files was promising but most of the bigger files seemed > to be broken). > > I used this as a bulk storage for backups and all the things I didn't > care to keep in more than one copies but that includes my > "scratchpad", so I cared enough to use RAID5 mode and to try restoring > some things. > > Any last ideas before I "ata secure erase" and sell/repurpose the disks? > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs/RAID5 became unmountable after SATA cable fault
I am afraid the filesystem right now is really damaged regardless of it's state upon the unexpected cable failure because I tried some dangerous options after read-only restore/recovery methods all failed (including zero-log, followed by init-csum-tree and even chunk-recovery -> all of them just spit out several kind of errors which suggested they probably didn't even write anything to the disks before they decided that they already failed but they only caused more harm than good if they did write something). Actually, I almost got rid of this data myself intentionally when my new set of drives arrived. I was considering if I should simply start from scratch (may be reviewing and might be saving my "scratchpad" portion of the data but nothing really irreplaceable and/or valuable) but I thought it's a good idea to test the "device replace" function in real life. Even though the replace operation seemed to be successful I am beginning to wonder if it wasn't really. On Wed, Oct 21, 2015 at 7:42 PM, ronnie sahlbergwrote: > Maybe hold off erasing the drives a little in case someone wants to > collect some extra data for diagnosing how/why the filesystem got into > this unrecoverable state. > > A single device having issues should not cause the whole filesystem to > become unrecoverable. > > On Wed, Oct 21, 2015 at 9:09 AM, Janos Toth F. wrote: >> I went through all the recovery options I could find (starting from >> read-only to "extraordinarily dangerous"). Nothing seemed to work. >> >> A Windows based proprietary recovery software (ReclaiMe) could scratch >> the surface but only that (it showed me the whole original folder >> structure after a few minutes of scanning and the "preview" of some >> some plaintext files was promising but most of the bigger files seemed >> to be broken). >> >> I used this as a bulk storage for backups and all the things I didn't >> care to keep in more than one copies but that includes my >> "scratchpad", so I cared enough to use RAID5 mode and to try restoring >> some things. >> >> Any last ideas before I "ata secure erase" and sell/repurpose the disks? >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in >> the body of a message to majord...@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs/RAID5 became unmountable after SATA cable fault
https://btrfs.wiki.kernel.org/index.php/Restore This should still be possible with even a degraded/unmounted raid5. It is a bit tedious to figure out how to use it but if you've got some things you want off the volume, it's not so difficult to prevent trying it. Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs/RAID5 became unmountable after SATA cable fault
I tried several things, including the degraded mount option. One example: # mount /dev/sdb /data -o ro,degraded,nodatasum,notreelog mount: wrong fs type, bad option, bad superblock on /dev/sdb, missing codepage or helper program, or other error In some cases useful info is found in syslog - try dmesg | tail or so. # cat /proc/kmsg <6>[ 262.616929] BTRFS info (device sdd): allowing degraded mounts <6>[ 262.616943] BTRFS info (device sdd): setting nodatasum <6>[ 262.616949] BTRFS info (device sdd): disk space caching is enabled <6>[ 262.616953] BTRFS: has skinny extents <6>[ 262.652671] BTRFS: bdev (null) errs: wr 858, rd 8057, flush 280, corrupt 0, gen 0 <3>[ 262.697162] BTRFS (device sdd): parent transid verify failed on 38719488 wanted 101765 found 101223 <3>[ 262.697633] BTRFS (device sdd): parent transid verify failed on 38719488 wanted 101765 found 101223 <3>[ 262.697660] BTRFS: Failed to read block groups: -5 <3>[ 262.709885] BTRFS: open_ctree failed <6>[ 267.197365] BTRFS info (device sdd): allowing degraded mounts <6>[ 267.197385] BTRFS info (device sdd): setting nodatasum <6>[ 267.197397] BTRFS info (device sdd): disabling tree log <6>[ 267.197406] BTRFS info (device sdd): disk space caching is enabled <6>[ 267.197412] BTRFS: has skinny extents <6>[ 267.232809] BTRFS: bdev (null) errs: wr 858, rd 8057, flush 280, corrupt 0, gen 0 <3>[ 267.246167] BTRFS (device sdd): parent transid verify failed on 38719488 wanted 101765 found 101223 <3>[ 267.246706] BTRFS (device sdd): parent transid verify failed on 38719488 wanted 101765 found 101223 <3>[ 267.246727] BTRFS: Failed to read block groups: -5 <3>[ 267.261392] BTRFS: open_ctree failed On Wed, Oct 21, 2015 at 6:09 PM, Janos Toth F.wrote: > I went through all the recovery options I could find (starting from > read-only to "extraordinarily dangerous"). Nothing seemed to work. > > A Windows based proprietary recovery software (ReclaiMe) could scratch > the surface but only that (it showed me the whole original folder > structure after a few minutes of scanning and the "preview" of some > some plaintext files was promising but most of the bigger files seemed > to be broken). > > I used this as a bulk storage for backups and all the things I didn't > care to keep in more than one copies but that includes my > "scratchpad", so I cared enough to use RAID5 mode and to try restoring > some things. > > Any last ideas before I "ata secure erase" and sell/repurpose the disks? -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html