Thanks for the explanations. Chris, I don't think 'degraded' did anything to help the mounting, I just passed it in to see if it would help (I'm not sure if btrfs is "smart" enough to ignore a drive if it would increase the chance of mounting the volume even if it is degraded, but one could hope). I believe the key was 'nologreplay'. Here is some info about the corrupted fs:
# btrfs fi show /tmp/root/ Label: 'kvm-btrfs' uuid: fef29f0a-dc4c-4cc4-b524-914e6630803c Total devices 3 FS bytes used 3.30TiB devid 1 size 2.73TiB used 2.09TiB path /dev/bcache32 devid 2 size 2.73TiB used 2.09TiB path /dev/bcache0 devid 3 size 2.73TiB used 2.09TiB path /dev/bcache16 # btrfs fi usage /tmp/root/ WARNING: RAID56 detected, not implemented WARNING: RAID56 detected, not implemented WARNING: RAID56 detected, not implemented Overall: Device size: 8.18TiB Device allocated: 0.00B Device unallocated: 8.18TiB Device missing: 0.00B Used: 0.00B Free (estimated): 0.00B (min: 8.00EiB) Data ratio: 0.00 Metadata ratio: 0.00 Global reserve: 512.00MiB (used: 0.00B) Data,RAID5: Size:4.15TiB, Used:3.28TiB /dev/bcache0 2.08TiB /dev/bcache16 2.08TiB /dev/bcache32 2.08TiB Metadata,RAID5: Size:22.00GiB, Used:20.69GiB /dev/bcache0 11.00GiB /dev/bcache16 11.00GiB /dev/bcache32 11.00GiB System,RAID5: Size:64.00MiB, Used:400.00KiB /dev/bcache0 32.00MiB /dev/bcache16 32.00MiB /dev/bcache32 32.00MiB Unallocated: /dev/bcache0 655.00GiB /dev/bcache16 655.00GiB /dev/bcache32 656.49GiB So it looks like I set the metadata and system data to RAID5 and not RAID1. I guess that it could have been affected by the write hole causing the problem I was seeing. Since I get the same space usage with RAID1 and RAID5, I think I'm just going to use RAID1. I don't need stripe performance or anything like that. It would be nice if btrfs supported hotplug and re-plug a little better so that it is more "production" quality, but I just have to be patient. I'm familiar with Gluster and contributed code to Ceph, so I'm familiar with those types of distributed systems. I really like them, but the complexity is quite overkill for my needs at home. As far as bcache performance: I have two Crucial MX200 250GB drives that were md raid1 containing /boot (ext2), swap and then bcache. I have 2 WD Reds and a Seagate Barracuda Desktop drive all 3TB. With bcache in writeback, apt-get would be painfully slow. Running iostat, the SSDs would be doing a few hundred IOPs and the backing disks would be very busy and would be the limiting factor overall. Even though apt-get just downloaded the file (should be on the SSDs because of writeback), it still involved the backend disks way too much. The amount of dirty data was always less than 10% so there should have been plenty of space to free up cache without having to flush. I experimented with changing the size of contiguous IO to force more to cache, increasing the dirty ratio, etc, nothing seemed to provide the performance I was hoping. To be fair having a pair of SSDs (md raid1) caching three spindles (btrfs raid5) may not be an ideal configuration. If I had three SSDs, one for each drive, then it may have performed better?? I have also ~980 snapshots spread over a years time, so I don't know how much that impacts things. I did use a btrfs utility to help find duplicate files/chunks and dedupe them so that updated system binaries between upgraded LXC containers would use the same space on disk and be more efficient in bcache cache usage. After restoring the root and LXC roots snapshots on the SSD (broke the md raid1 so I could restore to one of them), I ran apt-get and got upwards to 2,400 IOPs with it being sustained around 1,200 IOPs (btrfs single on md raid1 degraded). I know that btrfs has some performance challenges, but I don't think I was hitting those. I was most likely a very unusual set-up of bcache and btrfs raid that caused the problem. I have bcache on 10 year old desktop box with a single nvme drive that performs a little better, but it is hard to be certain because of its age. It has bcache in write-around (since there is only a single nvme) and btrfs in raid1. I haven't watched that box as closely because it is responsive enough. It also only has four Gb of RAM so it constantly has to swap (web pages are hogs these days) and one of the reasons to retrofit that box with nvme rather than MX200. If you have any other questions, feel free to ask. Thanks ---------------- Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html