On Thu, Oct 25, 2018 at 3:47 AM, Dmitry Katsubo <dm...@mail.ru> wrote: > > > BTRFS error (device sdf): bdev /dev/sdh errs: wr 0, rd 1867, flush 0, > corrupt 0, gen 0 > BTRFS error (device sdf): bdev /dev/sdg errs: wr 0, rd 1867, flush 0, > corrupt 0, gen 0 > > Attempts lasted for 29 minutes.
Yep, and it floods the log. It's extra fun if the journal is on the device with errors. The more errors, the more writes and reads to the problem drive, the more errors, the more writes, the more errors... snowball. But that's the state of error handling on Btrfs, which is still more sophisticated than other file systems. It's not more sophisticated than the kernel's md driver, which does have some sort of read error rate limit and then it'll kick the drive out of the array (faulty state) and stop complaining about it. And I think it considers a drive faulty on a single write failure. > > Thanks for this information. I have a situation similar to yours, with > only important difference that my drives are put into the USB dock with > independent power and cooling like this one: > > https://www.ebay.com/itm/Mediasonic-ProBox-4-Bay-3-5-Hard-Drive-Enclosure-USB-3-0-eSATA-Sata-3-6-0Gbps/273161164246 > > so I don't think I need to worry about amps. This dock is connected > directly to USB port on the motherboard. It is entirely plausible that this still needs a hub, but it really depends on the exact errors you're getting. And those need to go to the linux-usb list, I don't know enough about it. And it might require a bit of luck to get a reply because it's a very busy list. My main recommendation is to be very concise: They will want to know the hardware setup (topology), lsusb -v, lspci, and a complete dmesg. It'll seem reasonable to snip just to the usb error messages, that almost always drives developers crazy because important hints for problems can show up in kernel message during boot, so they will inevitably want the whole dmesg. Ideal scenario is to do a clean boot and then reproduce the problem, and then capture the dmesg, that way it's a concise dmesg that isn't two weeks old with a bunch of device connects and disconnects or whatever. There almost certainly will be usb kernel parameters for debugging, ideally you search the linux-usb list archives to find out what they are (I'm not sure) so that you already have that set for your clean boot. There might be usb quirks for your hardware setup that apply. Or they might suggests that it still needs a USB hub to clean things up between controller and bridge chipset. > > However indeed there could be bugs both on dock side and in south bridge. > More over I could imagine that USB reset happens due to another USB device, > like a wave stated in one place turning into tsunami for the whole > USB subsystem. If there is a hub, one of their jobs is to prevent that from happening. And if the drive enclosure and problem device are on separate ports, they are effectively going through a built-in hub in the usb host device. But yeah, you want to tell linux-usb exactly what devices (and chipsets which lsusb -v will show) you're using because they may already know about such problems. > >> There are pending patches for something similar that you can find in >> the archives. I think the reason they haven't been merged yet is there >> haven't been enough comments and feedback (?). I think Anand Jain is >> the author of those patches so you might dig around in the archives. >> In a way you have an ideal setup for testing them out. Just make sure >> you have backups... > > > Thanks for reference. Should I look for this patch here: > > https://patchwork.kernel.org/project/linux-btrfs/list/?submitter=34632&order=-date Maybe, it's a lot of patches to go through. I'm using https://lore.kernel.org/linux-btrfs which has a search field. This is the recent email I was thinking of that might point you in the right direction: https://lore.kernel.org/linux-btrfs/2287c62d-6dbb-3b30-1134-d754e4294...@oracle.com/ A complicating factor is that the block layer does do some retries. I'm also not familiar enough with the way md does retries and sets drives as faulty and if that is really what Btrfs should replicate or not. Some of these conversations require cooperation with other kernel developers, I suspect, like libata, SCSI, USB, SD, in order to make sure no one is being stepped on with some big surprise. > > I didn't observe any errors while doing "btrfs check" on this volume after > several such resets, because that volume is mostly used for reading and > chance that USB reset happens during the write is very low. If it mounts and the most recent changes are readable without errors, the file system is probably fine. Btrfs is pretty good at detecting and correcting for hardware related problems, in that it is fussier than other file systems because it can detect such problems in both metadata and data and should be able to avoid them in the first place due to always on COW (as long as you haven't disabled it). But there is some evidence that old Btrfs bugs could induce corruption in metadata, and not turn into a problem for a long time later. The scrubs only check if metadata and its checksum match up (corruption detection elsewhere in the storage stack) so the scrub most often can't find bugs that cause corruption. You best bet for side stepping such problems is backups, and using the most recent kernel you can. If you encounter some problem that might be a bug, inevitably you'll need to test with a newer kernel version anyway to see if it's still a bug. Each merge cycle involves thousands of lines of changes just for Btrfs and there's more to the storage stack in the kernel than just Btrfs. In your use case with mostly reads, and probably you also don't care about write performance, you could consider mounting with notreelog. This will drop the use of the treelog which is used to improve performance on operations that use fsync. With this option, transactions calling fsync() fall back to sync() so it's safer but slower. -- Chris Murphy