> On 28 Apr 2017, at 02:45, Qu Wenruo <quwen...@cn.fujitsu.com> wrote: > > > > At 04/26/2017 01:50 AM, Christophe de Dinechin wrote: >> Hi, >> I”ve been trying to run btrfs as my primary work filesystem for about 3-4 >> months now on Fedora 25 systems. I ran a few times into filesystem >> corruptions. At least one I attributed to a damaged disk, but the last one >> is with a brand new 3T disk that reports no SMART errors. Worse yet, in at >> least three cases, the filesystem corruption caused btrfsck to crash. >> The last filesystem corruption is documented here: >> https://bugzilla.redhat.com/show_bug.cgi?id=1444821. The dmesg log is in >> there. > > According to the bugzilla, the btrfs-progs seems to be too old in btrfs > standard.
> What about using the latest btrfs-progs v4.10.2? I tried 4.10.1-1 https://bugzilla.redhat.com/show_bug.cgi?id=1435567#c4. I am currently debugging with a build from the master branch as of Tuesday (commit bd0ab27afbf14370f9f0da1f5f5ecbb0adc654c1), which is 4.10.2 There was no change in behavior. Runs are split about evenly between list crash and abort. I added instrumentation and tried a fix, which brings me a tiny bit further, until I hit a message from delete_duplicate_records: Ok we have overlapping extents that aren't completely covered by each other, this is going to require more careful thought. The extents are [52428800-16384] and [52432896-16384] > Furthermore for v4.10.2, btrfs check provides a new mode called lowmem. > You could try "btrfs check --mode=lowmem" to see if such problem can be > avoided. I will try that, but what makes you think this is a memory-related condition? The machine has 16G of RAM, isn’t that enough for an fsck? > > For the kernel bug, it seems to be related to wrongly inserted delayed ref, > but I can totally be wrong. For now, I’m focusing on the “repair” part as much as I can, because I assume the kernel bug is there anyway, so someone else is bound to hit this problem. Thanks Christophe > > Thanks, > Qu >> The btrfsck crash is here: >> https://bugzilla.redhat.com/show_bug.cgi?id=1435567. I have two crash modes: >> either an abort or a SIGSEGV. I checked that both still happens on master as >> of today. >> The cause of the abort is that we call set_extent_dirty from >> check_extent_refs with rec->max_size == 0. I’ve instrumented to try to see >> where we set this to 0 (see >> https://github.com/c3d/btrfs-progs/tree/rhbz1435567), and indeed, we do >> sometimes see max_size set to 0 in a few locations. My instrumentation shows >> this: >> 78655 [1.792241:0x451fe0] MAX_SIZE_ZERO: Add extent rec 0x139eb80 max_size >> 16384 tmpl 0x7fffffffd120 >> 78657 [1.792242:0x451cb8] MAX_SIZE_ZERO: Set max size 0 for rec 0x139ec50 >> from tmpl 0x7fffffffcf80 >> 78660 [1.792244:0x451fe0] MAX_SIZE_ZERO: Add extent rec 0x139ed50 max_size >> 16384 tmpl 0x7fffffffd120 >> I don’t really know what to make of it. >> The cause of the SIGSEGV is that we try to free a list entry that has its >> next set to NULL. >> #0 list_del (entry=0x555555db0420) at >> /usr/src/debug/btrfs-progs-v4.10.1/kernel-lib/list.h:125 >> #1 free_all_extent_backrefs (rec=0x555555db0350) at cmds-check.c:5386 >> #2 maybe_free_extent_rec (extent_cache=0x7fffffffd990, rec=0x555555db0350) >> at cmds-check.c:5417 >> #3 0x00005555555b308f in check_block (flags=<optimized out>, >> buf=0x55557b87cdf0, extent_cache=0x7fffffffd990, root=0x55555587d570) at >> cmds-check.c:5851 >> #4 run_next_block (root=root@entry=0x55555587d570, >> bits=bits@entry=0x5555558841 >> I don’t know if the two problems are related, but they seem to be pretty >> consistent on this specific disk, so I think that we have a good opportunity >> to improve btrfsck to make it more robust to this specific form of >> corruption. But I don’t want to hapazardly modify a code I don’t really >> understand. So if anybody could make a suggestion on what the right strategy >> should be when we have max_size == 0, or how to avoid it in the first place. >> I don’t know if this is relevant at all, but all the machines that failed >> that way were used to run VMs with KVM/QEMU. DIsk activity tends to be >> somewhat intense on occasions, since the VMs running there are part of a >> personal Jenkins ring that automatically builds various projects. Nominally, >> there are between three and five guests running (Windows XP, WIndows 10, >> macOS, Fedora25, Ubuntu 16.04). >> Thanks >> Christophe de Dinechin >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in >> the body of a message to majord...@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html