Re: btrfs kernel oops on mount
On 09/12/2016 07:37 AM, Austin S. Hemmelgarn wrote: >> On 2016-09-09 15:23, moparisthebest wrote: >> Didn't ubuntu on kernel 4.4 die in the same can_overcommit function? >> (https://www.moparisthebest.com/btrfsoops.jpg) what kind of hardware >> issues would cause a repeatable kernel crash like that? Like am I >> looking at memory issues or the SAS controller or what? > It doesn't look like it died in can_overcommit, as that's not anywhere > on the stack trace. The second item on the stack though > (btrfs_async_reclaim_metadata_space) at least partly reinforces the > suspicion that something is messed up in the filesystems metadata (which > could explain the allocations in GlobalReserve, which is a subset of the > Metadata chunks). It looks like each crash was in a different place, > but at least the first two could easily be different parts of the kernel > choking on the same thing. As far as the crash in can_overcommit, that > combined with the apparent corrupted metadata makes me think there may > be a hardware problem. The first thing I'd check in that respect is the > cabling to the drives themselves, followed by system RAM, the PSU, and > the the storage controller. I generally check in that order because > it's trivial to check the cabling, and not all that difficult to check > the RAM and PSU (and RAM is more likely to go bad than the PSU), and > properly checking a storage controller is extremely dificult unless you > have another known working one you can swap it for (and even then, it's > only practical to check if you know the state on disk won't cause the > kernel to choke). The first RIP: line (https://www.moparisthebest.com/btrfsoops.jpg) ends in 'can_overcommit+0x1e/0xf0 [btrfs]', apologies for that being a literal picture of a CRT instead of a searchable text file, doesn't exactly make things easy... :( Still I'm relieved that more points to bad metadata than to bad hardware. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs kernel oops on mount
On 09/09/2016 03:28 PM, Chris Murphy wrote: > Sounds like with enospc devs want to see a couple things beyond what I > asked for: > > enospc_debug > grep -IR . /sys/fs/btrfs/UUID/allocation/ > > That's kinda hard to do right now if it's not mounting though... I managed to get more output from arch/4.7.2 using netconsole, I did end up with duplicate lines somehow though which uniq fixed, but some of the crash is mixed together on the same line, I didn't mess with that for fear of taking out something important: https://www.moparisthebest.com/btrfs/archnetconsole.txt I was also able to mount it ro so I ran the grep you asked for: https://www.moparisthebest.com/btrfs/enospcdebug.txt I tried mounting with mount -o rw,skip_balance and it still locked up, so for now it's read-only... Let me know what else I can provide or try. I haven't been able to boot with a 4.7 kernel on my ubuntu install so I figure 4.8 will be the same, I guess I'll need to permanently install something like arch to a flash drive and try 4.8 from there. Thanks! -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs kernel oops on mount
On 09/09/2016 01:51 PM, Chris Murphy wrote: > Also btrfs check output is useful to include also (without --repair > for starters). btrfs check --readonly output is here: https://www.moparisthebest.com/btrfs/btrfscheck.txt *Most* of it anyway, I messed up with tmux and it took 20 hours to run so I don't really want to run it again unless you need me to. Now that check is over I'll try the other things suggested. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs kernel oops on mount
On 09/09/2016 02:47 PM, Austin S. Hemmelgarn wrote: > On 2016-09-09 12:12, moparisthebest wrote: >> Hi, >> >> I'm hoping to get some help with mounting my btrfs array which quit >> working yesterday. My array was in the middle of a balance, about 50% >> remaining, when it hit an error and remounted itself read-only [1]. >> btrfs fi show output [2], btrfs df output [3]. >> >> I unmounted the array, and when I tried to mount it again, it locked up >> the whole system so even alt+sysrq would not work. I rebooted, tried to >> mount again, same lockup. This was all kernel 4.5.7. >> >> I rebooted to kernel 4.4.0, tried to mount, crashed again, this time a >> message appeared on the screen and I took a picture [4]. >> >> I rebooted into an arch live system with kernel 4.7.2, tried to mount >> again, got some dmesg output before it crashed [5] and took a picture >> when it crashed [6], says in part 'BUG: unable to handle kernel NULL >> pointer dereference at 01f0'. >> >> Is there anything I can do to get this in a working state again or >> perhaps even recover some data? >> >> Thanks much for any help >> >> [1]: https://www.moparisthebest.com/btrfs/initial_crash.txt >> [2]: https://www.moparisthebest.com/btrfs/btrfsfishow.txt >> [3]: https://www.moparisthebest.com/btrfs/btrfsdf.txt >> [4]: https://www.moparisthebest.com/btrfsoops.jpg >> [5]: https://www.moparisthebest.com/btrfs/dmsgprecrash.txt >> [6]: https://www.moparisthebest.com/btrfsnulldereference.jpg > > The output from btrfs fi show and fi df both indicate that the > filesystem is essentially completely full. You've gotten to the point > where your using the global metadata reserve, and I think things are > getting stuck trying (and failing) to reclaim the space that's used > there. The fact that the kernel is crashing in response to this is > concerning, but it isn't surprising as this is not something that's > really all that tested, and is very much not a normal operational > scenario. I'm guessing that the error you hit that forced the > filesystem read-only is something that requires recovery, which in turn > requires copy-on-write updates of some of the metadata, which you have > essentially zero room for, and that's what's causing the kernel to choke > when trying to mount the filesystem. > > Given that the FS is pretty much wedged, I think your best bet for > fixing this is probably going to be to use btrfs restore to get the data > onto a new (larger) set of disks. If you do take this approach, a > metadata dump might be useful, if somebody could find enough room to > extract it. > > Alternatively, because of the small amount of free space on the largest > device in the array, you _might_ be able to fix things if you can get it > mounted read-write by running a balance converting both data and > metadata to single profiles, adding a few more disks (or replacing some > with bigger ones), and then converting back to raid1 profiles. This is > exponentially more risky than just restoring to a new filesystem, and > will almost certainly take longer. > > A couple of other things to comment about on this: > 1. 'can_overcommit' (the function that the Arch kernel choked on) is > from the memory management subsystem. The fact that that's throwing a > null pointer says to me either your hardware has issues, or the Arch > kernel itself has problems (which would probably mean the kernel image > is corrupted). > 2. You may want to look for more symmetrically sized disks if you're > going to be using raid1 mode. The space that's free on the last listed > disk in the filesystem is unusable in raid1 mode because there are no > other disks with usable space. > 3. In general, it's a good idea to keep an eye on space usage on your > filesystems. If it's getting to be more than about 95% full, you should > be looking at getting some more storage space. This is especially true > for BTRFS, as a 100% full BTRFS filesystem functionally becomes > permanently read-only because there's nowhere for the copy-on-write > updates to write to. If I read btrfs fi show right, it's got minimum ~600gb free on each one of the 8 drives, shouldn't that be more than enough for most things? (I guess unless I have single files over 600gb that need COW'd, I don't though) Didn't ubuntu on kernel 4.4 die in the same can_overcommit function? (https://www.moparisthebest.com/btrfsoops.jpg) what kind of hardware issues would cause a repeatable kernel crash like that? Like am I looking at memory issues or the SAS controller or what? Thanks! -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs kernel oops on mount
On 09/09/2016 01:51 PM, Chris Murphy wrote: > On Fri, Sep 9, 2016 at 10:12 AM, moparisthebest > wrote: >> Hi, >> >> I'm hoping to get some help with mounting my btrfs array which quit >> working yesterday. My array was in the middle of a balance, about 50% >> remaining, when it hit an error and remounted itself read-only [1]. >> btrfs fi show output [2], btrfs df output [3]. >> >> I unmounted the array, and when I tried to mount it again, it locked up >> the whole system so even alt+sysrq would not work. I rebooted, tried to >> mount again, same lockup. This was all kernel 4.5.7. >> >> I rebooted to kernel 4.4.0, tried to mount, crashed again, this time a >> message appeared on the screen and I took a picture [4]. >> >> I rebooted into an arch live system with kernel 4.7.2, tried to mount >> again, got some dmesg output before it crashed [5] and took a picture >> when it crashed [6], says in part 'BUG: unable to handle kernel NULL >> pointer dereference at 01f0'. >> >> Is there anything I can do to get this in a working state again or >> perhaps even recover some data? >> >> Thanks much for any help >> >> [1]: https://www.moparisthebest.com/btrfs/initial_crash.txt >> [2]: https://www.moparisthebest.com/btrfs/btrfsfishow.txt >> [3]: https://www.moparisthebest.com/btrfs/btrfsdf.txt >> [4]: https://www.moparisthebest.com/btrfsoops.jpg >> [5]: https://www.moparisthebest.com/btrfs/dmsgprecrash.txt >> [6]: https://www.moparisthebest.com/btrfsnulldereference.jpg > > Good report. Try on the 4.7.2 kernel system, two consoles, have one > ready with 'echo w > /proc/sysrq-trigger' as root (sudo doesn't work) > but don't issue it, mount in the other console and then switch back > and issue the sysrq. It'll take a while, minutes maybe even to switch > consoles, and then also for the command itself to issue, and then > minutes before the result actually gets committed to systemd journal > or var/log/messages. If it's a systemd system, and if you have to > force reboot to regain control, you can get the sysrq with 'journalctl > -b-1 -k > outputfile.txt' > > Also btrfs check output is useful to include also (without --repair > for starters). > > The thing that concerns me is this occasional problem that comes up > sometimes with lzo compressed volumes. Duncan knows more about that > one so he may chime in. I would definitely only do default mounts for > the above, don't include the compression option. You could also try -o > ro,recovery and see where that gets you. > > This is indeed an lzo compressed system, it's always been mounted with that option anyhow. btrfs check has been running for ~6 hours so far, I'll follow up with output on that when it finishes. Hmm, the problem with the 4.7.2/systemd system is it's a live usb system so the log/journal wouldn't be saved anywhere except tmpfs, I'll see what I can rig up unless someone has any amazing ideas? I'm still brand new to systemd... Thanks! -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
btrfs kernel oops on mount
Hi, I'm hoping to get some help with mounting my btrfs array which quit working yesterday. My array was in the middle of a balance, about 50% remaining, when it hit an error and remounted itself read-only [1]. btrfs fi show output [2], btrfs df output [3]. I unmounted the array, and when I tried to mount it again, it locked up the whole system so even alt+sysrq would not work. I rebooted, tried to mount again, same lockup. This was all kernel 4.5.7. I rebooted to kernel 4.4.0, tried to mount, crashed again, this time a message appeared on the screen and I took a picture [4]. I rebooted into an arch live system with kernel 4.7.2, tried to mount again, got some dmesg output before it crashed [5] and took a picture when it crashed [6], says in part 'BUG: unable to handle kernel NULL pointer dereference at 01f0'. Is there anything I can do to get this in a working state again or perhaps even recover some data? Thanks much for any help [1]: https://www.moparisthebest.com/btrfs/initial_crash.txt [2]: https://www.moparisthebest.com/btrfs/btrfsfishow.txt [3]: https://www.moparisthebest.com/btrfs/btrfsdf.txt [4]: https://www.moparisthebest.com/btrfsoops.jpg [5]: https://www.moparisthebest.com/btrfs/dmsgprecrash.txt [6]: https://www.moparisthebest.com/btrfsnulldereference.jpg -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
abort device removal?
Hello all, I had a 6-device array I added a 4tb device to last night and ran the command to remove a previous 4tb device that still worked fine overnight. Unfortunately, one of the OTHER devices completely failed while this was happening, and it *looks* like btrfs did the right thing and stopped the move, except it's still marked as 0 space in btrfs fi show. The delete command is still running, though iotop shows it's not actually reading or writing anything and no further moving messages in dmesg/kern.log seems to indicate that too. So what I think I *need* to do is re-add the drive it's currently trying to remove so I can delete the now non-functioning other drive without losing any data. My fear is a reboot or unmount/remount will fail to mount the currently-being-removed drive as well causing me to lose everything. Here is some relevant info from the system: # uname -a Linux mytorrentflux1 3.13.0-40-generic #69-Ubuntu SMP Thu Nov 13 17:53:56 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux # btrfs --version Btrfs v3.17.3 # btrfs fi show Label: 'completed' uuid: 0d14bb0f-46cc-408e-9245-f06d50ec2da8 Total devices 7 FS bytes used 7.60TiB devid1 size 3.64TiB used 3.28TiB path /dev/mapper/fourtb1 devid2 size 3.64TiB used 3.29TiB path /dev/mapper/fourtb2 devid3 size 2.73TiB used 2.37TiB path /dev/mapper/threetb1 devid5 size 1.82TiB used 1.82TiB path /dev/mapper/twotb1 devid6 size 0.00B used 1.99TiB path /dev/mapper/fourtb3 devid7 size 2.73TiB used 2.22TiB path /dev/mapper/threetb2 devid8 size 3.64TiB used 240.29GiB path /dev/mapper/fourtb4 Btrfs v3.17.3 # btrfs fi df /mnt/completed/ Data, RAID10: total=6.26TiB, used=6.26TiB Data, RAID1: total=1.33TiB, used=1.33TiB System, RAID10: total=96.00MiB, used=852.00KiB Metadata, RAID10: total=10.77GiB, used=9.90GiB Metadata, RAID1: total=5.00GiB, used=4.37GiB fourtb4 is the new drive I just added, fourtb3 is the functioning drive I attempted to remove before threetb1 completely failed (smartctl can't even read anything from it, well, from the underlying device) dmesg/kern.log is too large too attach, here are some important-looking excerpts (3 lines often repeated): Dec 5 09:59:35 mytorrentflux1 kernel: [1549876.646751] btrfs: bdev /dev/mapper/threetb1 errs: wr 17599, rd 973, flush 0, corrupt 0, gen 0 Dec 5 09:59:35 mytorrentflux1 kernel: [1549877.022291] lost page write due to I/O error on /dev/mapper/threetb1 Dec 5 10:07:08 mytorrentflux1 kernel: [1550329.743294] btrfs_dev_stat_print_on_error: 264 callbacks suppressed I appreciate any help or guidance I can get on this issue so I don't lose data, hopefully. Thanks much! -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html