On 09/09/2016 02:47 PM, Austin S. Hemmelgarn wrote:
> On 2016-09-09 12:12, moparisthebest wrote:
>> Hi,
>>
>> I'm hoping to get some help with mounting my btrfs array which quit
>> working yesterday.  My array was in the middle of a balance, about 50%
>> remaining, when it hit an error and remounted itself read-only [1].
>> btrfs fi show output [2], btrfs df output [3].
>>
>> I unmounted the array, and when I tried to mount it again, it locked up
>> the whole system so even alt+sysrq would not work.  I rebooted, tried to
>> mount again, same lockup.  This was all kernel 4.5.7.
>>
>> I rebooted to kernel 4.4.0, tried to mount, crashed again, this time a
>> message appeared on the screen and I took a picture [4].
>>
>> I rebooted into an arch live system with kernel 4.7.2, tried to mount
>> again, got some dmesg output before it crashed [5] and took a picture
>> when it crashed [6], says in part 'BUG: unable to handle kernel NULL
>> pointer dereference at 00000000000001f0'.
>>
>> Is there anything I can do to get this in a working state again or
>> perhaps even recover some data?
>>
>> Thanks much for any help
>>
>> [1]: https://www.moparisthebest.com/btrfs/initial_crash.txt
>> [2]: https://www.moparisthebest.com/btrfs/btrfsfishow.txt
>> [3]: https://www.moparisthebest.com/btrfs/btrfsdf.txt
>> [4]: https://www.moparisthebest.com/btrfsoops.jpg
>> [5]: https://www.moparisthebest.com/btrfs/dmsgprecrash.txt
>> [6]: https://www.moparisthebest.com/btrfsnulldereference.jpg
> 
> The output from btrfs fi show and fi df both indicate that the
> filesystem is essentially completely full.  You've gotten to the point
> where your using the global metadata reserve, and I think things are
> getting stuck trying (and failing) to reclaim the space that's used
> there.  The fact that the kernel is crashing in response to this is
> concerning, but it isn't surprising as this is not something that's
> really all that tested, and is very much not a normal operational
> scenario.  I'm guessing that the error you hit that forced the
> filesystem read-only is something that requires recovery, which in turn
> requires copy-on-write updates of some of the metadata, which you have
> essentially zero room for, and that's what's causing the kernel to choke
> when trying to mount the filesystem.
> 
> Given that the FS is pretty much wedged, I think your best bet for
> fixing this is probably going to be to use btrfs restore to get the data
> onto a new (larger) set of disks.  If you do take this approach, a
> metadata dump might be useful, if somebody could find enough room to
> extract it.
> 
> Alternatively, because of the small amount of free space on the largest
> device in the array, you _might_ be able to fix things if you can get it
> mounted read-write by running a balance converting both data and
> metadata to single profiles, adding a few more disks (or replacing some
> with bigger ones), and then converting back to raid1 profiles.  This is
> exponentially more risky than just restoring to a new filesystem, and
> will almost certainly take longer.
> 
> A couple of other things to comment about on this:
> 1. 'can_overcommit' (the function that the Arch kernel choked on) is
> from the memory management subsystem.  The fact that that's throwing a
> null pointer says to me either your hardware has issues, or the Arch
> kernel itself has problems (which would probably mean the kernel image
> is corrupted).
> 2. You may want to look for more symmetrically sized disks if you're
> going to be using raid1 mode.  The space that's free on the last listed
> disk in the filesystem is unusable in raid1 mode because there are no
> other disks with usable space.
> 3. In general, it's a good idea to keep an eye on space usage on your
> filesystems.  If it's getting to be more than about 95% full, you should
> be looking at getting some more storage space.  This is especially true
> for BTRFS, as a 100% full BTRFS filesystem functionally becomes
> permanently read-only because there's nowhere for the copy-on-write
> updates to write to.

If I read btrfs fi show right, it's got minimum ~600gb free on each one
of the 8 drives, shouldn't that be more than enough for most things?  (I
guess unless I have single files over 600gb that need COW'd, I don't though)

Didn't ubuntu on kernel 4.4 die in the same can_overcommit function?
(https://www.moparisthebest.com/btrfsoops.jpg) what kind of hardware
issues would cause a repeatable kernel crash like that?  Like am I
looking at memory issues or the SAS controller or what?

Thanks!
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to