Re: btrfs kernel oops on mount

2016-09-12 Thread moparisthebest
On 09/12/2016 07:37 AM, Austin S. Hemmelgarn wrote:
>> On 2016-09-09 15:23, moparisthebest wrote:
>> Didn't ubuntu on kernel 4.4 die in the same can_overcommit function?
>> (https://www.moparisthebest.com/btrfsoops.jpg) what kind of hardware
>> issues would cause a repeatable kernel crash like that?  Like am I
>> looking at memory issues or the SAS controller or what?
> It doesn't look like it died in can_overcommit, as that's not anywhere
> on the stack trace.  The second item on the stack though
> (btrfs_async_reclaim_metadata_space) at least partly reinforces the
> suspicion that something is messed up in the filesystems metadata (which
> could explain the allocations in GlobalReserve, which is a subset of the
> Metadata chunks).  It looks like each crash was in a different place,
> but at least the first two could easily be different parts of the kernel
> choking on the same thing.  As far as the crash in can_overcommit, that
> combined with the apparent corrupted metadata makes me think there may
> be a hardware problem.  The first thing I'd check in that respect is the
> cabling to the drives themselves, followed by system RAM, the PSU, and
> the the storage controller.  I generally check in that order because
> it's trivial to check the cabling, and not all that difficult to check
> the RAM and PSU (and RAM is more likely to go bad than the PSU), and
> properly checking a storage controller is extremely dificult unless you
> have another known working one you can swap it for (and even then, it's
> only practical to check if you know the state on disk won't cause the
> kernel to choke).

The first RIP: line (https://www.moparisthebest.com/btrfsoops.jpg) ends
in 'can_overcommit+0x1e/0xf0 [btrfs]', apologies for that being a
literal picture of a CRT instead of a searchable text file, doesn't
exactly make things easy... :(

Still I'm relieved that more points to bad metadata than to bad hardware.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs kernel oops on mount

2016-09-10 Thread moparisthebest
On 09/09/2016 03:28 PM, Chris Murphy wrote:
> Sounds like with enospc devs want to see a couple things beyond what I
> asked for:
> 
> enospc_debug
> grep -IR . /sys/fs/btrfs/UUID/allocation/
> 
> That's kinda hard to do right now if it's not mounting though...

I managed to get more output from arch/4.7.2 using netconsole, I did end
up with duplicate lines somehow though which uniq fixed, but some of the
crash is mixed together on the same line, I didn't mess with that for
fear of taking out something important:

https://www.moparisthebest.com/btrfs/archnetconsole.txt

I was also able to mount it ro so I ran the grep you asked for:

https://www.moparisthebest.com/btrfs/enospcdebug.txt

I tried mounting with mount -o rw,skip_balance and it still locked up,
so for now it's read-only...

Let me know what else I can provide or try.  I haven't been able to boot
with a 4.7 kernel on my ubuntu install so I figure 4.8 will be the same,
I guess I'll need to permanently install something like arch to a flash
drive and try 4.8 from there.

Thanks!

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs kernel oops on mount

2016-09-10 Thread moparisthebest
On 09/09/2016 01:51 PM, Chris Murphy wrote:
> Also btrfs check output is useful to include also (without --repair
> for starters).

btrfs check --readonly output is here:

https://www.moparisthebest.com/btrfs/btrfscheck.txt

*Most* of it anyway, I messed up with tmux and it took 20 hours to run
so I don't really want to run it again unless you need me to.  Now that
check is over I'll try the other things suggested.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs kernel oops on mount

2016-09-09 Thread moparisthebest
On 09/09/2016 02:47 PM, Austin S. Hemmelgarn wrote:
> On 2016-09-09 12:12, moparisthebest wrote:
>> Hi,
>>
>> I'm hoping to get some help with mounting my btrfs array which quit
>> working yesterday.  My array was in the middle of a balance, about 50%
>> remaining, when it hit an error and remounted itself read-only [1].
>> btrfs fi show output [2], btrfs df output [3].
>>
>> I unmounted the array, and when I tried to mount it again, it locked up
>> the whole system so even alt+sysrq would not work.  I rebooted, tried to
>> mount again, same lockup.  This was all kernel 4.5.7.
>>
>> I rebooted to kernel 4.4.0, tried to mount, crashed again, this time a
>> message appeared on the screen and I took a picture [4].
>>
>> I rebooted into an arch live system with kernel 4.7.2, tried to mount
>> again, got some dmesg output before it crashed [5] and took a picture
>> when it crashed [6], says in part 'BUG: unable to handle kernel NULL
>> pointer dereference at 01f0'.
>>
>> Is there anything I can do to get this in a working state again or
>> perhaps even recover some data?
>>
>> Thanks much for any help
>>
>> [1]: https://www.moparisthebest.com/btrfs/initial_crash.txt
>> [2]: https://www.moparisthebest.com/btrfs/btrfsfishow.txt
>> [3]: https://www.moparisthebest.com/btrfs/btrfsdf.txt
>> [4]: https://www.moparisthebest.com/btrfsoops.jpg
>> [5]: https://www.moparisthebest.com/btrfs/dmsgprecrash.txt
>> [6]: https://www.moparisthebest.com/btrfsnulldereference.jpg
> 
> The output from btrfs fi show and fi df both indicate that the
> filesystem is essentially completely full.  You've gotten to the point
> where your using the global metadata reserve, and I think things are
> getting stuck trying (and failing) to reclaim the space that's used
> there.  The fact that the kernel is crashing in response to this is
> concerning, but it isn't surprising as this is not something that's
> really all that tested, and is very much not a normal operational
> scenario.  I'm guessing that the error you hit that forced the
> filesystem read-only is something that requires recovery, which in turn
> requires copy-on-write updates of some of the metadata, which you have
> essentially zero room for, and that's what's causing the kernel to choke
> when trying to mount the filesystem.
> 
> Given that the FS is pretty much wedged, I think your best bet for
> fixing this is probably going to be to use btrfs restore to get the data
> onto a new (larger) set of disks.  If you do take this approach, a
> metadata dump might be useful, if somebody could find enough room to
> extract it.
> 
> Alternatively, because of the small amount of free space on the largest
> device in the array, you _might_ be able to fix things if you can get it
> mounted read-write by running a balance converting both data and
> metadata to single profiles, adding a few more disks (or replacing some
> with bigger ones), and then converting back to raid1 profiles.  This is
> exponentially more risky than just restoring to a new filesystem, and
> will almost certainly take longer.
> 
> A couple of other things to comment about on this:
> 1. 'can_overcommit' (the function that the Arch kernel choked on) is
> from the memory management subsystem.  The fact that that's throwing a
> null pointer says to me either your hardware has issues, or the Arch
> kernel itself has problems (which would probably mean the kernel image
> is corrupted).
> 2. You may want to look for more symmetrically sized disks if you're
> going to be using raid1 mode.  The space that's free on the last listed
> disk in the filesystem is unusable in raid1 mode because there are no
> other disks with usable space.
> 3. In general, it's a good idea to keep an eye on space usage on your
> filesystems.  If it's getting to be more than about 95% full, you should
> be looking at getting some more storage space.  This is especially true
> for BTRFS, as a 100% full BTRFS filesystem functionally becomes
> permanently read-only because there's nowhere for the copy-on-write
> updates to write to.

If I read btrfs fi show right, it's got minimum ~600gb free on each one
of the 8 drives, shouldn't that be more than enough for most things?  (I
guess unless I have single files over 600gb that need COW'd, I don't though)

Didn't ubuntu on kernel 4.4 die in the same can_overcommit function?
(https://www.moparisthebest.com/btrfsoops.jpg) what kind of hardware
issues would cause a repeatable kernel crash like that?  Like am I
looking at memory issues or the SAS controller or what?

Thanks!
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs kernel oops on mount

2016-09-09 Thread moparisthebest
On 09/09/2016 01:51 PM, Chris Murphy wrote:
> On Fri, Sep 9, 2016 at 10:12 AM, moparisthebest
>  wrote:
>> Hi,
>>
>> I'm hoping to get some help with mounting my btrfs array which quit
>> working yesterday.  My array was in the middle of a balance, about 50%
>> remaining, when it hit an error and remounted itself read-only [1].
>> btrfs fi show output [2], btrfs df output [3].
>>
>> I unmounted the array, and when I tried to mount it again, it locked up
>> the whole system so even alt+sysrq would not work.  I rebooted, tried to
>> mount again, same lockup.  This was all kernel 4.5.7.
>>
>> I rebooted to kernel 4.4.0, tried to mount, crashed again, this time a
>> message appeared on the screen and I took a picture [4].
>>
>> I rebooted into an arch live system with kernel 4.7.2, tried to mount
>> again, got some dmesg output before it crashed [5] and took a picture
>> when it crashed [6], says in part 'BUG: unable to handle kernel NULL
>> pointer dereference at 01f0'.
>>
>> Is there anything I can do to get this in a working state again or
>> perhaps even recover some data?
>>
>> Thanks much for any help
>>
>> [1]: https://www.moparisthebest.com/btrfs/initial_crash.txt
>> [2]: https://www.moparisthebest.com/btrfs/btrfsfishow.txt
>> [3]: https://www.moparisthebest.com/btrfs/btrfsdf.txt
>> [4]: https://www.moparisthebest.com/btrfsoops.jpg
>> [5]: https://www.moparisthebest.com/btrfs/dmsgprecrash.txt
>> [6]: https://www.moparisthebest.com/btrfsnulldereference.jpg
> 
> Good report. Try on the 4.7.2 kernel system, two consoles, have one
> ready with 'echo w > /proc/sysrq-trigger' as root (sudo doesn't work)
> but don't issue it, mount in the other console and then switch back
> and issue the sysrq. It'll take a while, minutes maybe even to switch
> consoles, and then also for the command itself to issue, and then
> minutes before the result actually gets committed to systemd journal
> or var/log/messages. If it's a systemd system, and if you have to
> force reboot to regain control, you can get the sysrq with 'journalctl
> -b-1 -k > outputfile.txt'
> 
> Also btrfs check output is useful to include also (without --repair
> for starters).
> 
> The thing that concerns me is this occasional problem that comes up
> sometimes with lzo compressed volumes. Duncan knows more about that
> one so he may chime in. I would definitely only do default mounts for
> the above, don't include the compression option. You could also try -o
> ro,recovery and see where that gets you.
> 
> 

This is indeed an lzo compressed system, it's always been mounted with
that option anyhow.

btrfs check has been running for ~6 hours so far, I'll follow up with
output on that when it finishes.

Hmm, the problem with the 4.7.2/systemd system is it's a live usb system
so the log/journal wouldn't be saved anywhere except tmpfs, I'll see
what I can rig up unless someone has any amazing ideas?  I'm still brand
new to systemd...

Thanks!
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


btrfs kernel oops on mount

2016-09-09 Thread moparisthebest
Hi,

I'm hoping to get some help with mounting my btrfs array which quit
working yesterday.  My array was in the middle of a balance, about 50%
remaining, when it hit an error and remounted itself read-only [1].
btrfs fi show output [2], btrfs df output [3].

I unmounted the array, and when I tried to mount it again, it locked up
the whole system so even alt+sysrq would not work.  I rebooted, tried to
mount again, same lockup.  This was all kernel 4.5.7.

I rebooted to kernel 4.4.0, tried to mount, crashed again, this time a
message appeared on the screen and I took a picture [4].

I rebooted into an arch live system with kernel 4.7.2, tried to mount
again, got some dmesg output before it crashed [5] and took a picture
when it crashed [6], says in part 'BUG: unable to handle kernel NULL
pointer dereference at 01f0'.

Is there anything I can do to get this in a working state again or
perhaps even recover some data?

Thanks much for any help

[1]: https://www.moparisthebest.com/btrfs/initial_crash.txt
[2]: https://www.moparisthebest.com/btrfs/btrfsfishow.txt
[3]: https://www.moparisthebest.com/btrfs/btrfsdf.txt
[4]: https://www.moparisthebest.com/btrfsoops.jpg
[5]: https://www.moparisthebest.com/btrfs/dmsgprecrash.txt
[6]: https://www.moparisthebest.com/btrfsnulldereference.jpg
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


abort device removal?

2014-12-05 Thread moparisthebest
Hello all,

I had a 6-device array I added a 4tb device to last night and ran the
command to remove a previous 4tb device that still worked fine
overnight.  Unfortunately, one of the OTHER devices completely failed
while this was happening, and it *looks* like btrfs did the right thing
and stopped the move, except it's still marked as 0 space in btrfs fi
show.  The delete command is still running, though iotop shows it's not
actually reading or writing anything and no further moving messages in
dmesg/kern.log seems to indicate that too.

So what I think I *need* to do is re-add the drive it's currently trying
to remove so I can delete the now non-functioning other drive without
losing any data.  My fear is a reboot or unmount/remount will fail to
mount the currently-being-removed drive as well causing me to lose
everything.

Here is some relevant info from the system:
# uname -a
Linux mytorrentflux1 3.13.0-40-generic #69-Ubuntu SMP Thu Nov 13
17:53:56 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
# btrfs --version
Btrfs v3.17.3
# btrfs fi show
Label: 'completed'  uuid: 0d14bb0f-46cc-408e-9245-f06d50ec2da8
Total devices 7 FS bytes used 7.60TiB
devid1 size 3.64TiB used 3.28TiB path /dev/mapper/fourtb1
devid2 size 3.64TiB used 3.29TiB path /dev/mapper/fourtb2
devid3 size 2.73TiB used 2.37TiB path /dev/mapper/threetb1
devid5 size 1.82TiB used 1.82TiB path /dev/mapper/twotb1
devid6 size 0.00B used 1.99TiB path /dev/mapper/fourtb3
devid7 size 2.73TiB used 2.22TiB path /dev/mapper/threetb2
devid8 size 3.64TiB used 240.29GiB path /dev/mapper/fourtb4

Btrfs v3.17.3
# btrfs fi df /mnt/completed/
Data, RAID10: total=6.26TiB, used=6.26TiB
Data, RAID1: total=1.33TiB, used=1.33TiB
System, RAID10: total=96.00MiB, used=852.00KiB
Metadata, RAID10: total=10.77GiB, used=9.90GiB
Metadata, RAID1: total=5.00GiB, used=4.37GiB

fourtb4 is the new drive I just added, fourtb3 is the functioning drive
I attempted to remove before threetb1 completely failed (smartctl can't
even read anything from it, well, from the underlying device)

dmesg/kern.log is too large too attach, here are some important-looking
excerpts (3 lines often repeated):
Dec  5 09:59:35 mytorrentflux1 kernel: [1549876.646751] btrfs: bdev
/dev/mapper/threetb1 errs: wr 17599, rd 973, flush 0, corrupt 0, gen 0
Dec  5 09:59:35 mytorrentflux1 kernel: [1549877.022291] lost page write
due to I/O error on /dev/mapper/threetb1
Dec  5 10:07:08 mytorrentflux1 kernel: [1550329.743294]
btrfs_dev_stat_print_on_error: 264 callbacks suppressed

I appreciate any help or guidance I can get on this issue so I don't
lose data, hopefully.

Thanks much!
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html