On 9/30/16 2:54 PM, Rich Freeman wrote:
> On Thu, Sep 22, 2016 at 1:41 PM, Jeff Mahoney <je...@suse.com> wrote:
>> On 9/22/16 8:18 AM, Rich Freeman wrote:
>>> I have been getting panics consistently after doing a btrfs replace
>>> operation on a raid1 and rebooting.  I linked a photo of the panic; I
>>> haven't been able to get a text capture of it.
>>>
>>> https://ibin.co/2vx0HhDeViu3.jpg
>>>
>>> I'm getting this error on the latest 4.4, 4.1, and even on an old
>>> 3.18.26 kernel I had lying around.
>>>
>>> I tried the remove root_log_ctx from ctx list before btrfs_sync_log
>>> returns patch on 4.1 and that did not solve my problem either.
>>>
>>> I'm able to boot into single-user mode and if I don't start any
>>> processes the system seems fairly stable.  I am also able to start a
>>> btrfs balance and run that for several hours without issue.  If I
>>> start launching services the system will tend to panic, though how
>>> many processes I can launch will vary.  I don't think that it is a
>>> particular file being accessed that is triggering the issue since the
>>> point where it fails varies.  I suspect it may be load-related.
>>>
>>> Mounting with compress=no doesn't seem to help either.  Granted, I see
>>> lzo_decompress in the backtrace and that is probably a read operation.
>>>
>>> Any suggestions?  Google hasn't been helpful on this one...
>>
>> Can you boot with panic_on_oops=1, reproduce it, and capture that Oops?
>> The trace in your photo is a secondary Oops (tainted D), which means
>> that something else went wrong before that and now the system is
>> tripping over it.  Secondary Oopses don't really help the debugging
>> process because the system was already in a broken, undefined, state.
>>
> 
> Ok, the system has been up for a week without issue, but just paniced
> and rebooted right towards the end of a balance (it literally had
> about 30 of 2500 chunks left).
> 
> After it came up (and waiting for it to fully mount as there were a
> bunch of free space warnings/etc) I managed to capture an initial oops
> when it happened again:
> 
> https://ibin.co/2wt0n2IaCOA3.jpg
> 
> This is on a system without swap, though my understanding is that the
> paging system is used for other things.

It's not paging in the way Microsoft uses the term.  In this context, it
just means that the kernel tried to resolve a virtual address and
failed.  When the address is < PAGE_SIZE, it prints the NULL pointer
dereference message instead.  It's literally the same code otherwise.
Short version: it's the same thing as a segfault in userspace code.

This looks like a use-after-free on one of the pages used for
compression.  Can you post the output of objdump -Dr
/lib/modules/$(uname -r)/kernel/fs/btrfs/btrfs.ko somewhere?

-Jeff

-- 
Jeff Mahoney
SUSE Labs

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to