Re: unable to handle kernel paging request - btrfs

2016-09-30 Thread Jeff Mahoney
On 9/30/16 5:07 PM, Rich Freeman wrote:
> On Fri, Sep 30, 2016 at 4:55 PM, Jeff Mahoney  wrote:
>> This looks like a use-after-free on one of the pages used for
>> compression.  Can you post the output of objdump -Dr
>> /lib/modules/$(uname -r)/kernel/fs/btrfs/btrfs.ko somewhere?
>>
> 
> Sure:
> https://drive.google.com/open?id=0BwUDImviY_gcR3JfT0Z1cUlRVEk
> 
> I was impressed by just how large it was.
> 
> I take it you're going to try to use the offsets in the oops to figure
> out where it went wrong?  I really need to get kernel core dumping
> working on this box...

Yep.  What I think is happening is that we have workspace getting freed
while it's in use.  The faulting address is in vmalloc space and it's
also the first argument to memcpy, which makes it the destination.  In
lzo_decompress_biovec, that means it's the workspace->cbuf.  Beyond that
I'll have to dig a bit more.

It's the same fault that your first photo showed as a secondary Oops,
but that's not always the case.


-- 
Jeff Mahoney
SUSE Labs



signature.asc
Description: OpenPGP digital signature


Re: multi-device btrfs with single data mode and disk failure

2016-09-30 Thread Alexandre Poux
Hello again,

Just a quick question.

I did a full scrub and got no error at all

And a full check that gave me this :

#> btrfs check --check-data-csum -p /dev/sde6

Checking filesystem on /dev/sde6
UUID: 62db560b-a040-4c64-b613-6e7db033dc4d
checking extents [o]
checking free space cache [o]
checking fs roots [.]
checking csums
checking root refs
checking quota groups
Counts for qgroup id: 0/5 are different
our:referenced 7239132803072 referenced compressed 7239132803072
disk:referenced 7238982733824 referenced compressed 7238982733824
diff:referenced 150069248 referenced compressed 150069248
our:exclusive 7239132803072 exclusive compressed 7239132803072
disk:exclusive 7238982733824 exclusive compressed 7238982733824
diff:exclusive 150069248 exclusive compressed 150069248
found 7323422314496 bytes used err is 0
total csum bytes: 7020314688
total tree bytes: 11797741568
total fs tree bytes: 2904932352
total extent tree bytes: 656654336
btree space waste bytes: 1560529439
file data blocks allocated: 297363385454592
 referenced 6628544720896

I'm guessing that's not important, but I found nothing about this
so I don't really know what's about.

Can just confirm that everything seems OK ?

Do you think of an another test I should do before starting to use my
array again ?

Le 29/09/2016 à 14:55, Alexandre Poux a écrit :
> Hi,
>
> I finally did it : patched the kernel and removed the device.
> As expected he did not scream since there was nothing at all on the device.
> Now I'm checking that everything is fine:
> scrub (in read only)
> check (in read only)
> but I think that everything will be OK
> If not, I will rebuild the array from scratch (I did managed to save my
> data)
>
> Thank you both for your guidance.
> I think that a warning should be put in the wiki in order for other user
> to not do the same mistake I did :
> never ever use the single mode
>
> I will try to do it soon
>
> Again thank you
>
> Le 20/09/2016 à 23:15, Chris Murphy a écrit :
>> On Tue, Sep 20, 2016 at 2:18 PM, Alexandre Poux  wrote:
>>> Le 20/09/2016 à 21:46, Chris Murphy a écrit :
 On Tue, Sep 20, 2016 at 1:31 PM, Alexandre Poux  wrote:
> Le 20/09/2016 à 21:11, Chris Murphy a écrit :
>> And no backup? Umm, I'd resolve that sooner than anything else.
> Yeah you are absolutely right, this was a temporary solution which came
> to be not that temporary.
> And I regret it already...
 Well on the bright side, if this were LVM or mdadm linear/concat
 array, the whole thing would be toast because any other file system
 would have lost too much fs metadata on the missing device.

>>  It
>> should be true that it'll tolerate a read only mount indefinitely, but
>> read write? Not sure. This sort of edge case isn't well tested at all
>> seeing as it required changing the kernel to reduce safe guards. So
>> all bets are off the whole thing could become unmountable, not even
>> read only, and then it's a scraping job.
> I'm not that crazy, I tried the patch inside a virtual machine on
> virtual drives...
> And since it's only virtual, it may not work on the real partition...
 Are you sure the virtual setup lacked a CHUNK_ITEM on the missing
 device? That might be what pinned it in that case.
>>> In fact in my virtual setup there was more chunk missing (1 metadata 1
>>> System and 1 Data).
>>> I will try to do a setup closer to my real one.
>> Probably the reason why that missing device has no used chunks is
>> because it's so small. Btrfs allocates block groups to devices with
>> the most unallocated space first. Only once the unallocated space is
>> even (approximately) on all devices would it allocate a block group to
>> the small device.
>>
>>
>

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: unable to handle kernel paging request - btrfs

2016-09-30 Thread Rich Freeman
On Fri, Sep 30, 2016 at 4:55 PM, Jeff Mahoney  wrote:
> This looks like a use-after-free on one of the pages used for
> compression.  Can you post the output of objdump -Dr
> /lib/modules/$(uname -r)/kernel/fs/btrfs/btrfs.ko somewhere?
>

Sure:
https://drive.google.com/open?id=0BwUDImviY_gcR3JfT0Z1cUlRVEk

I was impressed by just how large it was.

I take it you're going to try to use the offsets in the oops to figure
out where it went wrong?  I really need to get kernel core dumping
working on this box...

--
Rich
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: unable to handle kernel paging request - btrfs

2016-09-30 Thread Jeff Mahoney
On 9/30/16 2:54 PM, Rich Freeman wrote:
> On Thu, Sep 22, 2016 at 1:41 PM, Jeff Mahoney  wrote:
>> On 9/22/16 8:18 AM, Rich Freeman wrote:
>>> I have been getting panics consistently after doing a btrfs replace
>>> operation on a raid1 and rebooting.  I linked a photo of the panic; I
>>> haven't been able to get a text capture of it.
>>>
>>> https://ibin.co/2vx0HhDeViu3.jpg
>>>
>>> I'm getting this error on the latest 4.4, 4.1, and even on an old
>>> 3.18.26 kernel I had lying around.
>>>
>>> I tried the remove root_log_ctx from ctx list before btrfs_sync_log
>>> returns patch on 4.1 and that did not solve my problem either.
>>>
>>> I'm able to boot into single-user mode and if I don't start any
>>> processes the system seems fairly stable.  I am also able to start a
>>> btrfs balance and run that for several hours without issue.  If I
>>> start launching services the system will tend to panic, though how
>>> many processes I can launch will vary.  I don't think that it is a
>>> particular file being accessed that is triggering the issue since the
>>> point where it fails varies.  I suspect it may be load-related.
>>>
>>> Mounting with compress=no doesn't seem to help either.  Granted, I see
>>> lzo_decompress in the backtrace and that is probably a read operation.
>>>
>>> Any suggestions?  Google hasn't been helpful on this one...
>>
>> Can you boot with panic_on_oops=1, reproduce it, and capture that Oops?
>> The trace in your photo is a secondary Oops (tainted D), which means
>> that something else went wrong before that and now the system is
>> tripping over it.  Secondary Oopses don't really help the debugging
>> process because the system was already in a broken, undefined, state.
>>
> 
> Ok, the system has been up for a week without issue, but just paniced
> and rebooted right towards the end of a balance (it literally had
> about 30 of 2500 chunks left).
> 
> After it came up (and waiting for it to fully mount as there were a
> bunch of free space warnings/etc) I managed to capture an initial oops
> when it happened again:
> 
> https://ibin.co/2wt0n2IaCOA3.jpg
> 
> This is on a system without swap, though my understanding is that the
> paging system is used for other things.

It's not paging in the way Microsoft uses the term.  In this context, it
just means that the kernel tried to resolve a virtual address and
failed.  When the address is < PAGE_SIZE, it prints the NULL pointer
dereference message instead.  It's literally the same code otherwise.
Short version: it's the same thing as a segfault in userspace code.

This looks like a use-after-free on one of the pages used for
compression.  Can you post the output of objdump -Dr
/lib/modules/$(uname -r)/kernel/fs/btrfs/btrfs.ko somewhere?

-Jeff

-- 
Jeff Mahoney
SUSE Labs



signature.asc
Description: OpenPGP digital signature


Re: unable to handle kernel paging request - btrfs

2016-09-30 Thread Rich Freeman
On Thu, Sep 22, 2016 at 1:41 PM, Jeff Mahoney  wrote:
> On 9/22/16 8:18 AM, Rich Freeman wrote:
>> I have been getting panics consistently after doing a btrfs replace
>> operation on a raid1 and rebooting.  I linked a photo of the panic; I
>> haven't been able to get a text capture of it.
>>
>> https://ibin.co/2vx0HhDeViu3.jpg
>>
>> I'm getting this error on the latest 4.4, 4.1, and even on an old
>> 3.18.26 kernel I had lying around.
>>
>> I tried the remove root_log_ctx from ctx list before btrfs_sync_log
>> returns patch on 4.1 and that did not solve my problem either.
>>
>> I'm able to boot into single-user mode and if I don't start any
>> processes the system seems fairly stable.  I am also able to start a
>> btrfs balance and run that for several hours without issue.  If I
>> start launching services the system will tend to panic, though how
>> many processes I can launch will vary.  I don't think that it is a
>> particular file being accessed that is triggering the issue since the
>> point where it fails varies.  I suspect it may be load-related.
>>
>> Mounting with compress=no doesn't seem to help either.  Granted, I see
>> lzo_decompress in the backtrace and that is probably a read operation.
>>
>> Any suggestions?  Google hasn't been helpful on this one...
>
> Can you boot with panic_on_oops=1, reproduce it, and capture that Oops?
> The trace in your photo is a secondary Oops (tainted D), which means
> that something else went wrong before that and now the system is
> tripping over it.  Secondary Oopses don't really help the debugging
> process because the system was already in a broken, undefined, state.
>

Ok, the system has been up for a week without issue, but just paniced
and rebooted right towards the end of a balance (it literally had
about 30 of 2500 chunks left).

After it came up (and waiting for it to fully mount as there were a
bunch of free space warnings/etc) I managed to capture an initial oops
when it happened again:

https://ibin.co/2wt0n2IaCOA3.jpg

This is on a system without swap, though my understanding is that the
paging system is used for other things.

Note that I've updated my kernel since my last post.  When it paniced
during the balance it was running 4.4.21, and on the oops I actually
captured it was on 4.4.23 (I was actually just waiting for the balance
to finish before rebooting with a new kernel).

--
Rich
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] btrfs: Add trace point for qgroup reserved space

2016-09-30 Thread Goldwyn Rodrigues


On 09/29/2016 08:15 PM, Qu Wenruo wrote:
> Introduce the following trace points:
> qgroup_update_reserve
> qgroup_meta_reserve
> 
> And modify the timing of btrfs_qgroup_free_delayed_ref() and
> btrfs_qgroup_release_data() events, to work with qgroup_update_reserve()
> event.
> 
> Signed-off-by: Qu Wenruo 

Reviewed-by: Goldwyn Rodrigues 


> ---
>  fs/btrfs/qgroup.c| 21 ++---
>  fs/btrfs/qgroup.h|  2 +-
>  include/trace/events/btrfs.h | 43 +++
>  3 files changed, 62 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
> index 8532587..593a519 100644
> --- a/fs/btrfs/qgroup.c
> +++ b/fs/btrfs/qgroup.c
> @@ -1062,6 +1062,8 @@ static int __qgroup_excl_accounting(struct 
> btrfs_fs_info *fs_info,
>   qgroup->excl += sign * num_bytes;
>   qgroup->excl_cmpr += sign * num_bytes;
>   if (sign > 0) {
> + trace_qgroup_update_reserve(fs_info, qgroup->qgroupid,
> + qgroup->reserved, (s64)-num_bytes);
>   if (WARN_ON(qgroup->reserved < num_bytes))
>   qgroup->reserved = 0;
>   else
> @@ -1087,6 +1089,9 @@ static int __qgroup_excl_accounting(struct 
> btrfs_fs_info *fs_info,
>   WARN_ON(sign < 0 && qgroup->excl < num_bytes);
>   qgroup->excl += sign * num_bytes;
>   if (sign > 0) {
> + trace_qgroup_update_reserve(fs_info, qgroup->qgroupid,
> + qgroup->reserved,
> + (s64)-num_bytes);
>   if (WARN_ON(qgroup->reserved < num_bytes))
>   qgroup->reserved = 0;
>   else
> @@ -2164,6 +2169,8 @@ static int qgroup_reserve(struct btrfs_root *root, u64 
> num_bytes)
>  
>   qg = u64_to_ptr(unode->aux);
>  
> + trace_qgroup_update_reserve(fs_info, qg->qgroupid,
> + qg->reserved, num_bytes);
>   qg->reserved += num_bytes;
>   }
>  
> @@ -2209,6 +2216,8 @@ void btrfs_qgroup_free_refroot(struct btrfs_fs_info 
> *fs_info,
>  
>   qg = u64_to_ptr(unode->aux);
>  
> + trace_qgroup_update_reserve(fs_info, qg->qgroupid,
> + qg->reserved, (s64)-num_bytes);
>   if (WARN_ON(qgroup->reserved < num_bytes))
>   qgroup->reserved = 0;
>   else
> @@ -2637,12 +2646,12 @@ static int __btrfs_qgroup_release_data(struct inode 
> *inode, u64 start, u64 len,
>   if (ret < 0)
>   goto out;
>  
> - if (free) {
> - qgroup_free(BTRFS_I(inode)->root, changeset.bytes_changed);
> + if (free)
>   trace_op = QGROUP_FREE;
> - }
>   trace_btrfs_qgroup_release_data(inode, start, len,
>   changeset.bytes_changed, trace_op);
> + if (free)
> + qgroup_free(BTRFS_I(inode)->root, changeset.bytes_changed);
>  out:
>   ulist_free(changeset.range_changed);
>   return ret;
> @@ -2692,6 +2701,8 @@ int btrfs_qgroup_reserve_meta(struct btrfs_root *root, 
> int num_bytes)
>   return 0;
>  
>   BUG_ON(num_bytes != round_down(num_bytes, root->nodesize));
> + trace_qgroup_meta_reserve(root->fs_info, root->objectid,
> +   (s64)num_bytes);
>   ret = qgroup_reserve(root, num_bytes);
>   if (ret < 0)
>   return ret;
> @@ -2709,6 +2720,8 @@ void btrfs_qgroup_free_meta_all(struct btrfs_root *root)
>   reserved = atomic_xchg(>qgroup_meta_rsv, 0);
>   if (reserved == 0)
>   return;
> + trace_qgroup_meta_reserve(root->fs_info, root->objectid,
> +   (s64)-reserved);
>   qgroup_free(root, reserved);
>  }
>  
> @@ -2720,6 +2733,8 @@ void btrfs_qgroup_free_meta(struct btrfs_root *root, 
> int num_bytes)
>   BUG_ON(num_bytes != round_down(num_bytes, root->nodesize));
>   WARN_ON(atomic_read(>qgroup_meta_rsv) < num_bytes);
>   atomic_sub(num_bytes, >qgroup_meta_rsv);
> + trace_qgroup_meta_reserve(root->fs_info, root->objectid,
> +   (s64)-num_bytes);
>   qgroup_free(root, num_bytes);
>  }
>  
> diff --git a/fs/btrfs/qgroup.h b/fs/btrfs/qgroup.h
> index 1bc64c8..6b6756c 100644
> --- a/fs/btrfs/qgroup.h
> +++ b/fs/btrfs/qgroup.h
> @@ -114,8 +114,8 @@ void btrfs_qgroup_free_refroot(struct btrfs_fs_info 
> *fs_info,
>  static inline void btrfs_qgroup_free_delayed_ref(struct btrfs_fs_info 
> *fs_info,
>u64 ref_root, u64 num_bytes)
>  {
> - btrfs_qgroup_free_refroot(fs_info, ref_root, num_bytes);
>   trace_btrfs_qgroup_free_delayed_ref(fs_info, ref_root, num_bytes);
> + 

Re: [PATCH 1/2] btrfs: Add WARN_ON for qgroup reserved underflow

2016-09-30 Thread Goldwyn Rodrigues


On 09/29/2016 08:15 PM, Qu Wenruo wrote:
> While the reason why qgroup reserved space may underflow is still under
> investigation, such WARN_ON will help us to expose the bug more easily,
> and for end-user we can detect and avoid underflow.
> 
> Signed-off-by: Qu Wenruo 

Looks good.

Reviewed-by: Goldwyn Rodrigues 

> ---
>  fs/btrfs/qgroup.c | 21 -
>  1 file changed, 16 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
> index 8db2e29..8532587 100644
> --- a/fs/btrfs/qgroup.c
> +++ b/fs/btrfs/qgroup.c
> @@ -1061,8 +1061,12 @@ static int __qgroup_excl_accounting(struct 
> btrfs_fs_info *fs_info,
>   WARN_ON(sign < 0 && qgroup->excl < num_bytes);
>   qgroup->excl += sign * num_bytes;
>   qgroup->excl_cmpr += sign * num_bytes;
> - if (sign > 0)
> - qgroup->reserved -= num_bytes;
> + if (sign > 0) {
> + if (WARN_ON(qgroup->reserved < num_bytes))
> + qgroup->reserved = 0;
> + else
> + qgroup->reserved -= num_bytes;
> + }
>  
>   qgroup_dirty(fs_info, qgroup);
>  
> @@ -1082,8 +1086,12 @@ static int __qgroup_excl_accounting(struct 
> btrfs_fs_info *fs_info,
>   qgroup->rfer_cmpr += sign * num_bytes;
>   WARN_ON(sign < 0 && qgroup->excl < num_bytes);
>   qgroup->excl += sign * num_bytes;
> - if (sign > 0)
> - qgroup->reserved -= num_bytes;
> + if (sign > 0) {
> + if (WARN_ON(qgroup->reserved < num_bytes))
> + qgroup->reserved = 0;
> + else
> + qgroup->reserved -= num_bytes;
> + }
>   qgroup->excl_cmpr += sign * num_bytes;
>   qgroup_dirty(fs_info, qgroup);
>  
> @@ -2201,7 +2209,10 @@ void btrfs_qgroup_free_refroot(struct btrfs_fs_info 
> *fs_info,
>  
>   qg = u64_to_ptr(unode->aux);
>  
> - qg->reserved -= num_bytes;
> + if (WARN_ON(qgroup->reserved < num_bytes))
> + qgroup->reserved = 0;
> + else
> + qgroup->reserved -= num_bytes;
>  
>   list_for_each_entry(glist, >groups, next_group) {
>   ret = ulist_add(fs_info->qgroup_ulist,
> 

-- 
Goldwyn
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] qgroup: Prevent qgroup->reserved from going subzero

2016-09-30 Thread Goldwyn Rodrigues
From: Goldwyn Rodrigues 

While free'ing qgroup->reserved resources, we much check if
the page has not been invalidated by a truncate operation
by checking if the page is still dirty before reducing the
qgroup resources. Resources in such a case are free'd when
the entire extent is released by delayed_ref.

This fixes a double accounting while releasing resources
in case of truncating a file, reproduced by the following testcase.

SCRATCH_DEV=/dev/vdb
SCRATCH_MNT=/mnt
mkfs.btrfs -f $SCRATCH_DEV
mount -t btrfs $SCRATCH_DEV $SCRATCH_MNT
cd $SCRATCH_MNT
btrfs quota enable $SCRATCH_MNT
btrfs subvolume create a
btrfs qgroup limit 500m a $SCRATCH_MNT
sync
for c in {1..15}; do
dd if=/dev/zero  bs=1M count=40 of=$SCRATCH_MNT/a/file;
done

sleep 10
sync
sleep 5

touch $SCRATCH_MNT/a/newfile

echo "Removing file"
rm $SCRATCH_MNT/a/file

Fixes: b9d0b38928 ("btrfs: Add handler for invalidate page")
Signed-off-by: Goldwyn Rodrigues 
Reviewed-by: Qu Wenruo 
---
 fs/btrfs/inode.c | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index e6811c4..bc1a004 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -8915,9 +8915,14 @@ again:
 *So even we call qgroup_free_data(), it won't decrease reserved
 *space.
 * 2) Not written to disk
-*This means the reserved space should be freed here.
+*This means the reserved space should be freed here. However,
+*if a truncate invalidates the page (by clearing PageDirty)
+*and the page is accounted for while allocating extent
+*in btrfs_check_data_free_space() we let delayed_ref to
+*free the entire extent.
 */
-   btrfs_qgroup_free_data(inode, page_start, PAGE_SIZE);
+   if (PageDirty(page))
+   btrfs_qgroup_free_data(inode, page_start, PAGE_SIZE);
if (!inode_evicting) {
clear_extent_bit(tree, page_start, page_end,
 EXTENT_LOCKED | EXTENT_DIRTY |
-- 
2.10.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs-progs: qgroup: Fix regression leads to corrupted qgroup status

2016-09-30 Thread David Sterba
On Fri, Sep 30, 2016 at 08:44:58AM +0800, Qu Wenruo wrote:
> 
> 
> At 09/30/2016 01:19 AM, David Sterba wrote:
> > On Wed, Sep 07, 2016 at 10:54:19AM +0800, Qu Wenruo wrote:
> >> Commit 93dabf211d74daf6e3de642bdd887a90a00f7b49
> >> Author: Mark Fasheh 
> >> Date:   Fri Jun 17 13:37:48 2016 -0700
> >>
> >> btrfs-progs: check: verify qgroups above level 0
> >>
> >> This commit introduced a new regression which corrupts
> >> read_qgroup_status, since it iterate leaf with manually specified slot,
> >> not correct path->slot[0].
> >>
> >> This leads to wrong slot[0] and read_qgroup_status() will read out wrong
> >> flags, leading to regression.
> >>
> >> Fix read_qgroup_status() by using eb and slot instread of wrong path
> >> strucutre.
> >>
> >> Reported-by: Tsutomu Itoh 
> >> Cc: Mark Fasheh 
> >> Signed-off-by: Qu Wenruo 
> >
> > I'm adding this patch to devel.  Do you have a test for the regression 
> > please?
> >
> >
> Xfstests btrfs/114 can produce it.

Good.

> If you mean to add btrfs-progs test case, then I can try to create a 
> minimal image to reproduce it.

Yes please.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html