Re: unable to handle kernel paging request - btrfs
On 9/30/16 5:07 PM, Rich Freeman wrote: > On Fri, Sep 30, 2016 at 4:55 PM, Jeff Mahoneywrote: >> This looks like a use-after-free on one of the pages used for >> compression. Can you post the output of objdump -Dr >> /lib/modules/$(uname -r)/kernel/fs/btrfs/btrfs.ko somewhere? >> > > Sure: > https://drive.google.com/open?id=0BwUDImviY_gcR3JfT0Z1cUlRVEk > > I was impressed by just how large it was. > > I take it you're going to try to use the offsets in the oops to figure > out where it went wrong? I really need to get kernel core dumping > working on this box... Yep. What I think is happening is that we have workspace getting freed while it's in use. The faulting address is in vmalloc space and it's also the first argument to memcpy, which makes it the destination. In lzo_decompress_biovec, that means it's the workspace->cbuf. Beyond that I'll have to dig a bit more. It's the same fault that your first photo showed as a secondary Oops, but that's not always the case. -- Jeff Mahoney SUSE Labs signature.asc Description: OpenPGP digital signature
Re: multi-device btrfs with single data mode and disk failure
Hello again, Just a quick question. I did a full scrub and got no error at all And a full check that gave me this : #> btrfs check --check-data-csum -p /dev/sde6 Checking filesystem on /dev/sde6 UUID: 62db560b-a040-4c64-b613-6e7db033dc4d checking extents [o] checking free space cache [o] checking fs roots [.] checking csums checking root refs checking quota groups Counts for qgroup id: 0/5 are different our:referenced 7239132803072 referenced compressed 7239132803072 disk:referenced 7238982733824 referenced compressed 7238982733824 diff:referenced 150069248 referenced compressed 150069248 our:exclusive 7239132803072 exclusive compressed 7239132803072 disk:exclusive 7238982733824 exclusive compressed 7238982733824 diff:exclusive 150069248 exclusive compressed 150069248 found 7323422314496 bytes used err is 0 total csum bytes: 7020314688 total tree bytes: 11797741568 total fs tree bytes: 2904932352 total extent tree bytes: 656654336 btree space waste bytes: 1560529439 file data blocks allocated: 297363385454592 referenced 6628544720896 I'm guessing that's not important, but I found nothing about this so I don't really know what's about. Can just confirm that everything seems OK ? Do you think of an another test I should do before starting to use my array again ? Le 29/09/2016 à 14:55, Alexandre Poux a écrit : > Hi, > > I finally did it : patched the kernel and removed the device. > As expected he did not scream since there was nothing at all on the device. > Now I'm checking that everything is fine: > scrub (in read only) > check (in read only) > but I think that everything will be OK > If not, I will rebuild the array from scratch (I did managed to save my > data) > > Thank you both for your guidance. > I think that a warning should be put in the wiki in order for other user > to not do the same mistake I did : > never ever use the single mode > > I will try to do it soon > > Again thank you > > Le 20/09/2016 à 23:15, Chris Murphy a écrit : >> On Tue, Sep 20, 2016 at 2:18 PM, Alexandre Pouxwrote: >>> Le 20/09/2016 à 21:46, Chris Murphy a écrit : On Tue, Sep 20, 2016 at 1:31 PM, Alexandre Poux wrote: > Le 20/09/2016 à 21:11, Chris Murphy a écrit : >> And no backup? Umm, I'd resolve that sooner than anything else. > Yeah you are absolutely right, this was a temporary solution which came > to be not that temporary. > And I regret it already... Well on the bright side, if this were LVM or mdadm linear/concat array, the whole thing would be toast because any other file system would have lost too much fs metadata on the missing device. >> It >> should be true that it'll tolerate a read only mount indefinitely, but >> read write? Not sure. This sort of edge case isn't well tested at all >> seeing as it required changing the kernel to reduce safe guards. So >> all bets are off the whole thing could become unmountable, not even >> read only, and then it's a scraping job. > I'm not that crazy, I tried the patch inside a virtual machine on > virtual drives... > And since it's only virtual, it may not work on the real partition... Are you sure the virtual setup lacked a CHUNK_ITEM on the missing device? That might be what pinned it in that case. >>> In fact in my virtual setup there was more chunk missing (1 metadata 1 >>> System and 1 Data). >>> I will try to do a setup closer to my real one. >> Probably the reason why that missing device has no used chunks is >> because it's so small. Btrfs allocates block groups to devices with >> the most unallocated space first. Only once the unallocated space is >> even (approximately) on all devices would it allocate a block group to >> the small device. >> >> > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: unable to handle kernel paging request - btrfs
On Fri, Sep 30, 2016 at 4:55 PM, Jeff Mahoneywrote: > This looks like a use-after-free on one of the pages used for > compression. Can you post the output of objdump -Dr > /lib/modules/$(uname -r)/kernel/fs/btrfs/btrfs.ko somewhere? > Sure: https://drive.google.com/open?id=0BwUDImviY_gcR3JfT0Z1cUlRVEk I was impressed by just how large it was. I take it you're going to try to use the offsets in the oops to figure out where it went wrong? I really need to get kernel core dumping working on this box... -- Rich -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: unable to handle kernel paging request - btrfs
On 9/30/16 2:54 PM, Rich Freeman wrote: > On Thu, Sep 22, 2016 at 1:41 PM, Jeff Mahoneywrote: >> On 9/22/16 8:18 AM, Rich Freeman wrote: >>> I have been getting panics consistently after doing a btrfs replace >>> operation on a raid1 and rebooting. I linked a photo of the panic; I >>> haven't been able to get a text capture of it. >>> >>> https://ibin.co/2vx0HhDeViu3.jpg >>> >>> I'm getting this error on the latest 4.4, 4.1, and even on an old >>> 3.18.26 kernel I had lying around. >>> >>> I tried the remove root_log_ctx from ctx list before btrfs_sync_log >>> returns patch on 4.1 and that did not solve my problem either. >>> >>> I'm able to boot into single-user mode and if I don't start any >>> processes the system seems fairly stable. I am also able to start a >>> btrfs balance and run that for several hours without issue. If I >>> start launching services the system will tend to panic, though how >>> many processes I can launch will vary. I don't think that it is a >>> particular file being accessed that is triggering the issue since the >>> point where it fails varies. I suspect it may be load-related. >>> >>> Mounting with compress=no doesn't seem to help either. Granted, I see >>> lzo_decompress in the backtrace and that is probably a read operation. >>> >>> Any suggestions? Google hasn't been helpful on this one... >> >> Can you boot with panic_on_oops=1, reproduce it, and capture that Oops? >> The trace in your photo is a secondary Oops (tainted D), which means >> that something else went wrong before that and now the system is >> tripping over it. Secondary Oopses don't really help the debugging >> process because the system was already in a broken, undefined, state. >> > > Ok, the system has been up for a week without issue, but just paniced > and rebooted right towards the end of a balance (it literally had > about 30 of 2500 chunks left). > > After it came up (and waiting for it to fully mount as there were a > bunch of free space warnings/etc) I managed to capture an initial oops > when it happened again: > > https://ibin.co/2wt0n2IaCOA3.jpg > > This is on a system without swap, though my understanding is that the > paging system is used for other things. It's not paging in the way Microsoft uses the term. In this context, it just means that the kernel tried to resolve a virtual address and failed. When the address is < PAGE_SIZE, it prints the NULL pointer dereference message instead. It's literally the same code otherwise. Short version: it's the same thing as a segfault in userspace code. This looks like a use-after-free on one of the pages used for compression. Can you post the output of objdump -Dr /lib/modules/$(uname -r)/kernel/fs/btrfs/btrfs.ko somewhere? -Jeff -- Jeff Mahoney SUSE Labs signature.asc Description: OpenPGP digital signature
Re: unable to handle kernel paging request - btrfs
On Thu, Sep 22, 2016 at 1:41 PM, Jeff Mahoneywrote: > On 9/22/16 8:18 AM, Rich Freeman wrote: >> I have been getting panics consistently after doing a btrfs replace >> operation on a raid1 and rebooting. I linked a photo of the panic; I >> haven't been able to get a text capture of it. >> >> https://ibin.co/2vx0HhDeViu3.jpg >> >> I'm getting this error on the latest 4.4, 4.1, and even on an old >> 3.18.26 kernel I had lying around. >> >> I tried the remove root_log_ctx from ctx list before btrfs_sync_log >> returns patch on 4.1 and that did not solve my problem either. >> >> I'm able to boot into single-user mode and if I don't start any >> processes the system seems fairly stable. I am also able to start a >> btrfs balance and run that for several hours without issue. If I >> start launching services the system will tend to panic, though how >> many processes I can launch will vary. I don't think that it is a >> particular file being accessed that is triggering the issue since the >> point where it fails varies. I suspect it may be load-related. >> >> Mounting with compress=no doesn't seem to help either. Granted, I see >> lzo_decompress in the backtrace and that is probably a read operation. >> >> Any suggestions? Google hasn't been helpful on this one... > > Can you boot with panic_on_oops=1, reproduce it, and capture that Oops? > The trace in your photo is a secondary Oops (tainted D), which means > that something else went wrong before that and now the system is > tripping over it. Secondary Oopses don't really help the debugging > process because the system was already in a broken, undefined, state. > Ok, the system has been up for a week without issue, but just paniced and rebooted right towards the end of a balance (it literally had about 30 of 2500 chunks left). After it came up (and waiting for it to fully mount as there were a bunch of free space warnings/etc) I managed to capture an initial oops when it happened again: https://ibin.co/2wt0n2IaCOA3.jpg This is on a system without swap, though my understanding is that the paging system is used for other things. Note that I've updated my kernel since my last post. When it paniced during the balance it was running 4.4.21, and on the oops I actually captured it was on 4.4.23 (I was actually just waiting for the balance to finish before rebooting with a new kernel). -- Rich -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] btrfs: Add trace point for qgroup reserved space
On 09/29/2016 08:15 PM, Qu Wenruo wrote: > Introduce the following trace points: > qgroup_update_reserve > qgroup_meta_reserve > > And modify the timing of btrfs_qgroup_free_delayed_ref() and > btrfs_qgroup_release_data() events, to work with qgroup_update_reserve() > event. > > Signed-off-by: Qu WenruoReviewed-by: Goldwyn Rodrigues > --- > fs/btrfs/qgroup.c| 21 ++--- > fs/btrfs/qgroup.h| 2 +- > include/trace/events/btrfs.h | 43 +++ > 3 files changed, 62 insertions(+), 4 deletions(-) > > diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c > index 8532587..593a519 100644 > --- a/fs/btrfs/qgroup.c > +++ b/fs/btrfs/qgroup.c > @@ -1062,6 +1062,8 @@ static int __qgroup_excl_accounting(struct > btrfs_fs_info *fs_info, > qgroup->excl += sign * num_bytes; > qgroup->excl_cmpr += sign * num_bytes; > if (sign > 0) { > + trace_qgroup_update_reserve(fs_info, qgroup->qgroupid, > + qgroup->reserved, (s64)-num_bytes); > if (WARN_ON(qgroup->reserved < num_bytes)) > qgroup->reserved = 0; > else > @@ -1087,6 +1089,9 @@ static int __qgroup_excl_accounting(struct > btrfs_fs_info *fs_info, > WARN_ON(sign < 0 && qgroup->excl < num_bytes); > qgroup->excl += sign * num_bytes; > if (sign > 0) { > + trace_qgroup_update_reserve(fs_info, qgroup->qgroupid, > + qgroup->reserved, > + (s64)-num_bytes); > if (WARN_ON(qgroup->reserved < num_bytes)) > qgroup->reserved = 0; > else > @@ -2164,6 +2169,8 @@ static int qgroup_reserve(struct btrfs_root *root, u64 > num_bytes) > > qg = u64_to_ptr(unode->aux); > > + trace_qgroup_update_reserve(fs_info, qg->qgroupid, > + qg->reserved, num_bytes); > qg->reserved += num_bytes; > } > > @@ -2209,6 +2216,8 @@ void btrfs_qgroup_free_refroot(struct btrfs_fs_info > *fs_info, > > qg = u64_to_ptr(unode->aux); > > + trace_qgroup_update_reserve(fs_info, qg->qgroupid, > + qg->reserved, (s64)-num_bytes); > if (WARN_ON(qgroup->reserved < num_bytes)) > qgroup->reserved = 0; > else > @@ -2637,12 +2646,12 @@ static int __btrfs_qgroup_release_data(struct inode > *inode, u64 start, u64 len, > if (ret < 0) > goto out; > > - if (free) { > - qgroup_free(BTRFS_I(inode)->root, changeset.bytes_changed); > + if (free) > trace_op = QGROUP_FREE; > - } > trace_btrfs_qgroup_release_data(inode, start, len, > changeset.bytes_changed, trace_op); > + if (free) > + qgroup_free(BTRFS_I(inode)->root, changeset.bytes_changed); > out: > ulist_free(changeset.range_changed); > return ret; > @@ -2692,6 +2701,8 @@ int btrfs_qgroup_reserve_meta(struct btrfs_root *root, > int num_bytes) > return 0; > > BUG_ON(num_bytes != round_down(num_bytes, root->nodesize)); > + trace_qgroup_meta_reserve(root->fs_info, root->objectid, > + (s64)num_bytes); > ret = qgroup_reserve(root, num_bytes); > if (ret < 0) > return ret; > @@ -2709,6 +2720,8 @@ void btrfs_qgroup_free_meta_all(struct btrfs_root *root) > reserved = atomic_xchg(>qgroup_meta_rsv, 0); > if (reserved == 0) > return; > + trace_qgroup_meta_reserve(root->fs_info, root->objectid, > + (s64)-reserved); > qgroup_free(root, reserved); > } > > @@ -2720,6 +2733,8 @@ void btrfs_qgroup_free_meta(struct btrfs_root *root, > int num_bytes) > BUG_ON(num_bytes != round_down(num_bytes, root->nodesize)); > WARN_ON(atomic_read(>qgroup_meta_rsv) < num_bytes); > atomic_sub(num_bytes, >qgroup_meta_rsv); > + trace_qgroup_meta_reserve(root->fs_info, root->objectid, > + (s64)-num_bytes); > qgroup_free(root, num_bytes); > } > > diff --git a/fs/btrfs/qgroup.h b/fs/btrfs/qgroup.h > index 1bc64c8..6b6756c 100644 > --- a/fs/btrfs/qgroup.h > +++ b/fs/btrfs/qgroup.h > @@ -114,8 +114,8 @@ void btrfs_qgroup_free_refroot(struct btrfs_fs_info > *fs_info, > static inline void btrfs_qgroup_free_delayed_ref(struct btrfs_fs_info > *fs_info, >u64 ref_root, u64 num_bytes) > { > - btrfs_qgroup_free_refroot(fs_info, ref_root, num_bytes); > trace_btrfs_qgroup_free_delayed_ref(fs_info, ref_root, num_bytes); > +
Re: [PATCH 1/2] btrfs: Add WARN_ON for qgroup reserved underflow
On 09/29/2016 08:15 PM, Qu Wenruo wrote: > While the reason why qgroup reserved space may underflow is still under > investigation, such WARN_ON will help us to expose the bug more easily, > and for end-user we can detect and avoid underflow. > > Signed-off-by: Qu WenruoLooks good. Reviewed-by: Goldwyn Rodrigues > --- > fs/btrfs/qgroup.c | 21 - > 1 file changed, 16 insertions(+), 5 deletions(-) > > diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c > index 8db2e29..8532587 100644 > --- a/fs/btrfs/qgroup.c > +++ b/fs/btrfs/qgroup.c > @@ -1061,8 +1061,12 @@ static int __qgroup_excl_accounting(struct > btrfs_fs_info *fs_info, > WARN_ON(sign < 0 && qgroup->excl < num_bytes); > qgroup->excl += sign * num_bytes; > qgroup->excl_cmpr += sign * num_bytes; > - if (sign > 0) > - qgroup->reserved -= num_bytes; > + if (sign > 0) { > + if (WARN_ON(qgroup->reserved < num_bytes)) > + qgroup->reserved = 0; > + else > + qgroup->reserved -= num_bytes; > + } > > qgroup_dirty(fs_info, qgroup); > > @@ -1082,8 +1086,12 @@ static int __qgroup_excl_accounting(struct > btrfs_fs_info *fs_info, > qgroup->rfer_cmpr += sign * num_bytes; > WARN_ON(sign < 0 && qgroup->excl < num_bytes); > qgroup->excl += sign * num_bytes; > - if (sign > 0) > - qgroup->reserved -= num_bytes; > + if (sign > 0) { > + if (WARN_ON(qgroup->reserved < num_bytes)) > + qgroup->reserved = 0; > + else > + qgroup->reserved -= num_bytes; > + } > qgroup->excl_cmpr += sign * num_bytes; > qgroup_dirty(fs_info, qgroup); > > @@ -2201,7 +2209,10 @@ void btrfs_qgroup_free_refroot(struct btrfs_fs_info > *fs_info, > > qg = u64_to_ptr(unode->aux); > > - qg->reserved -= num_bytes; > + if (WARN_ON(qgroup->reserved < num_bytes)) > + qgroup->reserved = 0; > + else > + qgroup->reserved -= num_bytes; > > list_for_each_entry(glist, >groups, next_group) { > ret = ulist_add(fs_info->qgroup_ulist, > -- Goldwyn -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] qgroup: Prevent qgroup->reserved from going subzero
From: Goldwyn RodriguesWhile free'ing qgroup->reserved resources, we much check if the page has not been invalidated by a truncate operation by checking if the page is still dirty before reducing the qgroup resources. Resources in such a case are free'd when the entire extent is released by delayed_ref. This fixes a double accounting while releasing resources in case of truncating a file, reproduced by the following testcase. SCRATCH_DEV=/dev/vdb SCRATCH_MNT=/mnt mkfs.btrfs -f $SCRATCH_DEV mount -t btrfs $SCRATCH_DEV $SCRATCH_MNT cd $SCRATCH_MNT btrfs quota enable $SCRATCH_MNT btrfs subvolume create a btrfs qgroup limit 500m a $SCRATCH_MNT sync for c in {1..15}; do dd if=/dev/zero bs=1M count=40 of=$SCRATCH_MNT/a/file; done sleep 10 sync sleep 5 touch $SCRATCH_MNT/a/newfile echo "Removing file" rm $SCRATCH_MNT/a/file Fixes: b9d0b38928 ("btrfs: Add handler for invalidate page") Signed-off-by: Goldwyn Rodrigues Reviewed-by: Qu Wenruo --- fs/btrfs/inode.c | 9 +++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index e6811c4..bc1a004 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -8915,9 +8915,14 @@ again: *So even we call qgroup_free_data(), it won't decrease reserved *space. * 2) Not written to disk -*This means the reserved space should be freed here. +*This means the reserved space should be freed here. However, +*if a truncate invalidates the page (by clearing PageDirty) +*and the page is accounted for while allocating extent +*in btrfs_check_data_free_space() we let delayed_ref to +*free the entire extent. */ - btrfs_qgroup_free_data(inode, page_start, PAGE_SIZE); + if (PageDirty(page)) + btrfs_qgroup_free_data(inode, page_start, PAGE_SIZE); if (!inode_evicting) { clear_extent_bit(tree, page_start, page_end, EXTENT_LOCKED | EXTENT_DIRTY | -- 2.10.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs-progs: qgroup: Fix regression leads to corrupted qgroup status
On Fri, Sep 30, 2016 at 08:44:58AM +0800, Qu Wenruo wrote: > > > At 09/30/2016 01:19 AM, David Sterba wrote: > > On Wed, Sep 07, 2016 at 10:54:19AM +0800, Qu Wenruo wrote: > >> Commit 93dabf211d74daf6e3de642bdd887a90a00f7b49 > >> Author: Mark Fasheh> >> Date: Fri Jun 17 13:37:48 2016 -0700 > >> > >> btrfs-progs: check: verify qgroups above level 0 > >> > >> This commit introduced a new regression which corrupts > >> read_qgroup_status, since it iterate leaf with manually specified slot, > >> not correct path->slot[0]. > >> > >> This leads to wrong slot[0] and read_qgroup_status() will read out wrong > >> flags, leading to regression. > >> > >> Fix read_qgroup_status() by using eb and slot instread of wrong path > >> strucutre. > >> > >> Reported-by: Tsutomu Itoh > >> Cc: Mark Fasheh > >> Signed-off-by: Qu Wenruo > > > > I'm adding this patch to devel. Do you have a test for the regression > > please? > > > > > Xfstests btrfs/114 can produce it. Good. > If you mean to add btrfs-progs test case, then I can try to create a > minimal image to reproduce it. Yes please. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html