Re: [PATCH v2 0/6] btrfs: qgroup: Delay subtree scan to reduce overhead

2018-12-07 Thread Qu Wenruo


On 2018/12/8 上午8:47, David Sterba wrote:
> On Fri, Dec 07, 2018 at 06:51:21AM +0800, Qu Wenruo wrote:
>>
>>
>> On 2018/12/7 上午3:35, David Sterba wrote:
>>> On Mon, Nov 12, 2018 at 10:33:33PM +0100, David Sterba wrote:
 On Thu, Nov 08, 2018 at 01:49:12PM +0800, Qu Wenruo wrote:
> This patchset can be fetched from github:
> https://github.com/adam900710/linux/tree/qgroup_delayed_subtree_rebased
>
> Which is based on v4.20-rc1.

 Thanks, I'll add it to for-next soon.
>>>
>>> The branch was there for some time but not for at least a week (my
>>> mistake I did not notice in time). I've rebased it on top of recent
>>> misc-next, but without the delayed refs patchset from Josef.
>>>
>>> At the moment I'm considering it for merge to 4.21, there's still some
>>> time to pull it out in case it shows up to be too problematic. I'm
>>> mostly worried about the unknown interactions with the enospc updates or
>>
>> For that part, I don't think it would have some obvious problem for
>> enospc updates.
>>
>> As the user-noticeable effect is the delay of reloc tree deletion.
>>
>> Despite that, it's mostly transparent to extent allocation.
>>
>>> generally because of lack of qgroup and reloc code reviews.
>>
>> That's the biggest problem.
>>
>> However most of the current qgroup + balance optimization is done inside
>> qgroup code (to skip certain qgroup record), if we're going to hit some
>> problem then this patchset would have the highest possibility to hit
>> problem.
>>
>> Later patches will just keep tweaking qgroup to without affecting any
>> other parts mostly.
>>
>> So I'm fine if you decide to pull it out for now.
> 
> I've adapted a stress tests that unpacks a large tarball, snaphosts
> every 20 seconds, deletes a random snapshot every 50 seconds, deletes
> file from the original subvolume, now enhanced with qgroups just for the
> new snapshots inherigin the toplevel subvolume. Lockup.
> 
> It gets stuck in a snapshot call with the follwin stacktrace
> 
> [<0>] btrfs_tree_read_lock+0xf3/0x150 [btrfs]
> [<0>] btrfs_qgroup_trace_subtree+0x280/0x7b0 [btrfs]

This looks like the original subtree tracing has something wrong.

Thanks for the report, I'll investigate it.
Qu

> [<0>] do_walk_down+0x681/0xb20 [btrfs]
> [<0>] walk_down_tree+0xf5/0x1c0 [btrfs]
> [<0>] btrfs_drop_snapshot+0x43b/0xb60 [btrfs]
> [<0>] btrfs_clean_one_deleted_snapshot+0xc1/0x120 [btrfs]
> [<0>] cleaner_kthread+0xf8/0x170 [btrfs]
> [<0>] kthread+0x121/0x140
> [<0>] ret_from_fork+0x27/0x50
> 
> and that's like 10th snapshot and ~3rd deltion. This is qgroup show:
> 
> qgroupid rfer excl parent
>    --
> 0/5 865.27MiB  1.66MiB ---
> 0/257   0.00B0.00B ---
> 0/259   0.00B0.00B ---
> 0/260   806.58MiB637.25MiB ---
> 0/262   0.00B0.00B ---
> 0/263   0.00B0.00B ---
> 0/264   0.00B0.00B ---
> 0/265   0.00B0.00B ---
> 0/266   0.00B0.00B ---
> 0/267   0.00B0.00B ---
> 0/268   0.00B0.00B ---
> 0/269   0.00B0.00B ---
> 0/270   989.04MiB  1.22MiB ---
> 0/271   0.00B0.00B ---
> 0/272   922.25MiB416.00KiB ---
> 0/273   931.02MiB  1.50MiB ---
> 0/274   910.94MiB  1.52MiB ---
> 1/1   1.64GiB  1.64GiB
> 0/5,0/257,0/259,0/260,0/262,0/263,0/264,0/265,0/266,0/267,0/268,0/269,0/270,0/271,0/272,0/273,0/274
> 
> No IO or cpu activity at this point, the stacktrace and show output
> remains the same.
> 
> So, considering this, I'm not going to add the patchset to 4.21 but will
> keep it in for-next for testing, any fixups or updates will be applied.
> 



signature.asc
Description: OpenPGP digital signature


Re: [PATCH v2 0/6] btrfs: qgroup: Delay subtree scan to reduce overhead

2018-12-07 Thread David Sterba
On Fri, Dec 07, 2018 at 06:51:21AM +0800, Qu Wenruo wrote:
> 
> 
> On 2018/12/7 上午3:35, David Sterba wrote:
> > On Mon, Nov 12, 2018 at 10:33:33PM +0100, David Sterba wrote:
> >> On Thu, Nov 08, 2018 at 01:49:12PM +0800, Qu Wenruo wrote:
> >>> This patchset can be fetched from github:
> >>> https://github.com/adam900710/linux/tree/qgroup_delayed_subtree_rebased
> >>>
> >>> Which is based on v4.20-rc1.
> >>
> >> Thanks, I'll add it to for-next soon.
> > 
> > The branch was there for some time but not for at least a week (my
> > mistake I did not notice in time). I've rebased it on top of recent
> > misc-next, but without the delayed refs patchset from Josef.
> > 
> > At the moment I'm considering it for merge to 4.21, there's still some
> > time to pull it out in case it shows up to be too problematic. I'm
> > mostly worried about the unknown interactions with the enospc updates or
> 
> For that part, I don't think it would have some obvious problem for
> enospc updates.
> 
> As the user-noticeable effect is the delay of reloc tree deletion.
> 
> Despite that, it's mostly transparent to extent allocation.
> 
> > generally because of lack of qgroup and reloc code reviews.
> 
> That's the biggest problem.
> 
> However most of the current qgroup + balance optimization is done inside
> qgroup code (to skip certain qgroup record), if we're going to hit some
> problem then this patchset would have the highest possibility to hit
> problem.
> 
> Later patches will just keep tweaking qgroup to without affecting any
> other parts mostly.
> 
> So I'm fine if you decide to pull it out for now.

I've adapted a stress tests that unpacks a large tarball, snaphosts
every 20 seconds, deletes a random snapshot every 50 seconds, deletes
file from the original subvolume, now enhanced with qgroups just for the
new snapshots inherigin the toplevel subvolume. Lockup.

It gets stuck in a snapshot call with the follwin stacktrace

[<0>] btrfs_tree_read_lock+0xf3/0x150 [btrfs]
[<0>] btrfs_qgroup_trace_subtree+0x280/0x7b0 [btrfs]
[<0>] do_walk_down+0x681/0xb20 [btrfs]
[<0>] walk_down_tree+0xf5/0x1c0 [btrfs]
[<0>] btrfs_drop_snapshot+0x43b/0xb60 [btrfs]
[<0>] btrfs_clean_one_deleted_snapshot+0xc1/0x120 [btrfs]
[<0>] cleaner_kthread+0xf8/0x170 [btrfs]
[<0>] kthread+0x121/0x140
[<0>] ret_from_fork+0x27/0x50

and that's like 10th snapshot and ~3rd deltion. This is qgroup show:

qgroupid rfer excl parent
   --
0/5 865.27MiB  1.66MiB ---
0/257   0.00B0.00B ---
0/259   0.00B0.00B ---
0/260   806.58MiB637.25MiB ---
0/262   0.00B0.00B ---
0/263   0.00B0.00B ---
0/264   0.00B0.00B ---
0/265   0.00B0.00B ---
0/266   0.00B0.00B ---
0/267   0.00B0.00B ---
0/268   0.00B0.00B ---
0/269   0.00B0.00B ---
0/270   989.04MiB  1.22MiB ---
0/271   0.00B0.00B ---
0/272   922.25MiB416.00KiB ---
0/273   931.02MiB  1.50MiB ---
0/274   910.94MiB  1.52MiB ---
1/1   1.64GiB  1.64GiB
0/5,0/257,0/259,0/260,0/262,0/263,0/264,0/265,0/266,0/267,0/268,0/269,0/270,0/271,0/272,0/273,0/274

No IO or cpu activity at this point, the stacktrace and show output
remains the same.

So, considering this, I'm not going to add the patchset to 4.21 but will
keep it in for-next for testing, any fixups or updates will be applied.


Re: [PATCH] libbtrfsutil: fix unprivileged tests if kernel lacks support

2018-12-07 Thread David Sterba
On Thu, Dec 06, 2018 at 04:29:32PM -0800, Omar Sandoval wrote:
> From: Omar Sandoval 
> 
> I apparently didn't test this on a pre-4.18 kernel.
> test_subvolume_info_unprivileged() checks for an ENOTTY, but this
> doesn't seem to work correctly with subTest().
> test_subvolume_iterator_unprivileged() doesn't have a check at all. Add
> an explicit check to both before doing the actual test.
> 
> Signed-off-by: Omar Sandoval 

Applied, thanks.


Re: [PATCH 2/8] btrfs: extent-tree: Open-code process_func in __btrfs_mod_ref

2018-12-07 Thread Nikolay Borisov



On 6.12.18 г. 8:58 ч., Qu Wenruo wrote:
> The process_func is never a function hook used anywhere else.
> 
> Open code it to make later delayed ref refactor easier, so we can
> refactor btrfs_inc_extent_ref() and btrfs_free_extent() in different
> patches.
> 
> Signed-off-by: Qu Wenruo 

Reviewed-by: Nikolay Borisov 

> ---
>  fs/btrfs/extent-tree.c | 33 ++---
>  1 file changed, 18 insertions(+), 15 deletions(-)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index ea2c3d5220f0..ea68d288d761 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -3220,10 +3220,6 @@ static int __btrfs_mod_ref(struct btrfs_trans_handle 
> *trans,
>   int i;
>   int level;
>   int ret = 0;
> - int (*process_func)(struct btrfs_trans_handle *,
> - struct btrfs_root *,
> - u64, u64, u64, u64, u64, u64, bool);
> -
>  
>   if (btrfs_is_testing(fs_info))
>   return 0;
> @@ -3235,11 +3231,6 @@ static int __btrfs_mod_ref(struct btrfs_trans_handle 
> *trans,
>   if (!test_bit(BTRFS_ROOT_REF_COWS, >state) && level == 0)
>   return 0;
>  
> - if (inc)
> - process_func = btrfs_inc_extent_ref;
> - else
> - process_func = btrfs_free_extent;
> -
>   if (full_backref)
>   parent = buf->start;
>   else
> @@ -3261,17 +3252,29 @@ static int __btrfs_mod_ref(struct btrfs_trans_handle 
> *trans,
>  
>   num_bytes = btrfs_file_extent_disk_num_bytes(buf, fi);
>   key.offset -= btrfs_file_extent_offset(buf, fi);
> - ret = process_func(trans, root, bytenr, num_bytes,
> -parent, ref_root, key.objectid,
> -key.offset, for_reloc);
> + if (inc)
> + ret = btrfs_inc_extent_ref(trans, root, bytenr,
> + num_bytes, parent, ref_root,
> + key.objectid, key.offset,
> + for_reloc);
> + else
> + ret = btrfs_free_extent(trans, root, bytenr,
> + num_bytes, parent, ref_root,
> + key.objectid, key.offset,
> + for_reloc);
>   if (ret)
>   goto fail;
>   } else {
>   bytenr = btrfs_node_blockptr(buf, i);
>   num_bytes = fs_info->nodesize;
> - ret = process_func(trans, root, bytenr, num_bytes,
> -parent, ref_root, level - 1, 0,
> -for_reloc);
> + if (inc)
> + ret = btrfs_inc_extent_ref(trans, root, bytenr,
> + num_bytes, parent, ref_root,
> + level - 1, 0, for_reloc);
> + else
> + ret = btrfs_free_extent(trans, root, bytenr,
> + num_bytes, parent, ref_root,
> + level - 1, 0, for_reloc);
>   if (ret)
>   goto fail;
>   }
> 


Re: System unable to mount partition after a power loss

2018-12-07 Thread Doni Crosby
I ran that command and I cannot get the email to send properly to the 
mailing list as the attachment of the output is over 4.6M.


On 12/7/2018 11:49 AM, Doni Crosby wrote:

The output of the command is attached. This is what errors showed up
on the system:
parent transid verify failed on 3563224842240 wanted 5184691 found 5184689
parent transid verify failed on 3563224842240 wanted 5184691 found 5184689
parent transid verify failed on 3563222974464 wanted 5184691 found 5184688
parent transid verify failed on 3563222974464 wanted 5184691 found 5184688
parent transid verify failed on 3563223121920 wanted 5184691 found 5184688
parent transid verify failed on 3563223121920 wanted 5184691 found 5184688
parent transid verify failed on 3563229970432 wanted 5184691 found 5184689
parent transid verify failed on 3563229970432 wanted 5184691 found 5184689
parent transid verify failed on 3563229970432 wanted 5184691 found 5184689
parent transid verify failed on 3563229970432 wanted 5184691 found 5184689
Ignoring transid failure
parent transid verify failed on 3563231428608 wanted 5184691 found 5183327
parent transid verify failed on 3563231428608 wanted 5184691 found 5183327
parent transid verify failed on 3563231428608 wanted 5184691 found 5183327
parent transid verify failed on 3563231428608 wanted 5184691 found 5183327
Ignoring transid failure
parent transid verify failed on 3563231444992 wanted 5184691 found 5183325
parent transid verify failed on 3563231444992 wanted 5184691 found 5183325
parent transid verify failed on 3563231444992 wanted 5184691 found 5183325
parent transid verify failed on 3563231444992 wanted 5184691 found 5183325
Ignoring transid failure
parent transid verify failed on 3563231412224 wanted 5184691 found 5183325
parent transid verify failed on 3563231412224 wanted 5184691 found 5183325
parent transid verify failed on 3563231412224 wanted 5184691 found 5183325
parent transid verify failed on 3563231412224 wanted 5184691 found 5183325
Ignoring transid failure
parent transid verify failed on 3563231461376 wanted 5184691 found 5183325
parent transid verify failed on 3563231461376 wanted 5184691 found 5183325
parent transid verify failed on 3563231461376 wanted 5184691 found 5183325
parent transid verify failed on 3563231461376 wanted 5184691 found 5183325
Ignoring transid failure
WARNING: eb corrupted: parent bytenr 31801344 slot 132 level 1 child
bytenr 3563231461376 level has 1 expect 0, skipping the slot
parent transid verify failed on 3563231494144 wanted 5184691 found 5183325
parent transid verify failed on 3563231494144 wanted 5184691 found 5183325
parent transid verify failed on 3563231494144 wanted 5184691 found 5183325
parent transid verify failed on 3563231494144 wanted 5184691 found 5183325
Ignoring transid failure
parent transid verify failed on 3563231526912 wanted 5184691 found 5183325
parent transid verify failed on 3563231526912 wanted 5184691 found 5183325
parent transid verify failed on 3563231526912 wanted 5184691 found 5183325
parent transid verify failed on 3563231526912 wanted 5184691 found 5183325
Ignoring transid failure
parent transid verify failed on 3563229626368 wanted 5184691 found 5184689
parent transid verify failed on 3563229626368 wanted 5184691 found 5184689
parent transid verify failed on 3563229937664 wanted 5184691 found 5184689
parent transid verify failed on 3563229937664 wanted 5184691 found 5184689
parent transid verify failed on 3563226857472 wanted 5184691 found 5184689
parent transid verify failed on 3563226857472 wanted 5184691 found 5184689
parent transid verify failed on 3563230674944 wanted 5184691 found 5183325
parent transid verify failed on 3563230674944 wanted 5184691 found 5183325
parent transid verify failed on 3563230674944 wanted 5184691 found 5183325
parent transid verify failed on 3563230674944 wanted 5184691 found 5183325
Ignoring transid failure
On Fri, Dec 7, 2018 at 2:22 AM Qu Wenruo  wrote:




On 2018/12/7 下午1:24, Doni Crosby wrote:

All,

I'm coming to you to see if there is a way to fix or at least recover
most of the data I have from a btrfs filesystem. The system went down
after both a breaker and the battery backup failed. I cannot currently
mount the system, with the following error from dmesg:

Note: The vda1 is just the entire disk being passed from the VM host
to the VM it's not an actual true virtual block device

[ 499.704398] BTRFS info (device vda1): disk space caching is enabled
[  499.704401] BTRFS info (device vda1): has skinny extents
[  499.739522] BTRFS error (device vda1): parent transid verify failed
on 3563231428608 wanted 5184691 found 5183327


Transid mismatch normally means the fs is screwed up more or less.

And according to your mount failure, it looks the fs get screwed up badly.

What's the kernel version used in the VM?
I don't really think the VM is always using the latest kernel.


[  499.740257] BTRFS error (device vda1): parent transid verify failed
on 3563231428608 wanted 5184691 found 5183327

Re: System unable to mount partition after a power loss

2018-12-07 Thread Doni Crosby
I just looked at the VM it does not have a cache. That's the default
in proxmox to improve performance.
On Fri, Dec 7, 2018 at 7:25 AM Austin S. Hemmelgarn
 wrote:
>
> On 2018-12-07 01:43, Doni Crosby wrote:
> >> This is qemu-kvm? What's the cache mode being used? It's possible the
> >> usual write guarantees are thwarted by VM caching.
> > Yes it is a proxmox host running the system so it is a qemu vm, I'm
> > unsure on the caching situation.
> On the note of QEMU and the cache mode, the only cache mode I've seen to
> actually cause issues for BTRFS volumes _inside_ a VM is 'cache=unsafe',
> but that causes problems for most filesystems, so it's probably not the
> issue here.
>
> OTOH, I've seen issues with most of the cache modes other than
> 'cache=writeback' and 'cache=writethrough' when dealing with BTRFS as
> the back-end storage on the host system, and most of the time such
> issues will manifest as both problems with the volume inside the VM
> _and_ the volume the disk images are being stored on.


[PATCH v2] Btrfs: use generic_remap_file_range_prep() for cloning and deduplication

2018-12-07 Thread fdmanana
From: Filipe Manana 

Since cloning and deduplication are no longer Btrfs specific operations, we
now have generic code to handle parameter validation, compare file ranges
used for deduplication, clear capabilities when cloning, etc. This change
makes Btrfs use it, eliminating a lot of code in Btrfs and also fixing a
few bugs, such as:

1) When cloning, the destination file's capabilities were not dropped
   (the fstest generic/513 tests this);

2) We were not checking if the destination file is immutable;

3) Not checking if either the source or destination files are swap
   files (swap file support is coming soon for Btrfs);

4) System limits were not checked (resource limits and O_LARGEFILE).

Note that the generic helper generic_remap_file_range_prep() does start
and waits for writeback by calling filemap_write_and_wait_range(), however
that is not enough for Btrfs for two reasons:

1) With compression, we need to start writeback twice in order to get the
   pages marked for writeback and ordered extents created;

2) filemap_write_and_wait_range() (and all its other variants) only waits
   for the IO to complete, but we need to wait for the ordered extents to
   finish, so that when we do the actual reflinking operations the file
   extent items are in the fs tree. This is also important due to the fact
   that the generic helper, for the deduplication case, compares the
   contents of the pages in the requested range, which might require
   reading extents from disk in the very unlikely case that pages get
   invalidated after writeback finishes (so the file extent items must be
   up to date in the fs tree).

Since these reasons are specific to Btrfs we have to do it in the Btrfs
code before calling generic_remap_file_range_prep(). This also results in
a more simple way of dealing with existing delalloc in the source/target
ranges, specially for the deduplication case where we used to lock all the
pages first and then if we found any dealloc for the range, or ordered
extent, we would unlock the pages trigger writeback and wait for ordered
extents to complete, then lock all the pages again and check if
deduplication can be done. So now we get a simpler approach: lock the
inodes, then trigger writeback and then wait for ordered extents to
complete.

So make btrfs use generic_remap_file_range_prep() (XFS and OCFS2 use it)
to eliminate duplicated code, fix a few bugs and benefit from future bug
fixes done there - for example the recent clone and dedupe bugs involving
reflinking a partial EOF block got a counterpart fix in the generic helpe,
since it affected all filesystems supporting these operations, so we no
longer need special checks in Btrfs for them.

Signed-off-by: Filipe Manana 
---

V2: Removed check that verifies if either of the inodes is a directory,
as it is done by generic_remap_file_range_prep(). Oddly in btrfs was being
done only for cloning but not for dedupe.

 fs/btrfs/ioctl.c | 612 ---
 1 file changed, 129 insertions(+), 483 deletions(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 802a628e9f7d..321fb9bc149d 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -3191,92 +3191,6 @@ static long btrfs_ioctl_dev_info(struct btrfs_fs_info 
*fs_info,
return ret;
 }
 
-static struct page *extent_same_get_page(struct inode *inode, pgoff_t index)
-{
-   struct page *page;
-
-   page = grab_cache_page(inode->i_mapping, index);
-   if (!page)
-   return ERR_PTR(-ENOMEM);
-
-   if (!PageUptodate(page)) {
-   int ret;
-
-   ret = btrfs_readpage(NULL, page);
-   if (ret)
-   return ERR_PTR(ret);
-   lock_page(page);
-   if (!PageUptodate(page)) {
-   unlock_page(page);
-   put_page(page);
-   return ERR_PTR(-EIO);
-   }
-   if (page->mapping != inode->i_mapping) {
-   unlock_page(page);
-   put_page(page);
-   return ERR_PTR(-EAGAIN);
-   }
-   }
-
-   return page;
-}
-
-static int gather_extent_pages(struct inode *inode, struct page **pages,
-  int num_pages, u64 off)
-{
-   int i;
-   pgoff_t index = off >> PAGE_SHIFT;
-
-   for (i = 0; i < num_pages; i++) {
-again:
-   pages[i] = extent_same_get_page(inode, index + i);
-   if (IS_ERR(pages[i])) {
-   int err = PTR_ERR(pages[i]);
-
-   if (err == -EAGAIN)
-   goto again;
-   pages[i] = NULL;
-   return err;
-   }
-   }
-   return 0;
-}
-
-static int lock_extent_range(struct inode *inode, u64 off, u64 len,
-bool retry_range_locking)
-{
-   /*
-* Do any pending 

Re: [PATCH 05/10] btrfs: introduce delayed_refs_rsv

2018-12-07 Thread Nikolay Borisov



On 3.12.18 г. 17:20 ч., Josef Bacik wrote:
> From: Josef Bacik 
> 
> Traditionally we've had voodoo in btrfs to account for the space that
> delayed refs may take up by having a global_block_rsv.  This works most
> of the time, except when it doesn't.  We've had issues reported and seen
> in production where sometimes the global reserve is exhausted during
> transaction commit before we can run all of our delayed refs, resulting
> in an aborted transaction.  Because of this voodoo we have equally
> dubious flushing semantics around throttling delayed refs which we often
> get wrong.
> 
> So instead give them their own block_rsv.  This way we can always know
> exactly how much outstanding space we need for delayed refs.  This
> allows us to make sure we are constantly filling that reservation up
> with space, and allows us to put more precise pressure on the enospc
> system.  Instead of doing math to see if its a good time to throttle,
> the normal enospc code will be invoked if we have a lot of delayed refs
> pending, and they will be run via the normal flushing mechanism.
> 
> For now the delayed_refs_rsv will hold the reservations for the delayed
> refs, the block group updates, and deleting csums.  We could have a
> separate rsv for the block group updates, but the csum deletion stuff is
> still handled via the delayed_refs so that will stay there.


I see one difference in the way that the space is managed. Essentially
for delayed refs rsv you'll only ever be increasaing the size and
->reserved only when you have to refill. This is opposite to the way
other metadata space is managed i.e by using use_block_rsv which
subtracts ->reserved everytime a block has to be CoW'ed. Why this
difference?


> 
> Signed-off-by: Josef Bacik 
> ---
>  fs/btrfs/ctree.h   |  14 +++-
>  fs/btrfs/delayed-ref.c |  43 --
>  fs/btrfs/disk-io.c |   4 +
>  fs/btrfs/extent-tree.c | 212 
> +
>  fs/btrfs/transaction.c |  37 -
>  5 files changed, 284 insertions(+), 26 deletions(-)
> 



> +/**
> + * btrfs_migrate_to_delayed_refs_rsv - transfer bytes to our delayed refs 
> rsv.
> + * @fs_info - the fs info for our fs.
> + * @src - the source block rsv to transfer from.
> + * @num_bytes - the number of bytes to transfer.
> + *
> + * This transfers up to the num_bytes amount from the src rsv to the
> + * delayed_refs_rsv.  Any extra bytes are returned to the space info.
> + */
> +void btrfs_migrate_to_delayed_refs_rsv(struct btrfs_fs_info *fs_info,
> +struct btrfs_block_rsv *src,
> +u64 num_bytes)

This function is currently used only during transaction start, it seems
to be rather specific to the delayed refs so I'd suggest making it
private to transaction.c

> +{
> + struct btrfs_block_rsv *delayed_refs_rsv = _info->delayed_refs_rsv;
> + u64 to_free = 0;
> +
> + spin_lock(>lock);
> + src->reserved -= num_bytes;
> + src->size -= num_bytes;
> + spin_unlock(>lock);
> +
> + spin_lock(_refs_rsv->lock);
> + if (delayed_refs_rsv->size > delayed_refs_rsv->reserved) {
> + u64 delta = delayed_refs_rsv->size -
> + delayed_refs_rsv->reserved;
> + if (num_bytes > delta) {
> + to_free = num_bytes - delta;
> + num_bytes = delta;
> + }
> + } else {
> + to_free = num_bytes;
> + num_bytes = 0;
> + }
> +
> + if (num_bytes)
> + delayed_refs_rsv->reserved += num_bytes;
> + if (delayed_refs_rsv->reserved >= delayed_refs_rsv->size)
> + delayed_refs_rsv->full = 1;
> + spin_unlock(_refs_rsv->lock);
> +
> + if (num_bytes)
> + trace_btrfs_space_reservation(fs_info, "delayed_refs_rsv",
> +   0, num_bytes, 1);
> + if (to_free)
> + space_info_add_old_bytes(fs_info, delayed_refs_rsv->space_info,
> +  to_free);
> +}
> +
> +/**
> + * btrfs_delayed_refs_rsv_refill - refill based on our delayed refs usage.
> + * @fs_info - the fs_info for our fs.
> + * @flush - control how we can flush for this reservation.
> + *
> + * This will refill the delayed block_rsv up to 1 items size worth of space 
> and
> + * will return -ENOSPC if we can't make the reservation.
> + */
> +int btrfs_delayed_refs_rsv_refill(struct btrfs_fs_info *fs_info,
> +   enum btrfs_reserve_flush_enum flush)
> +{
> + struct btrfs_block_rsv *block_rsv = _info->delayed_refs_rsv;
> + u64 limit = btrfs_calc_trans_metadata_size(fs_info, 1);
> + u64 num_bytes = 0;
> + int ret = -ENOSPC;
> +
> + spin_lock(_rsv->lock);
> + if (block_rsv->reserved < block_rsv->size) {
> + num_bytes = block_rsv->size - block_rsv->reserved;
> + num_bytes = min(num_bytes, limit);
> + }
> + 

Urgently need money? We can help you!

2018-12-07 Thread Mr. Muller Dieter
Urgently need money? We can help you!
Are you by the current situation in trouble or threatens you in trouble?
In this way, we give you the ability to take a new development.
As a rich person I feel obliged to assist people who are struggling to give 
them a chance. Everyone deserved a second chance and since the Government 
fails, it will have to come from others.
No amount is too crazy for us and the maturity we determine by mutual agreement.
No surprises, no extra costs, but just the agreed amounts and nothing else.
Don't wait any longer and comment on this post. Please specify the amount you 
want to borrow and we will contact you with all the possibilities. contact us 
today at stewarrt.l...@gmail.com


[PATCH] Btrfs: use generic_remap_file_range_prep() for cloning and deduplication

2018-12-07 Thread fdmanana
From: Filipe Manana 

Since cloning and deduplication are no longer Btrfs specific operations, we
now have generic code to handle parameter validation, compare file ranges
used for deduplication, clear capabilities when cloning, etc. This change
makes Btrfs use it, eliminating a lot of code in Btrfs and also fixing a
few bugs, such as:

1) When cloning, the destination file's capabilities were not dropped
   (the fstest generic/513 tests this);

2) We were not checking if the destination file is immutable;

3) Not checking if either the source or destination files are swap
   files (swap file support is coming soon for Btrfs);

4) System limits were not checked (resource limits and O_LARGEFILE).

Note that the generic helper generic_remap_file_range_prep() does start
and waits for writeback by calling filemap_write_and_wait_range(), however
that is not enough for Btrfs for two reasons:

1) With compression, we need to start writeback twice in order to get the
   pages marked for writeback and ordered extents created;

2) filemap_write_and_wait_range() (and all its other variants) only waits
   for the IO to complete, but we need to wait for the ordered extents to
   finish, so that when we do the actual reflinking operations the file
   extent items are in the fs tree. This is also important due to the fact
   that the generic helper, for the deduplication case, compares the
   contents of the pages in the requested range, which might require
   reading extents from disk in the very unlikely case that pages get
   invalidated after writeback finishes (so the file extent items must be
   up to date in the fs tree).

Since these reasons are specific to Btrfs we have to do it in the Btrfs
code before calling generic_remap_file_range_prep(). This also results in
a more simple way of dealing with existing delalloc in the source/target
ranges, specially for the deduplication case where we used to lock all the
pages first and then if we found any dealloc for the range, or ordered
extent, we would unlock the pages trigger writeback and wait for ordered
extents to complete, then lock all the pages again and check if
deduplication can be done. So now we get a simpler approach: lock the
inodes, then trigger writeback and then wait for ordered extents to
complete.

So make btrfs use generic_remap_file_range_prep() (XFS and OCFS2 use it)
to eliminate duplicated code, fix a few bugs and benefit from future bug
fixes done there - for example the recent clone and dedupe bugs involving
reflinking a partial EOF block got a counterpart fix in the generic helpe,
since it affected all filesystems supporting these operations, so we no
longer need special checks in Btrfs for them.

Signed-off-by: Filipe Manana 
---
 fs/btrfs/ioctl.c | 615 ---
 1 file changed, 132 insertions(+), 483 deletions(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 802a628e9f7d..261e116dddb2 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -3191,92 +3191,6 @@ static long btrfs_ioctl_dev_info(struct btrfs_fs_info 
*fs_info,
return ret;
 }
 
-static struct page *extent_same_get_page(struct inode *inode, pgoff_t index)
-{
-   struct page *page;
-
-   page = grab_cache_page(inode->i_mapping, index);
-   if (!page)
-   return ERR_PTR(-ENOMEM);
-
-   if (!PageUptodate(page)) {
-   int ret;
-
-   ret = btrfs_readpage(NULL, page);
-   if (ret)
-   return ERR_PTR(ret);
-   lock_page(page);
-   if (!PageUptodate(page)) {
-   unlock_page(page);
-   put_page(page);
-   return ERR_PTR(-EIO);
-   }
-   if (page->mapping != inode->i_mapping) {
-   unlock_page(page);
-   put_page(page);
-   return ERR_PTR(-EAGAIN);
-   }
-   }
-
-   return page;
-}
-
-static int gather_extent_pages(struct inode *inode, struct page **pages,
-  int num_pages, u64 off)
-{
-   int i;
-   pgoff_t index = off >> PAGE_SHIFT;
-
-   for (i = 0; i < num_pages; i++) {
-again:
-   pages[i] = extent_same_get_page(inode, index + i);
-   if (IS_ERR(pages[i])) {
-   int err = PTR_ERR(pages[i]);
-
-   if (err == -EAGAIN)
-   goto again;
-   pages[i] = NULL;
-   return err;
-   }
-   }
-   return 0;
-}
-
-static int lock_extent_range(struct inode *inode, u64 off, u64 len,
-bool retry_range_locking)
-{
-   /*
-* Do any pending delalloc/csum calculations on inode, one way or
-* another, and lock file content.
-* The locking order is:
-*
-*   1) pages
-*   2) range in the inode's io tree
-

[PATCH] Btrfs: scrub, move setup of nofs contexts higher in the stack

2018-12-07 Thread fdmanana
From: Filipe Manana 

Since scrub workers only do memory allocation with GFP_KERNEL when they
need to perform repair, we can move the recent setup of the nofs context
up to scrub_handle_errored_block() instead of setting it up down the call
chain at insert_full_stripe_lock() and scrub_add_page_to_wr_bio(),
removing some duplicate code and comment. So the only paths for which a
scrub worker can do memory allocations using GFP_KERNEL are the following:

 scrub_bio_end_io_worker()
   scrub_block_complete()
 scrub_handle_errored_block()
   lock_full_stripe()
 insert_full_stripe_lock()
   -> kmalloc with GFP_KERNEL

  scrub_bio_end_io_worker()
scrub_block_complete()
  scrub_handle_errored_block()
scrub_write_page_to_dev_replace()
  scrub_add_page_to_wr_bio()
-> kzalloc with GFP_KERNEL

Signed-off-by: Filipe Manana 
---

Applies on top of:

  Btrfs: fix deadlock with memory reclaim during scrub

 fs/btrfs/scrub.c | 34 ++
 1 file changed, 14 insertions(+), 20 deletions(-)

diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index bbd1b36f4918..f996f4064596 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -322,7 +322,6 @@ static struct full_stripe_lock *insert_full_stripe_lock(
struct rb_node *parent = NULL;
struct full_stripe_lock *entry;
struct full_stripe_lock *ret;
-   unsigned int nofs_flag;
 
lockdep_assert_held(_root->lock);
 
@@ -342,15 +341,8 @@ static struct full_stripe_lock *insert_full_stripe_lock(
 
/*
 * Insert new lock.
-*
-* We must use GFP_NOFS because the scrub task might be waiting for a
-* worker task executing this function and in turn a transaction commit
-* might be waiting the scrub task to pause (which needs to wait for all
-* the worker tasks to complete before pausing).
 */
-   nofs_flag = memalloc_nofs_save();
ret = kmalloc(sizeof(*ret), GFP_KERNEL);
-   memalloc_nofs_restore(nofs_flag);
if (!ret)
return ERR_PTR(-ENOMEM);
ret->logical = fstripe_logical;
@@ -842,6 +834,7 @@ static int scrub_handle_errored_block(struct scrub_block 
*sblock_to_check)
int page_num;
int success;
bool full_stripe_locked;
+   unsigned int nofs_flag;
static DEFINE_RATELIMIT_STATE(_rs, DEFAULT_RATELIMIT_INTERVAL,
  DEFAULT_RATELIMIT_BURST);
 
@@ -867,6 +860,16 @@ static int scrub_handle_errored_block(struct scrub_block 
*sblock_to_check)
dev = sblock_to_check->pagev[0]->dev;
 
/*
+* We must use GFP_NOFS because the scrub task might be waiting for a
+* worker task executing this function and in turn a transaction commit
+* might be waiting the scrub task to pause (which needs to wait for all
+* the worker tasks to complete before pausing).
+* We do allocations in the workers through insert_full_stripe_lock()
+* and scrub_add_page_to_wr_bio(), which happens down the call chain of
+* this function.
+*/
+   nofs_flag = memalloc_nofs_save();
+   /*
 * For RAID5/6, race can happen for a different device scrub thread.
 * For data corruption, Parity and Data threads will both try
 * to recovery the data.
@@ -875,6 +878,7 @@ static int scrub_handle_errored_block(struct scrub_block 
*sblock_to_check)
 */
ret = lock_full_stripe(fs_info, logical, _stripe_locked);
if (ret < 0) {
+   memalloc_nofs_restore(nofs_flag);
spin_lock(>stat_lock);
if (ret == -ENOMEM)
sctx->stat.malloc_errors++;
@@ -914,7 +918,7 @@ static int scrub_handle_errored_block(struct scrub_block 
*sblock_to_check)
 */
 
sblocks_for_recheck = kcalloc(BTRFS_MAX_MIRRORS,
- sizeof(*sblocks_for_recheck), GFP_NOFS);
+ sizeof(*sblocks_for_recheck), GFP_KERNEL);
if (!sblocks_for_recheck) {
spin_lock(>stat_lock);
sctx->stat.malloc_errors++;
@@ -1212,6 +1216,7 @@ static int scrub_handle_errored_block(struct scrub_block 
*sblock_to_check)
}
 
ret = unlock_full_stripe(fs_info, logical, full_stripe_locked);
+   memalloc_nofs_restore(nofs_flag);
if (ret < 0)
return ret;
return 0;
@@ -1630,19 +1635,8 @@ static int scrub_add_page_to_wr_bio(struct scrub_ctx 
*sctx,
mutex_lock(>wr_lock);
 again:
if (!sctx->wr_curr_bio) {
-   unsigned int nofs_flag;
-
-   /*
-* We must use GFP_NOFS because the scrub task might be waiting
-* for a worker task executing this function and in turn a
-* transaction commit might be waiting the scrub task to pause
-* (which needs to wait for all the worker tasks 

Re: [PATCH 04/10] btrfs: only track ref_heads in delayed_ref_updates

2018-12-07 Thread Nikolay Borisov



On 3.12.18 г. 17:20 ч., Josef Bacik wrote:
> From: Josef Bacik 
> 
> We use this number to figure out how many delayed refs to run, but
> __btrfs_run_delayed_refs really only checks every time we need a new
> delayed ref head, so we always run at least one ref head completely no
> matter what the number of items on it.  Fix the accounting to only be
> adjusted when we add/remove a ref head.

David,

I think also warrants a forward looking sentence stating that the number
is also going to be used to calculate the required number of bytes in
the delayed refs rsv. Something along the lines of:

In addition to using this number to limit the number of delayed refs
run, a future patch is also going to use it to calculate the amount of
space required for delayed refs space reservation.

> 
> Reviewed-by: Nikolay Borisov 
> Signed-off-by: Josef Bacik 
> ---
>  fs/btrfs/delayed-ref.c | 3 ---
>  1 file changed, 3 deletions(-)
> 
> diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
> index b3e4c9fcb664..48725fa757a3 100644
> --- a/fs/btrfs/delayed-ref.c
> +++ b/fs/btrfs/delayed-ref.c
> @@ -251,8 +251,6 @@ static inline void drop_delayed_ref(struct 
> btrfs_trans_handle *trans,
>   ref->in_tree = 0;
>   btrfs_put_delayed_ref(ref);
>   atomic_dec(_refs->num_entries);
> - if (trans->delayed_ref_updates)
> - trans->delayed_ref_updates--;
>  }
>  
>  static bool merge_ref(struct btrfs_trans_handle *trans,
> @@ -467,7 +465,6 @@ static int insert_delayed_ref(struct btrfs_trans_handle 
> *trans,
>   if (ref->action == BTRFS_ADD_DELAYED_REF)
>   list_add_tail(>add_list, >ref_add_list);
>   atomic_inc(>num_entries);
> - trans->delayed_ref_updates++;
>   spin_unlock(>lock);
>   return ret;
>  }
> 


Re: System unable to mount partition after a power loss

2018-12-07 Thread Austin S. Hemmelgarn

On 2018-12-07 01:43, Doni Crosby wrote:

This is qemu-kvm? What's the cache mode being used? It's possible the
usual write guarantees are thwarted by VM caching.

Yes it is a proxmox host running the system so it is a qemu vm, I'm
unsure on the caching situation.
On the note of QEMU and the cache mode, the only cache mode I've seen to 
actually cause issues for BTRFS volumes _inside_ a VM is 'cache=unsafe', 
but that causes problems for most filesystems, so it's probably not the 
issue here.


OTOH, I've seen issues with most of the cache modes other than 
'cache=writeback' and 'cache=writethrough' when dealing with BTRFS as 
the back-end storage on the host system, and most of the time such 
issues will manifest as both problems with the volume inside the VM 
_and_ the volume the disk images are being stored on.


Re: What if TRIM issued a wipe on devices that don't TRIM?

2018-12-07 Thread Austin S. Hemmelgarn

On 2018-12-06 23:09, Andrei Borzenkov wrote:

06.12.2018 16:04, Austin S. Hemmelgarn пишет:


* On SCSI devices, a discard operation translates to a SCSI UNMAP
command.  As pointed out by Ronnie Sahlberg in his reply, this command
is purely advisory, may not result in any actual state change on the
target device, and is not guaranteed to wipe the data.  To actually wipe
things, you have to explicitly write bogus data to the given regions
(using either regular writes, or a WRITESAME command with the desired
pattern), and _then_ call UNMAP on them.


WRITE SAME command has UNMAP bit and depending on device and kernel
version kernel may actually issue either UNMAP or WRITE SAME with UNMAP
bit set when doing discard.

Good to know.  I've not looked at the SCSI code much, and actually 
didn't know about the UNMAP bit for the WRITE SAME command, so I just 
assumed that the kernel only used the UNMAP command.


Re: [PATCH 08/10] btrfs: rework btrfs_check_space_for_delayed_refs

2018-12-07 Thread Nikolay Borisov



On 7.12.18 г. 9:09 ч., Nikolay Borisov wrote:
> 
> 
> On 6.12.18 г. 19:54 ч., David Sterba wrote:
>> On Thu, Dec 06, 2018 at 06:52:21PM +0200, Nikolay Borisov wrote:
>>>
>>>
>>> On 3.12.18 г. 17:20 ч., Josef Bacik wrote:
 Now with the delayed_refs_rsv we can now know exactly how much pending
 delayed refs space we need.  This means we can drastically simplify
>>>
>>> IMO it will be helpful if there is a sentence here referring back to
>>> btrfs_update_delayed_refs_rsv to put your first sentence into context.
>>> But I guess this is something David can also do.
>>
>> I'll update the changelog, but I'm not sure what exactly you want to see
>> there, please post the replacement text. Thanks.
> 
> With the introduction of dealyed_refs_rsv infrastructure, namely
> btrfs_update_delayed_refs_rsv we now know exactly how much pending
> delayed refs space is required.

To put things into context as to why I deem this change beneficial -
basically doing the migration of reservation from transaction to delayed
refs rsv modifies both size and reserved - they will be equal. Calling
btrfs_update_delayed_refs_rsv actually increases ->size and doesn't
really decrement ->reserved. Also we never do
btrfs_block_rsv_migrate/use_block_rsv on the delayed refs block rsv so
managing ->reserved  value for delayed refs rsv is different than for
the rest of the block rsv.


> 
>>
 btrfs_check_space_for_delayed_refs by simply checking how much space we
 have reserved for the global rsv (which acts as a spill over buffer) and
 the delayed refs rsv.  If our total size is beyond that amount then we
 know it's time to commit the transaction and stop any more delayed refs
 from being generated.

 Signed-off-by: Josef Bacik 
>>
> 


Re: HELP unmountable partition after btrfs balance to RAID0

2018-12-07 Thread Duncan
Thomas Mohr posted on Thu, 06 Dec 2018 12:31:15 +0100 as excerpted:

> We wanted to convert a file system to a RAID0 with two partitions.
> Unfortunately we had to reboot the server during the balance operation
> before it could complete.
> 
> Now following happens:
> 
> A mount attempt of the array fails with following error code:
> 
> btrfs recover yields roughly 1.6 out of 4 TB.

[Just another btrfs user and list regular, not a dev.  A dev may reply to 
your specific case, but meanwhile, for next time...]

That shouldn't be a problem.  Because with raid0 a failure of any of the 
components will take down the entire raid, making it less reliable than a 
single device, raid0 (in general, not just btrfs) is considered only 
useful for data of low enough value that its loss is no big deal, either 
because it's truly of little value (internet cache being a good example), 
or because backups are kept available and updated for whenever the raid0 
array fails.  Because with raid0, it's always a question of when it'll 
fail, not if.

So loss of a filesystem being converted to raid0 isn't a problem, because 
the data on it, by virtue of being in the process of conversion to raid0, 
is defined as of throw-away value in any case.  If it's of higher value 
than that, it's not going to be raid0 (or in the process of conversion to 
it) in the first place.

Of course that's simply an extension of the more general first sysadmin's 
rule of backups, that the true value of data is defined not by arbitrary 
claims, but by the number of backups of that data it's worth having.  
Because "things happen", whether it's fat-fingering, bad hardware, buggy 
software, or simply someone tripping over the power cable or running into 
the power pole outside at the wrong time.

So no backup is simply defining the data as worth less than the time/
trouble/resources necessary to make that backup.

Note that you ALWAYS save what was of most value to you, either the time/
trouble/resources to do the backup, if your actions defined that to be of 
more value than the data, or the data, if you had that backup, thereby 
defining the value of the data to be worth backing up.

Similarly, failure of the only backup isn't a problem because by virtue 
of there being only that one backup, the data is defined as not worth 
having more than one, and likewise, having an outdated backup isn't a 
problem, because that's simply the special case of defining the data in 
the delta between the backup time and the present as not (yet) worth the 
time/hassle/resources to make/refresh that backup.

(And FWIW, the second sysadmin's rule of backups is that it's not a 
backup until you've successfully tested it recoverable in the same sort 
of conditions you're likely to need to recover it in.  Because so many 
people have /thought/ they had backups, that turned out not to be, 
because they never tested that they could actually recover the data from 
them.  For instance, if the backup tools you'll need to recover the 
backup are on the backup itself, how do you get to them?  Can you create 
a filesystem for the new copy of the data and recover it from the backup 
with just the tools and documentation available from your emergency boot 
media?  Untested backup == no backup, or at best, backup still in 
process!)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman