Re: [PATCH v2 0/6] btrfs: qgroup: Delay subtree scan to reduce overhead

2018-12-08 Thread David Sterba
On Sat, Dec 08, 2018 at 08:50:32AM +0800, Qu Wenruo wrote:
> > I've adapted a stress tests that unpacks a large tarball, snaphosts
> > every 20 seconds, deletes a random snapshot every 50 seconds, deletes
> > file from the original subvolume, now enhanced with qgroups just for the
> > new snapshots inherigin the toplevel subvolume. Lockup.
> > 
> > It gets stuck in a snapshot call with the follwin stacktrace
> > 
> > [<0>] btrfs_tree_read_lock+0xf3/0x150 [btrfs]
> > [<0>] btrfs_qgroup_trace_subtree+0x280/0x7b0 [btrfs]
> 
> This looks like the original subtree tracing has something wrong.

Yes, I ran the test on current master and it locked up too, so it's not
due to your patchset.

> Thanks for the report, I'll investigate it.

Thanks.


Re: [PATCH v2 0/6] btrfs: qgroup: Delay subtree scan to reduce overhead

2018-12-07 Thread Qu Wenruo


On 2018/12/8 上午8:47, David Sterba wrote:
> On Fri, Dec 07, 2018 at 06:51:21AM +0800, Qu Wenruo wrote:
>>
>>
>> On 2018/12/7 上午3:35, David Sterba wrote:
>>> On Mon, Nov 12, 2018 at 10:33:33PM +0100, David Sterba wrote:
>>>> On Thu, Nov 08, 2018 at 01:49:12PM +0800, Qu Wenruo wrote:
>>>>> This patchset can be fetched from github:
>>>>> https://github.com/adam900710/linux/tree/qgroup_delayed_subtree_rebased
>>>>>
>>>>> Which is based on v4.20-rc1.
>>>>
>>>> Thanks, I'll add it to for-next soon.
>>>
>>> The branch was there for some time but not for at least a week (my
>>> mistake I did not notice in time). I've rebased it on top of recent
>>> misc-next, but without the delayed refs patchset from Josef.
>>>
>>> At the moment I'm considering it for merge to 4.21, there's still some
>>> time to pull it out in case it shows up to be too problematic. I'm
>>> mostly worried about the unknown interactions with the enospc updates or
>>
>> For that part, I don't think it would have some obvious problem for
>> enospc updates.
>>
>> As the user-noticeable effect is the delay of reloc tree deletion.
>>
>> Despite that, it's mostly transparent to extent allocation.
>>
>>> generally because of lack of qgroup and reloc code reviews.
>>
>> That's the biggest problem.
>>
>> However most of the current qgroup + balance optimization is done inside
>> qgroup code (to skip certain qgroup record), if we're going to hit some
>> problem then this patchset would have the highest possibility to hit
>> problem.
>>
>> Later patches will just keep tweaking qgroup to without affecting any
>> other parts mostly.
>>
>> So I'm fine if you decide to pull it out for now.
> 
> I've adapted a stress tests that unpacks a large tarball, snaphosts
> every 20 seconds, deletes a random snapshot every 50 seconds, deletes
> file from the original subvolume, now enhanced with qgroups just for the
> new snapshots inherigin the toplevel subvolume. Lockup.
> 
> It gets stuck in a snapshot call with the follwin stacktrace
> 
> [<0>] btrfs_tree_read_lock+0xf3/0x150 [btrfs]
> [<0>] btrfs_qgroup_trace_subtree+0x280/0x7b0 [btrfs]

This looks like the original subtree tracing has something wrong.

Thanks for the report, I'll investigate it.
Qu

> [<0>] do_walk_down+0x681/0xb20 [btrfs]
> [<0>] walk_down_tree+0xf5/0x1c0 [btrfs]
> [<0>] btrfs_drop_snapshot+0x43b/0xb60 [btrfs]
> [<0>] btrfs_clean_one_deleted_snapshot+0xc1/0x120 [btrfs]
> [<0>] cleaner_kthread+0xf8/0x170 [btrfs]
> [<0>] kthread+0x121/0x140
> [<0>] ret_from_fork+0x27/0x50
> 
> and that's like 10th snapshot and ~3rd deltion. This is qgroup show:
> 
> qgroupid rfer excl parent
>    --
> 0/5 865.27MiB  1.66MiB ---
> 0/257   0.00B0.00B ---
> 0/259   0.00B0.00B ---
> 0/260   806.58MiB637.25MiB ---
> 0/262   0.00B0.00B ---
> 0/263   0.00B0.00B ---
> 0/264   0.00B0.00B ---
> 0/265   0.00B0.00B ---
> 0/266   0.00B0.00B ---
> 0/267   0.00B0.00B ---
> 0/268   0.00B0.00B ---
> 0/269   0.00B0.00B ---
> 0/270   989.04MiB  1.22MiB ---
> 0/271   0.00B0.00B ---
> 0/272   922.25MiB416.00KiB ---
> 0/273   931.02MiB  1.50MiB ---
> 0/274   910.94MiB  1.52MiB ---
> 1/1   1.64GiB  1.64GiB
> 0/5,0/257,0/259,0/260,0/262,0/263,0/264,0/265,0/266,0/267,0/268,0/269,0/270,0/271,0/272,0/273,0/274
> 
> No IO or cpu activity at this point, the stacktrace and show output
> remains the same.
> 
> So, considering this, I'm not going to add the patchset to 4.21 but will
> keep it in for-next for testing, any fixups or updates will be applied.
> 



signature.asc
Description: OpenPGP digital signature


Re: [PATCH v2 0/6] btrfs: qgroup: Delay subtree scan to reduce overhead

2018-12-07 Thread David Sterba
On Fri, Dec 07, 2018 at 06:51:21AM +0800, Qu Wenruo wrote:
> 
> 
> On 2018/12/7 上午3:35, David Sterba wrote:
> > On Mon, Nov 12, 2018 at 10:33:33PM +0100, David Sterba wrote:
> >> On Thu, Nov 08, 2018 at 01:49:12PM +0800, Qu Wenruo wrote:
> >>> This patchset can be fetched from github:
> >>> https://github.com/adam900710/linux/tree/qgroup_delayed_subtree_rebased
> >>>
> >>> Which is based on v4.20-rc1.
> >>
> >> Thanks, I'll add it to for-next soon.
> > 
> > The branch was there for some time but not for at least a week (my
> > mistake I did not notice in time). I've rebased it on top of recent
> > misc-next, but without the delayed refs patchset from Josef.
> > 
> > At the moment I'm considering it for merge to 4.21, there's still some
> > time to pull it out in case it shows up to be too problematic. I'm
> > mostly worried about the unknown interactions with the enospc updates or
> 
> For that part, I don't think it would have some obvious problem for
> enospc updates.
> 
> As the user-noticeable effect is the delay of reloc tree deletion.
> 
> Despite that, it's mostly transparent to extent allocation.
> 
> > generally because of lack of qgroup and reloc code reviews.
> 
> That's the biggest problem.
> 
> However most of the current qgroup + balance optimization is done inside
> qgroup code (to skip certain qgroup record), if we're going to hit some
> problem then this patchset would have the highest possibility to hit
> problem.
> 
> Later patches will just keep tweaking qgroup to without affecting any
> other parts mostly.
> 
> So I'm fine if you decide to pull it out for now.

I've adapted a stress tests that unpacks a large tarball, snaphosts
every 20 seconds, deletes a random snapshot every 50 seconds, deletes
file from the original subvolume, now enhanced with qgroups just for the
new snapshots inherigin the toplevel subvolume. Lockup.

It gets stuck in a snapshot call with the follwin stacktrace

[<0>] btrfs_tree_read_lock+0xf3/0x150 [btrfs]
[<0>] btrfs_qgroup_trace_subtree+0x280/0x7b0 [btrfs]
[<0>] do_walk_down+0x681/0xb20 [btrfs]
[<0>] walk_down_tree+0xf5/0x1c0 [btrfs]
[<0>] btrfs_drop_snapshot+0x43b/0xb60 [btrfs]
[<0>] btrfs_clean_one_deleted_snapshot+0xc1/0x120 [btrfs]
[<0>] cleaner_kthread+0xf8/0x170 [btrfs]
[<0>] kthread+0x121/0x140
[<0>] ret_from_fork+0x27/0x50

and that's like 10th snapshot and ~3rd deltion. This is qgroup show:

qgroupid rfer excl parent
   --
0/5 865.27MiB  1.66MiB ---
0/257   0.00B0.00B ---
0/259   0.00B0.00B ---
0/260   806.58MiB637.25MiB ---
0/262   0.00B0.00B ---
0/263   0.00B0.00B ---
0/264   0.00B0.00B ---
0/265   0.00B0.00B ---
0/266   0.00B0.00B ---
0/267   0.00B0.00B ---
0/268   0.00B0.00B ---
0/269   0.00B0.00B ---
0/270   989.04MiB  1.22MiB ---
0/271   0.00B0.00B ---
0/272   922.25MiB416.00KiB ---
0/273   931.02MiB  1.50MiB ---
0/274   910.94MiB  1.52MiB ---
1/1   1.64GiB  1.64GiB
0/5,0/257,0/259,0/260,0/262,0/263,0/264,0/265,0/266,0/267,0/268,0/269,0/270,0/271,0/272,0/273,0/274

No IO or cpu activity at this point, the stacktrace and show output
remains the same.

So, considering this, I'm not going to add the patchset to 4.21 but will
keep it in for-next for testing, any fixups or updates will be applied.


Re: [PATCH 2/8] btrfs: extent-tree: Open-code process_func in __btrfs_mod_ref

2018-12-07 Thread Nikolay Borisov



On 6.12.18 г. 8:58 ч., Qu Wenruo wrote:
> The process_func is never a function hook used anywhere else.
> 
> Open code it to make later delayed ref refactor easier, so we can
> refactor btrfs_inc_extent_ref() and btrfs_free_extent() in different
> patches.
> 
> Signed-off-by: Qu Wenruo 

Reviewed-by: Nikolay Borisov 

> ---
>  fs/btrfs/extent-tree.c | 33 ++---
>  1 file changed, 18 insertions(+), 15 deletions(-)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index ea2c3d5220f0..ea68d288d761 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -3220,10 +3220,6 @@ static int __btrfs_mod_ref(struct btrfs_trans_handle 
> *trans,
>   int i;
>   int level;
>   int ret = 0;
> - int (*process_func)(struct btrfs_trans_handle *,
> - struct btrfs_root *,
> - u64, u64, u64, u64, u64, u64, bool);
> -
>  
>   if (btrfs_is_testing(fs_info))
>   return 0;
> @@ -3235,11 +3231,6 @@ static int __btrfs_mod_ref(struct btrfs_trans_handle 
> *trans,
>   if (!test_bit(BTRFS_ROOT_REF_COWS, >state) && level == 0)
>   return 0;
>  
> - if (inc)
> - process_func = btrfs_inc_extent_ref;
> - else
> - process_func = btrfs_free_extent;
> -
>   if (full_backref)
>   parent = buf->start;
>   else
> @@ -3261,17 +3252,29 @@ static int __btrfs_mod_ref(struct btrfs_trans_handle 
> *trans,
>  
>   num_bytes = btrfs_file_extent_disk_num_bytes(buf, fi);
>   key.offset -= btrfs_file_extent_offset(buf, fi);
> - ret = process_func(trans, root, bytenr, num_bytes,
> -parent, ref_root, key.objectid,
> -key.offset, for_reloc);
> + if (inc)
> + ret = btrfs_inc_extent_ref(trans, root, bytenr,
> + num_bytes, parent, ref_root,
> + key.objectid, key.offset,
> + for_reloc);
> + else
> + ret = btrfs_free_extent(trans, root, bytenr,
> + num_bytes, parent, ref_root,
> + key.objectid, key.offset,
> + for_reloc);
>   if (ret)
>   goto fail;
>   } else {
>   bytenr = btrfs_node_blockptr(buf, i);
>   num_bytes = fs_info->nodesize;
> - ret = process_func(trans, root, bytenr, num_bytes,
> -parent, ref_root, level - 1, 0,
> -for_reloc);
> + if (inc)
> + ret = btrfs_inc_extent_ref(trans, root, bytenr,
> + num_bytes, parent, ref_root,
> + level - 1, 0, for_reloc);
> + else
> + ret = btrfs_free_extent(trans, root, bytenr,
> + num_bytes, parent, ref_root,
> + level - 1, 0, for_reloc);
>   if (ret)
>   goto fail;
>   }
> 


[PATCH v2] Btrfs: use generic_remap_file_range_prep() for cloning and deduplication

2018-12-07 Thread fdmanana
From: Filipe Manana 

Since cloning and deduplication are no longer Btrfs specific operations, we
now have generic code to handle parameter validation, compare file ranges
used for deduplication, clear capabilities when cloning, etc. This change
makes Btrfs use it, eliminating a lot of code in Btrfs and also fixing a
few bugs, such as:

1) When cloning, the destination file's capabilities were not dropped
   (the fstest generic/513 tests this);

2) We were not checking if the destination file is immutable;

3) Not checking if either the source or destination files are swap
   files (swap file support is coming soon for Btrfs);

4) System limits were not checked (resource limits and O_LARGEFILE).

Note that the generic helper generic_remap_file_range_prep() does start
and waits for writeback by calling filemap_write_and_wait_range(), however
that is not enough for Btrfs for two reasons:

1) With compression, we need to start writeback twice in order to get the
   pages marked for writeback and ordered extents created;

2) filemap_write_and_wait_range() (and all its other variants) only waits
   for the IO to complete, but we need to wait for the ordered extents to
   finish, so that when we do the actual reflinking operations the file
   extent items are in the fs tree. This is also important due to the fact
   that the generic helper, for the deduplication case, compares the
   contents of the pages in the requested range, which might require
   reading extents from disk in the very unlikely case that pages get
   invalidated after writeback finishes (so the file extent items must be
   up to date in the fs tree).

Since these reasons are specific to Btrfs we have to do it in the Btrfs
code before calling generic_remap_file_range_prep(). This also results in
a more simple way of dealing with existing delalloc in the source/target
ranges, specially for the deduplication case where we used to lock all the
pages first and then if we found any dealloc for the range, or ordered
extent, we would unlock the pages trigger writeback and wait for ordered
extents to complete, then lock all the pages again and check if
deduplication can be done. So now we get a simpler approach: lock the
inodes, then trigger writeback and then wait for ordered extents to
complete.

So make btrfs use generic_remap_file_range_prep() (XFS and OCFS2 use it)
to eliminate duplicated code, fix a few bugs and benefit from future bug
fixes done there - for example the recent clone and dedupe bugs involving
reflinking a partial EOF block got a counterpart fix in the generic helpe,
since it affected all filesystems supporting these operations, so we no
longer need special checks in Btrfs for them.

Signed-off-by: Filipe Manana 
---

V2: Removed check that verifies if either of the inodes is a directory,
as it is done by generic_remap_file_range_prep(). Oddly in btrfs was being
done only for cloning but not for dedupe.

 fs/btrfs/ioctl.c | 612 ---
 1 file changed, 129 insertions(+), 483 deletions(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 802a628e9f7d..321fb9bc149d 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -3191,92 +3191,6 @@ static long btrfs_ioctl_dev_info(struct btrfs_fs_info 
*fs_info,
return ret;
 }
 
-static struct page *extent_same_get_page(struct inode *inode, pgoff_t index)
-{
-   struct page *page;
-
-   page = grab_cache_page(inode->i_mapping, index);
-   if (!page)
-   return ERR_PTR(-ENOMEM);
-
-   if (!PageUptodate(page)) {
-   int ret;
-
-   ret = btrfs_readpage(NULL, page);
-   if (ret)
-   return ERR_PTR(ret);
-   lock_page(page);
-   if (!PageUptodate(page)) {
-   unlock_page(page);
-   put_page(page);
-   return ERR_PTR(-EIO);
-   }
-   if (page->mapping != inode->i_mapping) {
-   unlock_page(page);
-   put_page(page);
-   return ERR_PTR(-EAGAIN);
-   }
-   }
-
-   return page;
-}
-
-static int gather_extent_pages(struct inode *inode, struct page **pages,
-  int num_pages, u64 off)
-{
-   int i;
-   pgoff_t index = off >> PAGE_SHIFT;
-
-   for (i = 0; i < num_pages; i++) {
-again:
-   pages[i] = extent_same_get_page(inode, index + i);
-   if (IS_ERR(pages[i])) {
-   int err = PTR_ERR(pages[i]);
-
-   if (err == -EAGAIN)
-   goto again;
-   pages[i] = NULL;
-   return err;
-   }
-   }
-   return 0;
-}
-
-static int lock_extent_range(struct inode *inode, u64 off, u64 len,
-bool retry_range_locking)
-{
-   /*
-* Do an

Re: [PATCH 05/10] btrfs: introduce delayed_refs_rsv

2018-12-07 Thread Nikolay Borisov



On 3.12.18 г. 17:20 ч., Josef Bacik wrote:
> From: Josef Bacik 
> 
> Traditionally we've had voodoo in btrfs to account for the space that
> delayed refs may take up by having a global_block_rsv.  This works most
> of the time, except when it doesn't.  We've had issues reported and seen
> in production where sometimes the global reserve is exhausted during
> transaction commit before we can run all of our delayed refs, resulting
> in an aborted transaction.  Because of this voodoo we have equally
> dubious flushing semantics around throttling delayed refs which we often
> get wrong.
> 
> So instead give them their own block_rsv.  This way we can always know
> exactly how much outstanding space we need for delayed refs.  This
> allows us to make sure we are constantly filling that reservation up
> with space, and allows us to put more precise pressure on the enospc
> system.  Instead of doing math to see if its a good time to throttle,
> the normal enospc code will be invoked if we have a lot of delayed refs
> pending, and they will be run via the normal flushing mechanism.
> 
> For now the delayed_refs_rsv will hold the reservations for the delayed
> refs, the block group updates, and deleting csums.  We could have a
> separate rsv for the block group updates, but the csum deletion stuff is
> still handled via the delayed_refs so that will stay there.


I see one difference in the way that the space is managed. Essentially
for delayed refs rsv you'll only ever be increasaing the size and
->reserved only when you have to refill. This is opposite to the way
other metadata space is managed i.e by using use_block_rsv which
subtracts ->reserved everytime a block has to be CoW'ed. Why this
difference?


> 
> Signed-off-by: Josef Bacik 
> ---
>  fs/btrfs/ctree.h   |  14 +++-
>  fs/btrfs/delayed-ref.c |  43 --
>  fs/btrfs/disk-io.c |   4 +
>  fs/btrfs/extent-tree.c | 212 
> +
>  fs/btrfs/transaction.c |  37 -
>  5 files changed, 284 insertions(+), 26 deletions(-)
> 



> +/**
> + * btrfs_migrate_to_delayed_refs_rsv - transfer bytes to our delayed refs 
> rsv.
> + * @fs_info - the fs info for our fs.
> + * @src - the source block rsv to transfer from.
> + * @num_bytes - the number of bytes to transfer.
> + *
> + * This transfers up to the num_bytes amount from the src rsv to the
> + * delayed_refs_rsv.  Any extra bytes are returned to the space info.
> + */
> +void btrfs_migrate_to_delayed_refs_rsv(struct btrfs_fs_info *fs_info,
> +struct btrfs_block_rsv *src,
> +u64 num_bytes)

This function is currently used only during transaction start, it seems
to be rather specific to the delayed refs so I'd suggest making it
private to transaction.c

> +{
> + struct btrfs_block_rsv *delayed_refs_rsv = _info->delayed_refs_rsv;
> + u64 to_free = 0;
> +
> + spin_lock(>lock);
> + src->reserved -= num_bytes;
> + src->size -= num_bytes;
> + spin_unlock(>lock);
> +
> + spin_lock(_refs_rsv->lock);
> + if (delayed_refs_rsv->size > delayed_refs_rsv->reserved) {
> + u64 delta = delayed_refs_rsv->size -
> + delayed_refs_rsv->reserved;
> + if (num_bytes > delta) {
> + to_free = num_bytes - delta;
> + num_bytes = delta;
> + }
> + } else {
> + to_free = num_bytes;
> + num_bytes = 0;
> + }
> +
> + if (num_bytes)
> + delayed_refs_rsv->reserved += num_bytes;
> + if (delayed_refs_rsv->reserved >= delayed_refs_rsv->size)
> + delayed_refs_rsv->full = 1;
> + spin_unlock(_refs_rsv->lock);
> +
> + if (num_bytes)
> + trace_btrfs_space_reservation(fs_info, "delayed_refs_rsv",
> +   0, num_bytes, 1);
> + if (to_free)
> + space_info_add_old_bytes(fs_info, delayed_refs_rsv->space_info,
> +  to_free);
> +}
> +
> +/**
> + * btrfs_delayed_refs_rsv_refill - refill based on our delayed refs usage.
> + * @fs_info - the fs_info for our fs.
> + * @flush - control how we can flush for this reservation.
> + *
> + * This will refill the delayed block_rsv up to 1 items size worth of space 
> and
> + * will return -ENOSPC if we can't make the reservation.
> + */
> +int btrfs_delayed_refs_rsv_refill(struct btrfs_fs_info *fs_info,
> +   enum btrfs_reserve_flush_enum flush)
> +{
> + struct btrfs_block_rsv *block_rsv = _info->delayed_refs

[PATCH] Btrfs: use generic_remap_file_range_prep() for cloning and deduplication

2018-12-07 Thread fdmanana
From: Filipe Manana 

Since cloning and deduplication are no longer Btrfs specific operations, we
now have generic code to handle parameter validation, compare file ranges
used for deduplication, clear capabilities when cloning, etc. This change
makes Btrfs use it, eliminating a lot of code in Btrfs and also fixing a
few bugs, such as:

1) When cloning, the destination file's capabilities were not dropped
   (the fstest generic/513 tests this);

2) We were not checking if the destination file is immutable;

3) Not checking if either the source or destination files are swap
   files (swap file support is coming soon for Btrfs);

4) System limits were not checked (resource limits and O_LARGEFILE).

Note that the generic helper generic_remap_file_range_prep() does start
and waits for writeback by calling filemap_write_and_wait_range(), however
that is not enough for Btrfs for two reasons:

1) With compression, we need to start writeback twice in order to get the
   pages marked for writeback and ordered extents created;

2) filemap_write_and_wait_range() (and all its other variants) only waits
   for the IO to complete, but we need to wait for the ordered extents to
   finish, so that when we do the actual reflinking operations the file
   extent items are in the fs tree. This is also important due to the fact
   that the generic helper, for the deduplication case, compares the
   contents of the pages in the requested range, which might require
   reading extents from disk in the very unlikely case that pages get
   invalidated after writeback finishes (so the file extent items must be
   up to date in the fs tree).

Since these reasons are specific to Btrfs we have to do it in the Btrfs
code before calling generic_remap_file_range_prep(). This also results in
a more simple way of dealing with existing delalloc in the source/target
ranges, specially for the deduplication case where we used to lock all the
pages first and then if we found any dealloc for the range, or ordered
extent, we would unlock the pages trigger writeback and wait for ordered
extents to complete, then lock all the pages again and check if
deduplication can be done. So now we get a simpler approach: lock the
inodes, then trigger writeback and then wait for ordered extents to
complete.

So make btrfs use generic_remap_file_range_prep() (XFS and OCFS2 use it)
to eliminate duplicated code, fix a few bugs and benefit from future bug
fixes done there - for example the recent clone and dedupe bugs involving
reflinking a partial EOF block got a counterpart fix in the generic helpe,
since it affected all filesystems supporting these operations, so we no
longer need special checks in Btrfs for them.

Signed-off-by: Filipe Manana 
---
 fs/btrfs/ioctl.c | 615 ---
 1 file changed, 132 insertions(+), 483 deletions(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 802a628e9f7d..261e116dddb2 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -3191,92 +3191,6 @@ static long btrfs_ioctl_dev_info(struct btrfs_fs_info 
*fs_info,
return ret;
 }
 
-static struct page *extent_same_get_page(struct inode *inode, pgoff_t index)
-{
-   struct page *page;
-
-   page = grab_cache_page(inode->i_mapping, index);
-   if (!page)
-   return ERR_PTR(-ENOMEM);
-
-   if (!PageUptodate(page)) {
-   int ret;
-
-   ret = btrfs_readpage(NULL, page);
-   if (ret)
-   return ERR_PTR(ret);
-   lock_page(page);
-   if (!PageUptodate(page)) {
-   unlock_page(page);
-   put_page(page);
-   return ERR_PTR(-EIO);
-   }
-   if (page->mapping != inode->i_mapping) {
-   unlock_page(page);
-   put_page(page);
-   return ERR_PTR(-EAGAIN);
-   }
-   }
-
-   return page;
-}
-
-static int gather_extent_pages(struct inode *inode, struct page **pages,
-  int num_pages, u64 off)
-{
-   int i;
-   pgoff_t index = off >> PAGE_SHIFT;
-
-   for (i = 0; i < num_pages; i++) {
-again:
-   pages[i] = extent_same_get_page(inode, index + i);
-   if (IS_ERR(pages[i])) {
-   int err = PTR_ERR(pages[i]);
-
-   if (err == -EAGAIN)
-   goto again;
-   pages[i] = NULL;
-   return err;
-   }
-   }
-   return 0;
-}
-
-static int lock_extent_range(struct inode *inode, u64 off, u64 len,
-bool retry_range_locking)
-{
-   /*
-* Do any pending delalloc/csum calculations on inode, one way or
-* another, and lock file content.
-* The locking order is:
-*
-*   1) pages
-*   2) range in t

[PATCH] Btrfs: scrub, move setup of nofs contexts higher in the stack

2018-12-07 Thread fdmanana
From: Filipe Manana 

Since scrub workers only do memory allocation with GFP_KERNEL when they
need to perform repair, we can move the recent setup of the nofs context
up to scrub_handle_errored_block() instead of setting it up down the call
chain at insert_full_stripe_lock() and scrub_add_page_to_wr_bio(),
removing some duplicate code and comment. So the only paths for which a
scrub worker can do memory allocations using GFP_KERNEL are the following:

 scrub_bio_end_io_worker()
   scrub_block_complete()
 scrub_handle_errored_block()
   lock_full_stripe()
 insert_full_stripe_lock()
   -> kmalloc with GFP_KERNEL

  scrub_bio_end_io_worker()
scrub_block_complete()
  scrub_handle_errored_block()
scrub_write_page_to_dev_replace()
  scrub_add_page_to_wr_bio()
-> kzalloc with GFP_KERNEL

Signed-off-by: Filipe Manana 
---

Applies on top of:

  Btrfs: fix deadlock with memory reclaim during scrub

 fs/btrfs/scrub.c | 34 ++
 1 file changed, 14 insertions(+), 20 deletions(-)

diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index bbd1b36f4918..f996f4064596 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -322,7 +322,6 @@ static struct full_stripe_lock *insert_full_stripe_lock(
struct rb_node *parent = NULL;
struct full_stripe_lock *entry;
struct full_stripe_lock *ret;
-   unsigned int nofs_flag;
 
lockdep_assert_held(_root->lock);
 
@@ -342,15 +341,8 @@ static struct full_stripe_lock *insert_full_stripe_lock(
 
/*
 * Insert new lock.
-*
-* We must use GFP_NOFS because the scrub task might be waiting for a
-* worker task executing this function and in turn a transaction commit
-* might be waiting the scrub task to pause (which needs to wait for all
-* the worker tasks to complete before pausing).
 */
-   nofs_flag = memalloc_nofs_save();
ret = kmalloc(sizeof(*ret), GFP_KERNEL);
-   memalloc_nofs_restore(nofs_flag);
if (!ret)
return ERR_PTR(-ENOMEM);
ret->logical = fstripe_logical;
@@ -842,6 +834,7 @@ static int scrub_handle_errored_block(struct scrub_block 
*sblock_to_check)
int page_num;
int success;
bool full_stripe_locked;
+   unsigned int nofs_flag;
static DEFINE_RATELIMIT_STATE(_rs, DEFAULT_RATELIMIT_INTERVAL,
  DEFAULT_RATELIMIT_BURST);
 
@@ -867,6 +860,16 @@ static int scrub_handle_errored_block(struct scrub_block 
*sblock_to_check)
dev = sblock_to_check->pagev[0]->dev;
 
/*
+* We must use GFP_NOFS because the scrub task might be waiting for a
+* worker task executing this function and in turn a transaction commit
+* might be waiting the scrub task to pause (which needs to wait for all
+* the worker tasks to complete before pausing).
+* We do allocations in the workers through insert_full_stripe_lock()
+* and scrub_add_page_to_wr_bio(), which happens down the call chain of
+* this function.
+*/
+   nofs_flag = memalloc_nofs_save();
+   /*
 * For RAID5/6, race can happen for a different device scrub thread.
 * For data corruption, Parity and Data threads will both try
 * to recovery the data.
@@ -875,6 +878,7 @@ static int scrub_handle_errored_block(struct scrub_block 
*sblock_to_check)
 */
ret = lock_full_stripe(fs_info, logical, _stripe_locked);
if (ret < 0) {
+   memalloc_nofs_restore(nofs_flag);
spin_lock(>stat_lock);
if (ret == -ENOMEM)
sctx->stat.malloc_errors++;
@@ -914,7 +918,7 @@ static int scrub_handle_errored_block(struct scrub_block 
*sblock_to_check)
 */
 
sblocks_for_recheck = kcalloc(BTRFS_MAX_MIRRORS,
- sizeof(*sblocks_for_recheck), GFP_NOFS);
+ sizeof(*sblocks_for_recheck), GFP_KERNEL);
if (!sblocks_for_recheck) {
spin_lock(>stat_lock);
sctx->stat.malloc_errors++;
@@ -1212,6 +1216,7 @@ static int scrub_handle_errored_block(struct scrub_block 
*sblock_to_check)
}
 
ret = unlock_full_stripe(fs_info, logical, full_stripe_locked);
+   memalloc_nofs_restore(nofs_flag);
if (ret < 0)
return ret;
return 0;
@@ -1630,19 +1635,8 @@ static int scrub_add_page_to_wr_bio(struct scrub_ctx 
*sctx,
mutex_lock(>wr_lock);
 again:
if (!sctx->wr_curr_bio) {
-   unsigned int nofs_flag;
-
-   /*
-* We must use GFP_NOFS because the scrub task might be waiting
-* for a worker task executing this function and in turn a
-* transaction commit might be waiting the scrub task to pause
-* (wh

Re: [PATCH 04/10] btrfs: only track ref_heads in delayed_ref_updates

2018-12-07 Thread Nikolay Borisov



On 3.12.18 г. 17:20 ч., Josef Bacik wrote:
> From: Josef Bacik 
> 
> We use this number to figure out how many delayed refs to run, but
> __btrfs_run_delayed_refs really only checks every time we need a new
> delayed ref head, so we always run at least one ref head completely no
> matter what the number of items on it.  Fix the accounting to only be
> adjusted when we add/remove a ref head.

David,

I think also warrants a forward looking sentence stating that the number
is also going to be used to calculate the required number of bytes in
the delayed refs rsv. Something along the lines of:

In addition to using this number to limit the number of delayed refs
run, a future patch is also going to use it to calculate the amount of
space required for delayed refs space reservation.

> 
> Reviewed-by: Nikolay Borisov 
> Signed-off-by: Josef Bacik 
> ---
>  fs/btrfs/delayed-ref.c | 3 ---
>  1 file changed, 3 deletions(-)
> 
> diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
> index b3e4c9fcb664..48725fa757a3 100644
> --- a/fs/btrfs/delayed-ref.c
> +++ b/fs/btrfs/delayed-ref.c
> @@ -251,8 +251,6 @@ static inline void drop_delayed_ref(struct 
> btrfs_trans_handle *trans,
>   ref->in_tree = 0;
>   btrfs_put_delayed_ref(ref);
>   atomic_dec(_refs->num_entries);
> - if (trans->delayed_ref_updates)
> - trans->delayed_ref_updates--;
>  }
>  
>  static bool merge_ref(struct btrfs_trans_handle *trans,
> @@ -467,7 +465,6 @@ static int insert_delayed_ref(struct btrfs_trans_handle 
> *trans,
>   if (ref->action == BTRFS_ADD_DELAYED_REF)
>   list_add_tail(>add_list, >ref_add_list);
>   atomic_inc(>num_entries);
> - trans->delayed_ref_updates++;
>   spin_unlock(>lock);
>   return ret;
>  }
> 


Re: [PATCH 08/10] btrfs: rework btrfs_check_space_for_delayed_refs

2018-12-07 Thread Nikolay Borisov



On 7.12.18 г. 9:09 ч., Nikolay Borisov wrote:
> 
> 
> On 6.12.18 г. 19:54 ч., David Sterba wrote:
>> On Thu, Dec 06, 2018 at 06:52:21PM +0200, Nikolay Borisov wrote:
>>>
>>>
>>> On 3.12.18 г. 17:20 ч., Josef Bacik wrote:
 Now with the delayed_refs_rsv we can now know exactly how much pending
 delayed refs space we need.  This means we can drastically simplify
>>>
>>> IMO it will be helpful if there is a sentence here referring back to
>>> btrfs_update_delayed_refs_rsv to put your first sentence into context.
>>> But I guess this is something David can also do.
>>
>> I'll update the changelog, but I'm not sure what exactly you want to see
>> there, please post the replacement text. Thanks.
> 
> With the introduction of dealyed_refs_rsv infrastructure, namely
> btrfs_update_delayed_refs_rsv we now know exactly how much pending
> delayed refs space is required.

To put things into context as to why I deem this change beneficial -
basically doing the migration of reservation from transaction to delayed
refs rsv modifies both size and reserved - they will be equal. Calling
btrfs_update_delayed_refs_rsv actually increases ->size and doesn't
really decrement ->reserved. Also we never do
btrfs_block_rsv_migrate/use_block_rsv on the delayed refs block rsv so
managing ->reserved  value for delayed refs rsv is different than for
the rest of the block rsv.


> 
>>
 btrfs_check_space_for_delayed_refs by simply checking how much space we
 have reserved for the global rsv (which acts as a spill over buffer) and
 the delayed refs rsv.  If our total size is beyond that amount then we
 know it's time to commit the transaction and stop any more delayed refs
 from being generated.

 Signed-off-by: Josef Bacik 
>>
> 


Re: HELP unmountable partition after btrfs balance to RAID0

2018-12-07 Thread Duncan
Thomas Mohr posted on Thu, 06 Dec 2018 12:31:15 +0100 as excerpted:

> We wanted to convert a file system to a RAID0 with two partitions.
> Unfortunately we had to reboot the server during the balance operation
> before it could complete.
> 
> Now following happens:
> 
> A mount attempt of the array fails with following error code:
> 
> btrfs recover yields roughly 1.6 out of 4 TB.

[Just another btrfs user and list regular, not a dev.  A dev may reply to 
your specific case, but meanwhile, for next time...]

That shouldn't be a problem.  Because with raid0 a failure of any of the 
components will take down the entire raid, making it less reliable than a 
single device, raid0 (in general, not just btrfs) is considered only 
useful for data of low enough value that its loss is no big deal, either 
because it's truly of little value (internet cache being a good example), 
or because backups are kept available and updated for whenever the raid0 
array fails.  Because with raid0, it's always a question of when it'll 
fail, not if.

So loss of a filesystem being converted to raid0 isn't a problem, because 
the data on it, by virtue of being in the process of conversion to raid0, 
is defined as of throw-away value in any case.  If it's of higher value 
than that, it's not going to be raid0 (or in the process of conversion to 
it) in the first place.

Of course that's simply an extension of the more general first sysadmin's 
rule of backups, that the true value of data is defined not by arbitrary 
claims, but by the number of backups of that data it's worth having.  
Because "things happen", whether it's fat-fingering, bad hardware, buggy 
software, or simply someone tripping over the power cable or running into 
the power pole outside at the wrong time.

So no backup is simply defining the data as worth less than the time/
trouble/resources necessary to make that backup.

Note that you ALWAYS save what was of most value to you, either the time/
trouble/resources to do the backup, if your actions defined that to be of 
more value than the data, or the data, if you had that backup, thereby 
defining the value of the data to be worth backing up.

Similarly, failure of the only backup isn't a problem because by virtue 
of there being only that one backup, the data is defined as not worth 
having more than one, and likewise, having an outdated backup isn't a 
problem, because that's simply the special case of defining the data in 
the delta between the backup time and the present as not (yet) worth the 
time/hassle/resources to make/refresh that backup.

(And FWIW, the second sysadmin's rule of backups is that it's not a 
backup until you've successfully tested it recoverable in the same sort 
of conditions you're likely to need to recover it in.  Because so many 
people have /thought/ they had backups, that turned out not to be, 
because they never tested that they could actually recover the data from 
them.  For instance, if the backup tools you'll need to recover the 
backup are on the backup itself, how do you get to them?  Can you create 
a filesystem for the new copy of the data and recover it from the backup 
with just the tools and documentation available from your emergency boot 
media?  Untested backup == no backup, or at best, backup still in 
process!)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman



Re: [PATCH 08/10] btrfs: rework btrfs_check_space_for_delayed_refs

2018-12-06 Thread Nikolay Borisov



On 6.12.18 г. 19:54 ч., David Sterba wrote:
> On Thu, Dec 06, 2018 at 06:52:21PM +0200, Nikolay Borisov wrote:
>>
>>
>> On 3.12.18 г. 17:20 ч., Josef Bacik wrote:
>>> Now with the delayed_refs_rsv we can now know exactly how much pending
>>> delayed refs space we need.  This means we can drastically simplify
>>
>> IMO it will be helpful if there is a sentence here referring back to
>> btrfs_update_delayed_refs_rsv to put your first sentence into context.
>> But I guess this is something David can also do.
> 
> I'll update the changelog, but I'm not sure what exactly you want to see
> there, please post the replacement text. Thanks.

With the introduction of dealyed_refs_rsv infrastructure, namely
btrfs_update_delayed_refs_rsv we now know exactly how much pending
delayed refs space is required.

> 
>>> btrfs_check_space_for_delayed_refs by simply checking how much space we
>>> have reserved for the global rsv (which acts as a spill over buffer) and
>>> the delayed refs rsv.  If our total size is beyond that amount then we
>>> know it's time to commit the transaction and stop any more delayed refs
>>> from being generated.
>>>
>>> Signed-off-by: Josef Bacik 
> 


Re: [PATCH RESEND 0/8] btrfs-progs: sub: Relax the privileges of "subvolume list/show"

2018-12-06 Thread Omar Sandoval
On Tue, Nov 27, 2018 at 02:24:41PM +0900, Misono Tomohiro wrote:
> Hello,
> 
> This is basically the resend of 
>   "[PATCH v2 00/20] btrfs-progs: Rework of "subvolume list/show" and relax the
>   root privileges of them" [1]
> which I submitted in June. The aim of this series is to allow non-privileged 
> user
> to use basic subvolume functionality (create/list/snapshot/delete; this 
> allows "list")
> 
> They were once in devel branch with some whitespace/comment modification by 
> david.
> I rebased them to current devel branch.
> 
> github: https://github.com/t-msn/btrfs-progs/tree/rework-sub-list
> 
> Basic logic/code is the same as before. Some differences are:
>  - Use latest libbtrfsutil from Omar [2] (thus drop first part of patches).
>As a result, "sub list" cannot accept an ordinary directry to be
>specified (which is allowed in previous version)
>  - Drop patches which add new options to "sub list"
>  - Use 'nobody' as non-privileged test user just like libbtrfsutil test
>  - Update comments
> 
> Importantly, in order to make output consistent for both root and 
> non-privileged
> user, this changes the behavior of "subvolume list": 
>  - (default) Only list in subvolume under the specified path.
>Path needs to be a subvolume.
>  - (-a) filter is dropped. i.e. its output is the same as the
> default behavior of "sub list" in progs <= 4.19
> 
> Therefore, existent scripts may need to update to add -a option
> (I believe nobody uses current -a option).
> If anyone thinks this is not good, please let me know.

I think there are a few options in the case that the path isn't a
subvolume:

1. List all subvolumes in the filesystem with randomly mangled paths,
   which is what we currently do.
2. Error out, which is what this version of the series does.
3. List all subvolumes under the containing subvolume, which is what the
   previous version does.
4. List all subvolumes under the containing subvolume that are
   underneath the given path.

Option 1 won't work well for unprivileged users. Option 2 (this series)
is definitely going to break people's workflows/scripts. Option 3 is
unintuitive. In my opinion, option 4 is the nicest, but it may also
break scripts that expect all subvolumes to be printed.

There's also an option 5, which is to keep the behavior the same for
root (like what my previous patch [1] did) and implement option 4 for
unprivileged users.

I think 4 and 5 are the two main choices: do we want to preserve
backwards compatibility as carefully as possible (at the cost of
consistency), or do we want to risk it and improve the interface?

1: 
https://github.com/osandov/btrfs-progs/commit/fb61c21aeb998b12c1d02532639083d7f40c41e0


Re: BTRFS RAID filesystem unmountable

2018-12-06 Thread Qu Wenruo


On 2018/12/7 上午7:15, Michael Wade wrote:
> Hi Qu,
> 
> Me again! Having formatted the drives and rebuilt the RAID array I
> seem to have be having the same problem as before (no power cut this
> time [I bought a UPS]).

But strangely, your super block shows it has log tree, which means
either your hit a kernel panic/transaction abort, or a unexpected power
loss.

> The brtfs volume is broken on my ReadyNAS.
> 
> I have attached the results of some of the commands you asked me to
> run last time, and I am hoping you might be able to help me out.

This time, the problem is more serious, some chunk tree blocks are not
even inside system chunk range, no wonder it fails to mount.

To confirm it, you could run "btrfs ins dump-tree -b 17725903077376
" and paste the output.

But I don't have any clue. My guess is some kernel problem related to
new chunk allocation, or the chunk root node itself is already seriously
corrupted.

Considering how old your kernel is (4.4), it's not recommended to use
btrfs on such old kernel, unless it's well backported with tons of btrfs
fixes.

Thanks,
Qu

> 
> Kind regards
> Michael
> On Sat, 19 May 2018 at 12:43, Michael Wade  wrote:
>>
>> I have let the find root command run for 14+ days, its produced a
>> pretty huge log file 1.6 GB but still hasn't completed. I think I will
>> start the process of reformatting my drives and starting over.
>>
>> Thanks for your help anyway.
>>
>> Kind regards
>> Michael
>>
>> On 5 May 2018 at 01:43, Qu Wenruo  wrote:
>>>
>>>
>>> On 2018年05月05日 00:18, Michael Wade wrote:
>>>> Hi Qu,
>>>>
>>>> The tool is still running and the log file is now ~300mb. I guess it
>>>> shouldn't normally take this long.. Is there anything else worth
>>>> trying?
>>>
>>> I'm afraid not much.
>>>
>>> Although there is a possibility to modify btrfs-find-root to do much
>>> faster but limited search.
>>>
>>> But from the result, it looks like underlying device corruption, and not
>>> much we can do right now.
>>>
>>> Thanks,
>>> Qu
>>>
>>>>
>>>> Kind regards
>>>> Michael
>>>>
>>>> On 2 May 2018 at 06:29, Michael Wade  wrote:
>>>>> Thanks Qu,
>>>>>
>>>>> I actually aborted the run with the old btrfs tools once I saw its
>>>>> output. The new btrfs tools is still running and has produced a log
>>>>> file of ~85mb filled with that content so far.
>>>>>
>>>>> Kind regards
>>>>> Michael
>>>>>
>>>>> On 2 May 2018 at 02:31, Qu Wenruo  wrote:
>>>>>>
>>>>>>
>>>>>> On 2018年05月01日 23:50, Michael Wade wrote:
>>>>>>> Hi Qu,
>>>>>>>
>>>>>>> Oh dear that is not good news!
>>>>>>>
>>>>>>> I have been running the find root command since yesterday but it only
>>>>>>> seems to be only be outputting the following message:
>>>>>>>
>>>>>>> ERROR: tree block bytenr 0 is not aligned to sectorsize 4096
>>>>>>
>>>>>> It's mostly fine, as find-root will go through all tree blocks and try
>>>>>> to read them as tree blocks.
>>>>>> Although btrfs-find-root will suppress csum error output, but such basic
>>>>>> tree validation check is not suppressed, thus you get such message.
>>>>>>
>>>>>>> ERROR: tree block bytenr 0 is not aligned to sectorsize 4096
>>>>>>> ERROR: tree block bytenr 0 is not aligned to sectorsize 4096
>>>>>>> ERROR: tree block bytenr 0 is not aligned to sectorsize 4096
>>>>>>> ERROR: tree block bytenr 0 is not aligned to sectorsize 4096
>>>>>>> ERROR: tree block bytenr 0 is not aligned to sectorsize 4096
>>>>>>> ERROR: tree block bytenr 0 is not aligned to sectorsize 4096
>>>>>>> ERROR: tree block bytenr 0 is not aligned to sectorsize 4096
>>>>>>> ERROR: tree block bytenr 0 is not aligned to sectorsize 4096
>>>>>>> ERROR: tree block bytenr 0 is not aligned to sectorsize 4096
>>>>>>>
>>>>>>> I tried with the latest btrfs tools compiled from source and the ones
>>>>>>> I have installed with the same result. Is there a CLI utility I could
>>>>>>> use to determine if the log cont

Re: [PATCH v2 0/6] btrfs: qgroup: Delay subtree scan to reduce overhead

2018-12-06 Thread Qu Wenruo


On 2018/12/7 上午3:35, David Sterba wrote:
> On Mon, Nov 12, 2018 at 10:33:33PM +0100, David Sterba wrote:
>> On Thu, Nov 08, 2018 at 01:49:12PM +0800, Qu Wenruo wrote:
>>> This patchset can be fetched from github:
>>> https://github.com/adam900710/linux/tree/qgroup_delayed_subtree_rebased
>>>
>>> Which is based on v4.20-rc1.
>>
>> Thanks, I'll add it to for-next soon.
> 
> The branch was there for some time but not for at least a week (my
> mistake I did not notice in time). I've rebased it on top of recent
> misc-next, but without the delayed refs patchset from Josef.
> 
> At the moment I'm considering it for merge to 4.21, there's still some
> time to pull it out in case it shows up to be too problematic. I'm
> mostly worried about the unknown interactions with the enospc updates or

For that part, I don't think it would have some obvious problem for
enospc updates.

As the user-noticeable effect is the delay of reloc tree deletion.

Despite that, it's mostly transparent to extent allocation.

> generally because of lack of qgroup and reloc code reviews.

That's the biggest problem.

However most of the current qgroup + balance optimization is done inside
qgroup code (to skip certain qgroup record), if we're going to hit some
problem then this patchset would have the highest possibility to hit
problem.

Later patches will just keep tweaking qgroup to without affecting any
other parts mostly.

So I'm fine if you decide to pull it out for now.

Thanks,
Qu

> 
> I'm going to do some testing of the rebased branch before I add it to
> for-next. The branch is ext/qu/qgroup-delay-scan in my devel repos,
> plase check if everyghing is still ok there. Thanks.
> 



signature.asc
Description: OpenPGP digital signature


Re: [PATCH v2 0/6] btrfs: qgroup: Delay subtree scan to reduce overhead

2018-12-06 Thread David Sterba
On Mon, Nov 12, 2018 at 10:33:33PM +0100, David Sterba wrote:
> On Thu, Nov 08, 2018 at 01:49:12PM +0800, Qu Wenruo wrote:
> > This patchset can be fetched from github:
> > https://github.com/adam900710/linux/tree/qgroup_delayed_subtree_rebased
> > 
> > Which is based on v4.20-rc1.
> 
> Thanks, I'll add it to for-next soon.

The branch was there for some time but not for at least a week (my
mistake I did not notice in time). I've rebased it on top of recent
misc-next, but without the delayed refs patchset from Josef.

At the moment I'm considering it for merge to 4.21, there's still some
time to pull it out in case it shows up to be too problematic. I'm
mostly worried about the unknown interactions with the enospc updates or
generally because of lack of qgroup and reloc code reviews.

I'm going to do some testing of the rebased branch before I add it to
for-next. The branch is ext/qu/qgroup-delay-scan in my devel repos,
plase check if everyghing is still ok there. Thanks.


Re: [PATCH][v2] btrfs: run delayed items before dropping the snapshot

2018-12-06 Thread David Sterba
On Wed, Dec 05, 2018 at 12:12:21PM -0500, Josef Bacik wrote:
> From: Josef Bacik 
> 
> With my delayed refs patches in place we started seeing a large amount
> of aborts in __btrfs_free_extent
> 
> BTRFS error (device sdb1): unable to find ref byte nr 91947008 parent 0 root 
> 35964  owner 1 offset 0
> Call Trace:
>  ? btrfs_merge_delayed_refs+0xaf/0x340
>  __btrfs_run_delayed_refs+0x6ea/0xfc0
>  ? btrfs_set_path_blocking+0x31/0x60
>  btrfs_run_delayed_refs+0xeb/0x180
>  btrfs_commit_transaction+0x179/0x7f0
>  ? btrfs_check_space_for_delayed_refs+0x30/0x50
>  ? should_end_transaction.isra.19+0xe/0x40
>  btrfs_drop_snapshot+0x41c/0x7c0
>  btrfs_clean_one_deleted_snapshot+0xb5/0xd0
>  cleaner_kthread+0xf6/0x120
>  kthread+0xf8/0x130
>  ? btree_invalidatepage+0x90/0x90
>  ? kthread_bind+0x10/0x10
>  ret_from_fork+0x35/0x40
> 
> This was because btrfs_drop_snapshot depends on the root not being modified
> while it's dropping the snapshot.  It will unlock the root node (and really
> every node) as it walks down the tree, only to re-lock it when it needs to do
> something.  This is a problem because if we modify the tree we could cow a 
> block
> in our path, which free's our reference to that block.  Then once we get back 
> to
> that shared block we'll free our reference to it again, and get ENOENT when
> trying to lookup our extent reference to that block in __btrfs_free_extent.
> 
> This is ultimately happening because we have delayed items left to be 
> processed
> for our deleted snapshot _after_ all of the inodes are closed for the 
> snapshot.
> We only run the delayed inode item if we're deleting the inode, and even then 
> we
> do not run the delayed insertions or delayed removals.  These can be run at 
> any
> point after our final inode does it's last iput, which is what triggers the
> snapshot deletion.  We can end up with the snapshot deletion happening and 
> then
> have the delayed items run on that file system, resulting in the above 
> problem.
> 
> This problem has existed forever, however my patches made it much easier to 
> hit
> as I wake up the cleaner much more often to deal with delayed iputs, which 
> made
> us more likely to start the snapshot dropping work before the transaction
> commits, which is when the delayed items would generally be run.  Before,
> generally speaking, we would run the delayed items, commit the transaction, 
> and
> wakeup the cleaner thread to start deleting snapshots, which means we were 
> less
> likely to hit this problem.  You could still hit it if you had multiple
> snapshots to be deleted and ended up with lots of delayed items, but it was
> definitely harder.
> 
> Fix for now by simply running all the delayed items before starting to drop 
> the
> snapshot.  We could make this smarter in the future by making the delayed 
> items
> per-root, and then simply drop any delayed items for roots that we are going 
> to
> delete.  But for now just a quick and easy solution is the safest.
> 
> Cc: sta...@vger.kernel.org
> Signed-off-by: Josef Bacik 
> ---
> v1->v2:
> - check for errors from btrfs_run_delayed_items.
> - Dave I can reroll the series, but the second version of patch 1 is the same,
>   let me know what you want.

As this is a small update it's fine to send just that patch. You may
also use --in-reply-to so it threads to the original series. Resending
series makes most sense (to me) when there's a discussion and many
changes, so a fresh series makes it clear what's the current status.

Patch replaced in for-next topic branch, thanks.


Re: [PATCH 1/2] btrfs: catch cow on deleting snapshots

2018-12-06 Thread David Sterba
On Fri, Nov 30, 2018 at 12:19:18PM -0500, Josef Bacik wrote:
> On Fri, Nov 30, 2018 at 05:14:54PM +, Filipe Manana wrote:
> > On Fri, Nov 30, 2018 at 4:53 PM Josef Bacik  wrote:
> > >
> > > From: Josef Bacik 
> > >
> > > When debugging some weird extent reference bug I suspected that we were
> > > changing a snapshot while we were deleting it, which could explain my
> > > bug.  This was indeed what was happening, and this patch helped me
> > > verify my theory.  It is never correct to modify the snapshot once it's
> > > being deleted, so mark the root when we are deleting it and make sure we
> > > complain about it when it happens.
> > >
> > > Signed-off-by: Josef Bacik 
> > > ---
> > >  fs/btrfs/ctree.c   | 3 +++
> > >  fs/btrfs/ctree.h   | 1 +
> > >  fs/btrfs/extent-tree.c | 9 +
> > >  3 files changed, 13 insertions(+)
> > >
> > > diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
> > > index 5912a97b07a6..5f82f86085e8 100644
> > > --- a/fs/btrfs/ctree.c
> > > +++ b/fs/btrfs/ctree.c
> > > @@ -1440,6 +1440,9 @@ noinline int btrfs_cow_block(struct 
> > > btrfs_trans_handle *trans,
> > > u64 search_start;
> > > int ret;
> > >
> > > +   if (test_bit(BTRFS_ROOT_DELETING, >state))
> > > +   WARN(1, KERN_CRIT "cow'ing blocks on a fs root thats 
> > > being dropped\n");
> > 
> > Please use btrfs_warn(), it makes sure we use a consistent message
> > style, identifies the fs, etc.
> > Also, "thats" should be "that is" or "that's".
> > 
> 
> Ah yeah, I was following the other convention in there but we should probably
> convert all of those to btrfs_warn.  I'll fix the grammer thing as well, just 
> a
> leftover from the much less code of conduct friendly message I originally had
> there.  Thanks,

Committed with the following fixup:

-   WARN(1, KERN_CRIT "cow'ing blocks on a fs root thats being 
dropped\n");
+   btrfs_error(fs_info,
+   "COW'ing blocks on a fs root that's being dropped");



Re: [PATCH 08/10] btrfs: rework btrfs_check_space_for_delayed_refs

2018-12-06 Thread David Sterba
On Thu, Dec 06, 2018 at 06:52:21PM +0200, Nikolay Borisov wrote:
> 
> 
> On 3.12.18 г. 17:20 ч., Josef Bacik wrote:
> > Now with the delayed_refs_rsv we can now know exactly how much pending
> > delayed refs space we need.  This means we can drastically simplify
> 
> IMO it will be helpful if there is a sentence here referring back to
> btrfs_update_delayed_refs_rsv to put your first sentence into context.
> But I guess this is something David can also do.

I'll update the changelog, but I'm not sure what exactly you want to see
there, please post the replacement text. Thanks.

> > btrfs_check_space_for_delayed_refs by simply checking how much space we
> > have reserved for the global rsv (which acts as a spill over buffer) and
> > the delayed refs rsv.  If our total size is beyond that amount then we
> > know it's time to commit the transaction and stop any more delayed refs
> > from being generated.
> > 
> > Signed-off-by: Josef Bacik 


Re: [PATCH 08/10] btrfs: rework btrfs_check_space_for_delayed_refs

2018-12-06 Thread Nikolay Borisov



On 3.12.18 г. 17:20 ч., Josef Bacik wrote:
> Now with the delayed_refs_rsv we can now know exactly how much pending
> delayed refs space we need.  This means we can drastically simplify

IMO it will be helpful if there is a sentence here referring back to
btrfs_update_delayed_refs_rsv to put your first sentence into context.
But I guess this is something David can also do.

> btrfs_check_space_for_delayed_refs by simply checking how much space we
> have reserved for the global rsv (which acts as a spill over buffer) and
> the delayed refs rsv.  If our total size is beyond that amount then we
> know it's time to commit the transaction and stop any more delayed refs
> from being generated.
> 
> Signed-off-by: Josef Bacik 

Reviewed-by: Nikolay Borisov 

> ---
>  fs/btrfs/ctree.h   |  2 +-
>  fs/btrfs/extent-tree.c | 48 ++--
>  fs/btrfs/inode.c   |  4 ++--
>  fs/btrfs/transaction.c |  2 +-
>  4 files changed, 22 insertions(+), 34 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 2eba398c722b..30da075c042e 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -2631,7 +2631,7 @@ static inline u64 btrfs_calc_trunc_metadata_size(struct 
> btrfs_fs_info *fs_info,
>  }
>  
>  int btrfs_should_throttle_delayed_refs(struct btrfs_trans_handle *trans);
> -int btrfs_check_space_for_delayed_refs(struct btrfs_trans_handle *trans);
> +bool btrfs_check_space_for_delayed_refs(struct btrfs_fs_info *fs_info);
>  void btrfs_dec_block_group_reservations(struct btrfs_fs_info *fs_info,
>const u64 start);
>  void btrfs_wait_block_group_reservations(struct btrfs_block_group_cache *bg);
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 5a2d0b061f57..07ef1b8087f7 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -2839,40 +2839,28 @@ u64 btrfs_csum_bytes_to_leaves(struct btrfs_fs_info 
> *fs_info, u64 csum_bytes)
>   return num_csums;
>  }
>  
> -int btrfs_check_space_for_delayed_refs(struct btrfs_trans_handle *trans)
> +bool btrfs_check_space_for_delayed_refs(struct btrfs_fs_info *fs_info)
>  {
> - struct btrfs_fs_info *fs_info = trans->fs_info;
> - struct btrfs_block_rsv *global_rsv;
> - u64 num_heads = trans->transaction->delayed_refs.num_heads_ready;
> - u64 csum_bytes = trans->transaction->delayed_refs.pending_csums;
> - unsigned int num_dirty_bgs = trans->transaction->num_dirty_bgs;
> - u64 num_bytes, num_dirty_bgs_bytes;
> - int ret = 0;
> + struct btrfs_block_rsv *delayed_refs_rsv = _info->delayed_refs_rsv;
> + struct btrfs_block_rsv *global_rsv = _info->global_block_rsv;
> + bool ret = false;
> + u64 reserved;
>  
> - num_bytes = btrfs_calc_trans_metadata_size(fs_info, 1);
> - num_heads = heads_to_leaves(fs_info, num_heads);
> - if (num_heads > 1)
> - num_bytes += (num_heads - 1) * fs_info->nodesize;
> - num_bytes <<= 1;
> - num_bytes += btrfs_csum_bytes_to_leaves(fs_info, csum_bytes) *
> - fs_info->nodesize;
> - num_dirty_bgs_bytes = btrfs_calc_trans_metadata_size(fs_info,
> -  num_dirty_bgs);
> - global_rsv = _info->global_block_rsv;
> + spin_lock(_rsv->lock);
> + reserved = global_rsv->reserved;
> + spin_unlock(_rsv->lock);
>  
>   /*
> -  * If we can't allocate any more chunks lets make sure we have _lots_ of
> -  * wiggle room since running delayed refs can create more delayed refs.
> +  * Since the global reserve is just kind of magic we don't really want
> +  * to rely on it to save our bacon, so if our size is more than the
> +  * delayed_refs_rsv and the global rsv then it's time to think about
> +  * bailing.
>*/
> - if (global_rsv->space_info->full) {
> - num_dirty_bgs_bytes <<= 1;
> - num_bytes <<= 1;
> - }
> -
> - spin_lock(_rsv->lock);
> - if (global_rsv->reserved <= num_bytes + num_dirty_bgs_bytes)
> - ret = 1;
> - spin_unlock(_rsv->lock);
> + spin_lock(_refs_rsv->lock);
> + reserved += delayed_refs_rsv->reserved;
> + if (delayed_refs_rsv->size >= reserved)
> + ret = true;
> + spin_unlock(_refs_rsv->lock);
>   return ret;
>  }
>  
> @@ -2891,7 +2879,7 @@ int btrfs_should_throttle_delayed_refs(struct 
> btrfs_trans_handle *trans)
>   if (val >= NSEC_PER_SEC / 2)
>   return 2;
>  
> - return 

Re: [PATCH 09/10] btrfs: don't run delayed refs in the end transaction logic

2018-12-06 Thread Nikolay Borisov



On 3.12.18 г. 17:20 ч., Josef Bacik wrote:
> Over the years we have built up a lot of infrastructure to keep delayed
> refs in check, mostly by running them at btrfs_end_transaction() time.
> We have a lot of different maths we do to figure out how much, if we
> should do it inline or async, etc.  This existed because we had no
> feedback mechanism to force the flushing of delayed refs when they
> became a problem.  However with the enospc flushing infrastructure in
> place for flushing delayed refs when they put too much pressure on the
> enospc system we have this problem solved.  Rip out all of this code as
> it is no longer needed.
> 
> Signed-off-by: Josef Bacik 
> ---
>  fs/btrfs/transaction.c | 38 --
>  1 file changed, 38 deletions(-)
> 
> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> index 2d8401bf8df9..01f39401619a 100644
> --- a/fs/btrfs/transaction.c
> +++ b/fs/btrfs/transaction.c
> @@ -798,22 +798,12 @@ static int should_end_transaction(struct 
> btrfs_trans_handle *trans)
>  int btrfs_should_end_transaction(struct btrfs_trans_handle *trans)
>  {
>   struct btrfs_transaction *cur_trans = trans->transaction;
> - int updates;
> - int err;
>  
>   smp_mb();
>   if (cur_trans->state >= TRANS_STATE_BLOCKED ||
>   cur_trans->delayed_refs.flushing)
>   return 1;
>  
> - updates = trans->delayed_ref_updates;
> - trans->delayed_ref_updates = 0;
> - if (updates) {
> - err = btrfs_run_delayed_refs(trans, updates * 2);
> - if (err) /* Error code will also eval true */
> - return err;
> - }
> -
>   return should_end_transaction(trans);
>  }
>  
> @@ -843,11 +833,8 @@ static int __btrfs_end_transaction(struct 
> btrfs_trans_handle *trans,
>  {
>   struct btrfs_fs_info *info = trans->fs_info;
>   struct btrfs_transaction *cur_trans = trans->transaction;
> - u64 transid = trans->transid;
> - unsigned long cur = trans->delayed_ref_updates;
>   int lock = (trans->type != TRANS_JOIN_NOLOCK);
>   int err = 0;
> - int must_run_delayed_refs = 0;
>  
>   if (refcount_read(>use_count) > 1) {
>   refcount_dec(>use_count);
> @@ -858,27 +845,6 @@ static int __btrfs_end_transaction(struct 
> btrfs_trans_handle *trans,
>   btrfs_trans_release_metadata(trans);
>   trans->block_rsv = NULL;
>  
> - if (!list_empty(>new_bgs))
> - btrfs_create_pending_block_groups(trans);

Is this being deleted because in delayed_refs_rsv you account also fo
new block groups?

> -
> - trans->delayed_ref_updates = 0;
> - if (!trans->sync) {
> - must_run_delayed_refs =
> - btrfs_should_throttle_delayed_refs(trans);
> - cur = max_t(unsigned long, cur, 32);
> -
> - /*
> -  * don't make the caller wait if they are from a NOLOCK
> -  * or ATTACH transaction, it will deadlock with commit
> -  */
> - if (must_run_delayed_refs == 1 &&
> - (trans->type & (__TRANS_JOIN_NOLOCK | __TRANS_ATTACH)))
> - must_run_delayed_refs = 2;
> - }
> -
> - btrfs_trans_release_metadata(trans);
> - trans->block_rsv = NULL;

Why remove those 2 lines as well ?

> -
>   if (!list_empty(>new_bgs))
>   btrfs_create_pending_block_groups(trans);
>  
> @@ -923,10 +889,6 @@ static int __btrfs_end_transaction(struct 
> btrfs_trans_handle *trans,
>   }
>  
>   kmem_cache_free(btrfs_trans_handle_cachep, trans);
> - if (must_run_delayed_refs) {
> - btrfs_async_run_delayed_refs(info, cur, transid,
> -  must_run_delayed_refs == 1);
> - }
>   return err;
>  }
>  
> 


Re: [PATCH 0/3] btrfs: use offset_in_page and PAGE_ALIGNED

2018-12-06 Thread David Sterba
On Wed, Dec 05, 2018 at 03:23:02PM +0100, Johannes Thumshirn wrote:
> Use the offset_in_page() and PAGE_ALIGNED() macros instead of open-coding them
> throughout btrfs.
> 
> This series also includes a patch for 'make coccicheck' which is marked as an
> RFC and I've CCed Julia in the hoping to get input from her.
> 
> Johannes Thumshirn (3):
>   btrfs: use offset_in_page instead of open-coding it
>   btrfs: use PAGE_ALIGNED instead of open-coding it

Added to misc-next, thanks.


Re: [PATCH 06/10] btrfs: update may_commit_transaction to use the delayed refs rsv

2018-12-06 Thread Nikolay Borisov



On 3.12.18 г. 17:20 ч., Josef Bacik wrote:
> Any space used in the delayed_refs_rsv will be freed up by a transaction
> commit, so instead of just counting the pinned space we also need to
> account for any space in the delayed_refs_rsv when deciding if it will
> make a different to commit the transaction to satisfy our space
> reservation.  If we have enough bytes to satisfy our reservation ticket
> then we are good to go, otherwise subtract out what space we would gain
> back by committing the transaction and compare that against the pinned
> space to make our decision.
> 
> Signed-off-by: Josef Bacik 

Reviewed-by: Nikolay Borisov 

However, look below for one suggestion: 

> ---
>  fs/btrfs/extent-tree.c | 24 +++-
>  1 file changed, 15 insertions(+), 9 deletions(-)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index aa0a638d0263..63ff9d832867 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -4843,8 +4843,10 @@ static int may_commit_transaction(struct btrfs_fs_info 
> *fs_info,
>  {
>   struct reserve_ticket *ticket = NULL;
>   struct btrfs_block_rsv *delayed_rsv = _info->delayed_block_rsv;
> + struct btrfs_block_rsv *delayed_refs_rsv = _info->delayed_refs_rsv;
>   struct btrfs_trans_handle *trans;
> - u64 bytes;
> + u64 bytes_needed;
> + u64 reclaim_bytes = 0;
>  
>   trans = (struct btrfs_trans_handle *)current->journal_info;
>   if (trans)
> @@ -4857,15 +4859,15 @@ static int may_commit_transaction(struct 
> btrfs_fs_info *fs_info,
>   else if (!list_empty(_info->tickets))
>   ticket = list_first_entry(_info->tickets,
> struct reserve_ticket, list);
> - bytes = (ticket) ? ticket->bytes : 0;
> + bytes_needed = (ticket) ? ticket->bytes : 0;
>   spin_unlock(_info->lock);
>  
> - if (!bytes)
> + if (!bytes_needed)
>   return 0;
>  
>   /* See if there is enough pinned space to make this reservation */
>   if (__percpu_counter_compare(_info->total_bytes_pinned,
> -bytes,
> +bytes_needed,
>  BTRFS_TOTAL_BYTES_PINNED_BATCH) >= 0)
>   goto commit;
>  
> @@ -4877,14 +4879,18 @@ static int may_commit_transaction(struct 
> btrfs_fs_info *fs_info,
>   return -ENOSPC;

If we remove this :
 if (space_info != delayed_rsv->space_info)  
return -ENOSPC; 

Check, can't we move the reclaim_bytes calc code above the 
__percpu_counter_compare 
and eventually be left with just a single invocation to percpu_compare. 
The diff should looke something along the lines of: 

@@ -4828,19 +4827,6 @@ static int may_commit_transaction(struct btrfs_fs_info 
*fs_info,
if (!bytes)
return 0;
 
-   /* See if there is enough pinned space to make this reservation */
-   if (__percpu_counter_compare(_info->total_bytes_pinned,
-  bytes,
-  BTRFS_TOTAL_BYTES_PINNED_BATCH) >= 0)
-   goto commit;
-
-   /*
-* See if there is some space in the delayed insertion reservation for
-* this reservation.
-*/
-   if (space_info != delayed_rsv->space_info)
-   return -ENOSPC;
-
spin_lock(_rsv->lock);
if (delayed_rsv->size > bytes)
bytes = 0;
@@ -4850,9 +4836,8 @@ static int may_commit_transaction(struct btrfs_fs_info 
*fs_info,
 
if (__percpu_counter_compare(_info->total_bytes_pinned,
   bytes,
-  BTRFS_TOTAL_BYTES_PINNED_BATCH) < 0) {
+  BTRFS_TOTAL_BYTES_PINNED_BATCH) < 0)
return -ENOSPC;
-   }
 
 commit:
trans = btrfs_join_transaction(fs_info->extent_root);


>  
>   spin_lock(_rsv->lock);
> - if (delayed_rsv->size > bytes)
> - bytes = 0;
> - else
> - bytes -= delayed_rsv->size;
> + reclaim_bytes += delayed_rsv->reserved;
>   spin_unlock(_rsv->lock);
>  
> + spin_lock(_refs_rsv->lock);
> + reclaim_bytes += delayed_refs_rsv->reserved;
> + spin_unlock(_refs_rsv->lock);
> + if (reclaim_bytes >= bytes_needed)
> + goto commit;
> + bytes_needed -= reclaim_bytes;
> +
>   if (__percpu_counter_compare(_info->total_bytes_pinned,
> -bytes,
> +bytes_needed,
>  BTRFS_TOTAL_BYTES_PINNED_BATCH) < 0) {
>   return -ENOSPC;
>   }
> 


Re: [PATCH 02/10] btrfs: add cleanup_ref_head_accounting helper

2018-12-06 Thread Nikolay Borisov



On 3.12.18 г. 17:20 ч., Josef Bacik wrote:
> From: Josef Bacik 
> 
> We were missing some quota cleanups in check_ref_cleanup, so break the
> ref head accounting cleanup into a helper and call that from both
> check_ref_cleanup and cleanup_ref_head.  This will hopefully ensure that
> we don't screw up accounting in the future for other things that we add.
> 
> Reviewed-by: Omar Sandoval 
> Reviewed-by: Liu Bo 
> Signed-off-by: Josef Bacik 

Doesn't this also need a stable tag? Furthermore, doesn't the missing
code dealing with total_bytes_pinned in check_ref_cleanup mean that
every time the last reference for a block was freed we were leaking
bytes in total_bytes_pinned? Shouldn't this have lead to eventually
total_bytes_pinned dominating the usage in a space_info ?

Codewise lgtm:

Reviewed-by: Nikolay Borisov 

> ---
>  fs/btrfs/extent-tree.c | 67 
> +-
>  1 file changed, 39 insertions(+), 28 deletions(-)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index c36b3a42f2bb..e3ed3507018d 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -2443,6 +2443,41 @@ static int cleanup_extent_op(struct btrfs_trans_handle 
> *trans,
>   return ret ? ret : 1;
>  }
>  
> +static void cleanup_ref_head_accounting(struct btrfs_trans_handle *trans,
> + struct btrfs_delayed_ref_head *head)
> +{
> + struct btrfs_fs_info *fs_info = trans->fs_info;
> + struct btrfs_delayed_ref_root *delayed_refs =
> + >transaction->delayed_refs;
> +
> + if (head->total_ref_mod < 0) {
> + struct btrfs_space_info *space_info;
> + u64 flags;
> +
> + if (head->is_data)
> + flags = BTRFS_BLOCK_GROUP_DATA;
> + else if (head->is_system)
> + flags = BTRFS_BLOCK_GROUP_SYSTEM;
> + else
> + flags = BTRFS_BLOCK_GROUP_METADATA;
> + space_info = __find_space_info(fs_info, flags);
> + ASSERT(space_info);
> + percpu_counter_add_batch(_info->total_bytes_pinned,
> +-head->num_bytes,
> +BTRFS_TOTAL_BYTES_PINNED_BATCH);
> +
> + if (head->is_data) {
> + spin_lock(_refs->lock);
> + delayed_refs->pending_csums -= head->num_bytes;
> + spin_unlock(_refs->lock);
> + }
> + }
> +
> + /* Also free its reserved qgroup space */
> + btrfs_qgroup_free_delayed_ref(fs_info, head->qgroup_ref_root,
> +   head->qgroup_reserved);
> +}
> +
>  static int cleanup_ref_head(struct btrfs_trans_handle *trans,
>   struct btrfs_delayed_ref_head *head)
>  {
> @@ -2478,31 +2513,6 @@ static int cleanup_ref_head(struct btrfs_trans_handle 
> *trans,
>   spin_unlock(>lock);
>   spin_unlock(_refs->lock);
>  
> - trace_run_delayed_ref_head(fs_info, head, 0);
> -
> - if (head->total_ref_mod < 0) {
> - struct btrfs_space_info *space_info;
> - u64 flags;
> -
> - if (head->is_data)
> - flags = BTRFS_BLOCK_GROUP_DATA;
> - else if (head->is_system)
> - flags = BTRFS_BLOCK_GROUP_SYSTEM;
> - else
> - flags = BTRFS_BLOCK_GROUP_METADATA;
> - space_info = __find_space_info(fs_info, flags);
> - ASSERT(space_info);
> - percpu_counter_add_batch(_info->total_bytes_pinned,
> --head->num_bytes,
> -BTRFS_TOTAL_BYTES_PINNED_BATCH);
> -
> - if (head->is_data) {
> - spin_lock(_refs->lock);
> - delayed_refs->pending_csums -= head->num_bytes;
> - spin_unlock(_refs->lock);
> - }
> - }
> -
>   if (head->must_insert_reserved) {
>   btrfs_pin_extent(fs_info, head->bytenr,
>head->num_bytes, 1);
> @@ -2512,9 +2522,9 @@ static int cleanup_ref_head(struct btrfs_trans_handle 
> *trans,
>   }
>   }
>  
> - /* Also free its reserved qgroup space */
> - btrfs_qgroup_free_delayed_ref(fs_info, head->qgroup_ref_root,
> -   head->qgroup_reserved);
> + cleanup_ref_head_accounting(trans, head);
> +
> + trace_run_delayed_ref_head(fs_info, head, 0);
>   btrfs_delayed_ref_unlock(head);
>   btrfs_put_delayed_ref_head(head);
>   return 0;
> @@ -6991,6 +7001,7 @@ static noinline int check_ref_cleanup(struct 
> btrfs_trans_handle *trans,
>   if (head->must_insert_reserved)
>   ret = 1;
>  
> + cleanup_ref_head_accounting(trans, head);
>   mutex_unlock(>mutex);
>   btrfs_put_delayed_ref_head(head);
>   return ret;
> 


Re: [PATCH 01/10] btrfs: add btrfs_delete_ref_head helper

2018-12-06 Thread Nikolay Borisov



On 3.12.18 г. 17:20 ч., Josef Bacik wrote:
> From: Josef Bacik 
> 
> We do this dance in cleanup_ref_head and check_ref_cleanup, unify it
> into a helper and cleanup the calling functions.
> 
> Signed-off-by: Josef Bacik 
> Reviewed-by: Omar Sandoval 

Reviewed-by: Nikolay Borisov 

> ---
>  fs/btrfs/delayed-ref.c | 14 ++
>  fs/btrfs/delayed-ref.h |  3 ++-
>  fs/btrfs/extent-tree.c | 22 +++---
>  3 files changed, 19 insertions(+), 20 deletions(-)
> 
> diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
> index 9301b3ad9217..b3e4c9fcb664 100644
> --- a/fs/btrfs/delayed-ref.c
> +++ b/fs/btrfs/delayed-ref.c
> @@ -400,6 +400,20 @@ struct btrfs_delayed_ref_head *btrfs_select_ref_head(
>   return head;
>  }
>  
> +void btrfs_delete_ref_head(struct btrfs_delayed_ref_root *delayed_refs,
> +struct btrfs_delayed_ref_head *head)
> +{
> + lockdep_assert_held(_refs->lock);
> + lockdep_assert_held(>lock);
> +
> + rb_erase_cached(>href_node, _refs->href_root);
> + RB_CLEAR_NODE(>href_node);
> + atomic_dec(_refs->num_entries);
> + delayed_refs->num_heads--;
> + if (head->processing == 0)
> + delayed_refs->num_heads_ready--;
> +}
> +
>  /*
>   * Helper to insert the ref_node to the tail or merge with tail.
>   *
> diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h
> index 8e20c5cb5404..d2af974f68a1 100644
> --- a/fs/btrfs/delayed-ref.h
> +++ b/fs/btrfs/delayed-ref.h
> @@ -261,7 +261,8 @@ static inline void btrfs_delayed_ref_unlock(struct 
> btrfs_delayed_ref_head *head)
>  {
>   mutex_unlock(>mutex);
>  }
> -
> +void btrfs_delete_ref_head(struct btrfs_delayed_ref_root *delayed_refs,
> +    struct btrfs_delayed_ref_head *head);
>  
>  struct btrfs_delayed_ref_head *btrfs_select_ref_head(
>   struct btrfs_delayed_ref_root *delayed_refs);
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index d242a1174e50..c36b3a42f2bb 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -2474,12 +2474,9 @@ static int cleanup_ref_head(struct btrfs_trans_handle 
> *trans,
>   spin_unlock(_refs->lock);
>   return 1;
>   }
> - delayed_refs->num_heads--;
> - rb_erase_cached(>href_node, _refs->href_root);
> - RB_CLEAR_NODE(>href_node);
> + btrfs_delete_ref_head(delayed_refs, head);
>   spin_unlock(>lock);
>   spin_unlock(_refs->lock);
> - atomic_dec(_refs->num_entries);
>  
>   trace_run_delayed_ref_head(fs_info, head, 0);
>  
> @@ -6984,22 +6981,9 @@ static noinline int check_ref_cleanup(struct 
> btrfs_trans_handle *trans,
>   if (!mutex_trylock(>mutex))
>   goto out;
>  
> - /*
> -  * at this point we have a head with no other entries.  Go
> -  * ahead and process it.
> -  */
> - rb_erase_cached(>href_node, _refs->href_root);
> - RB_CLEAR_NODE(>href_node);
> - atomic_dec(_refs->num_entries);
> -
> - /*
> -  * we don't take a ref on the node because we're removing it from the
> -  * tree, so we just steal the ref the tree was holding.
> -  */
> - delayed_refs->num_heads--;
> - if (head->processing == 0)
> - delayed_refs->num_heads_ready--;
> + btrfs_delete_ref_head(delayed_refs, head);
>   head->processing = 0;
> +
>   spin_unlock(>lock);
>   spin_unlock(_refs->lock);
>  
> 


Re: [PATCH 00/10] btrfs: Support for DAX devices

2018-12-06 Thread Goldwyn Rodrigues
On 11:07 06/12, Johannes Thumshirn wrote:
> On 05/12/2018 13:28, Goldwyn Rodrigues wrote:
> > This is a support for DAX in btrfs. I understand there have been
> > previous attempts at it. However, I wanted to make sure copy-on-write
> > (COW) works on dax as well.
> > 
> > Before I present this to the FS folks I wanted to run this through the
> > btrfs. Even though I wish, I cannot get it correct the first time
> > around :/.. Here are some questions for which I need suggestions:
> 
> Hi Goldwyn,
> 
> I've thrown your patches (from your git tree) onto one of my pmem test
> machines with this pmem config:

Thanks. I will check on this. Ordered extents have been a pain to deal
with for me (though mainly because of my incorrect usage)

> 
> mayhem:~/:[0]# ndctl list
> [
>   {
> "dev":"namespace1.0",
> "mode":"fsdax",
> "map":"dev",
> "size":792721358848,
> "uuid":"3fd4ab18-5145-4675-85a0-e05e6f9bcee4",
> "raw_uuid":"49264743-2351-41c5-9db9-38534813df61",
> "sector_size":512,
> "blockdev":"pmem1",
> "numa_node":1
>   },
>   {
> "dev":"namespace0.0",
> "mode":"fsdax",
> "map":"dev",
> "size":792721358848,
> "uuid":"dd0aec3c-7721-4621-8898-e50684a371b5",
> "raw_uuid":"84ff5463-f76e-4ddf-a248-85122541e909",
> "sector_size":4096,
> "blockdev":"pmem0",
> "numa_node":0
>   }
> ]
> 
> Unfortunately I hit a btrfs_panic() with btrfs/002.
> export TEST_DEV=/dev/pmem0
> export SCRATCH_DEV=/dev/pmem1
> export MOUNT_OPTIONS="-o dax"
> ./check
> [...]
> [  178.173113] run fstests btrfs/002 at 2018-12-06 10:55:43
> [  178.357044] BTRFS info (device pmem0): disk space caching is enabled
> [  178.357047] BTRFS info (device pmem0): has skinny extents
> [  178.360042] BTRFS info (device pmem0): enabling ssd optimizations
> [  178.475918] BTRFS: device fsid ee888255-7f4a-4bf7-af65-e8a6a354aca8
> devid 1 transid 3 /dev/pmem1
> [  178.505717] BTRFS info (device pmem1): disk space caching is enabled
> [  178.513593] BTRFS info (device pmem1): has skinny extents
> [  178.520384] BTRFS info (device pmem1): flagging fs with big metadata
> feature
> [  178.530997] BTRFS info (device pmem1): enabling ssd optimizations
> [  178.538331] BTRFS info (device pmem1): creating UUID tree
> [  178.587200] BTRFS critical (device pmem1): panic in
> ordered_data_tree_panic:57: Inconsistency in ordered tree at offset 0
> (errno=-17 Object already exists)
> [  178.603129] [ cut here ]
> [  178.608667] kernel BUG at fs/btrfs/ordered-data.c:57!
> [  178.614333] invalid opcode:  [#1] SMP PTI
> [  178.619295] CPU: 87 PID: 8225 Comm: dd Kdump: loaded Tainted: G
>   E 4.20.0-rc5-default-btrfs-dax #920
> [  178.630090] Hardware name: Intel Corporation PURLEY/PURLEY, BIOS
> SE5C620.86B.0D.01.0010.072020182008 07/20/2018
> [  178.640626] RIP: 0010:__btrfs_add_ordered_extent+0x325/0x410 [btrfs]
> [  178.647404] Code: 28 4d 89 f1 49 c7 c0 90 9c 57 c0 b9 ef ff ff ff ba
> 39 00 00 00 48 c7 c6 10 fe 56 c0 48 8b b8 d8 03 00 00 31 c0 e8 e2 99 06
> 00 <0f> 0b 65 8b 05 d2 e4 b0 3f 89 c0 48 0f a3 05 78 5e cf c2 0f 92 c0
> [  178.667019] RSP: 0018:a3e3674c7ba8 EFLAGS: 00010096
> [  178.672684] RAX: 008f RBX: 9770c2ac5748 RCX:
> 
> [  178.680254] RDX: 97711f9dee80 RSI: 97711f9d6868 RDI:
> 97711f9d6868
> [  178.687831] RBP: 97711d523000 R08:  R09:
> 065a
> [  178.695411] R10: 03ff R11: 0001 R12:
> 97710d66da70
> [  178.702993] R13: 9770c2ac5600 R14:  R15:
> 97710d66d9c0
> [  178.710573] FS:  7fe11ef90700() GS:97711f9c()
> knlGS:
> [  178.719122] CS:  0010 DS:  ES:  CR0: 80050033
> [  178.725380] CR2: 0156a000 CR3: 00eb30dfc006 CR4:
> 007606e0
> [  178.732999] DR0:  DR1:  DR2:
> 
> [  178.740574] DR3:  DR6: fffe0ff0 DR7:
> 0400
> [  178.748147] PKRU: 5554
> [  178.751297] Call Trace:
> [  178.754230]  btrfs_add_ordered_extent_dio+0x1d/0x30 [btrfs]
> [  178.760269]  btrfs_create_dio_extent+0x79/0xe0 [btrfs]
> [  178.765930]  btrfs_get_extent_map_write+0x1a9/0x2b0 [btrfs]
> [  178.771959]  btrfs_file_dax_write+0x1f8/0x4f0 [btrfs]
> [  178.777508]  ? current_t

Re: [PATCH 07/10] dax: export functions for use with btrfs

2018-12-06 Thread Goldwyn Rodrigues
On  6:52 05/12, Christoph Hellwig wrote:
> If you want to export these at all they have to be EXPORT_SYMBOL_GPL.
> 

Understood.

> But I'd really like to avoid seeing another duplicate DAX I/O path.
> Please try to adopt the existing iomap-based infrastructure for your
> needs first.

This is not worth with btrfs. With non-page aligned I/O on btrfs, we
need to copy the first/last page of the extents for CoW. So, we
would end up using the exported functions anyways. Believe me, I have
spent some time getting btrfs iomap compatible before giving up. The
problems are btrfs needs to carry a lot of information across
iomap_begin and iomap_end. While the added private variable helps in
this, it also needs hooks in bio_submit() functions for crc calculations
during direct writes.

-- 
Goldwyn


HELP unmountable partition after btrfs balance to RAID0

2018-12-06 Thread Thomas Mohr

Dear developers of BTRFS,

we have a problem. We wanted to convert a file system to a RAID0 with 
two partitions. Unfortunately we had to reboot the server during the 
balance operation before it could complete.


Now following happens:

A mount attempt of the array fails with following error code:

btrfs recover yields roughly 1.6 out of 4 TB.

to recover the rest we have tried:

mount:

[18192.357444] BTRFS info (device sdb1): disk space caching is enabled
[18192.357447] BTRFS info (device sdb1): has skinny extents
[18192.370664] BTRFS error (device sdb1): parent transid verify failed 
on 30523392 wanted 7432 found 7445
[18192.370810] BTRFS error (device sdb1): parent transid verify failed 
on 30523392 wanted 7432 found 7445

[18192.394745] BTRFS error (device sdb1): open_ctree failed

mounting with options ro, degraded, cache_clear etc yields the same errors.


btrfs rescue zero-log. This operation works, however, the error persists 
and the array remains unmountable


parent transid verify failed on 59768832 wanted 7422 found 7187
parent transid verify failed on 59768832 wanted 7422 found 7187
parent transid verify failed on 59768832 wanted 7422 found 7187
parent transid verify failed on 59768832 wanted 7422 found 7187
Ignoring transid failure
parent transid verify failed on 30408704 wanted 7430 found 7443
parent transid verify failed on 30408704 wanted 7430 found 7443
parent transid verify failed on 30408704 wanted 7430 found 7443
parent transid verify failed on 30408704 wanted 7430 found 7443
Ignoring transid failure
Clearing log on /dev/sdb1, previous log_root 0, level 0

btrfs rescue chunk-recover fails with following error message:

btrfs check results in:

Opening filesystem to check...
parent transid verify failed on 59768832 wanted 7422 found 7187
parent transid verify failed on 59768832 wanted 7422 found 7187
parent transid verify failed on 59768832 wanted 7422 found 7187
parent transid verify failed on 59768832 wanted 7422 found 7187
Ignoring transid failure
parent transid verify failed on 30408704 wanted 7430 found 7443
parent transid verify failed on 30408704 wanted 7430 found 7443
parent transid verify failed on 30408704 wanted 7430 found 7443
parent transid verify failed on 30408704 wanted 7430 found 7443
Ignoring transid failure
Checking filesystem on /dev/sdb1
UUID: 6c9ed4e1-d63f-46f0-b1e9-608b8fa43bb8
[1/7] checking root items
parent transid verify failed on 30523392 wanted 7432 found 7443
parent transid verify failed on 30523392 wanted 7432 found 7443
parent transid verify failed on 30523392 wanted 7432 found 7443
parent transid verify failed on 30523392 wanted 7432 found 7443
Ignoring transid failure
leaf parent key incorrect 30523392ERROR: failed to repair root items: 
Operation not permitted


Any ideas what is going on or how to recover the file system ? I would 
greatly appreciate your help !!!


best,

Thomas


uname -a:

Linux server2 4.19.5-1-default #1 SMP PREEMPT Tue Nov 27 19:56:09 UTC 
2018 (6210279) x86_64 x86_64 x86_64 GNU/Linux


btrfs-progs version 4.19


--
ScienceConsult - DI Thomas Mohr KG
DI Thomas Mohr
Enzianweg 10a
2353 Guntramsdorf
Austria
+43 2236 56793
+43 660 461 1966
http://www.mohrkeg.co.at



Re: [PATCH 00/10] btrfs: Support for DAX devices

2018-12-06 Thread Johannes Thumshirn
On 05/12/2018 13:28, Goldwyn Rodrigues wrote:
> This is a support for DAX in btrfs. I understand there have been
> previous attempts at it. However, I wanted to make sure copy-on-write
> (COW) works on dax as well.
> 
> Before I present this to the FS folks I wanted to run this through the
> btrfs. Even though I wish, I cannot get it correct the first time
> around :/.. Here are some questions for which I need suggestions:

Hi Goldwyn,

I've thrown your patches (from your git tree) onto one of my pmem test
machines with this pmem config:

mayhem:~/:[0]# ndctl list
[
  {
"dev":"namespace1.0",
"mode":"fsdax",
"map":"dev",
"size":792721358848,
"uuid":"3fd4ab18-5145-4675-85a0-e05e6f9bcee4",
"raw_uuid":"49264743-2351-41c5-9db9-38534813df61",
"sector_size":512,
"blockdev":"pmem1",
"numa_node":1
  },
  {
"dev":"namespace0.0",
"mode":"fsdax",
"map":"dev",
"size":792721358848,
"uuid":"dd0aec3c-7721-4621-8898-e50684a371b5",
"raw_uuid":"84ff5463-f76e-4ddf-a248-85122541e909",
"sector_size":4096,
"blockdev":"pmem0",
"numa_node":0
  }
]

Unfortunately I hit a btrfs_panic() with btrfs/002.
export TEST_DEV=/dev/pmem0
export SCRATCH_DEV=/dev/pmem1
export MOUNT_OPTIONS="-o dax"
./check
[...]
[  178.173113] run fstests btrfs/002 at 2018-12-06 10:55:43
[  178.357044] BTRFS info (device pmem0): disk space caching is enabled
[  178.357047] BTRFS info (device pmem0): has skinny extents
[  178.360042] BTRFS info (device pmem0): enabling ssd optimizations
[  178.475918] BTRFS: device fsid ee888255-7f4a-4bf7-af65-e8a6a354aca8
devid 1 transid 3 /dev/pmem1
[  178.505717] BTRFS info (device pmem1): disk space caching is enabled
[  178.513593] BTRFS info (device pmem1): has skinny extents
[  178.520384] BTRFS info (device pmem1): flagging fs with big metadata
feature
[  178.530997] BTRFS info (device pmem1): enabling ssd optimizations
[  178.538331] BTRFS info (device pmem1): creating UUID tree
[  178.587200] BTRFS critical (device pmem1): panic in
ordered_data_tree_panic:57: Inconsistency in ordered tree at offset 0
(errno=-17 Object already exists)
[  178.603129] [ cut here ]
[  178.608667] kernel BUG at fs/btrfs/ordered-data.c:57!
[  178.614333] invalid opcode:  [#1] SMP PTI
[  178.619295] CPU: 87 PID: 8225 Comm: dd Kdump: loaded Tainted: G
  E 4.20.0-rc5-default-btrfs-dax #920
[  178.630090] Hardware name: Intel Corporation PURLEY/PURLEY, BIOS
SE5C620.86B.0D.01.0010.072020182008 07/20/2018
[  178.640626] RIP: 0010:__btrfs_add_ordered_extent+0x325/0x410 [btrfs]
[  178.647404] Code: 28 4d 89 f1 49 c7 c0 90 9c 57 c0 b9 ef ff ff ff ba
39 00 00 00 48 c7 c6 10 fe 56 c0 48 8b b8 d8 03 00 00 31 c0 e8 e2 99 06
00 <0f> 0b 65 8b 05 d2 e4 b0 3f 89 c0 48 0f a3 05 78 5e cf c2 0f 92 c0
[  178.667019] RSP: 0018:a3e3674c7ba8 EFLAGS: 00010096
[  178.672684] RAX: 008f RBX: 9770c2ac5748 RCX:

[  178.680254] RDX: 97711f9dee80 RSI: 97711f9d6868 RDI:
97711f9d6868
[  178.687831] RBP: 97711d523000 R08:  R09:
065a
[  178.695411] R10: 03ff R11: 0001 R12:
97710d66da70
[  178.702993] R13: 9770c2ac5600 R14:  R15:
97710d66d9c0
[  178.710573] FS:  7fe11ef90700() GS:97711f9c()
knlGS:
[  178.719122] CS:  0010 DS:  ES:  CR0: 80050033
[  178.725380] CR2: 0156a000 CR3: 00eb30dfc006 CR4:
007606e0
[  178.732999] DR0:  DR1:  DR2:

[  178.740574] DR3: 0000 DR6: fffe0ff0 DR7:
0400
[  178.748147] PKRU: 5554
[  178.751297] Call Trace:
[  178.754230]  btrfs_add_ordered_extent_dio+0x1d/0x30 [btrfs]
[  178.760269]  btrfs_create_dio_extent+0x79/0xe0 [btrfs]
[  178.765930]  btrfs_get_extent_map_write+0x1a9/0x2b0 [btrfs]
[  178.771959]  btrfs_file_dax_write+0x1f8/0x4f0 [btrfs]
[  178.777508]  ? current_time+0x3f/0x70
[  178.781672]  btrfs_file_write_iter+0x384/0x580 [btrfs]
[  178.787265]  ? pipe_read+0x243/0x2a0
[  178.791298]  __vfs_write+0xee/0x170
[  178.795241]  vfs_write+0xad/0x1a0
[  178.799008]  ? vfs_read+0x111/0x130
[  178.802949]  ksys_write+0x42/0x90
[  178.806712]  do_syscall_64+0x5b/0x180
[  178.810829]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  178.816334] RIP: 0033:0x7fe11eabb3d0
[  178.820364] Code: 73 01 c3 48 8b 0d b8 ea 2b 00 f7 d8 64 89 01 48 83
c8 ff c3 66 0f 1f 44 00 00 83 3d b9 43 2c 00 00 75 10 b8 01 00 00 00 0f
05 <48> 3d 01 f0 ff ff 73 31 c3 48 83 ec 08 e8 2e 90 01 00 48 89 04 24
[  178.840052] RSP: 002b:7ffec969d978 EF

Re: [PATCH 00/10] btrfs: Support for DAX devices

2018-12-05 Thread Robert White

On 12/5/18 9:37 PM, Jeff Mahoney wrote:
The high level idea that Jan Kara and I came up with in our conversation 
at Labs conf is pretty expensive.  We'd need to set a flag that pauses 
new page faults, set the WP bit on affected ranges, do the snapshot, 
commit, clear the flag, and wake up the waiting threads.  Neither of us 
had any concrete idea of how well that would perform and it still 
depends on finding a good way to resolve all open mmap ranges on a 
subvolume.  Perhaps using the address_space->private_list anchored on 
each root would work.


This is a potentially wild idea, so "grain of salt" and all that. I may 
misuse the exact wording.


So the essential problem of DAX is basically the opposite of 
data-deduplication. Instead of merging two duplicate data regions, you 
want to mark regions as at-risk while keeping the original content 
intact if there are snapshots in conflict.


So suppose you _require_ data checksums and data mode of "dup" or mirror 
or one of the other fault tolerant layouts.


By definition any block that gets written with content that it didn't 
have before will now have a bad checksum.


If the inode is flagged for direct IO that's an indication that the 
block has been updated.


At this point you really just need to do the opposite of deduplication, 
as in find/recover the original contents and assign/leave assigned those 
to the old/other snapshots, then compute the new checksum on the 
"original block" and assign it to the active subvolume.


So when a region is mapped for direct IO, and it's refcount is greater 
than one, and you get to a sync or close event, you "recover" the old 
contents into a new location and assign those to "all the other users". 
Now that original storage region has only one user, so on sync or close 
you fix its checksums on the cheap.


Instead of the new data being a small rock sitting over a large rug to 
make a lump, the new data is like a rock being slid under the rug to 
make a lump.


So the first write to an extent creates a burdensome copy to retain the 
old contents, but second and subsequent writes to the same extent only 
have the cost of an _eventual_ checksum of the original block list.


Maybe If the data isn't already duplicated then the write mapping or the 
DAX open or the setting of the S_DUP flag could force the file into an 
extent block that _is_ duplicated.


The mental leap required is that the new blocks don't need to belong to 
the new state being created. The new blocks can be associated to the 
snapshots since data copy is idempotent.


The side note is that it only ever matters if the usage count is greater 
than one, so at worst taking a snapshot, which is already a _little_ 
racy anyway, would/could trigger a semi-lightweight copy of any S_DAX files:


If S_DAX :
  If checksum invalid :
copy data as-is and checksum, store in snapshot
  else : look for duplicate checksum
if duplicate found :
  assign that extent to the snapshot
else :
  If file opened for writing and has any mmaps for write :
copy extent and assign to new snapshot.
  else :
increment usage count and assign current block to snapshot

Anyway, I only know enough of the internals to be dangerous.

Since the real goal of mmap is speed during actual update, this idea is 
basically about amortizing the copy costs into the task of maintaining 
the snapshots instead of leaving them in the immediate hands of the 
time-critical updater.


The flush, unmmap, or close by the user, or a system-wide sync event, 
are also good points to expense the bookeeping time.


[PATCH 4/8] btrfs: delayed-ref: Use btrfs_ref to refactor btrfs_add_delayed_data_ref()

2018-12-05 Thread Qu Wenruo
Just like btrfs_add_delayed_tree_ref(), use btrfs_ref to refactor
btrfs_add_delayed_data_ref().

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/delayed-ref.c | 19 +--
 fs/btrfs/delayed-ref.h |  8 +++-
 fs/btrfs/extent-tree.c | 24 +++-
 3 files changed, 27 insertions(+), 24 deletions(-)

diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
index c42b8ade7b07..09caf1e6fc22 100644
--- a/fs/btrfs/delayed-ref.c
+++ b/fs/btrfs/delayed-ref.c
@@ -800,21 +800,27 @@ int btrfs_add_delayed_tree_ref(struct btrfs_trans_handle 
*trans,
  * add a delayed data ref. it's similar to btrfs_add_delayed_tree_ref.
  */
 int btrfs_add_delayed_data_ref(struct btrfs_trans_handle *trans,
-  struct btrfs_root *root,
-  u64 bytenr, u64 num_bytes,
-  u64 parent, u64 ref_root,
-  u64 owner, u64 offset, u64 reserved, int action,
-  int *old_ref_mod, int *new_ref_mod)
+  struct btrfs_ref *generic_ref,
+  u64 reserved, int *old_ref_mod,
+  int *new_ref_mod)
 {
struct btrfs_fs_info *fs_info = trans->fs_info;
struct btrfs_delayed_data_ref *ref;
struct btrfs_delayed_ref_head *head_ref;
struct btrfs_delayed_ref_root *delayed_refs;
struct btrfs_qgroup_extent_record *record = NULL;
+   int action = generic_ref->action;
int qrecord_inserted;
int ret;
+   u64 bytenr = generic_ref->bytenr;
+   u64 num_bytes = generic_ref->len;
+   u64 parent = generic_ref->parent;
+   u64 ref_root = generic_ref->data_ref.ref_root;
+   u64 owner = generic_ref->data_ref.ino;
+   u64 offset = generic_ref->data_ref.offset;
u8 ref_type;
 
+   ASSERT(generic_ref && generic_ref->type == BTRFS_REF_DATA && action);
ref = kmem_cache_alloc(btrfs_delayed_data_ref_cachep, GFP_NOFS);
if (!ref)
return -ENOMEM;
@@ -838,7 +844,8 @@ int btrfs_add_delayed_data_ref(struct btrfs_trans_handle 
*trans,
}
 
if (test_bit(BTRFS_FS_QUOTA_ENABLED, _info->flags) &&
-   is_fstree(ref_root) && is_fstree(root->root_key.objectid)) {
+   is_fstree(ref_root) && is_fstree(generic_ref->real_root) &&
+   !generic_ref->skip_qgroup) {
record = kzalloc(sizeof(*record), GFP_NOFS);
if (!record) {
    kmem_cache_free(btrfs_delayed_data_ref_cachep, ref);
diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h
index dbe029c4e01b..a8fde33b43fd 100644
--- a/fs/btrfs/delayed-ref.h
+++ b/fs/btrfs/delayed-ref.h
@@ -337,11 +337,9 @@ int btrfs_add_delayed_tree_ref(struct btrfs_trans_handle 
*trans,
   struct btrfs_delayed_extent_op *extent_op,
   int *old_ref_mod, int *new_ref_mod);
 int btrfs_add_delayed_data_ref(struct btrfs_trans_handle *trans,
-  struct btrfs_root *root,
-  u64 bytenr, u64 num_bytes,
-  u64 parent, u64 ref_root,
-  u64 owner, u64 offset, u64 reserved, int action,
-  int *old_ref_mod, int *new_ref_mod);
+  struct btrfs_ref *generic_ref,
+  u64 reserved, int *old_ref_mod,
+  int *new_ref_mod);
 int btrfs_add_delayed_extent_op(struct btrfs_fs_info *fs_info,
struct btrfs_trans_handle *trans,
    u64 bytenr, u64 num_bytes,
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index ecfa0234863b..fa5dd3dfe2e7 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2049,10 +2049,8 @@ int btrfs_inc_extent_ref(struct btrfs_trans_handle 
*trans,
ret = btrfs_add_delayed_tree_ref(trans, _ref,
NULL, _ref_mod, _ref_mod);
} else {
-   ret = btrfs_add_delayed_data_ref(trans, root, bytenr,
-num_bytes, parent,
-root_objectid, owner, offset,
-0, BTRFS_ADD_DELAYED_REF,
+   btrfs_init_data_ref(_ref, root_objectid, owner, offset);
+   ret = btrfs_add_delayed_data_ref(trans, _ref, 0,
 _ref_mod, _ref_mod);
}
 
@@ -7114,10 +7112,8 @@ int btrfs_free_extent(struct btrfs_trans_handle *trans,
ret = btrfs_add_delayed_tree_ref(trans, _ref, NULL,
 _ref_mod, _ref_mod);
} else {
-   ret = btrfs_add_delayed_data_ref(

[PATCH 8/8] btrfs: extent-tree: Use btrfs_ref to refactor btrfs_free_extent()

2018-12-05 Thread Qu Wenruo
Similar to btrfs_inc_extent_ref(), just use btrfs_ref to replace the
long parameter list and the confusing @owner parameter.

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/ctree.h   |  5 +---
 fs/btrfs/extent-tree.c | 53 ++
 fs/btrfs/file.c| 23 ++
 fs/btrfs/inode.c   | 13 +++
 fs/btrfs/relocation.c  | 26 +
 5 files changed, 62 insertions(+), 58 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index db3df5ce6087..9ed55a29993d 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -2671,10 +2671,7 @@ int btrfs_set_disk_extent_flags(struct 
btrfs_trans_handle *trans,
struct btrfs_fs_info *fs_info,
u64 bytenr, u64 num_bytes, u64 flags,
int level, int is_data);
-int btrfs_free_extent(struct btrfs_trans_handle *trans,
- struct btrfs_root *root,
- u64 bytenr, u64 num_bytes, u64 parent, u64 root_objectid,
- u64 owner, u64 offset, bool for_reloc);
+int btrfs_free_extent(struct btrfs_trans_handle *trans, struct btrfs_ref *ref);
 
 int btrfs_free_reserved_extent(struct btrfs_fs_info *fs_info,
   u64 start, u64 len, int delalloc);
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index ff60091aef6b..8a6a73006dc4 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -3255,10 +3255,7 @@ static int __btrfs_mod_ref(struct btrfs_trans_handle 
*trans,
if (inc)
ret = btrfs_inc_extent_ref(trans, _ref);
else
-   ret = btrfs_free_extent(trans, root, bytenr,
-   num_bytes, parent, ref_root,
-   key.objectid, key.offset,
-   for_reloc);
+   ret = btrfs_free_extent(trans, _ref);
if (ret)
goto fail;
} else {
@@ -3272,9 +3269,7 @@ static int __btrfs_mod_ref(struct btrfs_trans_handle 
*trans,
if (inc)
ret = btrfs_inc_extent_ref(trans, _ref);
else
-   ret = btrfs_free_extent(trans, root, bytenr,
-   num_bytes, parent, ref_root,
-   level - 1, 0, for_reloc);
+   ret = btrfs_free_extent(trans, _ref);
if (ret)
goto fail;
}
@@ -7073,47 +7068,43 @@ void btrfs_free_tree_block(struct btrfs_trans_handle 
*trans,
 }
 
 /* Can return -ENOMEM */
-int btrfs_free_extent(struct btrfs_trans_handle *trans,
- struct btrfs_root *root,
- u64 bytenr, u64 num_bytes, u64 parent, u64 root_objectid,
- u64 owner, u64 offset, bool for_reloc)
+int btrfs_free_extent(struct btrfs_trans_handle *trans, struct btrfs_ref *ref)
 {
-   struct btrfs_fs_info *fs_info = root->fs_info;
-   struct btrfs_ref generic_ref = { 0 };
+   struct btrfs_fs_info *fs_info = trans->fs_info;
int old_ref_mod, new_ref_mod;
int ret;
 
if (btrfs_is_testing(fs_info))
return 0;
 
-   btrfs_init_generic_ref(_ref, BTRFS_DROP_DELAYED_REF, bytenr,
-  num_bytes, root->root_key.objectid, parent);
-   generic_ref.skip_qgroup = for_reloc;
/*
 * tree log blocks never actually go into the extent allocation
 * tree, just update pinning info and exit early.
 */
-   if (root_objectid == BTRFS_TREE_LOG_OBJECTID) {
-   WARN_ON(owner >= BTRFS_FIRST_FREE_OBJECTID);
+   if ((ref->type == BTRFS_REF_METADATA &&
+ref->tree_ref.root == BTRFS_TREE_LOG_OBJECTID) ||
+   (ref->type == BTRFS_REF_DATA &&
+ref->data_ref.ref_root == BTRFS_TREE_LOG_OBJECTID)) {
/* unlocks the pinned mutex */
-   btrfs_pin_extent(fs_info, bytenr, num_bytes, 1);
+   btrfs_pin_extent(fs_info, ref->bytenr, ref->len, 1);
old_ref_mod = new_ref_mod = 0;
ret = 0;
-   } else if (owner < BTRFS_FIRST_FREE_OBJECTID) {
-   btrfs_init_tree_ref(_ref, (int)owner, root_objectid);
-   ret = btrfs_add_delayed_tree_ref(trans, _ref, NULL,
+   } else if (ref->type == BTRFS_REF_METADATA) {
+   ret = btrfs_add_delayed_tree_ref(trans, ref, NULL,
 _ref_mod, _ref_mod);
} else {
-   btrfs_init_data_ref(_ref, root_objectid, owner, offset);
-   ret = btrfs_add_

[PATCH 2/8] btrfs: extent-tree: Open-code process_func in __btrfs_mod_ref

2018-12-05 Thread Qu Wenruo
The process_func is never a function hook used anywhere else.

Open code it to make later delayed ref refactor easier, so we can
refactor btrfs_inc_extent_ref() and btrfs_free_extent() in different
patches.

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/extent-tree.c | 33 ++---
 1 file changed, 18 insertions(+), 15 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index ea2c3d5220f0..ea68d288d761 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -3220,10 +3220,6 @@ static int __btrfs_mod_ref(struct btrfs_trans_handle 
*trans,
int i;
int level;
int ret = 0;
-   int (*process_func)(struct btrfs_trans_handle *,
-   struct btrfs_root *,
-   u64, u64, u64, u64, u64, u64, bool);
-
 
if (btrfs_is_testing(fs_info))
return 0;
@@ -3235,11 +3231,6 @@ static int __btrfs_mod_ref(struct btrfs_trans_handle 
*trans,
if (!test_bit(BTRFS_ROOT_REF_COWS, >state) && level == 0)
return 0;
 
-   if (inc)
-   process_func = btrfs_inc_extent_ref;
-   else
-   process_func = btrfs_free_extent;
-
if (full_backref)
parent = buf->start;
else
@@ -3261,17 +3252,29 @@ static int __btrfs_mod_ref(struct btrfs_trans_handle 
*trans,
 
num_bytes = btrfs_file_extent_disk_num_bytes(buf, fi);
key.offset -= btrfs_file_extent_offset(buf, fi);
-   ret = process_func(trans, root, bytenr, num_bytes,
-  parent, ref_root, key.objectid,
-  key.offset, for_reloc);
+   if (inc)
+   ret = btrfs_inc_extent_ref(trans, root, bytenr,
+   num_bytes, parent, ref_root,
+   key.objectid, key.offset,
+   for_reloc);
+   else
+   ret = btrfs_free_extent(trans, root, bytenr,
+   num_bytes, parent, ref_root,
+   key.objectid, key.offset,
+   for_reloc);
if (ret)
goto fail;
} else {
bytenr = btrfs_node_blockptr(buf, i);
num_bytes = fs_info->nodesize;
-   ret = process_func(trans, root, bytenr, num_bytes,
-  parent, ref_root, level - 1, 0,
-  for_reloc);
+   if (inc)
+   ret = btrfs_inc_extent_ref(trans, root, bytenr,
+   num_bytes, parent, ref_root,
+   level - 1, 0, for_reloc);
+   else
+   ret = btrfs_free_extent(trans, root, bytenr,
+   num_bytes, parent, ref_root,
+   level - 1, 0, for_reloc);
if (ret)
goto fail;
}
-- 
2.19.2



[PATCH 7/8] btrfs: extent-tree: Use btrfs_ref to refactor btrfs_inc_extent_ref()

2018-12-05 Thread Qu Wenruo
Now we don't need to play the dirty game of reusing @owner for tree block
level.

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/ctree.h   |  6 ++---
 fs/btrfs/extent-tree.c | 58 ++
 fs/btrfs/file.c| 20 ++-
 fs/btrfs/inode.c   | 10 +---
 fs/btrfs/ioctl.c   | 17 -
 fs/btrfs/relocation.c  | 44 
 fs/btrfs/tree-log.c| 12 ++---
 7 files changed, 100 insertions(+), 67 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 6f4b1e605736..db3df5ce6087 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -40,6 +40,7 @@ extern struct kmem_cache *btrfs_bit_radix_cachep;
 extern struct kmem_cache *btrfs_path_cachep;
 extern struct kmem_cache *btrfs_free_space_cachep;
 struct btrfs_ordered_sum;
+struct btrfs_ref;
 
 #define BTRFS_MAGIC 0x4D5F53665248425FULL /* ascii _BHRfS_M, no null */
 
@@ -2682,10 +2683,7 @@ int btrfs_free_and_pin_reserved_extent(struct 
btrfs_fs_info *fs_info,
 void btrfs_prepare_extent_commit(struct btrfs_fs_info *fs_info);
 int btrfs_finish_extent_commit(struct btrfs_trans_handle *trans);
 int btrfs_inc_extent_ref(struct btrfs_trans_handle *trans,
-struct btrfs_root *root,
-u64 bytenr, u64 num_bytes, u64 parent,
-u64 root_objectid, u64 owner, u64 offset,
-bool for_reloc);
+struct btrfs_ref *generic_ref);
 
 int btrfs_start_dirty_block_groups(struct btrfs_trans_handle *trans);
 int btrfs_write_dirty_block_groups(struct btrfs_trans_handle *trans,
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 70c05ca30d9a..ff60091aef6b 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2026,36 +2026,28 @@ int btrfs_discard_extent(struct btrfs_fs_info *fs_info, 
u64 bytenr,
 
 /* Can return -ENOMEM */
 int btrfs_inc_extent_ref(struct btrfs_trans_handle *trans,
-struct btrfs_root *root,
-u64 bytenr, u64 num_bytes, u64 parent,
-u64 root_objectid, u64 owner, u64 offset,
-bool for_reloc)
+struct btrfs_ref *generic_ref)
 {
-   struct btrfs_fs_info *fs_info = root->fs_info;
-   struct btrfs_ref generic_ref = { 0 };
+   struct btrfs_fs_info *fs_info = trans->fs_info;
int old_ref_mod, new_ref_mod;
int ret;
 
-   BUG_ON(owner < BTRFS_FIRST_FREE_OBJECTID &&
-  root_objectid == BTRFS_TREE_LOG_OBJECTID);
+   BUG_ON(generic_ref->type == BTRFS_REF_NOT_SET ||
+  !generic_ref->action);
+   BUG_ON(generic_ref->type == BTRFS_REF_METADATA &&
+  generic_ref->tree_ref.root == BTRFS_TREE_LOG_OBJECTID);
 
-   btrfs_init_generic_ref(_ref, BTRFS_ADD_DELAYED_REF, bytenr,
-  num_bytes, root->root_key.objectid, parent);
-   generic_ref.skip_qgroup = for_reloc;
-   if (owner < BTRFS_FIRST_FREE_OBJECTID) {
-   btrfs_init_tree_ref(_ref, (int)owner, root_objectid);
-   ret = btrfs_add_delayed_tree_ref(trans, _ref,
+   if (generic_ref->type == BTRFS_REF_METADATA)
+   ret = btrfs_add_delayed_tree_ref(trans, generic_ref,
NULL, _ref_mod, _ref_mod);
-   } else {
-   btrfs_init_data_ref(_ref, root_objectid, owner, offset);
-   ret = btrfs_add_delayed_data_ref(trans, _ref, 0,
+   else
+   ret = btrfs_add_delayed_data_ref(trans, generic_ref, 0,
 _ref_mod, _ref_mod);
-   }
 
-   btrfs_ref_tree_mod(fs_info, _ref);
+   btrfs_ref_tree_mod(fs_info, generic_ref);
 
if (ret == 0 && old_ref_mod < 0 && new_ref_mod >= 0)
-   add_pinned_bytes(fs_info, _ref);
+   add_pinned_bytes(fs_info, generic_ref);
 
return ret;
 }
@@ -3212,8 +3204,10 @@ static int __btrfs_mod_ref(struct btrfs_trans_handle 
*trans,
u32 nritems;
struct btrfs_key key;
struct btrfs_file_extent_item *fi;
+   struct btrfs_ref generic_ref = { 0 };
bool for_reloc = btrfs_header_flag(buf, BTRFS_HEADER_FLAG_RELOC);
int i;
+   int action;
int level;
int ret = 0;
 
@@ -3231,6 +3225,10 @@ static int __btrfs_mod_ref(struct btrfs_trans_handle 
*trans,
parent = buf->start;
else
parent = 0;
+   if (inc)
+   action = BTRFS_ADD_DELAYED_REF;
+   else
+   action = BTRFS_DROP_DELAYED_REF;
 
for (i = 0; i < nritems; i++) {
if (level == 0) {
@@ -3248,11 +3246,14 @@ static int __btrfs_mod_ref(struct btrfs_trans_handle 
*trans,
 
num_bytes = btrfs_file_extent_disk_num_bytes(buf, fi);
key.offset -= btrfs_file_e

[PATCH 5/8] btrfs: ref-verify: Use btrfs_ref to refactor btrfs_ref_tree_mod()

2018-12-05 Thread Qu Wenruo
It's a perfect match for btrfs_ref_tree_mod() to use btrfs_ref, as
btrfs_ref describes a metadata/data reference update comprehensively.

Now we have one less function use confusing owner/level trick.

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/extent-tree.c | 27 +++--
 fs/btrfs/ref-verify.c  | 53 --
 fs/btrfs/ref-verify.h  | 10 
 3 files changed, 42 insertions(+), 48 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index fa5dd3dfe2e7..1d812bc2c7fc 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2038,9 +2038,6 @@ int btrfs_inc_extent_ref(struct btrfs_trans_handle *trans,
BUG_ON(owner < BTRFS_FIRST_FREE_OBJECTID &&
   root_objectid == BTRFS_TREE_LOG_OBJECTID);
 
-   btrfs_ref_tree_mod(root, bytenr, num_bytes, parent, root_objectid,
-  owner, offset, BTRFS_ADD_DELAYED_REF);
-
btrfs_init_generic_ref(_ref, BTRFS_ADD_DELAYED_REF, bytenr,
   num_bytes, root->root_key.objectid, parent);
generic_ref.skip_qgroup = for_reloc;
@@ -2054,6 +2051,8 @@ int btrfs_inc_extent_ref(struct btrfs_trans_handle *trans,
 _ref_mod, _ref_mod);
}
 
+   btrfs_ref_tree_mod(fs_info, _ref);
+
if (ret == 0 && old_ref_mod < 0 && new_ref_mod >= 0) {
bool metadata = owner < BTRFS_FIRST_FREE_OBJECTID;
 
@@ -7025,10 +7024,7 @@ void btrfs_free_tree_block(struct btrfs_trans_handle 
*trans,
if (root->root_key.objectid != BTRFS_TREE_LOG_OBJECTID) {
int old_ref_mod, new_ref_mod;
 
-   btrfs_ref_tree_mod(root, buf->start, buf->len, parent,
-  root->root_key.objectid,
-  btrfs_header_level(buf), 0,
-  BTRFS_DROP_DELAYED_REF);
+   btrfs_ref_tree_mod(fs_info, _ref);
ret = btrfs_add_delayed_tree_ref(trans, _ref, NULL,
 _ref_mod, _ref_mod);
BUG_ON(ret); /* -ENOMEM */
@@ -7089,11 +7085,6 @@ int btrfs_free_extent(struct btrfs_trans_handle *trans,
if (btrfs_is_testing(fs_info))
return 0;
 
-   if (root_objectid != BTRFS_TREE_LOG_OBJECTID)
-   btrfs_ref_tree_mod(root, bytenr, num_bytes, parent,
-  root_objectid, owner, offset,
-  BTRFS_DROP_DELAYED_REF);
-
btrfs_init_generic_ref(_ref, BTRFS_DROP_DELAYED_REF, bytenr,
   num_bytes, root->root_key.objectid, parent);
generic_ref.skip_qgroup = for_reloc;
@@ -7117,6 +7108,9 @@ int btrfs_free_extent(struct btrfs_trans_handle *trans,
 _ref_mod, _ref_mod);
}
 
+   if (root_objectid != BTRFS_TREE_LOG_OBJECTID)
+   btrfs_ref_tree_mod(fs_info, _ref);
+
if (ret == 0 && old_ref_mod >= 0 && new_ref_mod < 0) {
bool metadata = owner < BTRFS_FIRST_FREE_OBJECTID;
 
@@ -8083,14 +8077,11 @@ int btrfs_alloc_reserved_file_extent(struct 
btrfs_trans_handle *trans,
 
BUG_ON(root->root_key.objectid == BTRFS_TREE_LOG_OBJECTID);
 
-   btrfs_ref_tree_mod(root, ins->objectid, ins->offset, 0,
-  root->root_key.objectid, owner, offset,
-  BTRFS_ADD_DELAYED_EXTENT);
-
btrfs_init_generic_ref(_ref, BTRFS_ADD_DELAYED_EXTENT,
   ins->objectid, ins->offset,
   root->root_key.objectid, 0);
btrfs_init_data_ref(_ref, root->root_key.objectid, owner, 
offset);
+   btrfs_ref_tree_mod(root->fs_info, _ref);
ret = btrfs_add_delayed_data_ref(trans, _ref,
 ram_bytes, NULL, NULL);
return ret;
@@ -8338,13 +8329,11 @@ struct extent_buffer *btrfs_alloc_tree_block(struct 
btrfs_trans_handle *trans,
extent_op->is_data = false;
extent_op->level = level;
 
-   btrfs_ref_tree_mod(root, ins.objectid, ins.offset, parent,
-  root_objectid, level, 0,
-  BTRFS_ADD_DELAYED_EXTENT);
btrfs_init_generic_ref(_ref, BTRFS_ADD_DELAYED_EXTENT,
   ins.objectid, ins.offset,
   root->root_key.objectid, parent);
btrfs_init_tree_ref(_ref, level, root_objectid);
+   btrfs_ref_tree_mod(fs_info, _ref);
ret = btrfs_add_delayed_tree_ref(trans, _ref,
     extent_op, NULL, NULL);
if (ret)
diff --git a/fs/btrfs/ref-verify.c b/fs/btrfs/ref-verify.c
index d69

[PATCH 3/8] btrfs: delayed-ref: Use btrfs_ref to refactor btrfs_add_delayed_tree_ref()

2018-12-05 Thread Qu Wenruo
btrfs_add_delayed_tree_ref() has a longer and longer parameter list, and
some caller like btrfs_inc_extent_ref() are using @owner as level for
delayed tree ref.

Instead of making the parameter list longer and longer, use btrfs_ref to
refactor it, so each parameter assignment should be self-explaining
without dirty level/owner trick, and provides the basis for later refactor.

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/delayed-ref.c | 24 ++---
 fs/btrfs/delayed-ref.h |  4 +---
 fs/btrfs/extent-tree.c | 48 --
 3 files changed, 44 insertions(+), 32 deletions(-)

diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
index 11dd46be4017..c42b8ade7b07 100644
--- a/fs/btrfs/delayed-ref.c
+++ b/fs/btrfs/delayed-ref.c
@@ -710,9 +710,7 @@ static void init_delayed_ref_common(struct btrfs_fs_info 
*fs_info,
  * transaction commits.
  */
 int btrfs_add_delayed_tree_ref(struct btrfs_trans_handle *trans,
-  u64 bytenr, u64 num_bytes, u64 parent,
-  u64 ref_root,  int level, bool for_reloc,
-  int action,
+  struct btrfs_ref *generic_ref,
   struct btrfs_delayed_extent_op *extent_op,
   int *old_ref_mod, int *new_ref_mod)
 {
@@ -722,10 +720,17 @@ int btrfs_add_delayed_tree_ref(struct btrfs_trans_handle 
*trans,
struct btrfs_delayed_ref_root *delayed_refs;
struct btrfs_qgroup_extent_record *record = NULL;
int qrecord_inserted;
-   bool is_system = (ref_root == BTRFS_CHUNK_TREE_OBJECTID);
+   bool is_system = (generic_ref->real_root == BTRFS_CHUNK_TREE_OBJECTID);
+   int action = generic_ref->action;
+   int level = generic_ref->tree_ref.level;
int ret;
+   u64 bytenr = generic_ref->bytenr;
+   u64 num_bytes = generic_ref->len;
+   u64 parent = generic_ref->parent;
u8 ref_type;
 
+   ASSERT(generic_ref && generic_ref->type == BTRFS_REF_METADATA &&
+   generic_ref->action);
BUG_ON(extent_op && extent_op->is_data);
ref = kmem_cache_alloc(btrfs_delayed_tree_ref_cachep, GFP_NOFS);
if (!ref)
@@ -738,7 +743,9 @@ int btrfs_add_delayed_tree_ref(struct btrfs_trans_handle 
*trans,
}
 
if (test_bit(BTRFS_FS_QUOTA_ENABLED, _info->flags) &&
-   is_fstree(ref_root) && !for_reloc) {
+   is_fstree(generic_ref->real_root) &&
+   is_fstree(generic_ref->tree_ref.root) &&
+   !generic_ref->skip_qgroup) {
record = kzalloc(sizeof(*record), GFP_NOFS);
if (!record) {
kmem_cache_free(btrfs_delayed_tree_ref_cachep, ref);
@@ -753,13 +760,14 @@ int btrfs_add_delayed_tree_ref(struct btrfs_trans_handle 
*trans,
ref_type = BTRFS_TREE_BLOCK_REF_KEY;
 
init_delayed_ref_common(fs_info, >node, bytenr, num_bytes,
-   ref_root, action, ref_type);
-   ref->root = ref_root;
+   generic_ref->tree_ref.root, action, ref_type);
+   ref->root = generic_ref->tree_ref.root;
ref->parent = parent;
ref->level = level;
 
init_delayed_ref_head(head_ref, record, bytenr, num_bytes,
- ref_root, 0, action, false, is_system);
+ generic_ref->tree_ref.root, 0, action, false,
+ is_system);
head_ref->extent_op = extent_op;
 
delayed_refs = >transaction->delayed_refs;
diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h
index e36d6b05d85e..dbe029c4e01b 100644
--- a/fs/btrfs/delayed-ref.h
+++ b/fs/btrfs/delayed-ref.h
@@ -333,9 +333,7 @@ static inline void btrfs_put_delayed_ref_head(struct 
btrfs_delayed_ref_head *hea
 }
 
 int btrfs_add_delayed_tree_ref(struct btrfs_trans_handle *trans,
-  u64 bytenr, u64 num_bytes, u64 parent,
-  u64 ref_root, int level, bool for_reloc,
-  int action,
+  struct btrfs_ref *generic_ref,
   struct btrfs_delayed_extent_op *extent_op,
   int *old_ref_mod, int *new_ref_mod);
 int btrfs_add_delayed_data_ref(struct btrfs_trans_handle *trans,
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index ea68d288d761..ecfa0234863b 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2031,6 +2031,7 @@ int btrfs_inc_extent_ref(struct btrfs_trans_handle *trans,
 bool for_reloc)
 {
struct btrfs_fs_info *fs_info = root->fs_info;
+   struct btrfs_ref generic_ref = { 0 };
int old_ref_mod, new_ref_mod;
int ret;
 
@@ -2040,13 +2041,13 @@ in

[PATCH 1/8] btrfs: delayed-ref: Introduce better documented delayed ref structures

2018-12-05 Thread Qu Wenruo
Current delayed ref interface has several problems:
- Longer and longer parameter lists
  bytenr
  num_bytes
  parent
  ref_root
  owner
  offset
  for_reloc << Only qgroup code cares.

- Different interpretation for the same parameter
  Above @owner for data ref is ino owning this extent,
  while for tree ref, it's level. They are even in different size range.
  For level we only need 0~8, while for ino it's
  BTRFS_FIRST_FREE_OBJECTID~BTRFS_LAST_FREE_OBJECTID.

  And @offset doesn't even makes sense for tree ref.

  Such parameter reuse may look clever as an hidden union, but it
  destroys code readability.

To solve both problems, we introduce a new structure, btrfs_ref to solve
them:

- Structure instead of long parameter list
  This makes later expansion easier, and better documented.

- Use btrfs_ref::type to distinguish data and tree ref

- Use proper union to store data/tree ref specific structures.

- Use separate functions to fill data/tree ref data, with a common generic
  function to fill common bytenr/num_bytes members.

All parameters will find its place in btrfs_ref, and an extra member,
real_root, inspired by ref-verify code, is newly introduced for later
qgroup code, to record which tree is triggered this extent modification.

This patch doesn't touch any code, but provides the basis for incoming
refactors.

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/delayed-ref.h | 109 +
 1 file changed, 109 insertions(+)

diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h
index d8fa12d3f2cc..e36d6b05d85e 100644
--- a/fs/btrfs/delayed-ref.h
+++ b/fs/btrfs/delayed-ref.h
@@ -176,6 +176,81 @@ struct btrfs_delayed_ref_root {
u64 qgroup_to_skip;
 };
 
+enum btrfs_ref_type {
+   BTRFS_REF_NOT_SET = 0,
+   BTRFS_REF_DATA,
+   BTRFS_REF_METADATA,
+   BTRFS_REF_LAST,
+};
+
+struct btrfs_data_ref {
+   /*
+* For EXTENT_DATA_REF
+*
+* @ref_root:   current owner of the extent.
+*  may differ from btrfs_ref::real_root.
+* @ino:inode number of the owner.
+* @offset: *CALCULATED* offset. Not EXTENT_DATA key offset.
+*
+*/
+   u64 ref_root;
+   u64 ino;
+   u64 offset;
+};
+
+struct btrfs_tree_ref {
+   /* Common for all sub types and skinny combination */
+   int level;
+
+   /*
+* For TREE_BLOCK_REF (skinny metadata, either inline or keyed)
+*
+* root here may differ from btrfs_ref::real_root.
+*/
+   u64 root;
+
+   /* For non-skinny metadata, no special member needed */
+};
+
+struct btrfs_ref {
+   enum btrfs_ref_type type;
+   int action;
+
+   /*
+* Use full backref(SHARED_BLOCK_REF or SHARED_DATA_REF) for this
+* extent and its children.
+* Set for reloc trees.
+*/
+   unsigned int use_fullback:1;
+
+   /*
+* Whether this extent should go through qgroup record.
+* Normally false, but for certain case like delayed subtree scan,
+* this can hugely reduce qgroup overhead.
+*/
+   unsigned int skip_qgroup:1;
+
+   /*
+* Who owns this reference modification, optional.
+*
+* One example:
+* When creating reloc tree for source fs, it will increase tree block
+* ref for children tree blocks.
+* In that case, btrfs_ref::real_root = reloc tree,
+* while btrfs_ref::tree_ref::root = fs tree.
+*/
+   u64 real_root;
+   u64 bytenr;
+   u64 len;
+
+   /* Common @parent for SHARED_DATA_REF/SHARED_BLOCK_REF */
+   u64 parent;
+   union {
+   struct btrfs_data_ref data_ref;
+   struct btrfs_tree_ref tree_ref;
+   };
+};
+
 extern struct kmem_cache *btrfs_delayed_ref_head_cachep;
 extern struct kmem_cache *btrfs_delayed_tree_ref_cachep;
 extern struct kmem_cache *btrfs_delayed_data_ref_cachep;
@@ -184,6 +259,40 @@ extern struct kmem_cache *btrfs_delayed_extent_op_cachep;
 int __init btrfs_delayed_ref_init(void);
 void __cold btrfs_delayed_ref_exit(void);
 
+static inline void btrfs_init_generic_ref(struct btrfs_ref *generic_ref,
+   int action, u64 bytenr, u64 len, u64 real_root,
+   u64 parent)
+{
+   generic_ref->action = action;
+   generic_ref->bytenr = bytenr;
+   generic_ref->len = len;
+   generic_ref->real_root = real_root;
+   generic_ref->parent = parent;
+}
+
+static inline void btrfs_init_tree_ref(struct btrfs_ref *generic_ref,
+   int level, u64 root)
+{
+   /* If @real_root not set, use @root as fallback */
+   if (!generic_ref->real_root)
+   generic_ref->real_root = root;
+   generic_ref->tree_ref.level = level;
+   generic_ref->tree_ref.root = root;
+   generic_ref->type = BTRFS_REF_METADATA;
+}
+
+static inline void 

[PATCH 6/8] btrfs: extent-tree: Use btrfs_ref to refactor add_pinned_bytes()

2018-12-05 Thread Qu Wenruo
Since add_pinned_bytes() only needs to know if the extent is metadata
and if it's a chunk tree extent, btrfs_ref is a perfect match for it, as
we don't need various owner/level trick to determine extent type.

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/extent-tree.c | 26 ++
 1 file changed, 10 insertions(+), 16 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 1d812bc2c7fc..70c05ca30d9a 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -738,14 +738,15 @@ static struct btrfs_space_info *__find_space_info(struct 
btrfs_fs_info *info,
return NULL;
 }
 
-static void add_pinned_bytes(struct btrfs_fs_info *fs_info, s64 num_bytes,
-bool metadata, u64 root_objectid)
+static void add_pinned_bytes(struct btrfs_fs_info *fs_info,
+struct btrfs_ref *ref)
 {
struct btrfs_space_info *space_info;
+   s64 num_bytes = -ref->len;
u64 flags;
 
-   if (metadata) {
-   if (root_objectid == BTRFS_CHUNK_TREE_OBJECTID)
+   if (ref->type == BTRFS_REF_METADATA) {
+   if (ref->tree_ref.root == BTRFS_CHUNK_TREE_OBJECTID)
flags = BTRFS_BLOCK_GROUP_SYSTEM;
else
flags = BTRFS_BLOCK_GROUP_METADATA;
@@ -2053,11 +2054,8 @@ int btrfs_inc_extent_ref(struct btrfs_trans_handle 
*trans,
 
btrfs_ref_tree_mod(fs_info, _ref);
 
-   if (ret == 0 && old_ref_mod < 0 && new_ref_mod >= 0) {
-   bool metadata = owner < BTRFS_FIRST_FREE_OBJECTID;
-
-   add_pinned_bytes(fs_info, -num_bytes, metadata, root_objectid);
-   }
+   if (ret == 0 && old_ref_mod < 0 && new_ref_mod >= 0)
+   add_pinned_bytes(fs_info, _ref);
 
return ret;
 }
@@ -7059,8 +7057,7 @@ void btrfs_free_tree_block(struct btrfs_trans_handle 
*trans,
}
 out:
if (pin)
-   add_pinned_bytes(fs_info, buf->len, true,
-root->root_key.objectid);
+   add_pinned_bytes(fs_info, _ref);
 
if (last_ref) {
/*
@@ -7111,11 +7108,8 @@ int btrfs_free_extent(struct btrfs_trans_handle *trans,
if (root_objectid != BTRFS_TREE_LOG_OBJECTID)
btrfs_ref_tree_mod(fs_info, _ref);
 
-   if (ret == 0 && old_ref_mod >= 0 && new_ref_mod < 0) {
-   bool metadata = owner < BTRFS_FIRST_FREE_OBJECTID;
-
-   add_pinned_bytes(fs_info, num_bytes, metadata, root_objectid);
-   }
+   if (ret == 0 && old_ref_mod >= 0 && new_ref_mod < 0)
+   add_pinned_bytes(fs_info, _ref);
 
return ret;
 }
-- 
2.19.2



[PATCH 0/8] btrfs: Refactor delayed ref parameter list

2018-12-05 Thread Qu Wenruo
Current delayed ref interface has several problems:
- Longer and longer parameter lists
  bytenr
  num_bytes
  parent
   So far so good
  ref_root
  owner
  offset
   I don't feel well now
  for_reloc
   This parameter only makes sense for qgroup code, but we need
   to pass the parameter a long way.

  This makes later expand on parameter list more and more tricky.

- Different interpretation for the same parameter
  Above @owner for data ref is ino who owns this extent,
  while for tree ref, it's level. They are even in different size range.

  For level we only need 0~8, while for ino it's
  BTRFS_FIRST_FREE_OBJECTID~BTRFS_LAST_FREE_OBJECTID, so it's still
  possible to distinguish them, but it's never a straight-forward thing
  to grasp.

  And @offset doesn't even makes sense for tree ref.

  Such parameter reuse may look clever as an hidden union, but it
  destroys code readability.

This patchset will change the way how we pass parameters for delayed
ref.
Instead of calling delayed ref interface like:
  ret = btrfs_inc_extent_ref(trans, root, bytenr, num_bytes, parent,
 ref_root, owner, offset);
Or
  ret = btrfs_inc_extent_ref(trans, root, bytenr, nodesize, parent,
 level, ref_root, 0);

We now call like:
  btrfs_init_generic_ref(, bytenr, num_bytes,
 root->root_key.objectid, parent);
  btrfs_init_data_ref(, ref_root, owner, offset);
  ret = btrfs_inc_extent_ref(trans, );
Or
  btrfs_init_generic_ref(, bytenr, num_bytes,
 root->root_key.objectid, parent);
  btrfs_init_tree_ref(, level, ref_root);
  ret = btrfs_inc_extent_ref(trans, );

To determine if a ref is tree or data, instead of calling like:
  if (owner < BTRFS_FIRST_FREE_OBJECTID) {
  } else {
  }
We do it straight-forward:
  if (ref->type == BTRFS_REF_METADATA) {
  } else {
  }

And for newer and minor new members, we don't need to add a new
parameter to btrfs_add_delayed_tree|data_ref() or
btrfs_inc_extent_ref(), just assign them after generic/data/tree init:
  btrfs_init_generic_ref(, bytenr, num_bytes,
 root->root_key.objectid, parent);
  btrfs_init_data_ref(, ref_root, owner, offset);
  ref->skip_qgroup = true; /* @skip_qgroup is default to false, so new
  code doesn't need to care */
  ret = btrfs_inc_extent_ref(trans, );

This should improve the code readability and make later code easier to
write.


Qu Wenruo (8):
  btrfs: delayed-ref: Introduce better documented delayed ref structures
  btrfs: extent-tree: Open-code process_func in __btrfs_mod_ref
  btrfs: delayed-ref: Use btrfs_ref to refactor
btrfs_add_delayed_tree_ref()
  btrfs: delayed-ref: Use btrfs_ref to refactor
btrfs_add_delayed_data_ref()
  btrfs: ref-verify: Use btrfs_ref to refactor btrfs_ref_tree_mod()
  btrfs: extent-tree: Use btrfs_ref to refactor add_pinned_bytes()
  btrfs: extent-tree: Use btrfs_ref to refactor btrfs_inc_extent_ref()
  btrfs: extent-tree: Use btrfs_ref to refactor btrfs_free_extent()

 fs/btrfs/ctree.h   |  11 +--
 fs/btrfs/delayed-ref.c |  43 ++---
 fs/btrfs/delayed-ref.h | 121 +++--
 fs/btrfs/extent-tree.c | 195 +++--
 fs/btrfs/file.c|  43 +
 fs/btrfs/inode.c   |  23 +++--
 fs/btrfs/ioctl.c   |  17 ++--
 fs/btrfs/ref-verify.c  |  53 ++++++-
 fs/btrfs/ref-verify.h  |  10 +--
 fs/btrfs/relocation.c  |  70 +++++--
 fs/btrfs/tree-log.c|  12 ++-
 11 files changed, 375 insertions(+), 223 deletions(-)

-- 
2.19.2



Re: [PATCH 00/10] btrfs: Support for DAX devices

2018-12-05 Thread Jeff Mahoney

On 12/5/18 7:28 AM, Goldwyn Rodrigues wrote:

This is a support for DAX in btrfs. I understand there have been
previous attempts at it. However, I wanted to make sure copy-on-write
(COW) works on dax as well.

Before I present this to the FS folks I wanted to run this through the
btrfs. Even though I wish, I cannot get it correct the first time
around :/.. Here are some questions for which I need suggestions:

Questions:
1. I have been unable to do checksumming for DAX devices. While
checksumming can be done for reads and writes, it is a problem when mmap
is involved because btrfs kernel module does not get back control after
an mmap() writes. Any ideas are appreciated, or we would have to set
nodatasum when dax is enabled.


Yep.  It has to be nodatasum, at least within the confines of datasum 
today.  DAX mmap writes are essentially in the same situation as with 
direct i/o when another thread modifies the buffer being submitted. 
Except rather than it being a race, it happens every time.  An 
alternative here could be to add the ability to mark a crc as unreliable 
and then go back and update them once the last DAX mmap reference is 
dropped on a range.  There's no reason to make this a requirement of the 
initial implementation, though.



2. Currently, a user can continue writing on "old" extents of an mmaped file
after a snapshot has been created. How can we enforce writes to be directed
to new extents after snapshots have been created? Do we keep a list of
all mmap()s, and re-mmap them after a snapshot?


It's the second question that's the hard part.  As Adam describes later, 
setting each pfn read-only will ensure page faults cause the remapping.


The high level idea that Jan Kara and I came up with in our conversation 
at Labs conf is pretty expensive.  We'd need to set a flag that pauses 
new page faults, set the WP bit on affected ranges, do the snapshot, 
commit, clear the flag, and wake up the waiting threads.  Neither of us 
had any concrete idea of how well that would perform and it still 
depends on finding a good way to resolve all open mmap ranges on a 
subvolume.  Perhaps using the address_space->private_list anchored on 
each root would work.


-Jeff


Tested by creating a pmem device in RAM with "memmap=2G!4G" kernel
command line parameter.


[PATCH 01/10] btrfs: create a mount option for dax
[PATCH 02/10] btrfs: basic dax read
[PATCH 03/10] btrfs: dax: read zeros from holes
[PATCH 04/10] Rename __endio_write_update_ordered() to
[PATCH 05/10] btrfs: Carve out btrfs_get_extent_map_write() out of
[PATCH 06/10] btrfs: dax write support
[PATCH 07/10] dax: export functions for use with btrfs
[PATCH 08/10] btrfs: dax add read mmap path
[PATCH 09/10] btrfs: dax support for cow_page/mmap_private and shared
[PATCH 10/10] btrfs: dax mmap write

  fs/btrfs/Makefile   |1
  fs/btrfs/ctree.h    |   17 ++
  fs/btrfs/dax.c  |  303 
++++++--
  fs/btrfs/file.c |   29 
  fs/btrfs/inode.c    |   54 +
  fs/btrfs/ioctl.c|5
  fs/btrfs/super.c|   15 ++
  fs/dax.c|   35 --
  include/linux/dax.h |   16 ++
  9 files changed, 430 insertions(+), 45 deletions(-)




--
Jeff Mahoney
SUSE Labs



Re: [PATCH 00/10] btrfs: Support for DAX devices

2018-12-05 Thread Jeff Mahoney

On 12/5/18 8:03 AM, Qu Wenruo wrote:



On 2018/12/5 下午8:28, Goldwyn Rodrigues wrote:

This is a support for DAX in btrfs. I understand there have been
previous attempts at it. However, I wanted to make sure copy-on-write
(COW) works on dax as well.

Before I present this to the FS folks I wanted to run this through the
btrfs. Even though I wish, I cannot get it correct the first time
around :/.. Here are some questions for which I need suggestions:

Questions:
1. I have been unable to do checksumming for DAX devices. While
checksumming can be done for reads and writes, it is a problem when mmap
is involved because btrfs kernel module does not get back control after
an mmap() writes. Any ideas are appreciated, or we would have to set
nodatasum when dax is enabled.


I'm not familar with DAX, so it's completely possible I'm talking like
an idiot.


The general idea is:

1) there is no page cache involved. read() and write() are like direct 
i/o writes in concept.  The user buffer is written directly (via what is 
essentially a specialized memcpy) to the NVDIMM.
2) for mmap, once the mapping is established and mapped, the file system 
is not involved.  The application writes directly to the memory as it 
would a normal mmap, except it's persistent.  All that's required to 
ensure persistence is a CPU cache flush.  The only way the file system 
is involved again is if some operation has occurred to reset the WP bit.



If btrfs_page_mkwrite() can't provide enough control, then I have a
crazy idea.


It can't, because it is only invoked on the page fault path and we want 
to try to limit those as much as possible.



Forcing page fault for every mmap() read/write (completely disable page
cache like DIO).
So that we could get some control since we're informed to read the page
and do some hacks there.
There's no way to force a page fault for every mmap read/write.  Even if 
there was, we wouldn't want that.  No user would turn that on when they 
can just make similar guarantees in their app (which are typically apps 
that do this already) and not pay any performance penalty.   The idea 
with DAX mmap is that the file system manages the namespace, space 
allocation, and permissions.  Otherwise we stay out of the way.


-Jeff
--
Jeff Mahoney
SUSE Labs



Re: [PATCH][v2] btrfs: run delayed items before dropping the snapshot

2018-12-05 Thread Filipe Manana
On Wed, Dec 5, 2018 at 5:14 PM Josef Bacik  wrote:
>
> From: Josef Bacik 
>
> With my delayed refs patches in place we started seeing a large amount
> of aborts in __btrfs_free_extent
>
> BTRFS error (device sdb1): unable to find ref byte nr 91947008 parent 0 root 
> 35964  owner 1 offset 0
> Call Trace:
>  ? btrfs_merge_delayed_refs+0xaf/0x340
>  __btrfs_run_delayed_refs+0x6ea/0xfc0
>  ? btrfs_set_path_blocking+0x31/0x60
>  btrfs_run_delayed_refs+0xeb/0x180
>  btrfs_commit_transaction+0x179/0x7f0
>  ? btrfs_check_space_for_delayed_refs+0x30/0x50
>  ? should_end_transaction.isra.19+0xe/0x40
>  btrfs_drop_snapshot+0x41c/0x7c0
>  btrfs_clean_one_deleted_snapshot+0xb5/0xd0
>  cleaner_kthread+0xf6/0x120
>  kthread+0xf8/0x130
>  ? btree_invalidatepage+0x90/0x90
>  ? kthread_bind+0x10/0x10
>  ret_from_fork+0x35/0x40
>
> This was because btrfs_drop_snapshot depends on the root not being modified
> while it's dropping the snapshot.  It will unlock the root node (and really
> every node) as it walks down the tree, only to re-lock it when it needs to do
> something.  This is a problem because if we modify the tree we could cow a 
> block
> in our path, which free's our reference to that block.  Then once we get back 
> to
> that shared block we'll free our reference to it again, and get ENOENT when
> trying to lookup our extent reference to that block in __btrfs_free_extent.
>
> This is ultimately happening because we have delayed items left to be 
> processed
> for our deleted snapshot _after_ all of the inodes are closed for the 
> snapshot.
> We only run the delayed inode item if we're deleting the inode, and even then 
> we
> do not run the delayed insertions or delayed removals.  These can be run at 
> any
> point after our final inode does it's last iput, which is what triggers the
> snapshot deletion.  We can end up with the snapshot deletion happening and 
> then
> have the delayed items run on that file system, resulting in the above 
> problem.
>
> This problem has existed forever, however my patches made it much easier to 
> hit
> as I wake up the cleaner much more often to deal with delayed iputs, which 
> made
> us more likely to start the snapshot dropping work before the transaction
> commits, which is when the delayed items would generally be run.  Before,
> generally speaking, we would run the delayed items, commit the transaction, 
> and
> wakeup the cleaner thread to start deleting snapshots, which means we were 
> less
> likely to hit this problem.  You could still hit it if you had multiple
> snapshots to be deleted and ended up with lots of delayed items, but it was
> definitely harder.
>
> Fix for now by simply running all the delayed items before starting to drop 
> the
> snapshot.  We could make this smarter in the future by making the delayed 
> items
> per-root, and then simply drop any delayed items for roots that we are going 
> to
> delete.  But for now just a quick and easy solution is the safest.
>
> Cc: sta...@vger.kernel.org
> Signed-off-by: Josef Bacik 

Reviewed-by: Filipe Manana 

Looks good now. Thanks.

> ---
> v1->v2:
> - check for errors from btrfs_run_delayed_items.
> - Dave I can reroll the series, but the second version of patch 1 is the same,
>   let me know what you want.
>
>  fs/btrfs/extent-tree.c | 4 
>  1 file changed, 4 insertions(+)
>
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index dcb699dd57f3..473084eb7a2d 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -9330,6 +9330,10 @@ int btrfs_drop_snapshot(struct btrfs_root *root,
> goto out_free;
> }
>
> +   err = btrfs_run_delayed_items(trans);
> +   if (err)
> +   goto out_end_trans;
> +
> if (block_rsv)
> trans->block_rsv = block_rsv;
>
> --
> 2.14.3
>


-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”


Re: btrfs progs always assume devid 1?

2018-12-05 Thread Austin S. Hemmelgarn

On 2018-12-05 14:50, Roman Mamedov wrote:

Hello,

To migrate my FS to a different physical disk, I have added a new empty device
to the FS, then ran the remove operation on the original one.

Now my FS has only devid 2:

Label: 'p1'  uuid: d886c190-b383-45ba-9272-9f00c6a10c50
Total devices 1 FS bytes used 36.63GiB
devid2 size 50.00GiB used 45.06GiB path /dev/mapper/vg-p1

And all the operations of btrfs-progs now fail to work in their default
invocation, such as:

# btrfs fi resize max .
Resize '.' of 'max'
ERROR: unable to resize '.': No such device

[768813.414821] BTRFS info (device dm-5): resizer unable to find device 1

Of course this works:

# btrfs fi resize 2:max .
Resize '.' of '2:max'

But this is inconvenient and seems to be a rather simple oversight. If what I
got is normal (the device staying as ID 2 after such operation), then count
that as a suggestion that btrfs-progs should use the first existing devid,
rather than always looking for hard-coded devid 1.



I've been meaning to try and write up a patch to special-case this for a 
while now, but have not gotten around to it yet.


FWIW, this is one of multiple reasons that it's highly recommended to 
use `btrfs replace` instead of adding a new device and deleting the old 
one when replacing a device.  Other benefits include:


* It doesn't have to run in the foreground (and doesn't by default).
* It usually takes less time.
* Replace operations can be queried while running to get a nice 
indication of the completion percentage.


The only disadvantage is that the new device has to be at least as large 
as the old one (though you can get around this to a limited degree by 
shrinking the old device), and it needs the old and new device to be 
plugged in at the same time (add/delete doesn't, if you flip the order 
of the add and delete commands).


btrfs progs always assume devid 1?

2018-12-05 Thread Roman Mamedov
Hello,

To migrate my FS to a different physical disk, I have added a new empty device
to the FS, then ran the remove operation on the original one.

Now my FS has only devid 2:

Label: 'p1'  uuid: d886c190-b383-45ba-9272-9f00c6a10c50
Total devices 1 FS bytes used 36.63GiB
devid2 size 50.00GiB used 45.06GiB path /dev/mapper/vg-p1

And all the operations of btrfs-progs now fail to work in their default
invocation, such as:

# btrfs fi resize max .
Resize '.' of 'max'
ERROR: unable to resize '.': No such device

[768813.414821] BTRFS info (device dm-5): resizer unable to find device 1

Of course this works:

# btrfs fi resize 2:max .
Resize '.' of '2:max'

But this is inconvenient and seems to be a rather simple oversight. If what I
got is normal (the device staying as ID 2 after such operation), then count
that as a suggestion that btrfs-progs should use the first existing devid,
rather than always looking for hard-coded devid 1.

-- 
With respect,
Roman


[PATCH][v2] btrfs: run delayed items before dropping the snapshot

2018-12-05 Thread Josef Bacik
From: Josef Bacik 

With my delayed refs patches in place we started seeing a large amount
of aborts in __btrfs_free_extent

BTRFS error (device sdb1): unable to find ref byte nr 91947008 parent 0 root 
35964  owner 1 offset 0
Call Trace:
 ? btrfs_merge_delayed_refs+0xaf/0x340
 __btrfs_run_delayed_refs+0x6ea/0xfc0
 ? btrfs_set_path_blocking+0x31/0x60
 btrfs_run_delayed_refs+0xeb/0x180
 btrfs_commit_transaction+0x179/0x7f0
 ? btrfs_check_space_for_delayed_refs+0x30/0x50
 ? should_end_transaction.isra.19+0xe/0x40
 btrfs_drop_snapshot+0x41c/0x7c0
 btrfs_clean_one_deleted_snapshot+0xb5/0xd0
 cleaner_kthread+0xf6/0x120
 kthread+0xf8/0x130
 ? btree_invalidatepage+0x90/0x90
 ? kthread_bind+0x10/0x10
 ret_from_fork+0x35/0x40

This was because btrfs_drop_snapshot depends on the root not being modified
while it's dropping the snapshot.  It will unlock the root node (and really
every node) as it walks down the tree, only to re-lock it when it needs to do
something.  This is a problem because if we modify the tree we could cow a block
in our path, which free's our reference to that block.  Then once we get back to
that shared block we'll free our reference to it again, and get ENOENT when
trying to lookup our extent reference to that block in __btrfs_free_extent.

This is ultimately happening because we have delayed items left to be processed
for our deleted snapshot _after_ all of the inodes are closed for the snapshot.
We only run the delayed inode item if we're deleting the inode, and even then we
do not run the delayed insertions or delayed removals.  These can be run at any
point after our final inode does it's last iput, which is what triggers the
snapshot deletion.  We can end up with the snapshot deletion happening and then
have the delayed items run on that file system, resulting in the above problem.

This problem has existed forever, however my patches made it much easier to hit
as I wake up the cleaner much more often to deal with delayed iputs, which made
us more likely to start the snapshot dropping work before the transaction
commits, which is when the delayed items would generally be run.  Before,
generally speaking, we would run the delayed items, commit the transaction, and
wakeup the cleaner thread to start deleting snapshots, which means we were less
likely to hit this problem.  You could still hit it if you had multiple
snapshots to be deleted and ended up with lots of delayed items, but it was
definitely harder.

Fix for now by simply running all the delayed items before starting to drop the
snapshot.  We could make this smarter in the future by making the delayed items
per-root, and then simply drop any delayed items for roots that we are going to
delete.  But for now just a quick and easy solution is the safest.

Cc: sta...@vger.kernel.org
Signed-off-by: Josef Bacik 
---
v1->v2:
- check for errors from btrfs_run_delayed_items.
- Dave I can reroll the series, but the second version of patch 1 is the same,
  let me know what you want.

 fs/btrfs/extent-tree.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index dcb699dd57f3..473084eb7a2d 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -9330,6 +9330,10 @@ int btrfs_drop_snapshot(struct btrfs_root *root,
goto out_free;
}
 
+   err = btrfs_run_delayed_items(trans);
+   if (err)
+   goto out_end_trans;
+
if (block_rsv)
trans->block_rsv = block_rsv;
 
-- 
2.14.3



Re: [RFC PATCH] btrfs: Remove __extent_readpages

2018-12-05 Thread Josef Bacik
On Mon, Dec 03, 2018 at 12:25:32PM +0200, Nikolay Borisov wrote:
> When extent_readpages is called from the generic readahead code it first
> builds a batch of 16 pages (which might or might not be consecutive,
> depending on whether add_to_page_cache_lru failed) and submits them to
> __extent_readpages. The latter ensures that the range of pages (in the
> batch of 16) that is passed to __do_contiguous_readpages is consecutive.
> 
> If add_to_page_cache_lru does't fail then __extent_readpages will call
> __do_contiguous_readpages only once with the whole batch of 16.
> Alternatively, if add_to_page_cache_lru fails once on the 8th page (as an 
> example)
> then the contigous page read code will be called twice.
> 
> All of this can be simplified by exploiting the fact that all pages passed
> to extent_readpages are consecutive, thus when the batch is built in
> that function it is already consecutive (barring add_to_page_cache_lru
> failures) so are ready to be submitted directly to __do_contiguous_readpages.
> Also simplify the name of the function to contiguous_readpages. 
> 
> Signed-off-by: Nikolay Borisov 
> ---
> 
> So this patch looks like a very nice cleanup, however when doing performance 
> measurements with fio I was shocked to see that it actually is detrimental to 
> performance. Here are the results: 
> 
> The command line used for fio: 
> fio --name=/media/scratch/seqread --rw=read --direct=0 --ioengine=sync --bs=4k
>  --numjobs=1 --size=1G --runtime=600  --group_reporting --loop 10
> 
> This was tested on a vm with 4g of ram so the size of the test is smaller 
> than 
> the memory, so pages should have been nicely readahead. 
> 

What this patch changes is now we aren't reading all of the pages we are passed
by the readahead, so now we fall back to per-page reading when we go to read
those pages because the readahead window has already moved past them.  This
isn't great behavior, what we have is nice in that it tries to group the entire
range together as much as possible.  What your patch changes is that as soon as
we stop having a contiguous range we just bail and submit what we have.  Testing
it in isolation is likely to show very little change, but having recently
touched the fault in code where we definitely do not count major faults in some
cases I'd suspect that we're not reflecting this higher fault rate in the
performance counters properly.  We should preserve the existing behavior, what
hurts a little bit on a lightly loaded box is going to hurt a whole lot more on
a heavily loaded box.  Thanks,

Josef


Re: [PATCHv3] btrfs: Fix error handling in btrfs_cleanup_ordered_extents

2018-12-05 Thread David Sterba
On Wed, Nov 21, 2018 at 05:10:52PM +0200, Nikolay Borisov wrote:
> Running btrfs/124 in a loop hung up on me sporadically with the
> following call trace:
>   btrfs   D0  5760   5324 0x
>   Call Trace:
>? __schedule+0x243/0x800
>schedule+0x33/0x90
>btrfs_start_ordered_extent+0x10c/0x1b0 [btrfs]
>? wait_woken+0xa0/0xa0
>btrfs_wait_ordered_range+0xbb/0x100 [btrfs]
>btrfs_relocate_block_group+0x1ff/0x230 [btrfs]
>btrfs_relocate_chunk+0x49/0x100 [btrfs]
>    btrfs_balance+0xbeb/0x1740 [btrfs]
>    btrfs_ioctl_balance+0x2ee/0x380 [btrfs]
>btrfs_ioctl+0x1691/0x3110 [btrfs]
>? lockdep_hardirqs_on+0xed/0x180
>? __handle_mm_fault+0x8e7/0xfb0
>? _raw_spin_unlock+0x24/0x30
>? __handle_mm_fault+0x8e7/0xfb0
>? do_vfs_ioctl+0xa5/0x6e0
>    ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
>do_vfs_ioctl+0xa5/0x6e0
>? entry_SYSCALL_64_after_hwframe+0x3e/0xbe
>ksys_ioctl+0x3a/0x70
>__x64_sys_ioctl+0x16/0x20
>do_syscall_64+0x60/0x1b0
>entry_SYSCALL_64_after_hwframe+0x49/0xbe
> 
> This happens because during page writeback it's valid for
> writepage_delalloc to instantiate a delalloc range which doesn't
> belong to the page currently being written back.
> 
> The reason this case is valid is due to find_lock_delalloc_range
> returning any available range after the passed delalloc_start and
> ignorting whether the page under writeback is within that range.
> In turn ordered extents (OE) are always created for the returned range
> from find_lock_delalloc_range. If, however, a failure occurs while OE
> are being created then the clean up code in btrfs_cleanup_ordered_extents
> will be called.
> 
> Unfortunately the code in btrfs_cleanup_ordered_extents doesn't consider
> the case of such 'foreign' range being processed and instead it always
> assumes that the range OE are created for belongs to the page. This
> leads to the first page of such foregin range to not be cleaned up since
> it's deliberately missed skipped by the current cleaning up code.
> 
> Fix this by correctly checking whether the current page belongs to the
> range being instantiated and if so adjsut the range parameters
> passed for cleaning up. If it doesn't, then just clean the whole OE
> range directly.
> 
> Signed-off-by: Nikolay Borisov 
> Reviewed-by: Josef Bacik 

Added to misc-next, thanks.


Re: [PATCH 01/10] btrfs: create a mount option for dax

2018-12-05 Thread Adam Borowski
On Wed, Dec 05, 2018 at 02:43:03PM +0200, Nikolay Borisov wrote:
> One question below though .
> 
> > +++ b/fs/btrfs/super.c
> > @@ -739,6 +741,17 @@ int btrfs_parse_options(struct btrfs_fs_info *info, 
> > char *options,
> > case Opt_user_subvol_rm_allowed:
> > btrfs_set_opt(info->mount_opt, USER_SUBVOL_RM_ALLOWED);
> > break;
> > +#ifdef CONFIG_FS_DAX
> > +   case Opt_dax:
> > +   if (btrfs_super_num_devices(info->super_copy) > 1) {
> > +   btrfs_info(info,
> > +      "dax not supported for multi-device 
> > btrfs partition\n");
> 
> What prevents supporting dax for multiple devices so long as all devices
> are dax?

As I mentioned in a separate mail, most profiles are either redundant
(RAID0), require hardware support (RAID1, DUP) or are impossible (RAID5,
RAID6).

But, "single" profile multi-device would be useful and actually provide
something other dax-supporting filesystems don't have: combining multiple
devices into one logical piece.

On the other hand, DUP profiles need to be banned.  In particular, the
filesystem you mount might have existing DUP block groups.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ 
⣾⠁⢠⠒⠀⣿⡁ Ivan was a worldly man: born in St. Petersburg, raised in
⢿⡄⠘⠷⠚⠋⠀ Petrograd, lived most of his life in Leningrad, then returned
⠈⠳⣄ to the city of his birth to die.


Re: [PATCH 07/10] dax: export functions for use with btrfs

2018-12-05 Thread Christoph Hellwig
If you want to export these at all they have to be EXPORT_SYMBOL_GPL.

But I'd really like to avoid seeing another duplicate DAX I/O path.
Please try to adopt the existing iomap-based infrastructure for your
needs first.


Re: [PATCH 1/3] btrfs: use offset_in_page instead of open-coding it

2018-12-05 Thread Nikolay Borisov



On 5.12.18 г. 16:23 ч., Johannes Thumshirn wrote:
> Constructs like 'var & (PAGE_SIZE - 1)' or 'var & ~PAGE_MASK' can denote an
> offset into a page.
> 
> So replace them by the offset_in_page() macro instead of open-coding it if
> they're not used as an alignment check.
> 
> Signed-off-by: Johannes Thumshirn 

Reviewed-by: Nikolay Borisov 

> ---
>  fs/btrfs/check-integrity.c | 12 +--
>  fs/btrfs/compression.c |  2 +-
>  fs/btrfs/extent_io.c   | 53 
> +++++-
>  fs/btrfs/file.c|  4 ++--
>  fs/btrfs/inode.c   |  7 +++---
>  fs/btrfs/send.c|  2 +-
>  fs/btrfs/volumes.c |  2 +-
>  7 files changed, 38 insertions(+), 44 deletions(-)
> 
> diff --git a/fs/btrfs/check-integrity.c b/fs/btrfs/check-integrity.c
> index 781cae168d2a..d319c3020c09 100644
> --- a/fs/btrfs/check-integrity.c
> +++ b/fs/btrfs/check-integrity.c
> @@ -1202,24 +1202,24 @@ static void btrfsic_read_from_block_data(
>   void *dstv, u32 offset, size_t len)
>  {
>   size_t cur;
> - size_t offset_in_page;
> + size_t pgoff;
>   char *kaddr;
>   char *dst = (char *)dstv;
> - size_t start_offset = block_ctx->start & ((u64)PAGE_SIZE - 1);
> + size_t start_offset = offset_in_page(block_ctx->start);
>   unsigned long i = (start_offset + offset) >> PAGE_SHIFT;
>  
>   WARN_ON(offset + len > block_ctx->len);
> - offset_in_page = (start_offset + offset) & (PAGE_SIZE - 1);
> + pgoff = offset_in_page(start_offset + offset);
>  
>   while (len > 0) {
> - cur = min(len, ((size_t)PAGE_SIZE - offset_in_page));
> + cur = min(len, ((size_t)PAGE_SIZE - pgoff));
>   BUG_ON(i >= DIV_ROUND_UP(block_ctx->len, PAGE_SIZE));
>   kaddr = block_ctx->datav[i];
> - memcpy(dst, kaddr + offset_in_page, cur);
> + memcpy(dst, kaddr + pgoff, cur);
>  
>   dst += cur;
>   len -= cur;
> - offset_in_page = 0;
> +     pgoff = 0;
>   i++;
>   }
>  }
> diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
> index dba59ae914b8..2ab5591449f2 100644
> --- a/fs/btrfs/compression.c
> +++ b/fs/btrfs/compression.c
> @@ -477,7 +477,7 @@ static noinline int add_ra_bio_pages(struct inode *inode,
>  
>   if (page->index == end_index) {
>   char *userpage;
> - size_t zero_offset = isize & (PAGE_SIZE - 1);
> + size_t zero_offset = offset_in_page(isize);
>  
>   if (zero_offset) {
>   int zeros;
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index b2769e92b556..e365c5272e6b 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -2585,7 +2585,7 @@ static void end_bio_extent_readpage(struct bio *bio)
>   unsigned off;
>  
>   /* Zero out the end if this page straddles i_size */
> - off = i_size & (PAGE_SIZE-1);
> + off = offset_in_page(i_size);
>   if (page->index == end_index && off)
>   zero_user_segment(page, off, PAGE_SIZE);
>   SetPageUptodate(page);
> @@ -2888,7 +2888,7 @@ static int __do_readpage(struct extent_io_tree *tree,
>  
>   if (page->index == last_byte >> PAGE_SHIFT) {
>   char *userpage;
> - size_t zero_offset = last_byte & (PAGE_SIZE - 1);
> + size_t zero_offset = offset_in_page(last_byte);
>  
>   if (zero_offset) {
>   iosize = PAGE_SIZE - zero_offset;
> @@ -3432,7 +3432,7 @@ static int __extent_writepage(struct page *page, struct 
> writeback_control *wbc,
>  
>   ClearPageError(page);
>  
> - pg_offset = i_size & (PAGE_SIZE - 1);
> + pg_offset = offset_in_page(i_size);
>   if (page->index > end_index ||
>  (page->index == end_index && !pg_offset)) {
>   page->mapping->a_ops->invalidatepage(page, 0, PAGE_SIZE);
> @@ -5307,7 +5307,7 @@ void read_extent_buffer(const struct extent_buffer *eb, 
> void *dstv,
>   struct page *page;
>   char *kaddr;
>   char *dst = (char *)dstv;
> - size_t start_offset = eb->start & ((u64)PAGE_SIZE - 1);
> + size_t start_offset = offset_in_page(eb->start);
>   unsigned long i = (start_offset + start) >> PAGE_SHIFT;
>  
>   if (start + len > eb->len) {
> @@ -5317,7 +5317,7 @@ void read_extent

Re: [PATCH 2/3] btrfs: use PAGE_ALIGNED instead of open-coding it

2018-12-05 Thread Nikolay Borisov



On 5.12.18 г. 16:23 ч., Johannes Thumshirn wrote:
> When using a 'var & (PAGE_SIZE - 1)' construct one is checking for a page
> alignment and thus should use the PAGE_ALIGNED() macro instead of
> open-coding it.
> 
> Convert all open-coded occurrences of PAGE_ALIGNED().
> 
> Signed-off-by: Johannes Thumshirn 

Reviewed-by: Nikolay Borisov 

> ---
>  fs/btrfs/check-integrity.c | 8 
>  fs/btrfs/compression.c | 2 +-
>  fs/btrfs/inode.c   | 2 +-
>  3 files changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/btrfs/check-integrity.c b/fs/btrfs/check-integrity.c
> index d319c3020c09..84e9729badaa 100644
> --- a/fs/btrfs/check-integrity.c
> +++ b/fs/btrfs/check-integrity.c
> @@ -1601,7 +1601,7 @@ static int btrfsic_read_block(struct btrfsic_state 
> *state,
>   BUG_ON(block_ctx->datav);
>   BUG_ON(block_ctx->pagev);
>   BUG_ON(block_ctx->mem_to_free);
> - if (block_ctx->dev_bytenr & ((u64)PAGE_SIZE - 1)) {
> + if (!PAGE_ALIGNED(block_ctx->dev_bytenr)) {
>   pr_info("btrfsic: read_block() with unaligned bytenr %llu\n",
>  block_ctx->dev_bytenr);
>   return -1;
> @@ -1778,7 +1778,7 @@ static void btrfsic_process_written_block(struct 
> btrfsic_dev_state *dev_state,
>   return;
>   }
>   is_metadata = 1;
> - BUG_ON(BTRFS_SUPER_INFO_SIZE & (PAGE_SIZE - 1));
> + BUG_ON(!PAGE_ALIGNED(BTRFS_SUPER_INFO_SIZE));
>   processed_len = BTRFS_SUPER_INFO_SIZE;
>   if (state->print_mask &
>   BTRFSIC_PRINT_MASK_TREE_BEFORE_SB_WRITE) {
> @@ -2892,12 +2892,12 @@ int btrfsic_mount(struct btrfs_fs_info *fs_info,
>   struct list_head *dev_head = _devices->devices;
>   struct btrfs_device *device;
>  
> - if (fs_info->nodesize & ((u64)PAGE_SIZE - 1)) {
> + if (!PAGE_ALIGNED(fs_info->nodesize)) {
>   pr_info("btrfsic: cannot handle nodesize %d not being a 
> multiple of PAGE_SIZE %ld!\n",
>  fs_info->nodesize, PAGE_SIZE);
>   return -1;
>   }
> - if (fs_info->sectorsize & ((u64)PAGE_SIZE - 1)) {
> + if (!PAGE_ALIGNED(fs_info->sectorsize)) {
>   pr_info("btrfsic: cannot handle sectorsize %d not being a 
> multiple of PAGE_SIZE %ld!\n",
>  fs_info->sectorsize, PAGE_SIZE);
>   return -1;
> diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
> index 2ab5591449f2..d5381f39a9e8 100644
> --- a/fs/btrfs/compression.c
> +++ b/fs/btrfs/compression.c
> @@ -301,7 +301,7 @@ blk_status_t btrfs_submit_compressed_write(struct inode 
> *inode, u64 start,
>   blk_status_t ret;
>   int skip_sum = BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM;
>  
> - WARN_ON(start & ((u64)PAGE_SIZE - 1));
> + WARN_ON(!PAGE_ALIGNED(start));
>   cb = kmalloc(compressed_bio_size(fs_info, compressed_len), GFP_NOFS);
>   if (!cb)
>   return BLK_STS_RESOURCE;
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index bc0564c384de..5c52e91b01e8 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -2027,7 +2027,7 @@ int btrfs_set_extent_delalloc(struct inode *inode, u64 
> start, u64 end,
> unsigned int extra_bits,
> struct extent_state **cached_state, int dedupe)
>  {
> - WARN_ON((end & (PAGE_SIZE - 1)) == 0);
> + WARN_ON(PAGE_ALIGNED(end));
>   return set_extent_delalloc(_I(inode)->io_tree, start, end,
>  extra_bits, cached_state);
>  }
> 


[PATCH 1/3] btrfs: use offset_in_page instead of open-coding it

2018-12-05 Thread Johannes Thumshirn
Constructs like 'var & (PAGE_SIZE - 1)' or 'var & ~PAGE_MASK' can denote an
offset into a page.

So replace them by the offset_in_page() macro instead of open-coding it if
they're not used as an alignment check.

Signed-off-by: Johannes Thumshirn 
---
 fs/btrfs/check-integrity.c | 12 +------
 fs/btrfs/compression.c |  2 +-
 fs/btrfs/extent_io.c   | 53 +-----
 fs/btrfs/file.c|  4 ++--
 fs/btrfs/inode.c   |  7 +++---
 fs/btrfs/send.c|  2 +-
 fs/btrfs/volumes.c |  2 +-
 7 files changed, 38 insertions(+), 44 deletions(-)

diff --git a/fs/btrfs/check-integrity.c b/fs/btrfs/check-integrity.c
index 781cae168d2a..d319c3020c09 100644
--- a/fs/btrfs/check-integrity.c
+++ b/fs/btrfs/check-integrity.c
@@ -1202,24 +1202,24 @@ static void btrfsic_read_from_block_data(
void *dstv, u32 offset, size_t len)
 {
size_t cur;
-   size_t offset_in_page;
+   size_t pgoff;
char *kaddr;
char *dst = (char *)dstv;
-   size_t start_offset = block_ctx->start & ((u64)PAGE_SIZE - 1);
+   size_t start_offset = offset_in_page(block_ctx->start);
unsigned long i = (start_offset + offset) >> PAGE_SHIFT;
 
WARN_ON(offset + len > block_ctx->len);
-   offset_in_page = (start_offset + offset) & (PAGE_SIZE - 1);
+   pgoff = offset_in_page(start_offset + offset);
 
while (len > 0) {
-   cur = min(len, ((size_t)PAGE_SIZE - offset_in_page));
+   cur = min(len, ((size_t)PAGE_SIZE - pgoff));
BUG_ON(i >= DIV_ROUND_UP(block_ctx->len, PAGE_SIZE));
kaddr = block_ctx->datav[i];
-   memcpy(dst, kaddr + offset_in_page, cur);
+   memcpy(dst, kaddr + pgoff, cur);
 
dst += cur;
len -= cur;
-   offset_in_page = 0;
+   pgoff = 0;
    i++;
}
 }
diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index dba59ae914b8..2ab5591449f2 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -477,7 +477,7 @@ static noinline int add_ra_bio_pages(struct inode *inode,
 
if (page->index == end_index) {
char *userpage;
-   size_t zero_offset = isize & (PAGE_SIZE - 1);
+   size_t zero_offset = offset_in_page(isize);
 
if (zero_offset) {
        int zeros;
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index b2769e92b556..e365c5272e6b 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2585,7 +2585,7 @@ static void end_bio_extent_readpage(struct bio *bio)
unsigned off;
 
/* Zero out the end if this page straddles i_size */
-   off = i_size & (PAGE_SIZE-1);
+   off = offset_in_page(i_size);
if (page->index == end_index && off)
zero_user_segment(page, off, PAGE_SIZE);
SetPageUptodate(page);
@@ -2888,7 +2888,7 @@ static int __do_readpage(struct extent_io_tree *tree,
 
if (page->index == last_byte >> PAGE_SHIFT) {
char *userpage;
-   size_t zero_offset = last_byte & (PAGE_SIZE - 1);
+   size_t zero_offset = offset_in_page(last_byte);
 
if (zero_offset) {
iosize = PAGE_SIZE - zero_offset;
@@ -3432,7 +3432,7 @@ static int __extent_writepage(struct page *page, struct 
writeback_control *wbc,
 
ClearPageError(page);
 
-   pg_offset = i_size & (PAGE_SIZE - 1);
+   pg_offset = offset_in_page(i_size);
if (page->index > end_index ||
   (page->index == end_index && !pg_offset)) {
page->mapping->a_ops->invalidatepage(page, 0, PAGE_SIZE);
@@ -5307,7 +5307,7 @@ void read_extent_buffer(const struct extent_buffer *eb, 
void *dstv,
struct page *page;
char *kaddr;
char *dst = (char *)dstv;
-   size_t start_offset = eb->start & ((u64)PAGE_SIZE - 1);
+   size_t start_offset = offset_in_page(eb->start);
unsigned long i = (start_offset + start) >> PAGE_SHIFT;
 
if (start + len > eb->len) {
@@ -5317,7 +5317,7 @@ void read_extent_buffer(const struct extent_buffer *eb, 
void *dstv,
return;
}
 
-   offset = (start_offset + start) & (PAGE_SIZE - 1);
+   offset = offset_in_page(start_offset + start);
 
while (len > 0) {
page = eb->pages[i];
@@ -5342,14 +5342,14 @@ int read_extent_buffer_to_user(const struct 
extent_buffer *eb,
struct page *page;
char *kaddr;
char __user *dst = (char __user *)dstv;
-   size_t start_offset = eb->start & ((u64)PA

[PATCH 0/3] btrfs: use offset_in_page and PAGE_ALIGNED

2018-12-05 Thread Johannes Thumshirn
Use the offset_in_page() and PAGE_ALIGNED() macros instead of open-coding them
throughout btrfs.

This series also includes a patch for 'make coccicheck' which is marked as an
RFC and I've CCed Julia in the hoping to get input from her.

Johannes Thumshirn (3):
  btrfs: use offset_in_page instead of open-coding it
  btrfs: use PAGE_ALIGNED instead of open-coding it
  coccinelle: api: add offset_in_page.cocci

 fs/btrfs/check-integrity.c  | 20 +++
 fs/btrfs/compression.c  |  4 +-
 fs/btrfs/extent_io.c| 53 +--
 fs/btrfs/file.c |  4 +-
 fs/btrfs/inode.c|  9 ++--
 fs/btrfs/send.c |  2 +-
 fs/btrfs/volumes.c  |  2 +-
 scripts/coccinelle/api/offset_in_page.cocci | 81 +
 8 files changed, 125 insertions(+), 50 deletions(-)
 create mode 100644 scripts/coccinelle/api/offset_in_page.cocci

-- 
2.16.4



[PATCH 2/3] btrfs: use PAGE_ALIGNED instead of open-coding it

2018-12-05 Thread Johannes Thumshirn
When using a 'var & (PAGE_SIZE - 1)' construct one is checking for a page
alignment and thus should use the PAGE_ALIGNED() macro instead of
open-coding it.

Convert all open-coded occurrences of PAGE_ALIGNED().

Signed-off-by: Johannes Thumshirn 
---
 fs/btrfs/check-integrity.c | 8 
 fs/btrfs/compression.c | 2 +-
 fs/btrfs/inode.c   | 2 +-
 3 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/check-integrity.c b/fs/btrfs/check-integrity.c
index d319c3020c09..84e9729badaa 100644
--- a/fs/btrfs/check-integrity.c
+++ b/fs/btrfs/check-integrity.c
@@ -1601,7 +1601,7 @@ static int btrfsic_read_block(struct btrfsic_state *state,
BUG_ON(block_ctx->datav);
BUG_ON(block_ctx->pagev);
BUG_ON(block_ctx->mem_to_free);
-   if (block_ctx->dev_bytenr & ((u64)PAGE_SIZE - 1)) {
+   if (!PAGE_ALIGNED(block_ctx->dev_bytenr)) {
pr_info("btrfsic: read_block() with unaligned bytenr %llu\n",
   block_ctx->dev_bytenr);
return -1;
@@ -1778,7 +1778,7 @@ static void btrfsic_process_written_block(struct 
btrfsic_dev_state *dev_state,
return;
}
is_metadata = 1;
-   BUG_ON(BTRFS_SUPER_INFO_SIZE & (PAGE_SIZE - 1));
+   BUG_ON(!PAGE_ALIGNED(BTRFS_SUPER_INFO_SIZE));
processed_len = BTRFS_SUPER_INFO_SIZE;
if (state->print_mask &
BTRFSIC_PRINT_MASK_TREE_BEFORE_SB_WRITE) {
@@ -2892,12 +2892,12 @@ int btrfsic_mount(struct btrfs_fs_info *fs_info,
struct list_head *dev_head = _devices->devices;
struct btrfs_device *device;
 
-   if (fs_info->nodesize & ((u64)PAGE_SIZE - 1)) {
+   if (!PAGE_ALIGNED(fs_info->nodesize)) {
pr_info("btrfsic: cannot handle nodesize %d not being a 
multiple of PAGE_SIZE %ld!\n",
   fs_info->nodesize, PAGE_SIZE);
return -1;
}
-   if (fs_info->sectorsize & ((u64)PAGE_SIZE - 1)) {
+   if (!PAGE_ALIGNED(fs_info->sectorsize)) {
pr_info("btrfsic: cannot handle sectorsize %d not being a 
multiple of PAGE_SIZE %ld!\n",
   fs_info->sectorsize, PAGE_SIZE);
return -1;
diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index 2ab5591449f2..d5381f39a9e8 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -301,7 +301,7 @@ blk_status_t btrfs_submit_compressed_write(struct inode 
*inode, u64 start,
blk_status_t ret;
int skip_sum = BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM;
 
-   WARN_ON(start & ((u64)PAGE_SIZE - 1));
+   WARN_ON(!PAGE_ALIGNED(start));
cb = kmalloc(compressed_bio_size(fs_info, compressed_len), GFP_NOFS);
if (!cb)
return BLK_STS_RESOURCE;
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index bc0564c384de..5c52e91b01e8 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2027,7 +2027,7 @@ int btrfs_set_extent_delalloc(struct inode *inode, u64 
start, u64 end,
  unsigned int extra_bits,
  struct extent_state **cached_state, int dedupe)
 {
-   WARN_ON((end & (PAGE_SIZE - 1)) == 0);
+   WARN_ON(PAGE_ALIGNED(end));
return set_extent_delalloc(_I(inode)->io_tree, start, end,
   extra_bits, cached_state);
 }
-- 
2.16.4



Btrfs progs release 4.19.1

2018-12-05 Thread David Sterba
Hi,

btrfs-progs version 4.19.1 have been released.  There are build fixes, minor
update to libbtrfsutil and documentation updates.

Changes since 4.19.1-rc1: fix typos in CHANGES

Changes:

  * build fixes
* big-endian builds fail due to bswap helper clashes
* 'swap' macro is too generic, renamed to prevent build failures
  * libbtrfs
* minor version update to 1.1.0
* fix default search to top=0 as documented
* rename 'async' to avoid future python binding problems
* add support for unprivileged subvolume listing ioctls
* added tests, API docs
  * other
* lot of typos fixed
* warning cleanups
* doc formatting updates
* CI tests against zstd 1.3.7

Tarballs: https://www.kernel.org/pub/linux/kernel/people/kdave/btrfs-progs/
Git: git://git.kernel.org/pub/scm/linux/kernel/git/kdave/btrfs-progs.git

Shortlog:

David Sterba (7):
  btrfs-progs: kerncompat: rename swap to __swap
  btrfs-progs: README: add link to INSTALL
  btrfs-progs: docs: fix rendering of exponents in manual pages
  btrfs-progs: link to libbtrfsutil/README from the main README
  btrfs-progs: tests: pull zstd version 1.3.7 to the travis CI
  btrfs-progs: update CHANGES for v4.19.1
  Btrfs progs v4.19.1

Josh Soref (11):
  btrfs-progs: docs: fix typos in Documentation
  btrfs-progs: docs: fix typos in READMEs, INSTALL and CHANGES
  btrfs-progs: fix typos in Makefile
  btrfs-progs: tests: fix typos in test comments
  btrfs-progs: tests: fsck/025, fix typo in helpre name
  btrfs-progs: fix typos in comments
  btrfs-progs: fix typos in user-visible strings
  btrfs-progs: check: fix typo in device_extent_record::chunk_objectid
  btrfs-progs: utils: fix typo in a variable
  btrfs-progs: mkfs: fix typo in "multipler" variables
  btrfs-progs: fix typo in btrfs-list function export

Omar Sandoval (10):
  libbtrfsutil: use top=0 as default for SubvolumeIterator()
  libbtrfsutil: change async parameters to async_ in Python bindings
  libbtrfsutil: document qgroup_inherit parameter in Python bindings
  libbtrfsutil: use SubvolumeIterator as context manager in tests
  libbtrfsutil: add test helpers for dropping privileges
  libbtrfsutil: allow tests to create multiple Btrfs instances
  libbtrfsutil: relax the privileges of subvolume_info()
  libbtrfsutil: relax the privileges of subvolume iterator
  libbtrfsutil: bump version to 1.1.0
  libbtrfsutil: document API in README

Rosen Penev (3):
  btrfs-progs: kernel-lib: bitops: Fix big endian compilation
  btrfs-progs: task-utils: Fix comparison between pointer and integer
  btrfs-progs: treewide: Fix missing declarations



Re: [PATCH v2 07/13] btrfs-progs: Fix Wmaybe-uninitialized warning

2018-12-05 Thread Qu Wenruo


On 2018/12/5 下午9:40, David Sterba wrote:
> On Wed, Dec 05, 2018 at 02:40:12PM +0800, Qu Wenruo wrote:
>> GCC 8.2.1 will report the following warning with "make W=1":
>>
>>   ctree.c: In function 'btrfs_next_sibling_tree_block':
>>   ctree.c:2990:21: warning: 'slot' may be used uninitialized in this 
>> function [-Wmaybe-uninitialized]
>> path->slots[level] = slot;
>> ~~~^~
>>
>> The culprit is the following code:
>>
>>  int slot;   << Not initialized
>>  int level = path->lowest_level + 1;
>>  BUG_ON(path->lowest_level + 1 >= BTRFS_MAX_LEVEL);
>>  while(level < BTRFS_MAX_LEVEL) {
>>  slot = path->slots[level] + 1;
>>  ^^ but we initialize @slot here.
>>  ...
>>  }
>>  path->slots[level] = slot;
>>
>> It's possible that compiler doesn't get enough hint for BUG_ON() on
>> lowest_level + 1 >= BTRFS_MAX_LEVEL case.
>>
>> Fix it by using a do {} while() loop other than while() {} loop, to
>> ensure we will run the loop for at least once.
> 
> I was hoping that we can actually add the hint to BUG_ON so the code
> does not continue if the condition is true.
> 
I checked that method, but I'm not that confident about things like:

bugon_trace()
{
if (!val)
return;
__bugon_trace();
}

__attribute__ ((noreturn))
static inline void __bugon_trace();

This is as simple as just one extra function call, but the original
problem is just one more function call before hitting abort().

So I just give up screwing up things I'm not comfort enough to tweaking.

The current do {} while() loop is the most direct solution, if gcc one
day still gives such warning then I could say some harsh word then.

Thanks,
Qu



signature.asc
Description: OpenPGP digital signature


Re: [PATCH 07/10] dax: export functions for use with btrfs

2018-12-05 Thread Johannes Thumshirn
On 05/12/2018 13:28, Goldwyn Rodrigues wrote:

[...]

> -static void *grab_mapping_entry(struct xa_state *xas,
> +void *grab_mapping_entry(struct xa_state *xas,
>   struct address_space *mapping, unsigned long size_flag)
>  {
>   unsigned long index = xas->xa_index;
> @@ -531,6 +532,7 @@ static void *grab_mapping_entry(struct xa_state *xas,
>   xas_unlock_irq(xas);
>   return xa_mk_internal(VM_FAULT_FALLBACK);
>  }
> +EXPORT_SYMBOL(grab_mapping_entry);

dax_grab_mapping_entry() please.



-- 
Johannes ThumshirnSUSE Labs Filesystems
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850


Re: [PATCH 00/10] btrfs: Support for DAX devices

2018-12-05 Thread Adam Borowski
On Wed, Dec 05, 2018 at 06:28:25AM -0600, Goldwyn Rodrigues wrote:
> This is a support for DAX in btrfs.

Yay!

> I understand there have been previous attempts at it.  However, I wanted
> to make sure copy-on-write (COW) works on dax as well.

btrfs' usual use of CoW and DAX are thoroughly in conflict.

The very point of DAX is to have writes not go through the kernel, you
mmap the file then do all writes right to the pmem, flushing when needed
(without hitting the kernel) and having the processor+memory persist what
you wrote.

CoW via page faults are fine -- pmem is closer to memory than disk, and this
means the kernel will ask the filesystem for an extent to place the new page
in, copy the contents and let the process play with it.  But real btrfs CoW
would mean we'd need to page fault on ᴇᴠᴇʀʏ ꜱɪɴɢʟᴇ ᴡʀɪᴛᴇ.

Delaying CoW until the next commit doesn't help -- you'd need to store the
dirty page in DRAM then write it, which goes against the whole concept of
DAX.

Only way I see would be to CoW once then pretend the page is nodatacow until
the next commit, when we checksum it, add to the metadata trees, and mark
for CoWing on the next write.  Lots of complexity, and you still need to
copy the whole thing every commit (so no gain).

Ie, we're in nodatacow land.  CoW for metadata is fine.

> Before I present this to the FS folks I wanted to run this through the
> btrfs. Even though I wish, I cannot get it correct the first time
> around :/.. Here are some questions for which I need suggestions:
> 
> Questions:
> 1. I have been unable to do checksumming for DAX devices. While
> checksumming can be done for reads and writes, it is a problem when mmap
> is involved because btrfs kernel module does not get back control after
> an mmap() writes. Any ideas are appreciated, or we would have to set
> nodatasum when dax is enabled.

Per the above, it sounds like nodatacow (ie, "cow once") would be needed.

> 2. Currently, a user can continue writing on "old" extents of an mmaped file
> after a snapshot has been created. How can we enforce writes to be directed
> to new extents after snapshots have been created? Do we keep a list of
> all mmap()s, and re-mmap them after a snapshot?

Same as for any other memory that's shared: when a new instance of sharing
is added (a snapshot/reflink in our case), you deny writes, causing a page
fault on the next attempt.  "pmem" is named "ᴘersistent ᴍᴇᴍory" for a
reason...

> Tested by creating a pmem device in RAM with "memmap=2G!4G" kernel
> command line parameter.

Might be more useful to use a bigger piece of the "disk" than 2G, it's not
in the danger area though.

Also note that it's utterly pointless to use any RAID modes; multi-dev
single is fine, DUP counts as RAID here.
* RAID0 is already done better in hardware (interleave)
* RAID1 would require hardware support, replication isn't easy
* RAID5/6 

What would make sense, is disabling dax for any files that are not marked as
nodatacow.  This way, unrelated files can still use checksums or
compression, while only files meant as a pmempool or otherwise by a
pmem-aware program would have dax writes (you can still give read-only pages
that CoW to DRAM).  This way we can have write dax for only a subset of
files, and full set of btrfs features for the rest.  Write dax is dangerous
for programs that have no specific support: the vast majority of
database-like programs rely on page-level atomicity while pmem gives you
cacheline/word atomicity only; torn writes mean data loss.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ 
⣾⠁⢠⠒⠀⣿⡁ Ivan was a worldly man: born in St. Petersburg, raised in
⢿⡄⠘⠷⠚⠋⠀ Petrograd, lived most of his life in Leningrad, then returned
⠈⠳⣄ to the city of his birth to die.


Re: [PATCH 06/10] btrfs: dax write support

2018-12-05 Thread Johannes Thumshirn
start,
> + end - start, true);
> + iocb->ki_pos += done;
> + if (iocb->ki_pos > i_size_read(inode))
> + i_size_write(inode, iocb->ki_pos);
> + }
> +
> + btrfs_delalloc_release_extents(BTRFS_I(inode), count, false);
> +out2:

out?

> + if (count - done > 0)
> + btrfs_delalloc_release_space(inode, data_reserved, pos,
> + count - done, true);
> + extent_changeset_free(data_reserved);
> +return done ? done : ret;
> +
> +}
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index ef6ed93f44d1..29a3b12e6660 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -1964,7 +1964,9 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
>   if (sync)
>   atomic_inc(_I(inode)->sync_writers);
>  
> - if (iocb->ki_flags & IOCB_DIRECT) {
> + if (IS_DAX(inode)) {
> + num_written = btrfs_file_dax_write(iocb, from);
> + } else if (iocb->ki_flags & IOCB_DIRECT) {
>   num_written = __btrfs_direct_write(iocb, from);
>   } else {
>   num_written = btrfs_buffered_write(iocb, from);
> 


-- 
Johannes ThumshirnSUSE Labs Filesystems
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850


Re: [PATCH v2 07/13] btrfs-progs: Fix Wmaybe-uninitialized warning

2018-12-05 Thread David Sterba
On Wed, Dec 05, 2018 at 02:40:12PM +0800, Qu Wenruo wrote:
> GCC 8.2.1 will report the following warning with "make W=1":
> 
>   ctree.c: In function 'btrfs_next_sibling_tree_block':
>   ctree.c:2990:21: warning: 'slot' may be used uninitialized in this function 
> [-Wmaybe-uninitialized]
> path->slots[level] = slot;
> ~~~^~
> 
> The culprit is the following code:
> 
>   int slot;   << Not initialized
>   int level = path->lowest_level + 1;
>   BUG_ON(path->lowest_level + 1 >= BTRFS_MAX_LEVEL);
>   while(level < BTRFS_MAX_LEVEL) {
>   slot = path->slots[level] + 1;
>   ^^ but we initialize @slot here.
>   ...
>   }
>   path->slots[level] = slot;
> 
> It's possible that compiler doesn't get enough hint for BUG_ON() on
> lowest_level + 1 >= BTRFS_MAX_LEVEL case.
> 
> Fix it by using a do {} while() loop other than while() {} loop, to
> ensure we will run the loop for at least once.

I was hoping that we can actually add the hint to BUG_ON so the code
does not continue if the condition is true.


Re: [PATCH 03/10] btrfs: dax: read zeros from holes

2018-12-05 Thread Nikolay Borisov



On 5.12.18 г. 14:28 ч., Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues 
> 
> Signed-off-by: Goldwyn Rodrigues 
> ---
>  fs/btrfs/dax.c | 7 ++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/dax.c b/fs/btrfs/dax.c
> index d614bf73bf8e..5a297674adec 100644
> --- a/fs/btrfs/dax.c
> +++ b/fs/btrfs/dax.c
> @@ -54,7 +54,12 @@ ssize_t btrfs_file_dax_read(struct kiocb *iocb, struct 
> iov_iter *to)

nit: I think it's better if you rename the iterator variable to "iter".

>  
>  BUG_ON(em->flags & EXTENT_FLAG_FS_MAPPING);
>  
> -ret = em_dax_rw(inode, em, pos, len, to);
> + if (em->block_start == EXTENT_MAP_HOLE) {
> + u64 zero_len = min(em->len - (em->start - pos), len);

Shouldn't this be em->len - (pos - em->start) since this gives the
remaining length of the extent? Isn't pos guaranteed to be >= em->start ?

> + ret = iov_iter_zero(zero_len, to);
> + } else {
> + ret = em_dax_rw(inode, em, pos, len, to);
> + }
>  if (ret < 0)
>  goto out;
>  pos += ret;
> 


Re: [PATCH 02/10] btrfs: basic dax read

2018-12-05 Thread Johannes Thumshirn
On 05/12/2018 13:28, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues 
> 
> Signed-off-by: Goldwyn Rodrigues 
Can you explain why we can't use th dax_iomap_rw() interface like XFS or
EXT4?

[...]

> +static ssize_t em_dax_rw(struct inode *inode, struct extent_map *em, u64 pos,
> + u64 len, struct iov_iter *iter)
> +{
> +struct dax_device *dax_dev = fs_dax_get_by_bdev(em->bdev);
> +ssize_t map_len;
> +pgoff_t blk_pg;
> +void *kaddr;
> +sector_t blk_start;
> +unsigned offset = pos & (PAGE_SIZE - 1);

Nit: unsigned offset = offset_in_page(pos);




-- 
Johannes ThumshirnSUSE Labs Filesystems
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850


Re: [PATCH 02/10] btrfs: basic dax read

2018-12-05 Thread Nikolay Borisov



On 5.12.18 г. 14:28 ч., Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues 
> 
> Signed-off-by: Goldwyn Rodrigues 
> ---
>  fs/btrfs/Makefile |  1 +
>  fs/btrfs/ctree.h  |  5 
>  fs/btrfs/dax.c| 68 
> +++
>  fs/btrfs/file.c   | 13 ++-
>  4 files changed, 86 insertions(+), 1 deletion(-)
>  create mode 100644 fs/btrfs/dax.c
> 
> diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile
> index ca693dd554e9..1fa77b875ae9 100644
> --- a/fs/btrfs/Makefile
> +++ b/fs/btrfs/Makefile
> @@ -12,6 +12,7 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o 
> root-tree.o dir-item.o \
>  reada.o backref.o ulist.o qgroup.o send.o dev-replace.o raid56.o \
>  uuid-tree.o props.o free-space-tree.o tree-checker.o
>  
> +btrfs-$(CONFIG_FS_DAX) += dax.o
>  btrfs-$(CONFIG_BTRFS_FS_POSIX_ACL) += acl.o
>  btrfs-$(CONFIG_BTRFS_FS_CHECK_INTEGRITY) += check-integrity.o
>  btrfs-$(CONFIG_BTRFS_FS_REF_VERIFY) += ref-verify.o
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 5cc470fa6a40..038d64ecebe5 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -3685,6 +3685,11 @@ int btrfs_reada_wait(void *handle);
>  void btrfs_reada_detach(void *handle);
>  int btree_readahead_hook(struct extent_buffer *eb, int err);
>  
> +#ifdef CONFIG_FS_DAX
> +/* dax.c */
> +ssize_t btrfs_file_dax_read(struct kiocb *iocb, struct iov_iter *to);
> +#endif /* CONFIG_FS_DAX */
> +
>  static inline int is_fstree(u64 rootid)
>  {
>   if (rootid == BTRFS_FS_TREE_OBJECTID ||
> diff --git a/fs/btrfs/dax.c b/fs/btrfs/dax.c
> new file mode 100644
> index ..d614bf73bf8e
> --- /dev/null
> +++ b/fs/btrfs/dax.c
> @@ -0,0 +1,68 @@
> +#include 
> +#include 
> +#include "ctree.h"
> +#include "btrfs_inode.h"
> +
> +static ssize_t em_dax_rw(struct inode *inode, struct extent_map *em, u64 pos,
> + u64 len, struct iov_iter *iter)
> +{
> +struct dax_device *dax_dev = fs_dax_get_by_bdev(em->bdev);
> +ssize_t map_len;
> +pgoff_t blk_pg;
> +void *kaddr;
> +sector_t blk_start;
> +unsigned offset = pos & (PAGE_SIZE - 1);

offset = offset_in_page(pos)

> +
> +len = min(len + offset, em->len - (pos - em->start));
> +len = ALIGN(len, PAGE_SIZE);

len = PAGE_ALIGN(len);

> +blk_start = (get_start_sect(em->bdev) << 9) + (em->block_start + 
> (pos - em->start));
> +blk_pg = blk_start - offset;
> +map_len = dax_direct_access(dax_dev, PHYS_PFN(blk_pg), 
> PHYS_PFN(len), , NULL);
> +map_len = PFN_PHYS(map_len)> +kaddr += offset;
> +map_len -= offset;
> +if (map_len > len)
> +map_len = len;

map_len = min(map_len, len);

> +if (iov_iter_rw(iter) == WRITE)
> +return dax_copy_from_iter(dax_dev, blk_pg, kaddr, map_len, 
> iter);
> +else
> +return dax_copy_to_iter(dax_dev, blk_pg, kaddr, map_len, 
> iter);

Have you looked at the implementation of dax_iomap_actor where they have
pretty similar code. In case of either of those returning 0 they set ret
to EFAULT, should the same be done in btrfs_file_dax_read? IMO it will
be good of you can follow dax_iomap_actor's logic as much as possible
since this code has been used for quite some time and is deemed robust.

> +}
> +
> +ssize_t btrfs_file_dax_read(struct kiocb *iocb, struct iov_iter *to)
> +{
> +size_t ret = 0, done = 0, count = iov_iter_count(to);
> +struct extent_map *em;
> +u64 pos = iocb->ki_pos;
> +u64 end = pos + count;
> +struct inode *inode = file_inode(iocb->ki_filp);
> +
> +if (!count)
> +return 0;
> +
> +end = i_size_read(inode) < end ? i_size_read(inode) : end;

end = min(i_size_read(inode), end)

> +
> +while (pos < end) {
> +u64 len = end - pos;
> +
> +em = btrfs_get_extent(BTRFS_I(inode), NULL, 0, pos, len, 0);
> +if (IS_ERR(em)) {
> +if (!ret)
> +ret = PTR_ERR(em);
> +goto out;
> +}
> +
> +BUG_ON(em->flags & EXTENT_FLAG_FS_MAPPING);

I think this can never trigger, because EXTENT_FLAG_FS_MAPPING is set
for extents that map chunk and those are housed in the chunk tree at
fs_info->mapping_tree. Since the write call back is only ever called for
file inodes I'd say this BUG_ON can be eliminated. Did you manage to
trigger it during development?


> +
> +ret = em_dax_rw(ino

Re: [PATCH 00/10] btrfs: Support for DAX devices

2018-12-05 Thread Qu Wenruo


On 2018/12/5 下午8:28, Goldwyn Rodrigues wrote:
> This is a support for DAX in btrfs. I understand there have been
> previous attempts at it. However, I wanted to make sure copy-on-write
> (COW) works on dax as well.
> 
> Before I present this to the FS folks I wanted to run this through the
> btrfs. Even though I wish, I cannot get it correct the first time
> around :/.. Here are some questions for which I need suggestions:
> 
> Questions:
> 1. I have been unable to do checksumming for DAX devices. While
> checksumming can be done for reads and writes, it is a problem when mmap
> is involved because btrfs kernel module does not get back control after
> an mmap() writes. Any ideas are appreciated, or we would have to set
> nodatasum when dax is enabled.

I'm not familar with DAX, so it's completely possible I'm talking like
an idiot.

If btrfs_page_mkwrite() can't provide enough control, then I have a
crazy idea.

Forcing page fault for every mmap() read/write (completely disable page
cache like DIO).
So that we could get some control since we're informed to read the page
and do some hacks there.

Thanks,
Qu
> 
> 2. Currently, a user can continue writing on "old" extents of an mmaped file
> after a snapshot has been created. How can we enforce writes to be directed
> to new extents after snapshots have been created? Do we keep a list of
> all mmap()s, and re-mmap them after a snapshot?
> 
> Tested by creating a pmem device in RAM with "memmap=2G!4G" kernel
> command line parameter.
> 
> 
> [PATCH 01/10] btrfs: create a mount option for dax
> [PATCH 02/10] btrfs: basic dax read
> [PATCH 03/10] btrfs: dax: read zeros from holes
> [PATCH 04/10] Rename __endio_write_update_ordered() to
> [PATCH 05/10] btrfs: Carve out btrfs_get_extent_map_write() out of
> [PATCH 06/10] btrfs: dax write support
> [PATCH 07/10] dax: export functions for use with btrfs
> [PATCH 08/10] btrfs: dax add read mmap path
> [PATCH 09/10] btrfs: dax support for cow_page/mmap_private and shared
> [PATCH 10/10] btrfs: dax mmap write
> 
>  fs/btrfs/Makefile   |1 
>  fs/btrfs/ctree.h|   17 ++
>  fs/btrfs/dax.c  |  303 
> ++++++--
>  fs/btrfs/file.c |   29 
>  fs/btrfs/inode.c|   54 +
>  fs/btrfs/ioctl.c|5 
>  fs/btrfs/super.c|   15 ++
>  fs/dax.c|   35 --
>  include/linux/dax.h |   16 ++
>  9 files changed, 430 insertions(+), 45 deletions(-)
> 
> 



signature.asc
Description: OpenPGP digital signature


Re: [PATCH 01/10] btrfs: create a mount option for dax

2018-12-05 Thread Nikolay Borisov



On 5.12.18 г. 14:28 ч., Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues 
> 
> Also, set the inode->i_flags to S_DAX
> 
> Signed-off-by: Goldwyn Rodrigues 

Reviewed-by: Nikolay Borisov 

One question below though .

> ---
>  fs/btrfs/ctree.h |  1 +
>  fs/btrfs/ioctl.c |  5 -
>  fs/btrfs/super.c | 15 +++
>  3 files changed, 20 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 68f322f600a0..5cc470fa6a40 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -1353,6 +1353,7 @@ static inline u32 BTRFS_MAX_XATTR_SIZE(const struct 
> btrfs_fs_info *info)
>  #define BTRFS_MOUNT_FREE_SPACE_TREE  (1 << 26)
>  #define BTRFS_MOUNT_NOLOGREPLAY  (1 << 27)
>  #define BTRFS_MOUNT_REF_VERIFY   (1 << 28)
> +#define BTRFS_MOUNT_DAX  (1 << 29)
>  
>  #define BTRFS_DEFAULT_COMMIT_INTERVAL    (30)
>  #define BTRFS_DEFAULT_MAX_INLINE (2048)
> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> index 802a628e9f7d..e9146c157816 100644
> --- a/fs/btrfs/ioctl.c
> +++ b/fs/btrfs/ioctl.c
> @@ -149,8 +149,11 @@ void btrfs_sync_inode_flags_to_i_flags(struct inode 
> *inode)
>   if (binode->flags & BTRFS_INODE_DIRSYNC)
>   new_fl |= S_DIRSYNC;
>  
> + if ((btrfs_test_opt(btrfs_sb(inode->i_sb), DAX)) && 
> S_ISREG(inode->i_mode))
> + new_fl |= S_DAX;
> +
>   set_mask_bits(>i_flags,
> -   S_SYNC | S_APPEND | S_IMMUTABLE | S_NOATIME | S_DIRSYNC,
> +   S_SYNC | S_APPEND | S_IMMUTABLE | S_NOATIME | S_DIRSYNC | 
> S_DAX,
> new_fl);
>  }
>  
> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
> index 645fc81e2a94..035263b61cf5 100644
> --- a/fs/btrfs/super.c
> +++ b/fs/btrfs/super.c
> @@ -326,6 +326,7 @@ enum {
>   Opt_treelog, Opt_notreelog,
>   Opt_usebackuproot,
>   Opt_user_subvol_rm_allowed,
> + Opt_dax,
>  
>   /* Deprecated options */
>   Opt_alloc_start,
> @@ -393,6 +394,7 @@ static const match_table_t tokens = {
>   {Opt_notreelog, "notreelog"},
>   {Opt_usebackuproot, "usebackuproot"},
>   {Opt_user_subvol_rm_allowed, "user_subvol_rm_allowed"},
> + {Opt_dax, "dax"},
>  
>   /* Deprecated options */
>   {Opt_alloc_start, "alloc_start=%s"},
> @@ -739,6 +741,17 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char 
> *options,
>   case Opt_user_subvol_rm_allowed:
>   btrfs_set_opt(info->mount_opt, USER_SUBVOL_RM_ALLOWED);
>   break;
> +#ifdef CONFIG_FS_DAX
> + case Opt_dax:
> + if (btrfs_super_num_devices(info->super_copy) > 1) {
> + btrfs_info(info,
> +"dax not supported for multi-device 
> btrfs partition\n");

What prevents supporting dax for multiple devices so long as all devices
are dax?



> 


Re: [PATCH 01/10] btrfs: create a mount option for dax

2018-12-05 Thread Johannes Thumshirn
On 05/12/2018 13:28, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues 
> 
> Also, set the inode->i_flags to S_DAX
> 
> Signed-off-by: Goldwyn Rodrigues 
> ---
>  fs/btrfs/ctree.h |  1 +
>  fs/btrfs/ioctl.c |  5 -
>  fs/btrfs/super.c | 15 +++
>  3 files changed, 20 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 68f322f600a0..5cc470fa6a40 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -1353,6 +1353,7 @@ static inline u32 BTRFS_MAX_XATTR_SIZE(const struct 
> btrfs_fs_info *info)
>  #define BTRFS_MOUNT_FREE_SPACE_TREE  (1 << 26)
>  #define BTRFS_MOUNT_NOLOGREPLAY  (1 << 27)
>  #define BTRFS_MOUNT_REF_VERIFY   (1 << 28)
> +#define BTRFS_MOUNT_DAX  (1 << 29)

Just as a heads up, this will collide with the patch called '[RFC PATCH
02/17] btrfs: add mount definition BTRFS_MOUNT_PRIORITY_USAGE' from Su Yue.

[...]

> +#ifdef CONFIG_FS_DAX
> + case Opt_dax:
> + if (btrfs_super_num_devices(info->super_copy) > 1) {
> + btrfs_info(info,
> +"dax not supported for multi-device 
> btrfs partition\n");
> + ret = -EOPNOTSUPP;
> + goto out;
> + }
> + btrfs_set_opt(info->mount_opt, DAX);
> + break;
> +#endif

Can you please explain why we can't enable DAX on a multi device FS in
the changelog? It's (for me at least) not obvious.

Thanks,
Johannes
-- 
Johannes ThumshirnSUSE Labs Filesystems
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850


[PATCH 10/10] btrfs: dax mmap write

2018-12-05 Thread Goldwyn Rodrigues
From: Goldwyn Rodrigues 

Create a page size extent and copy the contents of the original
extent into the new one, and present to user space as the page
to write.

Signed-off-by: Goldwyn Rodrigues 
---
 fs/btrfs/dax.c | 39 +++
 1 file changed, 39 insertions(+)

diff --git a/fs/btrfs/dax.c b/fs/btrfs/dax.c
index 6d68d39cc5da..4634917877f3 100644
--- a/fs/btrfs/dax.c
+++ b/fs/btrfs/dax.c
@@ -231,6 +231,45 @@ vm_fault_t btrfs_dax_fault(struct vm_fault *vmf)
sector >>= 9;
ret = copy_user_dax(em->bdev, dax_dev, sector, PAGE_SIZE, 
vmf->cow_page, vaddr);
goto out;
+   } else if (vmf->flags & FAULT_FLAG_WRITE) {
+   pfn_t pfn;
+   struct extent_map *orig = em;
+   void *daddr;
+   sector_t dstart;
+   size_t maplen;
+   struct extent_changeset *data_reserved = NULL;
+   struct extent_state *cached_state = NULL;
+
+   ret = btrfs_delalloc_reserve_space(inode, _reserved, pos, 
PAGE_SIZE);
+   if (ret < 0)
+   return ret;
+   refcount_inc(>refs);
+   lock_extent_bits(_I(inode)->io_tree, pos, pos + 
PAGE_SIZE, _state);
+   /* Create an extent of page size */
+   ret = btrfs_get_extent_map_write(, NULL, inode, pos,
+   PAGE_SIZE);
+   if (ret < 0) {
+   free_extent_map(orig);
+   btrfs_delalloc_release_space(inode, data_reserved, pos,
+   PAGE_SIZE, true);
+   goto out;
+   }
+
+   dax_dev = fs_dax_get_by_bdev(em->bdev);
+   /* Calculate start address of destination extent */
+   dstart = (get_start_sect(em->bdev) << 9) + em->block_start;
+   maplen = dax_direct_access(dax_dev, PHYS_PFN(dstart),
+   1, , );
+
+   /* Copy the original contents into new destination */
+   copy_extent_page(orig, daddr, pos);
+   btrfs_update_ordered_extent(inode, pos, PAGE_SIZE, true);
+   dax_insert_entry(, mapping, vmf, entry, pfn, 0, false);
+   ret = vmf_insert_mixed(vmf->vma, vaddr, pfn);
+   free_extent_map(orig);
+   unlock_extent_cached(_I(inode)->io_tree, pos, pos + 
PAGE_SIZE, _state);
+   extent_changeset_free(data_reserved);
+   btrfs_delalloc_release_extents(BTRFS_I(inode), PAGE_SIZE, 
false);
} else {
sector_t sector;
if (em->block_start == EXTENT_MAP_HOLE) {
-- 
2.16.4



[PATCH 06/10] btrfs: dax write support

2018-12-05 Thread Goldwyn Rodrigues
From: Goldwyn Rodrigues 

This is a combination of direct and buffered I/O. Similarties
with direct I/O is that it needs to allocate space before
writing. Similarities with buffered is when the data is not
page-aligned, it needs to copy parts of the previous extents. In
order to accomplish that, keep a references of the first and last
extent (if required) and then perform allocations. If the "pos"
or "end" is not aligned, copy the data from first and last extent
respectively.

Signed-off-by: Goldwyn Rodrigues 
---
 fs/btrfs/ctree.h |   1 +
 fs/btrfs/dax.c   | 121 +++
 fs/btrfs/file.c  |   4 +-
 3 files changed, 125 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index a0d296b0d826..d91ff283a966 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3693,6 +3693,7 @@ int btree_readahead_hook(struct extent_buffer *eb, int 
err);
 #ifdef CONFIG_FS_DAX
 /* dax.c */
 ssize_t btrfs_file_dax_read(struct kiocb *iocb, struct iov_iter *to);
+ssize_t btrfs_file_dax_write(struct kiocb *iocb, struct iov_iter *from);
 #endif /* CONFIG_FS_DAX */
 
 static inline int is_fstree(u64 rootid)
diff --git a/fs/btrfs/dax.c b/fs/btrfs/dax.c
index 5a297674adec..4000259a426c 100644
--- a/fs/btrfs/dax.c
+++ b/fs/btrfs/dax.c
@@ -2,6 +2,7 @@
 #include 
 #include "ctree.h"
 #include "btrfs_inode.h"
+#include "extent_io.h"
 
 static ssize_t em_dax_rw(struct inode *inode, struct extent_map *em, u64 pos,
u64 len, struct iov_iter *iter)
@@ -71,3 +72,123 @@ ssize_t btrfs_file_dax_read(struct kiocb *iocb, struct 
iov_iter *to)
 return done ? done : ret;
 }
 
+static int copy_extent_page(struct extent_map *em, void *daddr, u64 pos)
+{
+struct dax_device *dax_dev;
+   void *saddr;
+   sector_t start;
+   size_t len;
+
+   if (em->block_start == EXTENT_MAP_HOLE) {
+   memset(daddr, 0, PAGE_SIZE);
+   } else {
+   dax_dev = fs_dax_get_by_bdev(em->bdev);
+   start = (get_start_sect(em->bdev) << 9) + (em->block_start + 
(pos - em->start));
+   len = dax_direct_access(dax_dev, PHYS_PFN(start), 1, , 
NULL);
+   memcpy(daddr, saddr, PAGE_SIZE);
+   }
+   free_extent_map(em);
+
+   return 0;
+}
+
+
+ssize_t btrfs_file_dax_write(struct kiocb *iocb, struct iov_iter *from)
+{
+   ssize_t ret, done = 0, count = iov_iter_count(from);
+struct inode *inode = file_inode(iocb->ki_filp);
+   u64 pos = iocb->ki_pos;
+   u64 start = round_down(pos, PAGE_SIZE);
+   u64 end = round_up(pos + count, PAGE_SIZE);
+   struct extent_state *cached_state = NULL;
+   struct extent_changeset *data_reserved = NULL;
+   struct extent_map *first = NULL, *last = NULL;
+
+   ret = btrfs_delalloc_reserve_space(inode, _reserved, start, end - 
start);
+   if (ret < 0)
+   return ret;
+
+   /* Grab a reference of the first extent to copy data */
+   if (start < pos) {
+   first = btrfs_get_extent(BTRFS_I(inode), NULL, 0, start, end - 
start, 0);
+   if (IS_ERR(first)) {
+   ret = PTR_ERR(first);
+   goto out2;
+   }
+   }
+
+   /* Grab a reference of the last extent to copy data */
+   if (pos + count < end) {
+   last = btrfs_get_extent(BTRFS_I(inode), NULL, 0, end - 
PAGE_SIZE, PAGE_SIZE, 0);
+   if (IS_ERR(last)) {
+   ret = PTR_ERR(last);
+   goto out2;
+   }
+   }
+
+   lock_extent_bits(_I(inode)->io_tree, start, end, _state);
+   while (done < count) {
+   struct extent_map *em;
+   struct dax_device *dax_dev;
+   int offset = pos & (PAGE_SIZE - 1);
+   u64 estart = round_down(pos, PAGE_SIZE);
+   u64 elen = end - estart;
+   size_t len = count - done;
+   sector_t dstart;
+   void *daddr;
+   ssize_t maplen;
+
+   /* Read the current extent */
+em = btrfs_get_extent(BTRFS_I(inode), NULL, 0, estart, elen, 
0);
+   if (IS_ERR(em)) {
+   ret = PTR_ERR(em);
+   goto out;
+   }
+
+   /* Get a new extent */
+   ret = btrfs_get_extent_map_write(, NULL, inode, estart, 
elen);
+   if (ret < 0)
+   goto out;
+
+   dax_dev = fs_dax_get_by_bdev(em->bdev);
+   /* Calculate start address start of destination extent */
+   dstart = (get_start_sect(em->bdev) << 9) + em->block_start;
+   maplen = dax_direct_access(dax_dev, PHYS_PFN(dstart),
+   PHYS_PFN(em->len), , NULL);
+
+   /* Copy fr

[PATCH 09/10] btrfs: dax support for cow_page/mmap_private and shared

2018-12-05 Thread Goldwyn Rodrigues
From: Goldwyn Rodrigues 

Signed-off-by: Goldwyn Rodrigues 
---
 fs/btrfs/dax.c | 25 -
 1 file changed, 20 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/dax.c b/fs/btrfs/dax.c
index 88017f8799d1..6d68d39cc5da 100644
--- a/fs/btrfs/dax.c
+++ b/fs/btrfs/dax.c
@@ -198,10 +198,13 @@ vm_fault_t btrfs_dax_fault(struct vm_fault *vmf)
pfn_t pfn;
struct address_space *mapping = vmf->vma->vm_file->f_mapping;
XA_STATE(xas, >i_pages, vmf->pgoff);
+   unsigned long vaddr = vmf->address;
struct inode *inode = mapping->host;
loff_t pos = (loff_t)vmf->pgoff << PAGE_SHIFT;
void *entry = NULL;
vm_fault_t ret = 0;
+   struct extent_map *em;
+   struct dax_device *dax_dev;
 
if (pos > i_size_read(inode)) {
ret = VM_FAULT_SIGBUS;
@@ -214,21 +217,33 @@ vm_fault_t btrfs_dax_fault(struct vm_fault *vmf)
goto out;
}
 
-   if (!vmf->cow_page) {
+em = btrfs_get_extent(BTRFS_I(inode), NULL, 0, pos, PAGE_SIZE, 0);
+   if (em->block_start != EXTENT_MAP_HOLE)
+   dax_dev = fs_dax_get_by_bdev(em->bdev);
+
+   if (vmf->cow_page) {
+   sector_t sector;
+   if (em->block_start == EXTENT_MAP_HOLE) {
+   clear_user_highpage(vmf->cow_page, vaddr);
+   goto out;
+   }
+   sector = (get_start_sect(em->bdev) << 9) + (em->block_start + 
(pos - em->start));
+   sector >>= 9;
+   ret = copy_user_dax(em->bdev, dax_dev, sector, PAGE_SIZE, 
vmf->cow_page, vaddr);
+   goto out;
+   } else {
sector_t sector;
-   struct extent_map *em;
-em = btrfs_get_extent(BTRFS_I(inode), NULL, 0, pos, PAGE_SIZE, 
0);
if (em->block_start == EXTENT_MAP_HOLE) {
ret = dax_load_hole(, mapping, entry, vmf);
goto out;
}
sector = ((get_start_sect(em->bdev) << 9) +
  (em->block_start + (pos - em->start))) >> 9;
-   ret = dax_pfn(fs_dax_get_by_bdev(em->bdev), em->bdev, sector, 
PAGE_SIZE, );
+   ret = dax_pfn(dax_dev, em->bdev, sector, PAGE_SIZE, );
if (ret)
goto out;
dax_insert_entry(, mapping, vmf, entry, pfn, 0, false);
-   ret = vmf_insert_mixed(vmf->vma, vmf->address, pfn);
+   ret = vmf_insert_mixed(vmf->vma, vaddr, pfn);
}
 out:
if (entry)
-- 
2.16.4



[PATCH 07/10] dax: export functions for use with btrfs

2018-12-05 Thread Goldwyn Rodrigues
From: Goldwyn Rodrigues 

These functions are required for btrfs dax support.

Signed-off-by: Goldwyn Rodrigues 
---
 fs/dax.c| 35 ---
 include/linux/dax.h | 16 
 2 files changed, 40 insertions(+), 11 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 9bcce89ea18e..4578640af631 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -244,7 +244,7 @@ static void put_unlocked_entry(struct xa_state *xas, void 
*entry)
  * dropped the xa_lock, so we know the xa_state is stale and must be reset
  * before use.
  */
-static void dax_unlock_entry(struct xa_state *xas, void *entry)
+void dax_unlock_entry(struct xa_state *xas, void *entry)
 {
void *old;
 
@@ -256,6 +256,7 @@ static void dax_unlock_entry(struct xa_state *xas, void 
*entry)
BUG_ON(!dax_is_locked(old));
dax_wake_entry(xas, entry, false);
 }
+EXPORT_SYMBOL(dax_unlock_entry);
 
 /*
  * Return: The entry stored at this location before it was locked.
@@ -448,7 +449,7 @@ void dax_unlock_mapping_entry(struct page *page)
  * a VM_FAULT code, encoded as an xarray internal entry.  The ERR_PTR values
  * overlap with xarray value entries.
  */
-static void *grab_mapping_entry(struct xa_state *xas,
+void *grab_mapping_entry(struct xa_state *xas,
struct address_space *mapping, unsigned long size_flag)
 {
unsigned long index = xas->xa_index;
@@ -531,6 +532,7 @@ static void *grab_mapping_entry(struct xa_state *xas,
xas_unlock_irq(xas);
return xa_mk_internal(VM_FAULT_FALLBACK);
 }
+EXPORT_SYMBOL(grab_mapping_entry);
 
 /**
  * dax_layout_busy_page - find first pinned page in @mapping
@@ -654,7 +656,7 @@ int dax_invalidate_mapping_entry_sync(struct address_space 
*mapping,
return __dax_invalidate_entry(mapping, index, false);
 }
 
-static int copy_user_dax(struct block_device *bdev, struct dax_device *dax_dev,
+int copy_user_dax(struct block_device *bdev, struct dax_device *dax_dev,
sector_t sector, size_t size, struct page *to,
unsigned long vaddr)
 {
@@ -679,6 +681,7 @@ static int copy_user_dax(struct block_device *bdev, struct 
dax_device *dax_dev,
dax_read_unlock(id);
return 0;
 }
+EXPORT_SYMBOL(copy_user_dax);
 
 /*
  * By this point grab_mapping_entry() has ensured that we have a locked entry
@@ -687,7 +690,7 @@ static int copy_user_dax(struct block_device *bdev, struct 
dax_device *dax_dev,
  * already in the tree, we will skip the insertion and just dirty the PMD as
  * appropriate.
  */
-static void *dax_insert_entry(struct xa_state *xas,
+void *dax_insert_entry(struct xa_state *xas,
struct address_space *mapping, struct vm_fault *vmf,
void *entry, pfn_t pfn, unsigned long flags, bool dirty)
 {
@@ -736,6 +739,7 @@ static void *dax_insert_entry(struct xa_state *xas,
xas_unlock_irq(xas);
return entry;
 }
+EXPORT_SYMBOL(dax_insert_entry);
 
 static inline
 unsigned long pgoff_address(pgoff_t pgoff, struct vm_area_struct *vma)
@@ -962,19 +966,18 @@ static sector_t dax_iomap_sector(struct iomap *iomap, 
loff_t pos)
return (iomap->addr + (pos & PAGE_MASK) - iomap->offset) >> 9;
 }
 
-static int dax_iomap_pfn(struct iomap *iomap, loff_t pos, size_t size,
-pfn_t *pfnp)
+int dax_pfn(struct dax_device *dax_dev, struct block_device *bdev,
+   const sector_t sector, size_t size, pfn_t *pfnp)
 {
-   const sector_t sector = dax_iomap_sector(iomap, pos);
pgoff_t pgoff;
int id, rc;
long length;
 
-   rc = bdev_dax_pgoff(iomap->bdev, sector, size, );
+   rc = bdev_dax_pgoff(bdev, sector, size, );
if (rc)
return rc;
id = dax_read_lock();
-   length = dax_direct_access(iomap->dax_dev, pgoff, PHYS_PFN(size),
+   length = dax_direct_access(dax_dev, pgoff, PHYS_PFN(size),
   NULL, pfnp);
if (length < 0) {
rc = length;
@@ -993,6 +996,14 @@ static int dax_iomap_pfn(struct iomap *iomap, loff_t pos, 
size_t size,
dax_read_unlock(id);
return rc;
 }
+EXPORT_SYMBOL(dax_pfn);
+
+static int dax_iomap_pfn(struct iomap *iomap, loff_t pos, size_t size,
+pfn_t *pfnp)
+{
+   const sector_t sector = dax_iomap_sector(iomap, pos);
+   return dax_pfn(iomap->dax_dev, iomap->bdev, sector, size, pfnp);
+}
 
 /*
  * The user has performed a load from a hole in the file.  Allocating a new
@@ -1001,7 +1012,7 @@ static int dax_iomap_pfn(struct iomap *iomap, loff_t pos, 
size_t size,
  * If this page is ever written to we will re-fault and change the mapping to
  * point to real DAX storage instead.
  */
-static vm_fault_t dax_load_hole(struct xa_state *xas,
+vm_fault_t dax_load_hole(struct xa_state *xas,
struct address_space *mapping, void **entry,
struct vm_fault *vmf)
 {
@@ -1017,6 +1028,7 @@ static vm_fault_t 

[PATCH 03/10] btrfs: dax: read zeros from holes

2018-12-05 Thread Goldwyn Rodrigues
From: Goldwyn Rodrigues 

Signed-off-by: Goldwyn Rodrigues 
---
 fs/btrfs/dax.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/dax.c b/fs/btrfs/dax.c
index d614bf73bf8e..5a297674adec 100644
--- a/fs/btrfs/dax.c
+++ b/fs/btrfs/dax.c
@@ -54,7 +54,12 @@ ssize_t btrfs_file_dax_read(struct kiocb *iocb, struct 
iov_iter *to)
 
 BUG_ON(em->flags & EXTENT_FLAG_FS_MAPPING);
 
-ret = em_dax_rw(inode, em, pos, len, to);
+   if (em->block_start == EXTENT_MAP_HOLE) {
+   u64 zero_len = min(em->len - (em->start - pos), len);
+   ret = iov_iter_zero(zero_len, to);
+   } else {
+   ret = em_dax_rw(inode, em, pos, len, to);
+   }
 if (ret < 0)
 goto out;
 pos += ret;
-- 
2.16.4



[PATCH 00/10] btrfs: Support for DAX devices

2018-12-05 Thread Goldwyn Rodrigues
This is a support for DAX in btrfs. I understand there have been
previous attempts at it. However, I wanted to make sure copy-on-write
(COW) works on dax as well.

Before I present this to the FS folks I wanted to run this through the
btrfs. Even though I wish, I cannot get it correct the first time
around :/.. Here are some questions for which I need suggestions:

Questions:
1. I have been unable to do checksumming for DAX devices. While
checksumming can be done for reads and writes, it is a problem when mmap
is involved because btrfs kernel module does not get back control after
an mmap() writes. Any ideas are appreciated, or we would have to set
nodatasum when dax is enabled.

2. Currently, a user can continue writing on "old" extents of an mmaped file
after a snapshot has been created. How can we enforce writes to be directed
to new extents after snapshots have been created? Do we keep a list of
all mmap()s, and re-mmap them after a snapshot?

Tested by creating a pmem device in RAM with "memmap=2G!4G" kernel
command line parameter.


[PATCH 01/10] btrfs: create a mount option for dax
[PATCH 02/10] btrfs: basic dax read
[PATCH 03/10] btrfs: dax: read zeros from holes
[PATCH 04/10] Rename __endio_write_update_ordered() to
[PATCH 05/10] btrfs: Carve out btrfs_get_extent_map_write() out of
[PATCH 06/10] btrfs: dax write support
[PATCH 07/10] dax: export functions for use with btrfs
[PATCH 08/10] btrfs: dax add read mmap path
[PATCH 09/10] btrfs: dax support for cow_page/mmap_private and shared
[PATCH 10/10] btrfs: dax mmap write

 fs/btrfs/Makefile   |1 
 fs/btrfs/ctree.h    |   17 ++
 fs/btrfs/dax.c  |  303 ++--
 fs/btrfs/file.c |   29 
 fs/btrfs/inode.c|   54 +
 fs/btrfs/ioctl.c    |5 
 fs/btrfs/super.c|   15 ++
 fs/dax.c|   35 --
 include/linux/dax.h |   16 ++
 9 files changed, 430 insertions(+), 45 deletions(-)


-- 
Goldwyn



[PATCH 01/10] btrfs: create a mount option for dax

2018-12-05 Thread Goldwyn Rodrigues
From: Goldwyn Rodrigues 

Also, set the inode->i_flags to S_DAX

Signed-off-by: Goldwyn Rodrigues 
---
 fs/btrfs/ctree.h |  1 +
 fs/btrfs/ioctl.c |  5 -
 fs/btrfs/super.c | 15 +++
 3 files changed, 20 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 68f322f600a0..5cc470fa6a40 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1353,6 +1353,7 @@ static inline u32 BTRFS_MAX_XATTR_SIZE(const struct 
btrfs_fs_info *info)
 #define BTRFS_MOUNT_FREE_SPACE_TREE(1 << 26)
 #define BTRFS_MOUNT_NOLOGREPLAY(1 << 27)
 #define BTRFS_MOUNT_REF_VERIFY (1 << 28)
+#define BTRFS_MOUNT_DAX(1 << 29)
 
 #define BTRFS_DEFAULT_COMMIT_INTERVAL  (30)
 #define BTRFS_DEFAULT_MAX_INLINE   (2048)
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 802a628e9f7d..e9146c157816 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -149,8 +149,11 @@ void btrfs_sync_inode_flags_to_i_flags(struct inode *inode)
if (binode->flags & BTRFS_INODE_DIRSYNC)
new_fl |= S_DIRSYNC;
 
+   if ((btrfs_test_opt(btrfs_sb(inode->i_sb), DAX)) && 
S_ISREG(inode->i_mode))
+   new_fl |= S_DAX;
+
set_mask_bits(>i_flags,
- S_SYNC | S_APPEND | S_IMMUTABLE | S_NOATIME | S_DIRSYNC,
+ S_SYNC | S_APPEND | S_IMMUTABLE | S_NOATIME | S_DIRSYNC | 
S_DAX,
      new_fl);
 }
 
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 645fc81e2a94..035263b61cf5 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -326,6 +326,7 @@ enum {
Opt_treelog, Opt_notreelog,
Opt_usebackuproot,
Opt_user_subvol_rm_allowed,
+   Opt_dax,
 
/* Deprecated options */
Opt_alloc_start,
@@ -393,6 +394,7 @@ static const match_table_t tokens = {
{Opt_notreelog, "notreelog"},
{Opt_usebackuproot, "usebackuproot"},
{Opt_user_subvol_rm_allowed, "user_subvol_rm_allowed"},
+   {Opt_dax, "dax"},
 
/* Deprecated options */
{Opt_alloc_start, "alloc_start=%s"},
@@ -739,6 +741,17 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char 
*options,
case Opt_user_subvol_rm_allowed:
btrfs_set_opt(info->mount_opt, USER_SUBVOL_RM_ALLOWED);
break;
+#ifdef CONFIG_FS_DAX
+   case Opt_dax:
+   if (btrfs_super_num_devices(info->super_copy) > 1) {
+   btrfs_info(info,
+  "dax not supported for multi-device 
btrfs partition\n");
+   ret = -EOPNOTSUPP;
+   goto out;
+   }
+   btrfs_set_opt(info->mount_opt, DAX);
+   break;
+#endif
case Opt_enospc_debug:
btrfs_set_opt(info->mount_opt, ENOSPC_DEBUG);
break;
@@ -1329,6 +1342,8 @@ static int btrfs_show_options(struct seq_file *seq, 
struct dentry *dentry)
seq_puts(seq, ",clear_cache");
if (btrfs_test_opt(info, USER_SUBVOL_RM_ALLOWED))
seq_puts(seq, ",user_subvol_rm_allowed");
+   if (btrfs_test_opt(info, DAX))
+   seq_puts(seq, ",dax");
if (btrfs_test_opt(info, ENOSPC_DEBUG))
seq_puts(seq, ",enospc_debug");
if (btrfs_test_opt(info, AUTO_DEFRAG))
-- 
2.16.4



[PATCH 02/10] btrfs: basic dax read

2018-12-05 Thread Goldwyn Rodrigues
From: Goldwyn Rodrigues 

Signed-off-by: Goldwyn Rodrigues 
---
 fs/btrfs/Makefile |  1 +
 fs/btrfs/ctree.h  |  5 
 fs/btrfs/dax.c| 68 +++
 fs/btrfs/file.c   | 13 ++-
 4 files changed, 86 insertions(+), 1 deletion(-)
 create mode 100644 fs/btrfs/dax.c

diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile
index ca693dd554e9..1fa77b875ae9 100644
--- a/fs/btrfs/Makefile
+++ b/fs/btrfs/Makefile
@@ -12,6 +12,7 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o 
root-tree.o dir-item.o \
   reada.o backref.o ulist.o qgroup.o send.o dev-replace.o raid56.o \
   uuid-tree.o props.o free-space-tree.o tree-checker.o
 
+btrfs-$(CONFIG_FS_DAX) += dax.o
 btrfs-$(CONFIG_BTRFS_FS_POSIX_ACL) += acl.o
 btrfs-$(CONFIG_BTRFS_FS_CHECK_INTEGRITY) += check-integrity.o
 btrfs-$(CONFIG_BTRFS_FS_REF_VERIFY) += ref-verify.o
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 5cc470fa6a40..038d64ecebe5 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3685,6 +3685,11 @@ int btrfs_reada_wait(void *handle);
 void btrfs_reada_detach(void *handle);
 int btree_readahead_hook(struct extent_buffer *eb, int err);
 
+#ifdef CONFIG_FS_DAX
+/* dax.c */
+ssize_t btrfs_file_dax_read(struct kiocb *iocb, struct iov_iter *to);
+#endif /* CONFIG_FS_DAX */
+
 static inline int is_fstree(u64 rootid)
 {
if (rootid == BTRFS_FS_TREE_OBJECTID ||
diff --git a/fs/btrfs/dax.c b/fs/btrfs/dax.c
new file mode 100644
index ..d614bf73bf8e
--- /dev/null
+++ b/fs/btrfs/dax.c
@@ -0,0 +1,68 @@
+#include 
+#include 
+#include "ctree.h"
+#include "btrfs_inode.h"
+
+static ssize_t em_dax_rw(struct inode *inode, struct extent_map *em, u64 pos,
+   u64 len, struct iov_iter *iter)
+{
+struct dax_device *dax_dev = fs_dax_get_by_bdev(em->bdev);
+ssize_t map_len;
+pgoff_t blk_pg;
+void *kaddr;
+sector_t blk_start;
+unsigned offset = pos & (PAGE_SIZE - 1);
+
+len = min(len + offset, em->len - (pos - em->start));
+len = ALIGN(len, PAGE_SIZE);
+blk_start = (get_start_sect(em->bdev) << 9) + (em->block_start + (pos 
- em->start));
+blk_pg = blk_start - offset;
+map_len = dax_direct_access(dax_dev, PHYS_PFN(blk_pg), PHYS_PFN(len), 
, NULL);
+map_len = PFN_PHYS(map_len);
+kaddr += offset;
+map_len -= offset;
+if (map_len > len)
+map_len = len;
+if (iov_iter_rw(iter) == WRITE)
+return dax_copy_from_iter(dax_dev, blk_pg, kaddr, map_len, 
iter);
+else
+return dax_copy_to_iter(dax_dev, blk_pg, kaddr, map_len, iter);
+}
+
+ssize_t btrfs_file_dax_read(struct kiocb *iocb, struct iov_iter *to)
+{
+size_t ret = 0, done = 0, count = iov_iter_count(to);
+struct extent_map *em;
+u64 pos = iocb->ki_pos;
+u64 end = pos + count;
+struct inode *inode = file_inode(iocb->ki_filp);
+
+if (!count)
+return 0;
+
+end = i_size_read(inode) < end ? i_size_read(inode) : end;
+
+while (pos < end) {
+u64 len = end - pos;
+
+em = btrfs_get_extent(BTRFS_I(inode), NULL, 0, pos, len, 0);
+if (IS_ERR(em)) {
+if (!ret)
+ret = PTR_ERR(em);
+goto out;
+}
+
+BUG_ON(em->flags & EXTENT_FLAG_FS_MAPPING);
+
+ret = em_dax_rw(inode, em, pos, len, to);
+if (ret < 0)
+goto out;
+pos += ret;
+done += ret;
+}
+
+out:
+iocb->ki_pos += done;
+    return done ? done : ret;
+}
+
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 58e93bce3036..ef6ed93f44d1 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -3308,9 +3308,20 @@ static int btrfs_file_open(struct inode *inode, struct 
file *filp)
return generic_file_open(inode, filp);
 }
 
+static ssize_t btrfs_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
+{
+   struct inode *inode = file_inode(iocb->ki_filp);
+
+#ifdef CONFIG_FS_DAX
+   if (IS_DAX(inode))
+   return btrfs_file_dax_read(iocb, to);
+#endif
+   return generic_file_read_iter(iocb, to);
+}
+
 const struct file_operations btrfs_file_operations = {
.llseek = btrfs_file_llseek,
-   .read_iter  = generic_file_read_iter,
+   .read_iter  = btrfs_file_read_iter,
.splice_read= generic_file_splice_read,
.write_iter = btrfs_file_write_iter,
.mmap   = btrfs_file_mmap,
-- 
2.16.4



[PATCH 08/10] btrfs: dax add read mmap path

2018-12-05 Thread Goldwyn Rodrigues
From: Goldwyn Rodrigues 

Signed-off-by: Goldwyn Rodrigues 
---
 fs/btrfs/ctree.h |  1 +
 fs/btrfs/dax.c   | 43 +++
 fs/btrfs/file.c  | 12 +++-
 3 files changed, 55 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index d91ff283a966..33648121ca52 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3694,6 +3694,7 @@ int btree_readahead_hook(struct extent_buffer *eb, int 
err);
 /* dax.c */
 ssize_t btrfs_file_dax_read(struct kiocb *iocb, struct iov_iter *to);
 ssize_t btrfs_file_dax_write(struct kiocb *iocb, struct iov_iter *from);
+vm_fault_t btrfs_dax_fault(struct vm_fault *vmf);
 #endif /* CONFIG_FS_DAX */
 
 static inline int is_fstree(u64 rootid)
diff --git a/fs/btrfs/dax.c b/fs/btrfs/dax.c
index 4000259a426c..88017f8799d1 100644
--- a/fs/btrfs/dax.c
+++ b/fs/btrfs/dax.c
@@ -190,5 +190,48 @@ ssize_t btrfs_file_dax_write(struct kiocb *iocb, struct 
iov_iter *from)
count - done, true);
extent_changeset_free(data_reserved);
 return done ? done : ret;
+}
+
+/* As copied from dax_iomap_pte_fault() */
+vm_fault_t btrfs_dax_fault(struct vm_fault *vmf)
+{
+   pfn_t pfn;
+   struct address_space *mapping = vmf->vma->vm_file->f_mapping;
+   XA_STATE(xas, >i_pages, vmf->pgoff);
+   struct inode *inode = mapping->host;
+   loff_t pos = (loff_t)vmf->pgoff << PAGE_SHIFT;
+   void *entry = NULL;
+   vm_fault_t ret = 0;
+
+   if (pos > i_size_read(inode)) {
+   ret = VM_FAULT_SIGBUS;
+   goto out;
+   }
 
+   entry = grab_mapping_entry(, mapping, 0);
+   if (IS_ERR(entry)) {
+   ret = dax_fault_return(PTR_ERR(entry));
+   goto out;
+   }
+
+   if (!vmf->cow_page) {
+   sector_t sector;
+   struct extent_map *em;
+em = btrfs_get_extent(BTRFS_I(inode), NULL, 0, pos, PAGE_SIZE, 
0);
+   if (em->block_start == EXTENT_MAP_HOLE) {
+   ret = dax_load_hole(, mapping, entry, vmf);
+   goto out;
+   }
+   sector = ((get_start_sect(em->bdev) << 9) +
+ (em->block_start + (pos - em->start))) >> 9;
+   ret = dax_pfn(fs_dax_get_by_bdev(em->bdev), em->bdev, sector, 
PAGE_SIZE, );
+   if (ret)
+   goto out;
+   dax_insert_entry(, mapping, vmf, entry, pfn, 0, false);
+   ret = vmf_insert_mixed(vmf->vma, vmf->address, pfn);
+   }
+out:
+   if (entry)
+   dax_unlock_entry(, entry);
+   return ret;
 }
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 29a3b12e6660..38b494686fb2 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -2227,8 +2227,18 @@ int btrfs_sync_file(struct file *file, loff_t start, 
loff_t end, int datasync)
return ret > 0 ? -EIO : ret;
 }
 
+static vm_fault_t btrfs_fault(struct vm_fault *vmf)
+{
+   struct inode *inode = vmf->vma->vm_file->f_mapping->host;
+#ifdef CONFIG_FS_DAX
+   if (IS_DAX(inode))
+   return btrfs_dax_fault(vmf);
+#endif
+   return filemap_fault(vmf);
+}
+
 static const struct vm_operations_struct btrfs_file_vm_ops = {
-   .fault  = filemap_fault,
+   .fault  = btrfs_fault,
.map_pages  = filemap_map_pages,
.page_mkwrite   = btrfs_page_mkwrite,
 };
-- 
2.16.4



[PATCH 05/10] btrfs: Carve out btrfs_get_extent_map_write() out of btrfs_get_blocks_write()

2018-12-05 Thread Goldwyn Rodrigues
From: Goldwyn Rodrigues 

This makes btrfs_get_extent_map_write() independent of Direct
I/O code.

Signed-off-by: Goldwyn Rodrigues 
---
 fs/btrfs/ctree.h |  2 ++
 fs/btrfs/inode.c | 40 +++-
 2 files changed, 29 insertions(+), 13 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 5144d28216b0..a0d296b0d826 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3169,6 +3169,8 @@ struct inode *btrfs_iget_path(struct super_block *s, 
struct btrfs_key *location,
  struct btrfs_path *path);
 struct inode *btrfs_iget(struct super_block *s, struct btrfs_key *location,
 struct btrfs_root *root, int *was_new);
+int btrfs_get_extent_map_write(struct extent_map **map, struct buffer_head *bh,
+   struct inode *inode, u64 start, u64 len);
 struct extent_map *btrfs_get_extent(struct btrfs_inode *inode,
struct page *page, size_t pg_offset,
u64 start, u64 end, int create);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 96e9fe9e4150..4671cd9165c1 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7485,11 +7485,10 @@ static int btrfs_get_blocks_direct_read(struct 
extent_map *em,
return 0;
 }
 
-static int btrfs_get_blocks_direct_write(struct extent_map **map,
-struct buffer_head *bh_result,
-struct inode *inode,
-struct btrfs_dio_data *dio_data,
-u64 start, u64 len)
+int btrfs_get_extent_map_write(struct extent_map **map,
+   struct buffer_head *bh,
+   struct inode *inode,
+   u64 start, u64 len)
 {
struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
struct extent_map *em = *map;
@@ -7543,22 +7542,38 @@ static int btrfs_get_blocks_direct_write(struct 
extent_map **map,
 */
btrfs_free_reserved_data_space_noquota(inode, start,
   len);
-   goto skip_cow;
+   /* skip COW */
+   goto out;
}
}
 
/* this will cow the extent */
-   len = bh_result->b_size;
+   if (bh)
+   len = bh->b_size;
free_extent_map(em);
*map = em = btrfs_new_extent_direct(inode, start, len);
-   if (IS_ERR(em)) {
-   ret = PTR_ERR(em);
-   goto out;
-   }
+   if (IS_ERR(em))
+   return PTR_ERR(em);
+out:
+   return ret;
+}
 
+static int btrfs_get_blocks_direct_write(struct extent_map **map,
+struct buffer_head *bh_result,
+struct inode *inode,
+struct btrfs_dio_data *dio_data,
+u64 start, u64 len)
+{
+   int ret = 0;
+   struct extent_map *em;
+
+   ret = btrfs_get_extent_map_write(map, bh_result, inode,
+   start, len);
+   if (ret < 0)
+   return ret;
+   em = *map;
len = min(len, em->len - (start - em->start));
 
-skip_cow:
bh_result->b_blocknr = (em->block_start + (start - em->start)) >>
inode->i_blkbits;
bh_result->b_size = len;
@@ -7579,7 +7594,6 @@ static int btrfs_get_blocks_direct_write(struct 
extent_map **map,
dio_data->reserve -= len;
dio_data->unsubmitted_oe_range_end = start + len;
current->journal_info = dio_data;
-out:
return ret;
 }
 
-- 
2.16.4



[PATCH v2] btrfs: Remove unnecessary code from __btrfs_rebalance

2018-12-05 Thread Nikolay Borisov
The first step fo the rebalance process, ensuring there is 1mb free on
each device. This number seems rather small. And in fact when talking
to the original authors their opinions were:

"man that's a little bonkers"
"i don't think we even need that code anymore"
"I think it was there to make sure we had room for the blank 1M at the
beginning. I bet it goes all the way back to v0"
"we just don't need any of that tho, i say we just delete it"

Clearly, this piece of code has lost its original intent throughout
the years. It doesn't really bring any real practical benefits to the
relocation process. Additionally, this patch makes the balance process
more lightweight by removing a pair of shrink/grow operations which
are rather expensive for heavily populated filesystems. This is mainly due to 
shrink requiring relocating block groups, involving heavy use of the btree.

Signed-off-by: Nikolay Borisov 
Suggested-by: Josef Bacik 
Reviewed-by: Josef Bacik 
---
Changes since v1: 
 * Improved changelog by adding information about reduced runtimes and 
explaining
 where they would come from.

 I did measurements of btrfs balance with and without the patch with 
 funclatency from bcc tools but didn't observe large differences, but this was 
 on a ligthly populated filesystem. 

 fs/btrfs/volumes.c | 53 --
 1 file changed, 53 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index d49baad64fe6..19cc31de1e84 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -3699,17 +3699,11 @@ static int __btrfs_balance(struct btrfs_fs_info 
*fs_info)
 {
struct btrfs_balance_control *bctl = fs_info->balance_ctl;
struct btrfs_root *chunk_root = fs_info->chunk_root;
-   struct btrfs_root *dev_root = fs_info->dev_root;
-   struct list_head *devices;
-   struct btrfs_device *device;
-   u64 old_size;
-   u64 size_to_free;
u64 chunk_type;
struct btrfs_chunk *chunk;
struct btrfs_path *path = NULL;
struct btrfs_key key;
struct btrfs_key found_key;
-   struct btrfs_trans_handle *trans;
struct extent_buffer *leaf;
int slot;
int ret;
@@ -3724,53 +3718,6 @@ static int __btrfs_balance(struct btrfs_fs_info *fs_info)
u32 count_sys = 0;
int chunk_reserved = 0;
 
-   /* step one make some room on all the devices */
-   devices = _info->fs_devices->devices;
-   list_for_each_entry(device, devices, dev_list) {
-   old_size = btrfs_device_get_total_bytes(device);
-   size_to_free = div_factor(old_size, 1);
-   size_to_free = min_t(u64, size_to_free, SZ_1M);
-   if (!test_bit(BTRFS_DEV_STATE_WRITEABLE, >dev_state) ||
-   btrfs_device_get_total_bytes(device) -
-   btrfs_device_get_bytes_used(device) > size_to_free ||
-   test_bit(BTRFS_DEV_STATE_REPLACE_TGT, >dev_state))
-   continue;
-
-   ret = btrfs_shrink_device(device, old_size - size_to_free);
-   if (ret == -ENOSPC)
-   break;
-   if (ret) {
-   /* btrfs_shrink_device never returns ret > 0 */
-   WARN_ON(ret > 0);
-   goto error;
-   }
-
-   trans = btrfs_start_transaction(dev_root, 0);
-   if (IS_ERR(trans)) {
-   ret = PTR_ERR(trans);
-   btrfs_info_in_rcu(fs_info,
-"resize: unable to start transaction after shrinking device %s 
(error %d), old size %llu, new size %llu",
- rcu_str_deref(device->name), ret,
- old_size, old_size - size_to_free);
-   goto error;
-   }
-
-   ret = btrfs_grow_device(trans, device, old_size);
-   if (ret) {
-   btrfs_end_transaction(trans);
-   /* btrfs_grow_device never returns ret > 0 */
-   WARN_ON(ret > 0);
-   btrfs_info_in_rcu(fs_info,
-"resize: unable to grow device after shrinking device %s 
(error %d), old size %llu, new size %llu",
- rcu_str_deref(device->name), ret,
- old_size, old_size - size_to_free);
-   goto error;
-   }
-
-   btrfs_end_transaction(trans);
-   }
-
-   /* step two, relocate all the chunks */
path = btrfs_alloc_path();
if (!path) {
ret = -ENOMEM;
-- 
2.17.1



Re: BTRFS Mount Delay Time Graph

2018-12-04 Thread Nikolay Borisov



On 4.12.18 г. 22:14 ч., Wilson, Ellis wrote:
> On 12/4/18 8:07 AM, Nikolay Borisov wrote:
>> On 3.12.18 г. 20:20 ч., Wilson, Ellis wrote:
>>> With 14TB drives available today, it doesn't take more than a handful of
>>> drives to result in a filesystem that takes around a minute to mount.
>>> As a result of this, I suspect this will become an increasingly problem
>>> for serious users of BTRFS as time goes on.  I'm not complaining as I'm
>>> not a contributor so I have no room to do so -- just shedding some light
>>> on a problem that may deserve attention as filesystem sizes continue to
>>> grow.
>> Would it be possible to provide perf traces of the longer-running mount
>> time? Everyone seems to be fixated on reading block groups (which is
>> likely to be the culprit) but before pointing finger I'd like concrete
>> evidence pointed at the offender.
> 
> I am glad to collect such traces -- please advise with commands that 
> would achieve that.  If you just mean block traces, I can do that, but I 
> suspect you mean something more BTRFS-specific.

A command that would be good is :

perf record --all-kernel -g mount /dev/vdc /media/scratch/

of course replace device/mount path appropriately. This will result in a
perf.data file which contains stacktraces of the hottest paths executed
during invocation of mount. If you could send this file to the mailing
list or upload it somwhere for interested people (me and perhaps) Qu to
inspect would be appreciated.

If the file turned out way too big you can use

perf report --stdio  to create a text output and you could send that as
well.

> 
> Best,
> 
> ellis
> 


[PATCH v2 11/13] btrfs-progs: Introduce rescue.h to resolve missing-prototypes for chunk and super rescue

2018-12-04 Thread Qu Wenruo
We don't have any header declaring btrfs_recover_chunk_tree() nor
btrfs_recover_superblocks(), thus W=1 gives missing-prototypes warning
on them.

Fix it by introducing a new header, rescue.h for these two functions, so
make W=1 could be much happier.

Signed-off-by: Qu Wenruo 
Reviewed-by: Nikolay Borisov 
---
 chunk-recover.c |  1 +
 cmds-rescue.c   |  4 +---
 rescue.h| 21 +
 super-recover.c |  1 +
 4 files changed, 24 insertions(+), 3 deletions(-)
 create mode 100644 rescue.h

diff --git a/chunk-recover.c b/chunk-recover.c
index 1d30db51d8ed..1e554b8e8750 100644
--- a/chunk-recover.c
+++ b/chunk-recover.c
@@ -40,6 +40,7 @@
 #include "utils.h"
 #include "btrfsck.h"
 #include "commands.h"
+#include "rescue.h"
 
 struct recover_control {
int verbose;
diff --git a/cmds-rescue.c b/cmds-rescue.c
index 2bc50c0841ed..36e9e1277e40 100644
--- a/cmds-rescue.c
+++ b/cmds-rescue.c
@@ -26,15 +26,13 @@
 #include "commands.h"
 #include "utils.h"
 #include "help.h"
+#include "rescue.h"
 
 static const char * const rescue_cmd_group_usage[] = {
"btrfs rescue  [options] ",
NULL
 };
 
-int btrfs_recover_chunk_tree(const char *path, int verbose, int yes);
-int btrfs_recover_superblocks(const char *path, int verbose, int yes);
-
 static const char * const cmd_rescue_chunk_recover_usage[] = {
"btrfs rescue chunk-recover [options] ",
"Recover the chunk tree by scanning the devices one by one.",
diff --git a/rescue.h b/rescue.h
new file mode 100644
index ..de486e2e2004
--- /dev/null
+++ b/rescue.h
@@ -0,0 +1,21 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (C) 2018 SUSE.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ */
+
+#ifndef __BTRFS_RESCUE_H__
+#define __BTRFS_RESCUE_H__
+
+int btrfs_recover_superblocks(const char *path, int verbose, int yes);
+int btrfs_recover_chunk_tree(const char *path, int verbose, int yes);
+
+#endif
diff --git a/super-recover.c b/super-recover.c
index 86b3df9867dc..a1af71786034 100644
--- a/super-recover.c
+++ b/super-recover.c
@@ -34,6 +34,7 @@
 #include "crc32c.h"
 #include "volumes.h"
 #include "commands.h"
+#include "rescue.h"
 
 struct btrfs_recover_superblock {
struct btrfs_fs_devices *fs_devices;
-- 
2.19.2



[PATCH v2 03/13] btrfs-progs: Makefile.extrawarn: Don't warn on sign compare

2018-12-04 Thread Qu Wenruo
Under most case, we are just using 'int' for 'unsigned int', and doesn't
care about the sign.

The Wsign-compare is causing tons of false alerts.
Suppressing it would make W=1 less noisy so we can focus on real
problem, while still allow it in W=3 build.

Signed-off-by: Qu Wenruo 
---
 Makefile.extrawarn | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/Makefile.extrawarn b/Makefile.extrawarn
index 0c11f2450802..9b4cace01ce4 100644
--- a/Makefile.extrawarn
+++ b/Makefile.extrawarn
@@ -54,6 +54,7 @@ warning-1 += $(call cc-option, -Wmissing-include-dirs)
 warning-1 += $(call cc-option, -Wunused-but-set-variable)
 warning-1 += $(call cc-disable-warning, missing-field-initializers)
 warning-1 += $(call cc-disable-warning, format-truncation)
+warning-1 += $(call cc-disable-warning, sign-compare)
 
 warning-2 := -Waggregate-return
 warning-2 += -Wcast-align
@@ -74,6 +75,7 @@ warning-3 += -Wredundant-decls
 warning-3 += -Wswitch-default
 warning-3 += $(call cc-option, -Wpacked-bitfield-compat)
 warning-3 += $(call cc-option, -Wvla)
+warning-3 += $(call cc-option, -Wsign-compare)
 
 warning := $(warning-$(findstring 1, $(BUILD_ENABLE_EXTRA_GCC_CHECKS)))
 warning += $(warning-$(findstring 2, $(BUILD_ENABLE_EXTRA_GCC_CHECKS)))
-- 
2.19.2



[PATCH v2 09/13] btrfs-progs: Fix missing-prototypes warning caused by non-static functions

2018-12-04 Thread Qu Wenruo
Make the following functions static to avoid missing-prototypes warning:
 - btrfs.c::handle_special_globals()
 - check/mode-lowmem.c::repair_ternary_lowmem()
 - extent-tree.c::btrfs_search_overlap_extent()
 - free-space-tree.c::convert_free_space_to_bitmaps()
 - free-space-tree.c::convert_free_space_to_extents()
 - free-space-tree.c::__remove_from_free_space_tree()
 - free-space-tree.c::__add_to_free_space_tree()
 - free-space-tree.c::btrfs_create_tree()

Signed-off-by: Qu Wenruo 
Reviewed-by: Nikolay Borisov 
---
 btrfs.c |  2 +-
 check/mode-lowmem.c |  6 +++---
 extent-tree.c   |  2 +-
 free-space-tree.c   | 30 +++---
 4 files changed, 20 insertions(+), 20 deletions(-)

diff --git a/btrfs.c b/btrfs.c
index 2d39f2ced3e8..78c468d2e050 100644
--- a/btrfs.c
+++ b/btrfs.c
@@ -210,7 +210,7 @@ static int handle_global_options(int argc, char **argv)
return shift;
 }
 
-void handle_special_globals(int shift, int argc, char **argv)
+static void handle_special_globals(int shift, int argc, char **argv)
 {
int has_help = 0;
int has_full = 0;
diff --git a/check/mode-lowmem.c b/check/mode-lowmem.c
index 14bbc9ee6cb6..f56b5e8d45dc 100644
--- a/check/mode-lowmem.c
+++ b/check/mode-lowmem.c
@@ -953,9 +953,9 @@ out:
  * returns 0 means success.
  * returns not 0 means on error;
  */
-int repair_ternary_lowmem(struct btrfs_root *root, u64 dir_ino, u64 ino,
- u64 index, char *name, int name_len, u8 filetype,
- int err)
+static int repair_ternary_lowmem(struct btrfs_root *root, u64 dir_ino, u64 ino,
+u64 index, char *name, int name_len,
+u8 filetype, int err)
 {
struct btrfs_trans_handle *trans;
int stage = 0;
diff --git a/extent-tree.c b/extent-tree.c
index cd98633992ac..8c9cdeff3b02 100644
--- a/extent-tree.c
+++ b/extent-tree.c
@@ -3749,7 +3749,7 @@ static void __get_extent_size(struct btrfs_root *root, 
struct btrfs_path *path,
  * Return >0 for not found.
  * Return <0 for err
  */
-int btrfs_search_overlap_extent(struct btrfs_root *root,
+static int btrfs_search_overlap_extent(struct btrfs_root *root,
struct btrfs_path *path, u64 bytenr, u64 len)
 {
struct btrfs_key key;
diff --git a/free-space-tree.c b/free-space-tree.c
index 6641cdfa42ba..b3ffa90f704c 100644
--- a/free-space-tree.c
+++ b/free-space-tree.c
@@ -202,9 +202,9 @@ static void le_bitmap_set(unsigned long *map, unsigned int 
start, int len)
}
 }
 
-int convert_free_space_to_bitmaps(struct btrfs_trans_handle *trans,
- struct btrfs_block_group_cache *block_group,
- struct btrfs_path *path)
+static int convert_free_space_to_bitmaps(struct btrfs_trans_handle *trans,
+   struct btrfs_block_group_cache *block_group,
+   struct btrfs_path *path)
 {
struct btrfs_fs_info *fs_info = trans->fs_info;
struct btrfs_root *root = fs_info->free_space_root;
@@ -341,9 +341,9 @@ out:
return ret;
 }
 
-int convert_free_space_to_extents(struct btrfs_trans_handle *trans,
- struct btrfs_block_group_cache *block_group,
- struct btrfs_path *path)
+static int convert_free_space_to_extents(struct btrfs_trans_handle *trans,
+   struct btrfs_block_group_cache *block_group,
+   struct btrfs_path *path)
 {
struct btrfs_fs_info *fs_info = trans->fs_info;
struct btrfs_root *root = fs_info->free_space_root;
@@ -780,9 +780,9 @@ out:
return ret;
 }
 
-int __remove_from_free_space_tree(struct btrfs_trans_handle *trans,
- struct btrfs_block_group_cache *block_group,
- struct btrfs_path *path, u64 start, u64 size)
+static int __remove_from_free_space_tree(struct btrfs_trans_handle *trans,
+   struct btrfs_block_group_cache *block_group,
+   struct btrfs_path *path, u64 start, u64 size)
 {
struct btrfs_free_space_info *info;
u32 flags;
@@ -960,9 +960,9 @@ out:
return ret;
 }
 
-int __add_to_free_space_tree(struct btrfs_trans_handle *trans,
-struct btrfs_block_group_cache *block_group,
-struct btrfs_path *path, u64 start, u64 size)
+static int __add_to_free_space_tree(struct btrfs_trans_handle *trans,
+   struct btrfs_block_group_cache *block_group,
+   struct btrfs_path *path, u64 start, u64 size)
 {
struct btrfs_fs_info *fs_info = trans->fs_info;
struct btrfs_free_space_info *info;
@@ -1420,9 +1420,9 @@ out:
return ret;
 }
 
-struct btrfs_root *btrfs_create_tree(struct btrfs_trans_handle *trans,
- 

[PATCH v2 12/13] btrfs-progs: Add utils.h include to solve missing-prototypes warning

2018-12-04 Thread Qu Wenruo
Prototypes for arg_strtou64() and lookup_path_rootid() are included in
utils.c, resulting make W=1 warning for them.

Just include that header to make W=1 happier.

Signed-off-by: Qu Wenruo 
Reviewed-by: Nikolay Borisov 
---
 utils-lib.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/utils-lib.c b/utils-lib.c
index 044f93fc4446..5bb89f2f1a8d 100644
--- a/utils-lib.c
+++ b/utils-lib.c
@@ -1,4 +1,5 @@
 #include "kerncompat.h"
+#include "utils.h"
 #include 
 #include 
 #include 
-- 
2.19.2



[PATCH v2 04/13] btrfs-progs: Fix Wempty-body warning

2018-12-04 Thread Qu Wenruo
messages.h:49:24: warning: suggest braces around empty body in an 'if' 
statement [-Wempty-body]
PRINT_TRACE_ON_ERROR;\

Just extra braces would solve the problem.

Signed-off-by: Qu Wenruo 
Reviewed-by: Nikolay Borisov 
---
 messages.h | 15 ++-
 1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/messages.h b/messages.h
index ec7d93381a36..16f650d19a4b 100644
--- a/messages.h
+++ b/messages.h
@@ -45,13 +45,16 @@
 
 #define error_on(cond, fmt, ...)   \
do {\
-   if ((cond)) \
+   if ((cond)) {   \
PRINT_TRACE_ON_ERROR;   \
-   if ((cond)) \
+   }   \
+   if ((cond)) {   \
PRINT_VERBOSE_ERROR;\
+   }   \
__btrfs_error_on((cond), (fmt), ##__VA_ARGS__); \
-   if ((cond)) \
+   if ((cond)) {   \
DO_ABORT_ON_ERROR;  \
+   }   \
} while (0)
 
 #define error_btrfs_util(err)  \
@@ -76,10 +79,12 @@
 
 #define warning_on(cond, fmt, ...) \
do {\
-   if ((cond)) \
+   if ((cond)) {   \
PRINT_TRACE_ON_ERROR;   \
-   if ((cond)) \
+   }   \
+   if ((cond)) {   \
PRINT_VERBOSE_ERROR;\
+   }   \
__btrfs_warning_on((cond), (fmt), ##__VA_ARGS__);   \
} while (0)
 
-- 
2.19.2



[PATCH v2 10/13] btrfs-progs: Move btrfs_check_nodesize to fsfeatures.c to fix missing-prototypes warning

2018-12-04 Thread Qu Wenruo
And fsfeatures.c is indeed a better location for that function.

Signed-off-by: Qu Wenruo 
Reviewed-by: Nikolay Borisov 
---
 fsfeatures.c | 23 +++
 utils.c  | 23 ---
 2 files changed, 23 insertions(+), 23 deletions(-)

diff --git a/fsfeatures.c b/fsfeatures.c
index 7d85d60f1277..13ad030870cd 100644
--- a/fsfeatures.c
+++ b/fsfeatures.c
@@ -225,3 +225,26 @@ u32 get_running_kernel_version(void)
return version;
 }
 
+int btrfs_check_nodesize(u32 nodesize, u32 sectorsize, u64 features)
+{
+   if (nodesize < sectorsize) {
+   error("illegal nodesize %u (smaller than %u)",
+   nodesize, sectorsize);
+   return -1;
+   } else if (nodesize > BTRFS_MAX_METADATA_BLOCKSIZE) {
+   error("illegal nodesize %u (larger than %u)",
+   nodesize, BTRFS_MAX_METADATA_BLOCKSIZE);
+   return -1;
+   } else if (nodesize & (sectorsize - 1)) {
+   error("illegal nodesize %u (not aligned to %u)",
+   nodesize, sectorsize);
+   return -1;
+   } else if (features & BTRFS_FEATURE_INCOMPAT_MIXED_GROUPS &&
+  nodesize != sectorsize) {
+   error(
+   "illegal nodesize %u (not equal to %u for mixed block group)",
+   nodesize, sectorsize);
+   return -1;
+   }
+   return 0;
+}
diff --git a/utils.c b/utils.c
index b274f46fdd9d..a7e34b804551 100644
--- a/utils.c
+++ b/utils.c
@@ -2266,29 +2266,6 @@ int btrfs_tree_search2_ioctl_supported(int fd)
return ret;
 }
 
-int btrfs_check_nodesize(u32 nodesize, u32 sectorsize, u64 features)
-{
-   if (nodesize < sectorsize) {
-   error("illegal nodesize %u (smaller than %u)",
-   nodesize, sectorsize);
-   return -1;
-   } else if (nodesize > BTRFS_MAX_METADATA_BLOCKSIZE) {
-   error("illegal nodesize %u (larger than %u)",
-   nodesize, BTRFS_MAX_METADATA_BLOCKSIZE);
-   return -1;
-   } else if (nodesize & (sectorsize - 1)) {
-   error("illegal nodesize %u (not aligned to %u)",
-   nodesize, sectorsize);
-   return -1;
-   } else if (features & BTRFS_FEATURE_INCOMPAT_MIXED_GROUPS &&
-  nodesize != sectorsize) {
-   error("illegal nodesize %u (not equal to %u for mixed block 
group)",
-   nodesize, sectorsize);
-   return -1;
-   }
-   return 0;
-}
-
 /*
  * Copy a path argument from SRC to DEST and check the SRC length if it's at
  * most PATH_MAX and fits into DEST. DESTLEN is supposed to be exact size of
-- 
2.19.2



[PATCH v2 08/13] btrfs-progs: Fix Wtype-limits warning

2018-12-04 Thread Qu Wenruo
The only hit is the following code:

tlv_len = le16_to_cpu(tlv_hdr->tlv_len);

if (tlv_type == 0 || tlv_type > BTRFS_SEND_A_MAX
|| tlv_len > BTRFS_SEND_BUF_SIZE) {
error("invalid tlv in cmd tlv_type = %hu, tlv_len = 
%hu",
tlv_type, tlv_len);

@tlv_len is u16, while BTRFS_SEND_BUF_SIZE is 64K.
u16 MAX is 64K - 1, so the final check is always false.

Just remove it.

Signed-off-by: Qu Wenruo 
Reviewed-by: Nikolay Borisov 
---
 send-stream.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/send-stream.c b/send-stream.c
index 3b8e39c9486a..25461e92c37b 100644
--- a/send-stream.c
+++ b/send-stream.c
@@ -157,8 +157,7 @@ static int read_cmd(struct btrfs_send_stream *sctx)
tlv_type = le16_to_cpu(tlv_hdr->tlv_type);
tlv_len = le16_to_cpu(tlv_hdr->tlv_len);
 
-   if (tlv_type == 0 || tlv_type > BTRFS_SEND_A_MAX
-   || tlv_len > BTRFS_SEND_BUF_SIZE) {
+   if (tlv_type == 0 || tlv_type > BTRFS_SEND_A_MAX) {
error("invalid tlv in cmd tlv_type = %hu, tlv_len = 
%hu",
tlv_type, tlv_len);
ret = -EINVAL;
-- 
2.19.2



[PATCH v2 13/13] btrfs-progs: free-space-tree: Remove unsued function

2018-12-04 Thread Qu Wenruo
set_free_space_tree_thresholds() is never used, just remove it to solve
the missing-prototypes warning from make W=1.

Signed-off-by: Qu Wenruo 
Reviewed-by: Nikolay Borisov 
---
 free-space-tree.c | 29 -
 1 file changed, 29 deletions(-)

diff --git a/free-space-tree.c b/free-space-tree.c
index b3ffa90f704c..af141e6e611a 100644
--- a/free-space-tree.c
+++ b/free-space-tree.c
@@ -24,35 +24,6 @@
 #include "bitops.h"
 #include "internal.h"
 
-void set_free_space_tree_thresholds(struct btrfs_block_group_cache *cache,
-   u64 sectorsize)
-{
-   u32 bitmap_range;
-   size_t bitmap_size;
-   u64 num_bitmaps, total_bitmap_size;
-
-   /*
-* We convert to bitmaps when the disk space required for using extents
-* exceeds that required for using bitmaps.
-*/
-   bitmap_range = sectorsize * BTRFS_FREE_SPACE_BITMAP_BITS;
-   num_bitmaps = div_u64(cache->key.offset + bitmap_range - 1,
- bitmap_range);
-   bitmap_size = sizeof(struct btrfs_item) + BTRFS_FREE_SPACE_BITMAP_SIZE;
-   total_bitmap_size = num_bitmaps * bitmap_size;
-   cache->bitmap_high_thresh = div_u64(total_bitmap_size,
-   sizeof(struct btrfs_item));
-
-   /*
-* We allow for a small buffer between the high threshold and low
-* threshold to avoid thrashing back and forth between the two formats.
-*/
-   if (cache->bitmap_high_thresh > 100)
-   cache->bitmap_low_thresh = cache->bitmap_high_thresh - 100;
-   else
-   cache->bitmap_low_thresh = 0;
-}
-
 static struct btrfs_free_space_info *
 search_free_space_info(struct btrfs_trans_handle *trans,
   struct btrfs_fs_info *fs_info,
-- 
2.19.2



[PATCH v2 07/13] btrfs-progs: Fix Wmaybe-uninitialized warning

2018-12-04 Thread Qu Wenruo
GCC 8.2.1 will report the following warning with "make W=1":

  ctree.c: In function 'btrfs_next_sibling_tree_block':
  ctree.c:2990:21: warning: 'slot' may be used uninitialized in this function 
[-Wmaybe-uninitialized]
path->slots[level] = slot;
~~~^~

The culprit is the following code:

int slot;   << Not initialized
int level = path->lowest_level + 1;
BUG_ON(path->lowest_level + 1 >= BTRFS_MAX_LEVEL);
while(level < BTRFS_MAX_LEVEL) {
slot = path->slots[level] + 1;
^^ but we initialize @slot here.
...
}
path->slots[level] = slot;

It's possible that compiler doesn't get enough hint for BUG_ON() on
lowest_level + 1 >= BTRFS_MAX_LEVEL case.

Fix it by using a do {} while() loop other than while() {} loop, to
ensure we will run the loop for at least once.

Signed-off-by: Qu Wenruo 
---
 ctree.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/ctree.c b/ctree.c
index 46e2ccedc0bf..867e8b60b199 100644
--- a/ctree.c
+++ b/ctree.c
@@ -2966,7 +2966,7 @@ int btrfs_next_sibling_tree_block(struct btrfs_fs_info 
*fs_info,
struct extent_buffer *next = NULL;
 
BUG_ON(path->lowest_level + 1 >= BTRFS_MAX_LEVEL);
-   while(level < BTRFS_MAX_LEVEL) {
+   do {
if (!path->nodes[level])
return 1;
 
@@ -2986,7 +2986,7 @@ int btrfs_next_sibling_tree_block(struct btrfs_fs_info 
*fs_info,
if (!extent_buffer_uptodate(next))
return -EIO;
break;
-   }
+   } while (level < BTRFS_MAX_LEVEL);
path->slots[level] = slot;
while(1) {
level--;
-- 
2.19.2



[PATCH v2 05/13] btrfs-progs: Fix Wimplicit-fallthrough warning

2018-12-04 Thread Qu Wenruo
Although most fallthrough case is pretty obvious, we still need to teach
the dumb compiler that it's an explicit fallthrough.

Also reformat the code to use common indent.

Signed-off-by: Qu Wenruo 
Reviewed-by: Nikolay Borisov 
---
 utils.c | 30 ++
 1 file changed, 22 insertions(+), 8 deletions(-)

diff --git a/utils.c b/utils.c
index a310300829eb..b274f46fdd9d 100644
--- a/utils.c
+++ b/utils.c
@@ -1134,15 +1134,25 @@ int pretty_size_snprintf(u64 size, char *str, size_t 
str_size, unsigned unit_mod
num_divs = 0;
last_size = size;
switch (unit_mode & UNITS_MODE_MASK) {
-   case UNITS_TBYTES: base *= mult; num_divs++;
-   case UNITS_GBYTES: base *= mult; num_divs++;
-   case UNITS_MBYTES: base *= mult; num_divs++;
-   case UNITS_KBYTES: num_divs++;
-  break;
+   case UNITS_TBYTES:
+   base *= mult;
+   num_divs++;
+   __attribute__ ((fallthrough));
+   case UNITS_GBYTES:
+   base *= mult;
+   num_divs++;
+   __attribute__ ((fallthrough));
+   case UNITS_MBYTES:
+   base *= mult;
+   num_divs++;
+   __attribute__ ((fallthrough));
+   case UNITS_KBYTES:
+   num_divs++;
+   break;
case UNITS_BYTES:
-  base = 1;
-  num_divs = 0;
-  break;
+   base = 1;
+   num_divs = 0;
+   break;
default:
if (negative) {
s64 ssize = (s64)size;
@@ -1907,13 +1917,17 @@ int test_num_disk_vs_raid(u64 metadata_profile, u64 
data_profile,
default:
case 4:
allowed |= BTRFS_BLOCK_GROUP_RAID10;
+   __attribute__ ((fallthrough));
case 3:
allowed |= BTRFS_BLOCK_GROUP_RAID6;
+   __attribute__ ((fallthrough));
case 2:
allowed |= BTRFS_BLOCK_GROUP_RAID0 | BTRFS_BLOCK_GROUP_RAID1 |
BTRFS_BLOCK_GROUP_RAID5;
+   __attribute__ ((fallthrough));
case 1:
allowed |= BTRFS_BLOCK_GROUP_DUP;
+   __attribute__ ((fallthrough));
}
 
if (dev_cnt > 1 && profile & BTRFS_BLOCK_GROUP_DUP) {
-- 
2.19.2



[PATCH v2 06/13] btrfs-progs: Fix Wsuggest-attribute=format warning

2018-12-04 Thread Qu Wenruo
Add __attribute__ ((format (printf, 4, 0))) to fix the vprintf calling
function.

Signed-off-by: Qu Wenruo 
Reviewed-by: Nikolay Borisov 
---
 string-table.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/string-table.c b/string-table.c
index 95833768960d..455285702d51 100644
--- a/string-table.c
+++ b/string-table.c
@@ -48,6 +48,7 @@ struct string_table *table_create(int columns, int rows)
  * '>' the text is right aligned. If fmt is equal to '=' the text will
  * be replaced by a '=' dimensioned on the basis of the column width
  */
+__attribute__ ((format (printf, 4, 0)))
 char *table_vprintf(struct string_table *tab, int column, int row,
  const char *fmt, va_list ap)
 {
-- 
2.19.2



[PATCH v2 00/13] btrfs-progs: Make W=1 great (no "again")

2018-12-04 Thread Qu Wenruo
This patchset can be fetched from github:
https://github.com/adam900710/btrfs-progs/tree/warning_fixes
Which is based on v4.19 tag.

This patchset will make "make W=1" reports no warning.

This patch will first introduce fix to Makefile.extrawarn to make
"cc-disable-warning" works, then disable sign-compare warning
completely, as we really don't want extra "unsigned" prefix to slow our
typing.

Then re-use (ok, in fact rework) Yanjun's patch to disable
formwat-truncation warning.

Finally, fix all the remaining warnings reported by make W=1.

Now, we make "make W=1" great (may 'again' or not, depending on the
distribution rolling speed).

changelog:
v1.1:
- Use cc-disable-warning instead of putting -Wno-something to improve
  compatibility.
- Better explaination on the BUG_ON() branch caused uninitialized
  variable.
- Also cleanup free-space-tree.c

v2:
- Add reviewed-by tags, except the 7th patch, as it goes a different way
  to fix in v2.
- Fix bad port of cc-disable-warning, using $CFLAGS instead of kernel
  flags.
- Make sure fixed warnings still show in W=3.
- Use do {} while() loop to replace a while() {} loop, so even compiler
  doesn't have enough hint for BUG_ON(), it won't report uninitialized
  variable warning.

Qu Wenruo (12):
  btrfs-progs: Makefile.extrawarn: Import cc-disable-warning
  btrfs-progs: Makefile.extrawarn: Don't warn on sign compare
  btrfs-progs: Fix Wempty-body warning
  btrfs-progs: Fix Wimplicit-fallthrough warning
  btrfs-progs: Fix Wsuggest-attribute=format warning
  btrfs-progs: Fix Wmaybe-uninitialized warning
  btrfs-progs: Fix Wtype-limits warning
  btrfs-progs: Fix missing-prototypes warning caused by non-static
functions
  btrfs-progs: Move btrfs_check_nodesize to fsfeatures.c to fix
missing-prototypes warning
  btrfs-progs: Introduce rescue.h to resolve missing-prototypes for
chunk and super rescue
  btrfs-progs: Add utils.h include to solve missing-prototypes warning
  btrfs-progs: free-space-tree: Remove unsued function

Su Yanjun (1):
  btrfs-progs: fix gcc8 default build warning caused by
'-Wformat-truncation'

 Makefile|  5 
 Makefile.extrawarn  | 10 
 btrfs.c |  2 +-
 check/mode-lowmem.c |  6 ++---
 chunk-recover.c |  1 +
 cmds-rescue.c   |  4 +--
 ctree.c |  4 +--
 extent-tree.c   |  2 +-
 free-space-tree.c   | 59 -
 fsfeatures.c| 23 ++
 messages.h  | 15 
 rescue.h| 21 
 send-stream.c   |  3 +--
 string-table.c  |  1 +
 super-recover.c |  1 +
 utils-lib.c |  1 +
 utils.c | 53 +---
 17 files changed, 119 insertions(+), 92 deletions(-)
 create mode 100644 rescue.h

-- 
2.19.2



[PATCH v2 01/13] btrfs-progs: Makefile.extrawarn: Import cc-disable-warning

2018-12-04 Thread Qu Wenruo
We imported cc-option but forgot to import cc-disable-warning.

Fixes: b556a992c3ad ("btrfs-progs: build: allow to build with various compiler 
warnings")
Signed-off-by: Qu Wenruo 
---
 Makefile.extrawarn | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/Makefile.extrawarn b/Makefile.extrawarn
index 1f4bda94a167..18a3a860053e 100644
--- a/Makefile.extrawarn
+++ b/Makefile.extrawarn
@@ -19,6 +19,12 @@ try-run = $(shell set -e;   \
  cc-option = $(call try-run,\
  $(CC) $(CFLAGS) $(1) -c -x c /dev/null -o "$$TMP",$(1),$(2))
 
+# cc-disable-warning
+# Usage: cflags-y += $(call cc-disable-warning,unused-but-set-variable)
+cc-disable-warning = $(call try-run,\
+   $(CC) -Werror $(CFLAGS) -W$(strip $(1)) -c -x c /dev/null -o 
"$$TMP",-Wno-$(strip $(1)))
+
+
 # From linux.git/scripts/Makefile.extrawarn
 # ==
 #
-- 
2.19.2



[PATCH v2 02/13] btrfs-progs: fix gcc8 default build warning caused by '-Wformat-truncation'

2018-12-04 Thread Qu Wenruo
From: Su Yanjun 

When using gcc8 + glibc 2.28.5 compiles utils.c, it complains as below:

  utils.c:852:45: warning: '%s' directive output may be truncated writing
  up to 4095 bytes into a region of size 4084 [-Wformat-truncation=]
 snprintf(path, sizeof(path), "/dev/mapper/%s", name);
 ^~   
  In file included from /usr/include/stdio.h:873,
   from utils.c:20:
  /usr/include/bits/stdio2.h:67:10: note: '__builtin___snprintf_chk'
  output between 13 and 4108 bytes into a destination of size 4096
 return __builtin___snprintf_chk (__s, __n, __USE_FORTIFY_LEVEL - 1,
^~~~
  __bos (__s), __fmt, __va_arg_pack ());
  ~

This isn't a type of warning we care about, particularly when calling
snprintf() we expect string to be truncated.

Using the GCC option -Wno-format-truncation to disable this for default
build and W=1 build, while still keep it for W=2/W=3 build.

Signed-off-by: Su Yanjun 
[Use cc-disable-warning to fix the not working CFLAGS setting in configure.ac]
[Keep the warning in W=2/W=3 build]
Signed-off-by: Qu Wenruo 
---
 Makefile   | 5 +
 Makefile.extrawarn | 2 ++
 2 files changed, 7 insertions(+)

diff --git a/Makefile b/Makefile
index f4ab14ea74c8..a9e57fecb6e6 100644
--- a/Makefile
+++ b/Makefile
@@ -62,6 +62,10 @@ DEBUG_LDFLAGS :=
 ABSTOPDIR = $(shell pwd)
 TOPDIR := .
 
+# Disable certain GCC 8 + glibc 2.28 warning for snprintf()
+# where string truncation for snprintf() is expected.
+DISABLE_WARNING_FLAGS := $(call cc-disable-warning, format-truncation)
+
 # Common build flags
 CFLAGS = $(SUBST_CFLAGS) \
 $(CSTD) \
@@ -73,6 +77,7 @@ CFLAGS = $(SUBST_CFLAGS) \
 -I$(TOPDIR) \
 -I$(TOPDIR)/kernel-lib \
 -I$(TOPDIR)/libbtrfsutil \
+$(DISABLE_WARNING_FLAGS) \
 $(EXTRAWARN_CFLAGS) \
 $(DEBUG_CFLAGS_INTERNAL) \
 $(EXTRA_CFLAGS)
diff --git a/Makefile.extrawarn b/Makefile.extrawarn
index 18a3a860053e..0c11f2450802 100644
--- a/Makefile.extrawarn
+++ b/Makefile.extrawarn
@@ -53,6 +53,7 @@ warning-1 += -Wold-style-definition
 warning-1 += $(call cc-option, -Wmissing-include-dirs)
 warning-1 += $(call cc-option, -Wunused-but-set-variable)
 warning-1 += $(call cc-disable-warning, missing-field-initializers)
+warning-1 += $(call cc-disable-warning, format-truncation)
 
 warning-2 := -Waggregate-return
 warning-2 += -Wcast-align
@@ -61,6 +62,7 @@ warning-2 += -Wnested-externs
 warning-2 += -Wshadow
 warning-2 += $(call cc-option, -Wlogical-op)
 warning-2 += $(call cc-option, -Wmissing-field-initializers)
+warning-2 += $(call cc-option, -Wformat-truncation)
 
 warning-3 := -Wbad-function-cast
 warning-3 += -Wcast-qual
-- 
2.19.2



Re: [PATCH 7/9] btrfs-progs: Fix Wmaybe-uninitialized warning

2018-12-04 Thread Qu Wenruo


On 2018/12/4 下午8:17, David Sterba wrote:
> On Fri, Nov 16, 2018 at 03:54:24PM +0800, Qu Wenruo wrote:
>> The only location is the following code:
>>
>>  int level = path->lowest_level + 1;
>>  BUG_ON(path->lowest_level + 1 >= BTRFS_MAX_LEVEL);
>>  while(level < BTRFS_MAX_LEVEL) {
>>  slot = path->slots[level] + 1;
>>  ...
>>  }
>>  path->slots[level] = slot;
>>
>> Again, it's the stupid compiler needs some hint for the fact that
>> we will always enter the while loop for at least once, thus @slot should
>> always be initialized.
> 
> Harsh words for the compiler, and I say not deserved. The same code
> pasted to kernel a built with the same version does not report the
> warning, so it's apparently a missing annotation of BUG_ON in
> btrfs-progs that does not give the right hint.
> 
Well, in fact after the recent gcc8 updates (god knows how many versions
gcc8 get updated in Arch after the patchset), it doesn't report this
error anymore.

But your idea on the BUG_ON() lacking noreturn attribute makes sense.

I'll just add some hint for kerncompact.

Thanks,
Qu



signature.asc
Description: OpenPGP digital signature


Re: [PATCH 2/2] btrfs: scrub: move scrub_setup_ctx allocation out of device_list_mutex

2018-12-04 Thread David Sterba
On Tue, Dec 04, 2018 at 05:22:19PM +0200, Nikolay Borisov wrote:
> > @@ -3874,16 +3882,9 @@ int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, 
> > u64 devid, u64 start,
> > if (ret) {
> > mutex_unlock(_info->scrub_lock);
> > mutex_unlock(_info->fs_devices->device_list_mutex);
> > -   return ret;
> > +   goto out_free_ctx;
> 
> Don't we suffer the same issue when calling scrub_workers_get since in
> it we do btrfs_alloc_workqueue which also calls kzalloc with GFP_KERNEL?

Yes, that's right. I instrumented only the allocations in scrub.c to see
if the nofs and lock_not_held assertions work at all so this one did not
get caught directly.

As scrub_workers_get still needs the scrub_lock, fixing it by moving
does not work and would need more restructuring.


Re: BTRFS Mount Delay Time Graph

2018-12-04 Thread Wilson, Ellis
On 12/4/18 8:07 AM, Nikolay Borisov wrote:
> On 3.12.18 г. 20:20 ч., Wilson, Ellis wrote:
>> With 14TB drives available today, it doesn't take more than a handful of
>> drives to result in a filesystem that takes around a minute to mount.
>> As a result of this, I suspect this will become an increasingly problem
>> for serious users of BTRFS as time goes on.  I'm not complaining as I'm
>> not a contributor so I have no room to do so -- just shedding some light
>> on a problem that may deserve attention as filesystem sizes continue to
>> grow.
> Would it be possible to provide perf traces of the longer-running mount
> time? Everyone seems to be fixated on reading block groups (which is
> likely to be the culprit) but before pointing finger I'd like concrete
> evidence pointed at the offender.

I am glad to collect such traces -- please advise with commands that 
would achieve that.  If you just mean block traces, I can do that, but I 
suspect you mean something more BTRFS-specific.

Best,

ellis



Re: [PATCH 3/3] btrfs: replace cleaner_delayed_iput_mutex with a waitqueue

2018-12-04 Thread Josef Bacik
On Tue, Dec 04, 2018 at 01:46:58PM +0200, Nikolay Borisov wrote:
> 
> 
> On 3.12.18 г. 18:06 ч., Josef Bacik wrote:
> > The throttle path doesn't take cleaner_delayed_iput_mutex, which means
> > we could think we're done flushing iputs in the data space reservation
> > path when we could have a throttler doing an iput.  There's no real
> > reason to serialize the delayed iput flushing, so instead of taking the
> > cleaner_delayed_iput_mutex whenever we flush the delayed iputs just
> > replace it with an atomic counter and a waitqueue.  This removes the
> > short (or long depending on how big the inode is) window where we think
> > there are no more pending iputs when there really are some.
> > 
> > Signed-off-by: Josef Bacik 
> > ---
> >  fs/btrfs/ctree.h   |  4 +++-
> >  fs/btrfs/disk-io.c |  5 ++---
> >  fs/btrfs/extent-tree.c | 13 -
> >  fs/btrfs/inode.c   | 21 +++++
> >  4 files changed, 34 insertions(+), 9 deletions(-)
> > 
> > diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> > index dc56a4d940c3..20af5d6d81f1 100644
> > --- a/fs/btrfs/ctree.h
> > +++ b/fs/btrfs/ctree.h
> > @@ -915,7 +915,8 @@ struct btrfs_fs_info {
> >  
> > spinlock_t delayed_iput_lock;
> > struct list_head delayed_iputs;
> > -   struct mutex cleaner_delayed_iput_mutex;
> > +   atomic_t nr_delayed_iputs;
> > +   wait_queue_head_t delayed_iputs_wait;
> >  
> 
> Have you considered whether the same could be achieved with a completion
> rather than an open-coded waitqueue? I tried prototyping it and it could
> be done but it becomes messy regarding when the completion should be
> initialised i.e only when it's not in btrfs_add_delayed_iput
> 

Yeah a waitqueue makes more sense here than a completion since it's not a
one-off.

> 
> 
> 
> > @@ -4958,9 +4962,8 @@ static void flush_space(struct btrfs_fs_info *fs_info,
> >  * bunch of pinned space, so make sure we run the iputs before
> >  * we do our pinned bytes check below.
> >  */
> > -   mutex_lock(_info->cleaner_delayed_iput_mutex);
> > btrfs_run_delayed_iputs(fs_info);
> > -   mutex_unlock(_info->cleaner_delayed_iput_mutex);
> > +   btrfs_wait_on_delayed_iputs(fs_info);
> 
> Waiting on delayed iputs here is pointless since they are run
> synchronously form this context.
> 

Unless there are other threads (the cleaner thread) running iputs as well.  We
could be running an iput that is evicting the inode in another thread and we
really want that space, so we need to wait here to make sure everybody is truly
done.  Thanks,

Josef


Re: [PATCH 2/3] btrfs: wakeup cleaner thread when adding delayed iput

2018-12-04 Thread Josef Bacik
On Tue, Dec 04, 2018 at 11:21:14AM +0200, Nikolay Borisov wrote:
> 
> 
> On 3.12.18 г. 18:06 ч., Josef Bacik wrote:
> > The cleaner thread usually takes care of delayed iputs, with the
> > exception of the btrfs_end_transaction_throttle path.  The cleaner
> > thread only gets woken up every 30 seconds, so instead wake it up to do
> > it's work so that we can free up that space as quickly as possible.
> 
> This description misses any rationale whatsoever about why the cleaner
> needs to be woken up more frequently than 30 seconds (and IMO this is
> the most important question that needs answering).
> 

Yeah I'll add that.

> Also have you done any measurements of the number of processed delayed
> inodes with this change. Given the behavior you so desire why not just
> make delayed iputs to be run via schedule_work on the global workqueue
> and be done with it? I'm sure the latency will be better than the
> current 30 seconds one :)

We already have the cleaner thread to do this work, and it sets up for the
snapshot drop stuff to be run as well.  We could probably add another delayed
work thing, but I would rather do that in a different patch.  Thanks,

Josef


[Mount time bug bounty?] was: BTRFS Mount Delay Time Graph

2018-12-04 Thread Lionel Bouton
Le 03/12/2018 à 23:22, Hans van Kranenburg a écrit :
> [...]
> Yes, I think that's true. See btrfs_read_block_groups in extent-tree.c:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/btrfs/extent-tree.c#n9982
>
> What the code is doing here is starting at the beginning of the extent
> tree, searching forward until it sees the first BLOCK_GROUP_ITEM (which
> is not that far away), and then based on the information in it, computes
> where the next one will be (just after the end of the vaddr+length of
> it), and then jumps over all normal extent items and searches again near
> where the next block group item has to be. So, yes, that means that they
> depend on each other.
>
> Two possible ways to improve this:
>
> 1. Instead, walk the chunk tree (which has all related items packed
> together) instead to find out at which locations in the extent tree the
> block group items are located and then start getting items in parallel.
> If you have storage with a lot of rotating rust that can deliver much
> more random reads if you ask for more of them at the same time, then
> this can already cause a massive speedup.
>
> 2. Move the block group items somewhere else, where they can nicely be
> grouped together, so that the amount of metadata pages that has to be
> looked up is minimal. Quoting from the link below, "slightly tricky
> [...] but there are no fundamental obstacles".
>
> https://www.spinics.net/lists/linux-btrfs/msg71766.html
>
> I think the main obstacle here is finding a developer with enough
> experience and time to do it. :)

I would definitely be interested in sponsoring at least a part of the
needed time through my company (we are too small to hire kernel
developers full-time but we can make a one-time contribution for
something as valuable to us as faster mount delays).

If needed it could be split in two steps with separate bounties :
- providing a patch for the latest LTS kernel with a substantial
decrease in mount time in our case (ideally less than a minute instead
of 15 minutes but <5 minutes is already worth it).
- having it integrated in mainline.

I don't have any experience with company sponsorship/bounties but I'm
willing to learn (don't hesitate to make suggestions). I'll have to
discuss it with our accountant to make sure we do it correctly.

Is it the right place to discuss this kind of subject or should I take
the discussion elsewhere ?

Best regards,

Lionel


Re: [PATCHv3] btrfs: Fix error handling in btrfs_cleanup_ordered_extents

2018-12-04 Thread Nikolay Borisov



On 21.11.18 г. 17:10 ч., Nikolay Borisov wrote:
> Running btrfs/124 in a loop hung up on me sporadically with the
> following call trace:
>   btrfs   D0  5760   5324 0x
>   Call Trace:
>? __schedule+0x243/0x800
>schedule+0x33/0x90
>btrfs_start_ordered_extent+0x10c/0x1b0 [btrfs]
>? wait_woken+0xa0/0xa0
>btrfs_wait_ordered_range+0xbb/0x100 [btrfs]
>btrfs_relocate_block_group+0x1ff/0x230 [btrfs]
>btrfs_relocate_chunk+0x49/0x100 [btrfs]
>    btrfs_balance+0xbeb/0x1740 [btrfs]
>    btrfs_ioctl_balance+0x2ee/0x380 [btrfs]
>btrfs_ioctl+0x1691/0x3110 [btrfs]
>? lockdep_hardirqs_on+0xed/0x180
>? __handle_mm_fault+0x8e7/0xfb0
>? _raw_spin_unlock+0x24/0x30
>? __handle_mm_fault+0x8e7/0xfb0
>? do_vfs_ioctl+0xa5/0x6e0
>    ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
>do_vfs_ioctl+0xa5/0x6e0
>? entry_SYSCALL_64_after_hwframe+0x3e/0xbe
>ksys_ioctl+0x3a/0x70
>__x64_sys_ioctl+0x16/0x20
>do_syscall_64+0x60/0x1b0
>entry_SYSCALL_64_after_hwframe+0x49/0xbe
> 
> This happens because during page writeback it's valid for
> writepage_delalloc to instantiate a delalloc range which doesn't
> belong to the page currently being written back.
> 
> The reason this case is valid is due to find_lock_delalloc_range
> returning any available range after the passed delalloc_start and
> ignorting whether the page under writeback is within that range.
> In turn ordered extents (OE) are always created for the returned range
> from find_lock_delalloc_range. If, however, a failure occurs while OE
> are being created then the clean up code in btrfs_cleanup_ordered_extents
> will be called.
> 
> Unfortunately the code in btrfs_cleanup_ordered_extents doesn't consider
> the case of such 'foreign' range being processed and instead it always
> assumes that the range OE are created for belongs to the page. This
> leads to the first page of such foregin range to not be cleaned up since
> it's deliberately missed skipped by the current cleaning up code.
> 
> Fix this by correctly checking whether the current page belongs to the
> range being instantiated and if so adjsut the range parameters
> passed for cleaning up. If it doesn't, then just clean the whole OE
> range directly.
> 
> Signed-off-by: Nikolay Borisov 
> Reviewed-by: Josef Bacik 
> ---
> V3:
>  * Re-worded the commit for easier comprehension
>  * Added RB tag from Josef
> 
> V2:
>  * Fix compilation failure due to missing parentheses
>  * Fixed the "Fixes" tag.
>  fs/btrfs/inode.c | 29 -
>  1 file changed, 20 insertions(+), 9 deletions(-)
> 

Ping,

Also this patch needs:

Fixes: 524272607e88 ("btrfs: Handle delalloc error correctly to avoid
ordered extent hang") and it needs to be applied to the stable releases 4.14





Re: [PATCH 2/2] btrfs: scrub: move scrub_setup_ctx allocation out of device_list_mutex

2018-12-04 Thread Nikolay Borisov



On 4.12.18 г. 17:11 ч., David Sterba wrote:
> The scrub context is allocated with GFP_KERNEL and called from
> btrfs_scrub_dev under the fs_info::device_list_mutex. This is not safe
> regarding reclaim that could try to flush filesystem data in order to
> get the memory. And the device_list_mutex is held during superblock
> commit, so this would cause a lockup.
> 
> Move the alocation and initialization before any changes that require
> the mutex.
> 
> Signed-off-by: David Sterba 
> ---
>  fs/btrfs/scrub.c | 30 ++
>  1 file changed, 18 insertions(+), 12 deletions(-)
> 
> diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
> index ffcab263e057..051d14c9f013 100644
> --- a/fs/btrfs/scrub.c
> +++ b/fs/btrfs/scrub.c
> @@ -3834,13 +3834,18 @@ int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, 
> u64 devid, u64 start,
>   return -EINVAL;
>   }
>  
> + /* Allocate outside of device_list_mutex */
> + sctx = scrub_setup_ctx(fs_info, is_dev_replace);
> + if (IS_ERR(sctx))
> + return PTR_ERR(sctx);
>  
>   mutex_lock(_info->fs_devices->device_list_mutex);
>   dev = btrfs_find_device(fs_info, devid, NULL, NULL);
>   if (!dev || (test_bit(BTRFS_DEV_STATE_MISSING, >dev_state) &&
>!is_dev_replace)) {
>   mutex_unlock(_info->fs_devices->device_list_mutex);
> - return -ENODEV;
> + ret = -ENODEV;
> + goto out_free_ctx;
>   }
>  
>   if (!is_dev_replace && !readonly &&
> @@ -3848,7 +3853,8 @@ int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, u64 
> devid, u64 start,
>   mutex_unlock(_info->fs_devices->device_list_mutex);
>   btrfs_err_in_rcu(fs_info, "scrub: device %s is not writable",
>   rcu_str_deref(dev->name));
> - return -EROFS;
> + ret = -EROFS;
> + goto out_free_ctx;
>   }
>  
>   mutex_lock(_info->scrub_lock);
> @@ -3856,7 +3862,8 @@ int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, u64 
> devid, u64 start,
>   test_bit(BTRFS_DEV_STATE_REPLACE_TGT, >dev_state)) {
>   mutex_unlock(_info->scrub_lock);
>   mutex_unlock(_info->fs_devices->device_list_mutex);
> - return -EIO;
> + ret = -EIO;
> + goto out_free_ctx;
>   }
>  
>   btrfs_dev_replace_read_lock(_info->dev_replace);
> @@ -3866,7 +3873,8 @@ int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, u64 
> devid, u64 start,
>   btrfs_dev_replace_read_unlock(_info->dev_replace);
>   mutex_unlock(_info->scrub_lock);
>   mutex_unlock(_info->fs_devices->device_list_mutex);
> - return -EINPROGRESS;
> + ret = -EINPROGRESS;
> + goto out_free_ctx;
>   }
>   btrfs_dev_replace_read_unlock(_info->dev_replace);
>  
> @@ -3874,16 +3882,9 @@ int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, u64 
> devid, u64 start,
>   if (ret) {
>   mutex_unlock(_info->scrub_lock);
>   mutex_unlock(_info->fs_devices->device_list_mutex);
> - return ret;
> + goto out_free_ctx;

Don't we suffer the same issue when calling scrub_workers_get since in
it we do btrfs_alloc_workqueue which also calls kzalloc with GFP_KERNEL?


>   }
>  
> - sctx = scrub_setup_ctx(fs_info, is_dev_replace);
> - if (IS_ERR(sctx)) {
> - mutex_unlock(_info->scrub_lock);
> - mutex_unlock(_info->fs_devices->device_list_mutex);
> - scrub_workers_put(fs_info);
> - return PTR_ERR(sctx);
> - }
>   sctx->readonly = readonly;
>   dev->scrub_ctx = sctx;
>   mutex_unlock(_info->fs_devices->device_list_mutex);
> @@ -3936,6 +3937,11 @@ int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, u64 
> devid, u64 start,
>  
>   scrub_put_ctx(sctx);
>  
> + return ret;
> +
> +out_free_ctx:
> + scrub_free_ctx(sctx);
> +
>   return ret;
>  }
>  
> 


  1   2   3   4   5   6   7   8   9   10   >