Re: [PATCH][v2] btrfs: run delayed items before dropping the snapshot

2018-12-05 Thread Filipe Manana
On Wed, Dec 5, 2018 at 5:14 PM Josef Bacik  wrote:
>
> From: Josef Bacik 
>
> With my delayed refs patches in place we started seeing a large amount
> of aborts in __btrfs_free_extent
>
> BTRFS error (device sdb1): unable to find ref byte nr 91947008 parent 0 root 
> 35964  owner 1 offset 0
> Call Trace:
>  ? btrfs_merge_delayed_refs+0xaf/0x340
>  __btrfs_run_delayed_refs+0x6ea/0xfc0
>  ? btrfs_set_path_blocking+0x31/0x60
>  btrfs_run_delayed_refs+0xeb/0x180
>  btrfs_commit_transaction+0x179/0x7f0
>  ? btrfs_check_space_for_delayed_refs+0x30/0x50
>  ? should_end_transaction.isra.19+0xe/0x40
>  btrfs_drop_snapshot+0x41c/0x7c0
>  btrfs_clean_one_deleted_snapshot+0xb5/0xd0
>  cleaner_kthread+0xf6/0x120
>  kthread+0xf8/0x130
>  ? btree_invalidatepage+0x90/0x90
>  ? kthread_bind+0x10/0x10
>  ret_from_fork+0x35/0x40
>
> This was because btrfs_drop_snapshot depends on the root not being modified
> while it's dropping the snapshot.  It will unlock the root node (and really
> every node) as it walks down the tree, only to re-lock it when it needs to do
> something.  This is a problem because if we modify the tree we could cow a 
> block
> in our path, which free's our reference to that block.  Then once we get back 
> to
> that shared block we'll free our reference to it again, and get ENOENT when
> trying to lookup our extent reference to that block in __btrfs_free_extent.
>
> This is ultimately happening because we have delayed items left to be 
> processed
> for our deleted snapshot _after_ all of the inodes are closed for the 
> snapshot.
> We only run the delayed inode item if we're deleting the inode, and even then 
> we
> do not run the delayed insertions or delayed removals.  These can be run at 
> any
> point after our final inode does it's last iput, which is what triggers the
> snapshot deletion.  We can end up with the snapshot deletion happening and 
> then
> have the delayed items run on that file system, resulting in the above 
> problem.
>
> This problem has existed forever, however my patches made it much easier to 
> hit
> as I wake up the cleaner much more often to deal with delayed iputs, which 
> made
> us more likely to start the snapshot dropping work before the transaction
> commits, which is when the delayed items would generally be run.  Before,
> generally speaking, we would run the delayed items, commit the transaction, 
> and
> wakeup the cleaner thread to start deleting snapshots, which means we were 
> less
> likely to hit this problem.  You could still hit it if you had multiple
> snapshots to be deleted and ended up with lots of delayed items, but it was
> definitely harder.
>
> Fix for now by simply running all the delayed items before starting to drop 
> the
> snapshot.  We could make this smarter in the future by making the delayed 
> items
> per-root, and then simply drop any delayed items for roots that we are going 
> to
> delete.  But for now just a quick and easy solution is the safest.
>
> Cc: sta...@vger.kernel.org
> Signed-off-by: Josef Bacik 

Reviewed-by: Filipe Manana 

Looks good now. Thanks.

> ---
> v1->v2:
> - check for errors from btrfs_run_delayed_items.
> - Dave I can reroll the series, but the second version of patch 1 is the same,
>   let me know what you want.
>
>  fs/btrfs/extent-tree.c | 4 
>  1 file changed, 4 insertions(+)
>
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index dcb699dd57f3..473084eb7a2d 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -9330,6 +9330,10 @@ int btrfs_drop_snapshot(struct btrfs_root *root,
> goto out_free;
> }
>
> +   err = btrfs_run_delayed_items(trans);
> +   if (err)
> +   goto out_end_trans;
> +
> if (block_rsv)
> trans->block_rsv = block_rsv;
>
> --
> 2.14.3
>


-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”


Re: [PATCH 2/2] btrfs: run delayed items before dropping the snapshot

2018-11-30 Thread Filipe Manana
On Fri, Nov 30, 2018 at 5:12 PM Filipe Manana  wrote:
>
> On Fri, Nov 30, 2018 at 4:53 PM Josef Bacik  wrote:
> >
> > From: Josef Bacik 
> >
> > With my delayed refs patches in place we started seeing a large amount
> > of aborts in __btrfs_free_extent
> >
> > BTRFS error (device sdb1): unable to find ref byte nr 91947008 parent 0 
> > root 35964  owner 1 offset 0
> > Call Trace:
> >  ? btrfs_merge_delayed_refs+0xaf/0x340
> >  __btrfs_run_delayed_refs+0x6ea/0xfc0
> >  ? btrfs_set_path_blocking+0x31/0x60
> >  btrfs_run_delayed_refs+0xeb/0x180
> >  btrfs_commit_transaction+0x179/0x7f0
> >  ? btrfs_check_space_for_delayed_refs+0x30/0x50
> >  ? should_end_transaction.isra.19+0xe/0x40
> >  btrfs_drop_snapshot+0x41c/0x7c0
> >  btrfs_clean_one_deleted_snapshot+0xb5/0xd0
> >  cleaner_kthread+0xf6/0x120
> >  kthread+0xf8/0x130
> >  ? btree_invalidatepage+0x90/0x90
> >  ? kthread_bind+0x10/0x10
> >  ret_from_fork+0x35/0x40
> >
> > This was because btrfs_drop_snapshot depends on the root not being modified
> > while it's dropping the snapshot.  It will unlock the root node (and really
> > every node) as it walks down the tree, only to re-lock it when it needs to 
> > do
> > something.  This is a problem because if we modify the tree we could cow a 
> > block
> > in our path, which free's our reference to that block.  Then once we get 
> > back to
> > that shared block we'll free our reference to it again, and get ENOENT when
> > trying to lookup our extent reference to that block in __btrfs_free_extent.
> >
> > This is ultimately happening because we have delayed items left to be 
> > processed
> > for our deleted snapshot _after_ all of the inodes are closed for the 
> > snapshot.
> > We only run the delayed inode item if we're deleting the inode, and even 
> > then we
> > do not run the delayed insertions or delayed removals.  These can be run at 
> > any
> > point after our final inode does it's last iput, which is what triggers the
> > snapshot deletion.  We can end up with the snapshot deletion happening and 
> > then
> > have the delayed items run on that file system, resulting in the above 
> > problem.
> >
> > This problem has existed forever, however my patches made it much easier to 
> > hit
> > as I wake up the cleaner much more often to deal with delayed iputs, which 
> > made
> > us more likely to start the snapshot dropping work before the transaction
> > commits, which is when the delayed items would generally be run.  Before,
> > generally speaking, we would run the delayed items, commit the transaction, 
> > and
> > wakeup the cleaner thread to start deleting snapshots, which means we were 
> > less
> > likely to hit this problem.  You could still hit it if you had multiple
> > snapshots to be deleted and ended up with lots of delayed items, but it was
> > definitely harder.
> >
> > Fix for now by simply running all the delayed items before starting to drop 
> > the
> > snapshot.  We could make this smarter in the future by making the delayed 
> > items
> > per-root, and then simply drop any delayed items for roots that we are 
> > going to
> > delete.  But for now just a quick and easy solution is the safest.
> >
> > Cc: sta...@vger.kernel.org
> > Signed-off-by: Josef Bacik 
>
> Reviewed-by: Filipe Manana 
>
> Great catch!
> I've hit this error from  __btrfs_free_extent() a handful of times
> over the years, but never managed
> to reproduce it on demand or figure out it was related to snapshot deletion.
>
> > ---
> >  fs/btrfs/extent-tree.c | 2 ++
> >  1 file changed, 2 insertions(+)
> >
> > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> > index dcb699dd57f3..965702034b22 100644
> > --- a/fs/btrfs/extent-tree.c
> > +++ b/fs/btrfs/extent-tree.c
> > @@ -9330,6 +9330,8 @@ int btrfs_drop_snapshot(struct btrfs_root *root,
> > goto out_free;
> > }
> >
> > +   btrfs_run_delayed_items(trans);
> > +

Btw, we should check the return value of this and return it if it's an error?
We can't do nothing with it in the context of the cleaner thread,
which is why, I suppose, you chose to ignore the value (besides that
the error might have been for some other root).
But this can be used in the context of relocation, where we can return
the error back to userspace.

Thanks.

> > if (block_rsv)
> > trans->block_rsv = block_rsv;
> >
> > --
> > 2.14.3
> >
>
>
> --
> Filipe David Manana,
>
> “Whether you think you can, or you think you can't — you're right.”



-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”


Re: [PATCH 1/2] btrfs: catch cow on deleting snapshots

2018-11-30 Thread Filipe Manana
On Fri, Nov 30, 2018 at 4:53 PM Josef Bacik  wrote:
>
> From: Josef Bacik 
>
> When debugging some weird extent reference bug I suspected that we were
> changing a snapshot while we were deleting it, which could explain my
> bug.  This was indeed what was happening, and this patch helped me
> verify my theory.  It is never correct to modify the snapshot once it's
> being deleted, so mark the root when we are deleting it and make sure we
> complain about it when it happens.
>
> Signed-off-by: Josef Bacik 
> ---
>  fs/btrfs/ctree.c   | 3 +++
>  fs/btrfs/ctree.h   | 1 +
>  fs/btrfs/extent-tree.c | 9 +
>  3 files changed, 13 insertions(+)
>
> diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
> index 5912a97b07a6..5f82f86085e8 100644
> --- a/fs/btrfs/ctree.c
> +++ b/fs/btrfs/ctree.c
> @@ -1440,6 +1440,9 @@ noinline int btrfs_cow_block(struct btrfs_trans_handle 
> *trans,
> u64 search_start;
> int ret;
>
> +   if (test_bit(BTRFS_ROOT_DELETING, >state))
> +   WARN(1, KERN_CRIT "cow'ing blocks on a fs root thats being 
> dropped\n");

Please use btrfs_warn(), it makes sure we use a consistent message
style, identifies the fs, etc.
Also, "thats" should be "that is" or "that's".

With that fixed,
Reviewed-by: Filipe Manana 

> +
> if (trans->transaction != fs_info->running_transaction)
> WARN(1, KERN_CRIT "trans %llu running %llu\n",
>trans->transid,
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index facde70c15ed..5a3a94ccb65c 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -1199,6 +1199,7 @@ enum {
> BTRFS_ROOT_FORCE_COW,
> BTRFS_ROOT_MULTI_LOG_TASKS,
> BTRFS_ROOT_DIRTY,
> +   BTRFS_ROOT_DELETING,
>  };
>
>  /*
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 581c2a0b2945..dcb699dd57f3 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -9333,6 +9333,15 @@ int btrfs_drop_snapshot(struct btrfs_root *root,
> if (block_rsv)
> trans->block_rsv = block_rsv;
>
> +   /*
> +* This will help us catch people modifying the fs tree while we're
> +* dropping it.  It is unsafe to mess with the fs tree while it's 
> being
> +* dropped as we unlock the root node and parent nodes as we walk down
> +* the tree, assuming nothing will change.  If something does change
> +* then we'll have stale information and drop references to blocks 
> we've
> +* already dropped.
> +*/
> +   set_bit(BTRFS_ROOT_DELETING, >state);
> if (btrfs_disk_key_objectid(_item->drop_progress) == 0) {
> level = btrfs_header_level(root->node);
> path->nodes[level] = btrfs_lock_root_node(root);
> --
> 2.14.3
>


-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”


Re: [PATCH 2/2] btrfs: run delayed items before dropping the snapshot

2018-11-30 Thread Filipe Manana
On Fri, Nov 30, 2018 at 4:53 PM Josef Bacik  wrote:
>
> From: Josef Bacik 
>
> With my delayed refs patches in place we started seeing a large amount
> of aborts in __btrfs_free_extent
>
> BTRFS error (device sdb1): unable to find ref byte nr 91947008 parent 0 root 
> 35964  owner 1 offset 0
> Call Trace:
>  ? btrfs_merge_delayed_refs+0xaf/0x340
>  __btrfs_run_delayed_refs+0x6ea/0xfc0
>  ? btrfs_set_path_blocking+0x31/0x60
>  btrfs_run_delayed_refs+0xeb/0x180
>  btrfs_commit_transaction+0x179/0x7f0
>  ? btrfs_check_space_for_delayed_refs+0x30/0x50
>  ? should_end_transaction.isra.19+0xe/0x40
>  btrfs_drop_snapshot+0x41c/0x7c0
>  btrfs_clean_one_deleted_snapshot+0xb5/0xd0
>  cleaner_kthread+0xf6/0x120
>  kthread+0xf8/0x130
>  ? btree_invalidatepage+0x90/0x90
>  ? kthread_bind+0x10/0x10
>  ret_from_fork+0x35/0x40
>
> This was because btrfs_drop_snapshot depends on the root not being modified
> while it's dropping the snapshot.  It will unlock the root node (and really
> every node) as it walks down the tree, only to re-lock it when it needs to do
> something.  This is a problem because if we modify the tree we could cow a 
> block
> in our path, which free's our reference to that block.  Then once we get back 
> to
> that shared block we'll free our reference to it again, and get ENOENT when
> trying to lookup our extent reference to that block in __btrfs_free_extent.
>
> This is ultimately happening because we have delayed items left to be 
> processed
> for our deleted snapshot _after_ all of the inodes are closed for the 
> snapshot.
> We only run the delayed inode item if we're deleting the inode, and even then 
> we
> do not run the delayed insertions or delayed removals.  These can be run at 
> any
> point after our final inode does it's last iput, which is what triggers the
> snapshot deletion.  We can end up with the snapshot deletion happening and 
> then
> have the delayed items run on that file system, resulting in the above 
> problem.
>
> This problem has existed forever, however my patches made it much easier to 
> hit
> as I wake up the cleaner much more often to deal with delayed iputs, which 
> made
> us more likely to start the snapshot dropping work before the transaction
> commits, which is when the delayed items would generally be run.  Before,
> generally speaking, we would run the delayed items, commit the transaction, 
> and
> wakeup the cleaner thread to start deleting snapshots, which means we were 
> less
> likely to hit this problem.  You could still hit it if you had multiple
> snapshots to be deleted and ended up with lots of delayed items, but it was
> definitely harder.
>
> Fix for now by simply running all the delayed items before starting to drop 
> the
> snapshot.  We could make this smarter in the future by making the delayed 
> items
> per-root, and then simply drop any delayed items for roots that we are going 
> to
> delete.  But for now just a quick and easy solution is the safest.
>
> Cc: sta...@vger.kernel.org
> Signed-off-by: Josef Bacik 

Reviewed-by: Filipe Manana 

Great catch!
I've hit this error from  __btrfs_free_extent() a handful of times
over the years, but never managed
to reproduce it on demand or figure out it was related to snapshot deletion.

> ---
>  fs/btrfs/extent-tree.c | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index dcb699dd57f3..965702034b22 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -9330,6 +9330,8 @@ int btrfs_drop_snapshot(struct btrfs_root *root,
> goto out_free;
> }
>
> +   btrfs_run_delayed_items(trans);
> +
> if (block_rsv)
> trans->block_rsv = block_rsv;
>
> --
> 2.14.3
>


-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”


Re: [PATCH v2 1/3] btrfs: scrub: maintain the unlock order in scrub thread

2018-11-29 Thread Filipe Manana
On Thu, Nov 29, 2018 at 9:27 AM Anand Jain  wrote:
>
> The device_list_mutex and scrub_lock creates a nested locks in
> btrfs_scrub_dev().
>
> During lock the order is device_list_mutex and then scrub_lock, and during
> unlock, the order is device_list_mutex and then scrub_lock.
> Fix this to the lock order of scrub_lock and then device_list_mutex.
>
> Signed-off-by: Anand Jain 
> ---
> v1->v2: change the order of lock acquire first scrub_lock and then
> device_list_mutex, which matches with the order of unlock.
> The extra line which are now in the scrub_lock are ok to be
> under the scrub_lock.

I don't get it.
What problem does this patch fixes?
Doesn't seem any functional fix to me, nor performance gain (by the
contrary, the scrub_lock is now held for a longer time than needed),
nor makes anything more readable or "beautiful".

>  fs/btrfs/scrub.c | 13 +++--
>  1 file changed, 7 insertions(+), 6 deletions(-)
>
> diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
> index 902819d3cf41..a9d6fc3b01d4 100644
> --- a/fs/btrfs/scrub.c
> +++ b/fs/btrfs/scrub.c
> @@ -3813,28 +3813,29 @@ int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, 
> u64 devid, u64 start,
> return -EINVAL;
> }
>
> -
> +   mutex_lock(_info->scrub_lock);
> mutex_lock(_info->fs_devices->device_list_mutex);
> dev = btrfs_find_device(fs_info, devid, NULL, NULL);
> if (!dev || (test_bit(BTRFS_DEV_STATE_MISSING, >dev_state) &&
>  !is_dev_replace)) {
> mutex_unlock(_info->fs_devices->device_list_mutex);
> +   mutex_unlock(_info->scrub_lock);
> return -ENODEV;
> }
>
> if (!is_dev_replace && !readonly &&
> !test_bit(BTRFS_DEV_STATE_WRITEABLE, >dev_state)) {
> mutex_unlock(_info->fs_devices->device_list_mutex);
> +   mutex_unlock(_info->scrub_lock);
> btrfs_err_in_rcu(fs_info, "scrub: device %s is not writable",
> rcu_str_deref(dev->name));
> return -EROFS;
> }
>
> -   mutex_lock(_info->scrub_lock);
> if (!test_bit(BTRFS_DEV_STATE_IN_FS_METADATA, >dev_state) ||
> test_bit(BTRFS_DEV_STATE_REPLACE_TGT, >dev_state)) {
> -   mutex_unlock(_info->scrub_lock);
> mutex_unlock(_info->fs_devices->device_list_mutex);
> +   mutex_unlock(_info->scrub_lock);
> return -EIO;
> }
>
> @@ -3843,23 +3844,23 @@ int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, 
> u64 devid, u64 start,
> (!is_dev_replace &&
>  btrfs_dev_replace_is_ongoing(_info->dev_replace))) {
> btrfs_dev_replace_read_unlock(_info->dev_replace);
> -   mutex_unlock(_info->scrub_lock);
> mutex_unlock(_info->fs_devices->device_list_mutex);
> +   mutex_unlock(_info->scrub_lock);
> return -EINPROGRESS;
> }
> btrfs_dev_replace_read_unlock(_info->dev_replace);
>
> ret = scrub_workers_get(fs_info, is_dev_replace);
> if (ret) {
> -   mutex_unlock(_info->scrub_lock);
> mutex_unlock(_info->fs_devices->device_list_mutex);
> +   mutex_unlock(_info->scrub_lock);
> return ret;
> }
>
> sctx = scrub_setup_ctx(dev, is_dev_replace);
> if (IS_ERR(sctx)) {
> -   mutex_unlock(_info->scrub_lock);
> mutex_unlock(_info->fs_devices->device_list_mutex);
> +   mutex_unlock(_info->scrub_lock);
> scrub_workers_put(fs_info);
> return PTR_ERR(sctx);
> }
> --
> 1.8.3.1



-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”


Re: [PATCH] btrfs: skip file_extent generation check for free_space_inode in run_delalloc_nocow

2018-11-29 Thread Filipe Manana
On Thu, Nov 29, 2018 at 9:32 AM Lu Fengqi  wrote:
>
> The btrfs/001 with inode_cache mount option will encounter the following
> warning:

"The test case btrfs/001 ..."

>
> WARNING: CPU: 1 PID: 23700 at fs/btrfs/inode.c:956 
> cow_file_range.isra.19+0x32b/0x430 [btrfs]
> CPU: 1 PID: 23700 Comm: btrfs Kdump: loaded Tainted: GW  O  
> 4.20.0-rc4-custom+ #30
> Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
> RIP: 0010:cow_file_range.isra.19+0x32b/0x430 [btrfs]
> Call Trace:
>  ? free_extent_buffer+0x46/0x90 [btrfs]
>  run_delalloc_nocow+0x455/0x900 [btrfs]
>  btrfs_run_delalloc_range+0x1a7/0x360 [btrfs]
>  writepage_delalloc+0xf9/0x150 [btrfs]
>  __extent_writepage+0x125/0x3e0 [btrfs]
>  extent_write_cache_pages+0x1b6/0x3e0 [btrfs]
>  ? __wake_up_common_lock+0x63/0xc0
>  extent_writepages+0x50/0x80 [btrfs]
>  do_writepages+0x41/0xd0
>  ? __filemap_fdatawrite_range+0x9e/0xf0
>  __filemap_fdatawrite_range+0xbe/0xf0
>  btrfs_fdatawrite_range+0x1b/0x50 [btrfs]
>  __btrfs_write_out_cache+0x42c/0x480 [btrfs]
>  btrfs_write_out_ino_cache+0x84/0xd0 [btrfs]
>  btrfs_save_ino_cache+0x551/0x660 [btrfs]
>  commit_fs_roots+0xc5/0x190 [btrfs]
>  btrfs_commit_transaction+0x2bf/0x8d0 [btrfs]
>  btrfs_mksubvol+0x48d/0x4d0 [btrfs]
>  btrfs_ioctl_snap_create_transid+0x170/0x180 [btrfs]
>  btrfs_ioctl_snap_create_v2+0x124/0x180 [btrfs]
>  btrfs_ioctl+0x123f/0x3030 [btrfs]
>
> The file extent generation of the free space inode is equal to the last
> snapshot of the file root, so the inode will be passed to cow_file_rage.
> But the inode was created and its extents were preallocated in
> btrfs_save_ino_cache, there are no cow copies on disk.
>
> The preallocated extents don't present on disk, and the

The preallocated extent is not yet in the extent tree, and the ...
(singular, it's only used for each space cache)

> btrfs_cross_ref_exist will ignore the -ENOENT returned by
> check_committed_ref, so we can directly write the inode to the disk.
>
> Fixes: 78d4295b1eee ("btrfs: lift some btrfs_cross_ref_exist checks in nocow 
> path")
> Signed-off-by: Lu Fengqi 

The code changes look good to me.

Reviewed-by: Filipe Manana 

Thanks.

> ---
>  fs/btrfs/inode.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index d54bdef16d8d..9c5e9629eb6c 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -1369,7 +1369,8 @@ static noinline int run_delalloc_nocow(struct inode 
> *inode,
>  * Do the same check as in btrfs_cross_ref_exist but
>  * without the unnecessary search.
>  */
> -   if (btrfs_file_extent_generation(leaf, fi) <=
> +   if (!nolock &&
> +   btrfs_file_extent_generation(leaf, fi) <=
> btrfs_root_last_snapshot(>root_item))
> goto out_check;
> if (extent_type == BTRFS_FILE_EXTENT_REG && !force)
> --
> 2.19.2
>
>
>


-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”


Re: [PATCH 2/3] btrfs: wakeup cleaner thread when adding delayed iput

2018-11-28 Thread Filipe Manana
On Wed, Nov 28, 2018 at 7:09 PM David Sterba  wrote:
>
> On Tue, Nov 27, 2018 at 03:08:08PM -0500, Josef Bacik wrote:
> > On Tue, Nov 27, 2018 at 07:59:42PM +, Chris Mason wrote:
> > > On 27 Nov 2018, at 14:54, Josef Bacik wrote:
> > >
> > > > On Tue, Nov 27, 2018 at 10:26:15AM +0200, Nikolay Borisov wrote:
> > > >>
> > > >>
> > > >> On 21.11.18 г. 21:09 ч., Josef Bacik wrote:
> > > >>> The cleaner thread usually takes care of delayed iputs, with the
> > > >>> exception of the btrfs_end_transaction_throttle path.  The cleaner
> > > >>> thread only gets woken up every 30 seconds, so instead wake it up to
> > > >>> do
> > > >>> it's work so that we can free up that space as quickly as possible.
> > > >>
> > > >> Have you done any measurements how this affects the overall system. I
> > > >> suspect this introduces a lot of noise since now we are going to be
> > > >> doing a thread wakeup on every iput, does this give a chance to have
> > > >> nice, large batches of iputs that  the cost of wake up can be
> > > >> amortized
> > > >> across?
> > > >
> > > > I ran the whole patchset with our A/B testing stuff and the patchset
> > > > was a 5%
> > > > win overall, so I'm inclined to think it's fine.  Thanks,
> > >
> > > It's a good point though, a delayed wakeup may be less overhead.
> >
> > Sure, but how do we go about that without it sometimes messing up?  In 
> > practice
> > the only time we're doing this is at the end of finish_ordered_io, so 
> > likely to
> > not be a constant stream.  I suppose since we have places where we force it 
> > to
> > run that we don't really need this.  IDK, I'm fine with dropping it.  
> > Thanks,
>
> The transaction thread wakes up cleaner periodically (commit interval,
> 30s by default), so the time to process iputs is not unbounded.
>
> I have the same concerns as Nikolay, coupling the wakeup to all delayed
> iputs could result in smaller batches. But some of the callers of
> btrfs_add_delayed_iput might benefit from the extra wakeup, like
> btrfs_remove_block_group, so I don't want to leave the idea yet.

I'm curious, why do you think it would benefit btrfs_remove_block_group()?



-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”


Re: [PATCH v4] Btrfs: fix deadlock with memory reclaim during scrub

2018-11-28 Thread Filipe Manana
On Wed, Nov 28, 2018 at 2:22 PM David Sterba  wrote:
>
> On Mon, Nov 26, 2018 at 08:10:30PM +, Filipe Manana wrote:
> > On Mon, Nov 26, 2018 at 6:17 PM David Sterba  wrote:
> > >
> > > On Fri, Nov 23, 2018 at 06:25:40PM +, fdman...@kernel.org wrote:
> > > > From: Filipe Manana 
> > > >
> > > > When a transaction commit starts, it attempts to pause scrub and it 
> > > > blocks
> > > > until the scrub is paused. So while the transaction is blocked waiting 
> > > > for
> > > > scrub to pause, we can not do memory allocation with GFP_KERNEL from 
> > > > scrub,
> > > > otherwise we risk getting into a deadlock with reclaim.
> > > >
> > > > Checking for scrub pause requests is done early at the beginning of the
> > > > while loop of scrub_stripe() and later in the loop, scrub_extent() and
> > > > scrub_raid56_parity() are called, which in turn call scrub_pages() and
> > > > scrub_pages_for_parity() respectively. These last two functions do 
> > > > memory
> > > > allocations using GFP_KERNEL. Same problem could happen while scrubbing
> > > > the super blocks, since it calls scrub_pages().
> > > >
> > > > So make sure GFP_NOFS is used for the memory allocations because at any
> > > > time a scrub pause request can happen from another task that started to
> > > > commit a transaction.
> > > >
> > > > Fixes: 58c4e173847a ("btrfs: scrub: use GFP_KERNEL on the submission 
> > > > path")
> > > > Signed-off-by: Filipe Manana 
> > > > ---
> > > >
> > > > V2: Make using GFP_NOFS unconditionial. Previous version was racy, as 
> > > > pausing
> > > > requests migth happen just after we checked for them.
> > > >
> > > > V3: Use memalloc_nofs_save() just like V1 did.
> > > >
> > > > V4: Similar problem happened for raid56, which was previously missed, so
> > > > deal with it as well as the case for scrub_supers().
> > >
> > > Enclosing the whole scrub to 'nofs' seems like the best option and
> > > future proof. What I missed in 58c4e173847a was the "don't hold big lock
> > > under GFP_KERNEL allocation" pattern.
> > >
> > > >  fs/btrfs/scrub.c | 12 
> > > >  1 file changed, 12 insertions(+)
> > > >
> > > > diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
> > > > index 3be1456b5116..e08b7502d1f0 100644
> > > > --- a/fs/btrfs/scrub.c
> > > > +++ b/fs/btrfs/scrub.c
> > > > @@ -3779,6 +3779,7 @@ int btrfs_scrub_dev(struct btrfs_fs_info 
> > > > *fs_info, u64 devid, u64 start,
> > > >   struct scrub_ctx *sctx;
> > > >   int ret;
> > > >   struct btrfs_device *dev;
> > > > + unsigned int nofs_flag;
> > > >
> > > >   if (btrfs_fs_closing(fs_info))
> > > >   return -EINVAL;
> > > > @@ -3882,6 +3883,16 @@ int btrfs_scrub_dev(struct btrfs_fs_info 
> > > > *fs_info, u64 devid, u64 start,
> > > >   atomic_inc(_info->scrubs_running);
> > > >   mutex_unlock(_info->scrub_lock);
> > > >
> > > > + /*
> > > > +  * In order to avoid deadlock with reclaim when there is a 
> > > > transaction
> > > > +  * trying to pause scrub, make sure we use GFP_NOFS for all the
> > > > +  * allocations done at btrfs_scrub_pages() and 
> > > > scrub_pages_for_parity()
> > > > +  * invoked by our callees. The pausing request is done when the
> > > > +  * transaction commit starts, and it blocks the transaction until 
> > > > scrub
> > > > +  * is paused (done at specific points at scrub_stripe() or right 
> > > > above
> > > > +  * before incrementing fs_info->scrubs_running).
> > >
> > > This hilights why there's perhaps no point in trying to make the nofs
> > > section smaller, handling all the interactions between scrub and
> > > transaction would be too complex.
> > >
> > > Reviewed-by: David Sterba 
> >
> > Well, the worker tasks can also not use gfp_kernel, since the scrub
> > task waits for them to complete before pausing.
> > I missed this, and 2 reviewers as well, so perhaps it wasn't that
> > trivial and I shouldn't feel that I miserably failed to identify all
>

Re: [RFC PATCH] btrfs: drop file privileges in btrfs_clone_files

2018-11-28 Thread Filipe Manana
On Wed, Nov 28, 2018 at 9:26 AM Lu Fengqi  wrote:
>
> On Wed, Nov 28, 2018 at 09:48:07AM +0200, Nikolay Borisov wrote:
> >
> >
> >On 28.11.18 г. 9:46 ч., Christoph Hellwig wrote:
> >> On Wed, Nov 28, 2018 at 09:44:59AM +0200, Nikolay Borisov wrote:
> >>>
> >>>
> >>> On 28.11.18 г. 5:07 ч., Lu Fengqi wrote:
>  The generic/513 tell that cloning into a file did not strip security
>  privileges (suid, capabilities) like a regular write would.
> 
>  Signed-off-by: Lu Fengqi 
>  ---
>  The xfs and ocfs2 call generic_remap_file_range_prep to drop file
>  privileges, I'm not sure whether btrfs should do the same thing.
> >>>
> >>> Why do you think btrfs shouldn't do the same thing. Looking at
>
> I'm not sure btrfs doesn't use generic check intentionally for some reason.
>
> >>> remap_file_range_prep it seems that btrfs is missing a ton of checks
> >>> that are useful i.e immutable files/aligned offsets etc.
>
> It is indeed.
>
> In addition, generic_remap_file_range_prep will invoke inode_dio_wait
> filemap_write_and_wait_range for the source and destination inode/range.
> For the dedupe case, it will call vfs_dedupe_file_range_compare.
>
> I still can't judge whether these operations are welcome by btrfs. I
> will go deep into the code.
>
> >>
> >> Any chance we could move btrfs over to use remap_file_range_prep so that
> >> all file systems share the exact same checks?
>
> In theory we can call generic_remap_file_range_prep in
> btrfs_remap_file_range, which give us the opportunity to clean up the
> duplicate check code in btrfs_extent_same and btrfs_clone_files.
>
> >
> >I'm not very familiar with the, Filipe is more familiar so adding to CC.
> >But IMO we should do that provided there are no blockers.
> >
> >Filipe, what do you think, is it feasible?
>
> I'm all ears for the suggestions.

There's no reason why it shouldn't be possible to have them called in
btrfs as well.
There's quite a few changes in vfs and generic functions introduced in
4.20 due to reflink/dedupe bugs, probably either at the time,
or when cloning/dedupe stopped being btrfs specific, someone forgot to
make btrfs use those generic vfs helpers.
I'll take a look as well.

>
> --
> Thanks,
> Lu
>
>


-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”


Re: [PATCH] btrfs: only run delayed refs if we're committing

2018-11-27 Thread Filipe Manana
On Tue, Nov 27, 2018 at 7:22 PM Josef Bacik  wrote:
>
> On Fri, Nov 23, 2018 at 04:59:32PM +, Filipe Manana wrote:
> > On Thu, Nov 22, 2018 at 12:35 AM Josef Bacik  wrote:
> > >
> > > I noticed in a giant dbench run that we spent a lot of time on lock
> > > contention while running transaction commit.  This is because dbench
> > > results in a lot of fsync()'s that do a btrfs_transaction_commit(), and
> > > they all run the delayed refs first thing, so they all contend with
> > > each other.  This leads to seconds of 0 throughput.  Change this to only
> > > run the delayed refs if we're the ones committing the transaction.  This
> > > makes the latency go away and we get no more lock contention.
> >
> > Can you share the following in the changelog please?
> >
> > 1) How did you ran dbench (parameters, config).
> >
> > 2) What results did you get before and after this change. So that we all get
> > an idea of how good the impact is.
> >
> > While the reduced contention makes all sense and seems valid, I'm not
> > sure this is always a win.
> > It certainly is when multiple tasks are calling
> > btrfs_commit_transaction() simultaneously, but,
> > what about when only one does it?
> >
> > By running all delayed references inside the critical section of the
> > transaction commit
> > (state == TRANS_STATE_COMMIT_START), instead of running most of them
> > outside/before,
> > we will be blocking for a longer a time other tasks calling
> > btrfs_start_transaction() (which is used
> > a lot - creating files, unlinking files, adding links, etc, and even fsync).
> >
> > Won't there by any other types of workload and tests other then dbench
> > that can get increased
> > latency and/or smaller throughput?
> >
> > I find that sort of information useful to have in the changelog. If
> > you verified that or you think
> > it's irrelevant to measure/consider, it would be great to have it
> > mentioned in the changelog
> > (and explained).
> >
>
> Yeah I thought about the delayed refs being run in the critical section now,
> that's not awesome.  I'll drop this for now, I think just having a mutex 
> around
> running delayed refs will be good enough, since we want people who care about
> flushing delayed refs to wait around for that to finish happening.  Thanks,

Well, I think we can have a solution that doesn't bring such trade-off
nor introducing a mutex.
We could do like what is currently done for writing space caches, to
make sure only the first task
calling commit transaction does the work and all others do nothing
except waiting for the commit to finish:

btrfs_commit_transaction()
   if (!test_and_set_bit(BTRFS_TRANS_COMMIT_START, _trans->flags)) {
   run delayed refs before entering critical section
   }

thanks

>
> Josef



-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”


Re: [PATCH v4] Btrfs: fix deadlock with memory reclaim during scrub

2018-11-26 Thread Filipe Manana
On Mon, Nov 26, 2018 at 6:17 PM David Sterba  wrote:
>
> On Fri, Nov 23, 2018 at 06:25:40PM +, fdman...@kernel.org wrote:
> > From: Filipe Manana 
> >
> > When a transaction commit starts, it attempts to pause scrub and it blocks
> > until the scrub is paused. So while the transaction is blocked waiting for
> > scrub to pause, we can not do memory allocation with GFP_KERNEL from scrub,
> > otherwise we risk getting into a deadlock with reclaim.
> >
> > Checking for scrub pause requests is done early at the beginning of the
> > while loop of scrub_stripe() and later in the loop, scrub_extent() and
> > scrub_raid56_parity() are called, which in turn call scrub_pages() and
> > scrub_pages_for_parity() respectively. These last two functions do memory
> > allocations using GFP_KERNEL. Same problem could happen while scrubbing
> > the super blocks, since it calls scrub_pages().
> >
> > So make sure GFP_NOFS is used for the memory allocations because at any
> > time a scrub pause request can happen from another task that started to
> > commit a transaction.
> >
> > Fixes: 58c4e173847a ("btrfs: scrub: use GFP_KERNEL on the submission path")
> > Signed-off-by: Filipe Manana 
> > ---
> >
> > V2: Make using GFP_NOFS unconditionial. Previous version was racy, as 
> > pausing
> > requests migth happen just after we checked for them.
> >
> > V3: Use memalloc_nofs_save() just like V1 did.
> >
> > V4: Similar problem happened for raid56, which was previously missed, so
> > deal with it as well as the case for scrub_supers().
>
> Enclosing the whole scrub to 'nofs' seems like the best option and
> future proof. What I missed in 58c4e173847a was the "don't hold big lock
> under GFP_KERNEL allocation" pattern.
>
> >  fs/btrfs/scrub.c | 12 
> >  1 file changed, 12 insertions(+)
> >
> > diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
> > index 3be1456b5116..e08b7502d1f0 100644
> > --- a/fs/btrfs/scrub.c
> > +++ b/fs/btrfs/scrub.c
> > @@ -3779,6 +3779,7 @@ int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, 
> > u64 devid, u64 start,
> >   struct scrub_ctx *sctx;
> >   int ret;
> >   struct btrfs_device *dev;
> > + unsigned int nofs_flag;
> >
> >   if (btrfs_fs_closing(fs_info))
> >   return -EINVAL;
> > @@ -3882,6 +3883,16 @@ int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, 
> > u64 devid, u64 start,
> >   atomic_inc(_info->scrubs_running);
> >   mutex_unlock(_info->scrub_lock);
> >
> > + /*
> > +  * In order to avoid deadlock with reclaim when there is a transaction
> > +  * trying to pause scrub, make sure we use GFP_NOFS for all the
> > +  * allocations done at btrfs_scrub_pages() and 
> > scrub_pages_for_parity()
> > +  * invoked by our callees. The pausing request is done when the
> > +  * transaction commit starts, and it blocks the transaction until 
> > scrub
> > +  * is paused (done at specific points at scrub_stripe() or right above
> > +  * before incrementing fs_info->scrubs_running).
>
> This hilights why there's perhaps no point in trying to make the nofs
> section smaller, handling all the interactions between scrub and
> transaction would be too complex.
>
> Reviewed-by: David Sterba 

Well, the worker tasks can also not use gfp_kernel, since the scrub
task waits for them to complete before pausing.
I missed this, and 2 reviewers as well, so perhaps it wasn't that
trivial and I shouldn't feel that I miserably failed to identify all
cases for something rather trivial. V5 sent.


Re: [PATCH] Btrfs: bring back key search optimization to btrfs_search_old_slot()

2018-11-26 Thread Filipe Manana
On Fri, Nov 16, 2018 at 11:09 AM  wrote:
>
> From: Filipe Manana 
>
> Commit d7396f07358a ("Btrfs: optimize key searches in btrfs_search_slot"),
> dated from August 2013, introduced an optimization to search for keys in a
> node/leaf to both btrfs_search_slot() and btrfs_search_old_slot(). For the
> later, it ended up being reverted in commit d4b4087c43cc ("Btrfs: do a
> full search everytime in btrfs_search_old_slot"), from September 2013,
> because the content of extent buffers were often inconsistent during
> replay. It turned out that the reason why they were often inconsistent was
> because the extent buffer replay stopped being done atomically, and got
> broken after commit c8cc63416537 ("Btrfs: stop using GFP_ATOMIC for the
> tree mod log allocations"), introduced in July 2013. The extent buffer
> replay issue was then found and fixed by commit 5de865eebb83 ("Btrfs: fix
> tree mod logging"), dated from December 2013.
>
> So bring back the optimization to btrfs_search_old_slot() as skipping it
> and its comment about disabling it confusing. After all, if unwinding
> extent buffers resulted in some inconsistency, the normal searches (binary
> searches) would also not always work.
>
> Signed-off-by: Filipe Manana 

David, please remove this change from the integration branch.

It turns out after 3 weeks of stress tests it finally triggered an
assertion failure (hard to hit) and
it's indeed not reliable to use the search optimization because of how
the mod log tree currently works.
The idea was just to not make it different from btrfs_search_slot().
Use of the mod log tree is limited
to some cases where occasional faster search wouldn't bring much benefits.

Thanks.

> ---
>  fs/btrfs/ctree.c | 8 ++--
>  1 file changed, 2 insertions(+), 6 deletions(-)
>
> diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
> index 089b46c4d97f..cf5487a79c96 100644
> --- a/fs/btrfs/ctree.c
> +++ b/fs/btrfs/ctree.c
> @@ -2966,7 +2966,7 @@ int btrfs_search_old_slot(struct btrfs_root *root, 
> const struct btrfs_key *key,
> int level;
> int lowest_unlock = 1;
> u8 lowest_level = 0;
> -   int prev_cmp = -1;
> +   int prev_cmp;
>
> lowest_level = p->lowest_level;
> WARN_ON(p->nodes[0] != NULL);
> @@ -2977,6 +2977,7 @@ int btrfs_search_old_slot(struct btrfs_root *root, 
> const struct btrfs_key *key,
> }
>
>  again:
> +   prev_cmp = -1;
> b = get_old_root(root, time_seq);
> level = btrfs_header_level(b);
> p->locks[level] = BTRFS_READ_LOCK;
> @@ -2994,11 +2995,6 @@ int btrfs_search_old_slot(struct btrfs_root *root, 
> const struct btrfs_key *key,
>  */
> btrfs_unlock_up_safe(p, level + 1);
>
> -   /*
> -* Since we can unwind ebs we want to do a real search every
> -* time.
> -*/
> -   prev_cmp = -1;
> ret = key_search(b, key, level, _cmp, );
>
> if (level != 0) {
> --
> 2.11.0
>


Re: [PATCH] btrfs: only run delayed refs if we're committing

2018-11-23 Thread Filipe Manana
On Thu, Nov 22, 2018 at 12:35 AM Josef Bacik  wrote:
>
> I noticed in a giant dbench run that we spent a lot of time on lock
> contention while running transaction commit.  This is because dbench
> results in a lot of fsync()'s that do a btrfs_transaction_commit(), and
> they all run the delayed refs first thing, so they all contend with
> each other.  This leads to seconds of 0 throughput.  Change this to only
> run the delayed refs if we're the ones committing the transaction.  This
> makes the latency go away and we get no more lock contention.

Can you share the following in the changelog please?

1) How did you ran dbench (parameters, config).

2) What results did you get before and after this change. So that we all get
an idea of how good the impact is.

While the reduced contention makes all sense and seems valid, I'm not
sure this is always a win.
It certainly is when multiple tasks are calling
btrfs_commit_transaction() simultaneously, but,
what about when only one does it?

By running all delayed references inside the critical section of the
transaction commit
(state == TRANS_STATE_COMMIT_START), instead of running most of them
outside/before,
we will be blocking for a longer a time other tasks calling
btrfs_start_transaction() (which is used
a lot - creating files, unlinking files, adding links, etc, and even fsync).

Won't there by any other types of workload and tests other then dbench
that can get increased
latency and/or smaller throughput?

I find that sort of information useful to have in the changelog. If
you verified that or you think
it's irrelevant to measure/consider, it would be great to have it
mentioned in the changelog
(and explained).

Thanks.

>
> Reviewed-by: Omar Sandoval 
> Signed-off-by: Josef Bacik 
> ---
>  fs/btrfs/transaction.c | 24 +---
>  1 file changed, 9 insertions(+), 15 deletions(-)
>
> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> index 3c1be9db897c..41cc96cc59a3 100644
> --- a/fs/btrfs/transaction.c
> +++ b/fs/btrfs/transaction.c
> @@ -1918,15 +1918,6 @@ int btrfs_commit_transaction(struct btrfs_trans_handle 
> *trans)
> btrfs_trans_release_metadata(trans);
> trans->block_rsv = NULL;
>
> -   /* make a pass through all the delayed refs we have so far
> -* any runnings procs may add more while we are here
> -*/
> -   ret = btrfs_run_delayed_refs(trans, 0);
> -   if (ret) {
> -   btrfs_end_transaction(trans);
> -   return ret;
> -   }
> -
> cur_trans = trans->transaction;
>
> /*
> @@ -1938,12 +1929,6 @@ int btrfs_commit_transaction(struct btrfs_trans_handle 
> *trans)
>
> btrfs_create_pending_block_groups(trans);
>
> -   ret = btrfs_run_delayed_refs(trans, 0);
> -   if (ret) {
> -   btrfs_end_transaction(trans);
> -   return ret;
> -   }
> -
> if (!test_bit(BTRFS_TRANS_DIRTY_BG_RUN, _trans->flags)) {
> int run_it = 0;
>
> @@ -2014,6 +1999,15 @@ int btrfs_commit_transaction(struct btrfs_trans_handle 
> *trans)
> spin_unlock(_info->trans_lock);
> }
>
> +   /*
> +* We are now the only one in the commit area, we can run delayed refs
> +* without hitting a bunch of lock contention from a lot of people
> +* trying to commit the transaction at once.
> +*/
> +   ret = btrfs_run_delayed_refs(trans, 0);
> +   if (ret)
> +   goto cleanup_transaction;
> +
> extwriter_counter_dec(cur_trans, trans->type);
>
> ret = btrfs_start_delalloc_flush(fs_info);
> --
> 2.14.3
>


-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”


Re: [PATCH v2] Btrfs: fix deadlock when enabling quotas due to concurrent snapshot creation

2018-11-19 Thread Filipe Manana
On Mon, Nov 19, 2018 at 2:48 PM Qu Wenruo  wrote:
>
>
>
> On 2018/11/19 下午10:15, fdman...@kernel.org wrote:
> > From: Filipe Manana 
> >
> > If the quota enable and snapshot creation ioctls are called concurrently
> > we can get into a deadlock where the task enabling quotas will deadlock
> > on the fs_info->qgroup_ioctl_lock mutex because it attempts to lock it
> > twice, or the task creating a snapshot tries to commit the transaction
> > while the task enabling quota waits for the former task to commit the
> > transaction while holding the mutex. The following time diagrams show how
> > both cases happen.
> >
> > First scenario:
> >
> >CPU 0CPU 1
> >
> >  btrfs_ioctl()
> >   btrfs_ioctl_quota_ctl()
> >btrfs_quota_enable()
> > mutex_lock(fs_info->qgroup_ioctl_lock)
> > btrfs_start_transaction()
> >
> >  btrfs_ioctl()
> >   btrfs_ioctl_snap_create_v2
> >create_snapshot()
> > --> adds snapshot to the
> > list pending_snapshots
> > of the current
> > transaction
> >
> > btrfs_commit_transaction()
> >  create_pending_snapshots()
> >create_pending_snapshot()
> > qgroup_account_snapshot()
> >  btrfs_qgroup_inherit()
> >  mutex_lock(fs_info->qgroup_ioctl_lock)
> >   --> deadlock, mutex already locked
> >   by this task at
> >   btrfs_quota_enable()
> >
> > Second scenario:
> >
> >CPU 0CPU 1
> >
> >  btrfs_ioctl()
> >   btrfs_ioctl_quota_ctl()
> >btrfs_quota_enable()
> > mutex_lock(fs_info->qgroup_ioctl_lock)
> > btrfs_start_transaction()
> >
> >  btrfs_ioctl()
> >   btrfs_ioctl_snap_create_v2
> >create_snapshot()
> > --> adds snapshot to the
> > list pending_snapshots
> > of the current
> > transaction
> >
> > btrfs_commit_transaction()
> >  --> waits for task at
> >  CPU 0 to release
> >  its transaction
> >  handle
> >
> > btrfs_commit_transaction()
> >  --> sees another task started
> >  the transaction commit first
> >  --> releases its transaction
> >  handle
> >  --> waits for the transaction
> >  commit to be completed by
> >  the task at CPU 1
> >
> >  create_pending_snapshot()
> >   qgroup_account_snapshot()
> >btrfs_qgroup_inherit()
> > 
> > mutex_lock(fs_info->qgroup_ioctl_lock)
> >  --> deadlock, task at 
> > CPU 0
> >  has the mutex 
> > locked but
> >  it is waiting for 
> > us to
> >  finish the 
> > transaction
> >      commit
> >
> > So fix this by setting the quota enabled flag in fs_info after committing
> > the transaction at btrfs_quota_enable(). This ends up serializing quota
> > enable and snapshot creation as if the snapshot creation happened just
> > before the quota enable request. The quota rescan task, scheduled after
> > committing the transaction in btrfs_quote_enable(), will do the accounting.
> >
> > Fixes: 6426c7ad697d ("btrfs: qgroup: Fix qgroup accounti

Re: [PATCH] Btrfs: fix deadlock when enabling quotas due to concurrent snapshot creation

2018-11-19 Thread Filipe Manana
On Mon, Nov 19, 2018 at 11:52 AM Filipe Manana  wrote:
>
> On Mon, Nov 19, 2018 at 11:35 AM Qu Wenruo  wrote:
> >
> >
> >
> > On 2018/11/19 下午7:13, Filipe Manana wrote:
> > > On Mon, Nov 19, 2018 at 11:09 AM Qu Wenruo  wrote:
> > >>
> > >>
> > >>
> > >> On 2018/11/19 下午5:48, fdman...@kernel.org wrote:
> > >>> From: Filipe Manana 
> > >>>
> > >>> If the quota enable and snapshot creation ioctls are called concurrently
> > >>> we can get into a deadlock where the task enabling quotas will deadlock
> > >>> on the fs_info->qgroup_ioctl_lock mutex because it attempts to lock it
> > >>> twice. The following time diagram shows how this happens.
> > >>>
> > >>>CPU 0CPU 1
> > >>>
> > >>>  btrfs_ioctl()
> > >>>   btrfs_ioctl_quota_ctl()
> > >>>btrfs_quota_enable()
> > >>> mutex_lock(fs_info->qgroup_ioctl_lock)
> > >>> btrfs_start_transaction()
> > >>>
> > >>>  btrfs_ioctl()
> > >>>   btrfs_ioctl_snap_create_v2
> > >>>create_snapshot()
> > >>> --> adds snapshot to the
> > >>> list 
> > >>> pending_snapshots
> > >>> of the current
> > >>> transaction
> > >>>
> > >>> btrfs_commit_transaction()
> > >>>  create_pending_snapshots()
> > >>>create_pending_snapshot()
> > >>> qgroup_account_snapshot()
> > >>>  btrfs_qgroup_inherit()
> > >>>  mutex_lock(fs_info->qgroup_ioctl_lock)
> > >>>   --> deadlock, mutex already locked
> > >>>   by this task at
> > >>>   btrfs_quota_enable()
> > >>
> > >> The backtrace looks valid.
> > >>
> > >>>
> > >>> So fix this by adding a flag to the transaction handle that signals if 
> > >>> the
> > >>> transaction is being used for enabling quotas (only seen by the task 
> > >>> doing
> > >>> it) and do not lock the mutex qgroup_ioctl_lock at 
> > >>> btrfs_qgroup_inherit()
> > >>> if the transaction handle corresponds to the one being used to enable 
> > >>> the
> > >>> quotas.
> > >>>
> > >>> Fixes: 6426c7ad697d ("btrfs: qgroup: Fix qgroup accounting when 
> > >>> creating snapshot")
> > >>> Signed-off-by: Filipe Manana 
> > >>> ---
> > >>>  fs/btrfs/qgroup.c  | 10 --
> > >>>  fs/btrfs/transaction.h |  1 +
> > >>>  2 files changed, 9 insertions(+), 2 deletions(-)
> > >>>
> > >>> diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
> > >>> index d4917c0cddf5..3aec3bfa3d70 100644
> > >>> --- a/fs/btrfs/qgroup.c
> > >>> +++ b/fs/btrfs/qgroup.c
> > >>> @@ -908,6 +908,7 @@ int btrfs_quota_enable(struct btrfs_fs_info 
> > >>> *fs_info)
> > >>>   trans = NULL;
> > >>>   goto out;
> > >>>   }
> > >>> + trans->enabling_quotas = true;
> > >>
> > >> Should we put enabling_quotas bit into btrfs_transaction instead of
> > >> btrfs_trans_handle?
> > >
> > > Why?
> > > Only the task which is enabling quotas needs to know about it.
> >
> > But it's the btrfs_qgroup_inherit() using the trans handler to avoid
> > dead lock.
> >
> > What makes sure btrfs_qgroup_inherit() get the exactly same trans
> > handler allocated here?
>
> If it's the other task (the one creating a snapshot) that starts the
> transaction commit,
> it will have to wait for the task enabling quotas to release the
> transaction - once that task
> also calls commit_transaction(), it will skip doing the commit itself
> and wait for the snapshot
> one to finish the commit, while holding the qgroup mutex (this part I
> missed before).
> So yes we'll h

Re: [PATCH] Btrfs: fix deadlock when enabling quotas due to concurrent snapshot creation

2018-11-19 Thread Filipe Manana
On Mon, Nov 19, 2018 at 11:35 AM Qu Wenruo  wrote:
>
>
>
> On 2018/11/19 下午7:13, Filipe Manana wrote:
> > On Mon, Nov 19, 2018 at 11:09 AM Qu Wenruo  wrote:
> >>
> >>
> >>
> >> On 2018/11/19 下午5:48, fdman...@kernel.org wrote:
> >>> From: Filipe Manana 
> >>>
> >>> If the quota enable and snapshot creation ioctls are called concurrently
> >>> we can get into a deadlock where the task enabling quotas will deadlock
> >>> on the fs_info->qgroup_ioctl_lock mutex because it attempts to lock it
> >>> twice. The following time diagram shows how this happens.
> >>>
> >>>CPU 0CPU 1
> >>>
> >>>  btrfs_ioctl()
> >>>   btrfs_ioctl_quota_ctl()
> >>>btrfs_quota_enable()
> >>> mutex_lock(fs_info->qgroup_ioctl_lock)
> >>> btrfs_start_transaction()
> >>>
> >>>  btrfs_ioctl()
> >>>   btrfs_ioctl_snap_create_v2
> >>>create_snapshot()
> >>> --> adds snapshot to the
> >>> list pending_snapshots
> >>> of the current
> >>> transaction
> >>>
> >>> btrfs_commit_transaction()
> >>>  create_pending_snapshots()
> >>>create_pending_snapshot()
> >>> qgroup_account_snapshot()
> >>>  btrfs_qgroup_inherit()
> >>>  mutex_lock(fs_info->qgroup_ioctl_lock)
> >>>   --> deadlock, mutex already locked
> >>>   by this task at
> >>>   btrfs_quota_enable()
> >>
> >> The backtrace looks valid.
> >>
> >>>
> >>> So fix this by adding a flag to the transaction handle that signals if the
> >>> transaction is being used for enabling quotas (only seen by the task doing
> >>> it) and do not lock the mutex qgroup_ioctl_lock at btrfs_qgroup_inherit()
> >>> if the transaction handle corresponds to the one being used to enable the
> >>> quotas.
> >>>
> >>> Fixes: 6426c7ad697d ("btrfs: qgroup: Fix qgroup accounting when creating 
> >>> snapshot")
> >>> Signed-off-by: Filipe Manana 
> >>> ---
> >>>  fs/btrfs/qgroup.c  | 10 --
> >>>  fs/btrfs/transaction.h |  1 +
> >>>  2 files changed, 9 insertions(+), 2 deletions(-)
> >>>
> >>> diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
> >>> index d4917c0cddf5..3aec3bfa3d70 100644
> >>> --- a/fs/btrfs/qgroup.c
> >>> +++ b/fs/btrfs/qgroup.c
> >>> @@ -908,6 +908,7 @@ int btrfs_quota_enable(struct btrfs_fs_info *fs_info)
> >>>   trans = NULL;
> >>>   goto out;
> >>>   }
> >>> + trans->enabling_quotas = true;
> >>
> >> Should we put enabling_quotas bit into btrfs_transaction instead of
> >> btrfs_trans_handle?
> >
> > Why?
> > Only the task which is enabling quotas needs to know about it.
>
> But it's the btrfs_qgroup_inherit() using the trans handler to avoid
> dead lock.
>
> What makes sure btrfs_qgroup_inherit() get the exactly same trans
> handler allocated here?

If it's the other task (the one creating a snapshot) that starts the
transaction commit,
it will have to wait for the task enabling quotas to release the
transaction - once that task
also calls commit_transaction(), it will skip doing the commit itself
and wait for the snapshot
one to finish the commit, while holding the qgroup mutex (this part I
missed before).
So yes we'll have to use a bit in the transaction itself instead.

>
> >
> >>
> >> Isn't it possible to have different trans handle pointed to the same
> >> transaction?
> >
> > Yes.
> >
> >>
> >> And I'm not really sure about the naming "enabling_quotas".
> >> What about "quota_ioctl_mutex_hold"? (Well, this also sounds awful)
> >
> > Too long.
>
> Anyway, current naming doesn't really show why we could skip mutex
> locking. Just hope to get some name better.

No name will ever show you that.
You'll always have to see where  and how it's us

Re: [PATCH] Btrfs: fix deadlock when enabling quotas due to concurrent snapshot creation

2018-11-19 Thread Filipe Manana
On Mon, Nov 19, 2018 at 11:09 AM Qu Wenruo  wrote:
>
>
>
> On 2018/11/19 下午5:48, fdman...@kernel.org wrote:
> > From: Filipe Manana 
> >
> > If the quota enable and snapshot creation ioctls are called concurrently
> > we can get into a deadlock where the task enabling quotas will deadlock
> > on the fs_info->qgroup_ioctl_lock mutex because it attempts to lock it
> > twice. The following time diagram shows how this happens.
> >
> >CPU 0CPU 1
> >
> >  btrfs_ioctl()
> >   btrfs_ioctl_quota_ctl()
> >btrfs_quota_enable()
> > mutex_lock(fs_info->qgroup_ioctl_lock)
> > btrfs_start_transaction()
> >
> >  btrfs_ioctl()
> >   btrfs_ioctl_snap_create_v2
> >create_snapshot()
> > --> adds snapshot to the
> > list pending_snapshots
> > of the current
> > transaction
> >
> > btrfs_commit_transaction()
> >  create_pending_snapshots()
> >create_pending_snapshot()
> > qgroup_account_snapshot()
> >  btrfs_qgroup_inherit()
> >  mutex_lock(fs_info->qgroup_ioctl_lock)
> >   --> deadlock, mutex already locked
> >   by this task at
> >   btrfs_quota_enable()
>
> The backtrace looks valid.
>
> >
> > So fix this by adding a flag to the transaction handle that signals if the
> > transaction is being used for enabling quotas (only seen by the task doing
> > it) and do not lock the mutex qgroup_ioctl_lock at btrfs_qgroup_inherit()
> > if the transaction handle corresponds to the one being used to enable the
> > quotas.
> >
> > Fixes: 6426c7ad697d ("btrfs: qgroup: Fix qgroup accounting when creating 
> > snapshot")
> > Signed-off-by: Filipe Manana 
> > ---
> >  fs/btrfs/qgroup.c  | 10 --
> >  fs/btrfs/transaction.h |  1 +
> >  2 files changed, 9 insertions(+), 2 deletions(-)
> >
> > diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
> > index d4917c0cddf5..3aec3bfa3d70 100644
> > --- a/fs/btrfs/qgroup.c
> > +++ b/fs/btrfs/qgroup.c
> > @@ -908,6 +908,7 @@ int btrfs_quota_enable(struct btrfs_fs_info *fs_info)
> >   trans = NULL;
> >   goto out;
> >   }
> > + trans->enabling_quotas = true;
>
> Should we put enabling_quotas bit into btrfs_transaction instead of
> btrfs_trans_handle?

Why?
Only the task which is enabling quotas needs to know about it.

>
> Isn't it possible to have different trans handle pointed to the same
> transaction?

Yes.

>
> And I'm not really sure about the naming "enabling_quotas".
> What about "quota_ioctl_mutex_hold"? (Well, this also sounds awful)

Too long.


>
> Thanks,
> Qu
>
> >
> >   fs_info->qgroup_ulist = ulist_alloc(GFP_KERNEL);
> >   if (!fs_info->qgroup_ulist) {
> > @@ -2250,7 +2251,11 @@ int btrfs_qgroup_inherit(struct btrfs_trans_handle 
> > *trans, u64 srcid,
> >   u32 level_size = 0;
> >   u64 nums;
> >
> > - mutex_lock(_info->qgroup_ioctl_lock);
> > + if (trans->enabling_quotas)
> > + lockdep_assert_held(_info->qgroup_ioctl_lock);
> > + else
> > + mutex_lock(_info->qgroup_ioctl_lock);
> > +
> >   if (!test_bit(BTRFS_FS_QUOTA_ENABLED, _info->flags))
> >   goto out;
> >
> > @@ -2413,7 +2418,8 @@ int btrfs_qgroup_inherit(struct btrfs_trans_handle 
> > *trans, u64 srcid,
> >  unlock:
> >   spin_unlock(_info->qgroup_lock);
> >  out:
> > - mutex_unlock(_info->qgroup_ioctl_lock);
> > + if (!trans->enabling_quotas)
> > + mutex_unlock(_info->qgroup_ioctl_lock);
> >   return ret;
> >  }
> >
> > diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h
> > index 703d5116a2fc..a5553a1dee30 100644
> > --- a/fs/btrfs/transaction.h
> > +++ b/fs/btrfs/transaction.h
> > @@ -122,6 +122,7 @@ struct btrfs_trans_handle {
> >   bool reloc_reserved;
> >   bool sync;
> >   bool dirty;
> > + bool enabling_quotas;
> >   struct btrfs_root *root;
> >   struct btrfs_fs_info *fs_info;
> >   struct list_head new_bgs;
> >
>


Re: [PATCH] btrfs: introduce feature to forget a btrfs device

2018-11-14 Thread Filipe Manana
On Wed, Nov 14, 2018 at 11:15 AM Filipe Manana  wrote:
>
> On Wed, Nov 14, 2018 at 9:14 AM Anand Jain  wrote:
> >
> > Support for a new command 'btrfs dev forget [dev]' is proposed here
> > to undo the effects of 'btrfs dev scan [dev]'. For this purpose
> > this patch proposes to use ioctl #5 as it was empty.
> > IOW(BTRFS_IOCTL_MAGIC, 5, ..)
> > This patch adds new ioctl BTRFS_IOC_FORGET_DEV which can be sent from
> > the /dev/btrfs-control to forget one or all devices, (devices which are
> > not mounted) from the btrfs kernel.
> >
> > The argument it takes is struct btrfs_ioctl_vol_args, and ::name can be
> > set to specify the device path. And all unmounted devices can be removed
> > from the kernel if no device path is provided.
> >
> > Again, the devices are removed only if the relevant fsid aren't mounted.
> >
> > This new cli can provide..
> >  . Release of unwanted btrfs_fs_devices and btrfs_devices memory if the
> >device is not going to be mounted.
> >  . Ability to mount the device in degraded mode when one of the other
> >device is corrupted like in split brain raid1.
> >  . Running test cases which requires btrfs.ko-reload if the rootfs
> >is btrfs.
> >
> > Signed-off-by: Anand Jain 
> > Reviewed-by: Nikolay Borisov 
> > ---
> >  fs/btrfs/super.c   | 3 +++
> >  fs/btrfs/volumes.c | 9 +
> >  fs/btrfs/volumes.h | 1 +
> >  include/uapi/linux/btrfs.h | 2 ++
> >  4 files changed, 15 insertions(+)
> >
> > diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
> > index 345c64d810d4..f99db6899004 100644
> > --- a/fs/btrfs/super.c
> > +++ b/fs/btrfs/super.c
> > @@ -2246,6 +2246,9 @@ static long btrfs_control_ioctl(struct file *file, 
> > unsigned int cmd,
> > ret = PTR_ERR_OR_ZERO(device);
> > mutex_unlock(_mutex);
> > break;
> > +   case BTRFS_IOC_FORGET_DEV:
> > +   ret = btrfs_forget_devices(vol->name);
> > +   break;
> > case BTRFS_IOC_DEVICES_READY:
> > mutex_lock(_mutex);
> > device = btrfs_scan_one_device(vol->name, FMODE_READ,
> > diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> > index f435d397019e..e1365a122657 100644
> > --- a/fs/btrfs/volumes.c
> > +++ b/fs/btrfs/volumes.c
> > @@ -1208,6 +1208,15 @@ static int btrfs_read_disk_super(struct block_device 
> > *bdev, u64 bytenr,
> > return 0;
> >  }
> >
> > +int btrfs_forget_devices(const char *path)
> > +{
> > +   mutex_lock(_mutex);
> > +   btrfs_free_stale_devices(strlen(path) ? path:NULL, NULL);
>
> One space before : and another one after it please.
>
> Now the more important: don't use strlen, use strnlen. Some malicious
> or sloppy user might have passed a non-null terminated string, you
> don't want strlen to go past the limits of btrfs_ioctl_vol_args for
> obvious reasons.

In fact that's a problem for the entire use of vol->name in
btrfs_control_ioctl. The name's last byte should be set to '\0' to
avoid issues.
I'll send a fix for that, so if David fixes the white spaces on commit
there's no need for a v12.

>
> Also, please, not just to make a maintainer's life easier, but current
> and future reviewers, add the patch version to each patch's subject
> and not just the cover letter. Also list (after ---) what changes
> between each patch version in the patch itself and not the cover
> letter.
>
> V12, here we go.
>
> > +   mutex_unlock(_mutex);
> > +
> > +   return 0;
> > +}
> > +
> >  /*
> >   * Look for a btrfs signature on a device. This may be called out of the 
> > mount path
> >   * and we are not allowed to call set_blocksize during the scan. The 
> > superblock
> > diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
> > index aefce895e994..180297d04938 100644
> > --- a/fs/btrfs/volumes.h
> > +++ b/fs/btrfs/volumes.h
> > @@ -406,6 +406,7 @@ int btrfs_open_devices(struct btrfs_fs_devices 
> > *fs_devices,
> >fmode_t flags, void *holder);
> >  struct btrfs_device *btrfs_scan_one_device(const char *path,
> >fmode_t flags, void *holder);
> > +int btrfs_forget_devices(const char *path);
> >  int btrfs_close_devices(struct btrfs_fs_devices *fs_devices);
> >  void btrfs_free_extra_devids(struct btrfs_fs_devices *fs_devices, int 
> > step);
> >  void btrfs_assign_next_active_device(struct btrfs_device *device,
> > d

Re: [PATCH] btrfs: introduce feature to forget a btrfs device

2018-11-14 Thread Filipe Manana
On Wed, Nov 14, 2018 at 9:14 AM Anand Jain  wrote:
>
> Support for a new command 'btrfs dev forget [dev]' is proposed here
> to undo the effects of 'btrfs dev scan [dev]'. For this purpose
> this patch proposes to use ioctl #5 as it was empty.
> IOW(BTRFS_IOCTL_MAGIC, 5, ..)
> This patch adds new ioctl BTRFS_IOC_FORGET_DEV which can be sent from
> the /dev/btrfs-control to forget one or all devices, (devices which are
> not mounted) from the btrfs kernel.
>
> The argument it takes is struct btrfs_ioctl_vol_args, and ::name can be
> set to specify the device path. And all unmounted devices can be removed
> from the kernel if no device path is provided.
>
> Again, the devices are removed only if the relevant fsid aren't mounted.
>
> This new cli can provide..
>  . Release of unwanted btrfs_fs_devices and btrfs_devices memory if the
>device is not going to be mounted.
>  . Ability to mount the device in degraded mode when one of the other
>device is corrupted like in split brain raid1.
>  . Running test cases which requires btrfs.ko-reload if the rootfs
>is btrfs.
>
> Signed-off-by: Anand Jain 
> Reviewed-by: Nikolay Borisov 
> ---
>  fs/btrfs/super.c   | 3 +++
>  fs/btrfs/volumes.c | 9 +
>  fs/btrfs/volumes.h | 1 +
>  include/uapi/linux/btrfs.h | 2 ++
>  4 files changed, 15 insertions(+)
>
> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
> index 345c64d810d4..f99db6899004 100644
> --- a/fs/btrfs/super.c
> +++ b/fs/btrfs/super.c
> @@ -2246,6 +2246,9 @@ static long btrfs_control_ioctl(struct file *file, 
> unsigned int cmd,
> ret = PTR_ERR_OR_ZERO(device);
> mutex_unlock(_mutex);
> break;
> +   case BTRFS_IOC_FORGET_DEV:
> +   ret = btrfs_forget_devices(vol->name);
> +   break;
> case BTRFS_IOC_DEVICES_READY:
> mutex_lock(_mutex);
> device = btrfs_scan_one_device(vol->name, FMODE_READ,
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index f435d397019e..e1365a122657 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -1208,6 +1208,15 @@ static int btrfs_read_disk_super(struct block_device 
> *bdev, u64 bytenr,
> return 0;
>  }
>
> +int btrfs_forget_devices(const char *path)
> +{
> +   mutex_lock(_mutex);
> +   btrfs_free_stale_devices(strlen(path) ? path:NULL, NULL);

One space before : and another one after it please.

Now the more important: don't use strlen, use strnlen. Some malicious
or sloppy user might have passed a non-null terminated string, you
don't want strlen to go past the limits of btrfs_ioctl_vol_args for
obvious reasons.

Also, please, not just to make a maintainer's life easier, but current
and future reviewers, add the patch version to each patch's subject
and not just the cover letter. Also list (after ---) what changes
between each patch version in the patch itself and not the cover
letter.

V12, here we go.

> +   mutex_unlock(_mutex);
> +
> +   return 0;
> +}
> +
>  /*
>   * Look for a btrfs signature on a device. This may be called out of the 
> mount path
>   * and we are not allowed to call set_blocksize during the scan. The 
> superblock
> diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
> index aefce895e994..180297d04938 100644
> --- a/fs/btrfs/volumes.h
> +++ b/fs/btrfs/volumes.h
> @@ -406,6 +406,7 @@ int btrfs_open_devices(struct btrfs_fs_devices 
> *fs_devices,
>fmode_t flags, void *holder);
>  struct btrfs_device *btrfs_scan_one_device(const char *path,
>fmode_t flags, void *holder);
> +int btrfs_forget_devices(const char *path);
>  int btrfs_close_devices(struct btrfs_fs_devices *fs_devices);
>  void btrfs_free_extra_devids(struct btrfs_fs_devices *fs_devices, int step);
>  void btrfs_assign_next_active_device(struct btrfs_device *device,
> diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h
> index 5ca1d21fc4a7..b1be7f828cb4 100644
> --- a/include/uapi/linux/btrfs.h
> +++ b/include/uapi/linux/btrfs.h
> @@ -836,6 +836,8 @@ enum btrfs_err_code {
>struct btrfs_ioctl_vol_args)
>  #define BTRFS_IOC_SCAN_DEV _IOW(BTRFS_IOCTL_MAGIC, 4, \
>struct btrfs_ioctl_vol_args)
> +#define BTRFS_IOC_FORGET_DEV _IOW(BTRFS_IOCTL_MAGIC, 5, \
> +  struct btrfs_ioctl_vol_args)
>  /* trans start and trans end are dangerous, and only for
>   * use by applications that know how to avoid the
>   * resulting deadlocks
> --
> 1.8.3.1
>


-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”


Re: [PATCH v2 0/6] btrfs: qgroup: Delay subtree scan to reduce overhead

2018-11-13 Thread Filipe Manana
On Tue, Nov 13, 2018 at 5:08 PM David Sterba  wrote:
>
> On Mon, Nov 12, 2018 at 10:33:33PM +0100, David Sterba wrote:
> > On Thu, Nov 08, 2018 at 01:49:12PM +0800, Qu Wenruo wrote:
> > > This patchset can be fetched from github:
> > > https://github.com/adam900710/linux/tree/qgroup_delayed_subtree_rebased
> > >
> > > Which is based on v4.20-rc1.
> >
> > Thanks, I'll add it to for-next soon.
>
> During test generic/517, the logs were full of the warning below. The 
> reference
> test on current master, effectively misc-4.20 which was used as base of your
> branch did not get the warning.
>
> [11540.167829] BTRFS: end < start 2519039 2519040
> [11540.170513] WARNING: CPU: 1 PID: 539 at fs/btrfs/extent_io.c:436 
> insert_state+0xd8/0x100 [btrfs]
> [11540.174411] Modules linked in: dm_thin_pool dm_persistent_data dm_bufio 
> dm_bio_prison btrfs libcrc32c xor zstd_decompress zstd_compress xxhash 
> raid6_pq dm_mod loop [last unloaded: libcrc32c]
> [11540.178279] CPU: 1 PID: 539 Comm: xfs_io Tainted: G  D W 
> 4.20.0-rc1-default+ #329
> [11540.180616] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> rel-1.11.2-0-gf9626cc-prebuilt.qemu-project.org 04/01/2014
> [11540.183754] RIP: 0010:insert_state+0xd8/0x100 [btrfs]
> [11540.189173] RSP: 0018:a0d245eafb20 EFLAGS: 00010282
> [11540.189885] RAX:  RBX: 9f0bb3267320 RCX: 
> 
> [11540.191646] RDX: 0002 RSI: 0001 RDI: 
> a40c400d
> [11540.192942] RBP: 00266fff R08: 0001 R09: 
> 
> [11540.193871] R10:  R11: a629da2d R12: 
> 9f0ba0281c60
> [11540.195527] R13: 00267000 R14: a0d245eafb98 R15: 
> a0d245eafb90
> [11540.197026] FS:  7fa338eb4b80() GS:9f0bbd60() 
> knlGS:
> [11540.198251] CS:  0010 DS:  ES:  CR0: 80050033
> [11540.199698] CR2: 7fa33873bfb8 CR3: 6fb6e000 CR4: 
> 06e0
> [11540.201428] Call Trace:
> [11540.202164]  __set_extent_bit+0x43b/0x5b0 [btrfs]
> [11540.203223]  lock_extent_bits+0x5d/0x210 [btrfs]
> [11540.204346]  ? _raw_spin_unlock+0x24/0x40
> [11540.205381]  ? test_range_bit+0xdf/0x130 [btrfs]
> [11540.206573]  lock_extent_range+0xb8/0x150 [btrfs]
> [11540.207696]  btrfs_double_extent_lock+0x78/0xb0 [btrfs]
> [11540.208988]  btrfs_extent_same_range+0x131/0x4e0 [btrfs]
> [11540.210237]  btrfs_remap_file_range+0x337/0x350 [btrfs]
> [11540.211448]  vfs_dedupe_file_range_one+0x141/0x150
> [11540.212622]  vfs_dedupe_file_range+0x146/0x1a0
> [11540.213795]  do_vfs_ioctl+0x520/0x6c0
> [11540.214711]  ? __fget+0x109/0x1e0
> [11540.215616]  ksys_ioctl+0x3a/0x70
> [11540.216233]  __x64_sys_ioctl+0x16/0x20
> [11540.216860]  do_syscall_64+0x54/0x180
> [11540.217409]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> [11540.218126] RIP: 0033:0x7fa338a4daa7

That's the infinite loop issue fixed by one of the patches submitted
for 4.20-rc2:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h=v4.20-rc2=11023d3f5fdf89bba5e1142127701ca6e6014587

The branch you used for testing doesn't have that fix?

>


-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”


Re: [PATCH] Btrfs: do not set log for full commit when creating non-data block groups

2018-11-13 Thread Filipe Manana
On Tue, Nov 13, 2018 at 2:31 PM David Sterba  wrote:
>
> On Thu, Nov 08, 2018 at 02:48:29PM +, Filipe Manana wrote:
> > On Thu, Nov 8, 2018 at 2:37 PM Filipe Manana  wrote:
> > >
> > > On Thu, Nov 8, 2018 at 2:35 PM Qu Wenruo  wrote:
> > > >
> > > >
> > > >
> > > > On 2018/11/8 下午9:17, fdman...@kernel.org wrote:
> > > > > From: Filipe Manana 
> > > > >
> > > > > When creating a block group we don't need to set the log for full 
> > > > > commit
> > > > > if the new block group is not used for data. Logged items can only 
> > > > > point
> > > > > to logical addresses of data block groups (through file extent items) 
> > > > > so
> > > > > there is no need to for the next fsync to fallback to a transaction 
> > > > > commit
> > > > > if the new block group is for metadata.
> > > >
> > > > Is it possible for the log tree blocks to be allocated in that new block
> > > > group?
> > >
> > > Yes.
> >
> > Now I realize what might be your concern, and this would cause trouble.
>
> Is this patch ok for for-next or does it need more work? Thanks.

Nop, it's no good (despite not triggering problems initially), due to
Qu's first question.
So just drop it and forget it.
Thanks.


Re: [PATCH] Btrfs: do not set log for full commit when creating non-data block groups

2018-11-09 Thread Filipe Manana
On Fri, Nov 9, 2018 at 12:27 AM Qu Wenruo  wrote:
>
>
>
> On 2018/11/8 下午10:48, Filipe Manana wrote:
> > On Thu, Nov 8, 2018 at 2:37 PM Filipe Manana  wrote:
> >>
> >> On Thu, Nov 8, 2018 at 2:35 PM Qu Wenruo  wrote:
> >>>
> >>>
> >>>
> >>> On 2018/11/8 下午9:17, fdman...@kernel.org wrote:
> >>>> From: Filipe Manana 
> >>>>
> >>>> When creating a block group we don't need to set the log for full commit
> >>>> if the new block group is not used for data. Logged items can only point
> >>>> to logical addresses of data block groups (through file extent items) so
> >>>> there is no need to for the next fsync to fallback to a transaction 
> >>>> commit
> >>>> if the new block group is for metadata.
> >>>
> >>> Is it possible for the log tree blocks to be allocated in that new block
> >>> group?
> >>
> >> Yes.
> >
> > Now I realize what might be your concern, and this would cause trouble.
> > Surprised this didn't trigger any problem and I had this (together
> > with other changes) running tests for some weeks already.
>
> Maybe it's related metadata chunk pre-allocation so it will be super
> hard to hit in normal case, or some extent allocation policy preventing
> us from allocating tree block of newly created bg.

No, I don't think we have such kind of policy, would have noticed it
over the years if we did have one.
Metadata chunk allocation just happens much less frequently for
workloads the tests exercised.

>
> Thanks,
> Qu
>
> >
> >>
> >>>
> >>> Thanks,
> >>> Qu
> >>>
> >>>>
> >>>> Signed-off-by: Filipe Manana 
> >>>> ---
> >>>>  fs/btrfs/extent-tree.c | 3 ++-
> >>>>  1 file changed, 2 insertions(+), 1 deletion(-)
> >>>>
> >>>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> >>>> index 577878324799..588fbd1606fb 100644
> >>>> --- a/fs/btrfs/extent-tree.c
> >>>> +++ b/fs/btrfs/extent-tree.c
> >>>> @@ -10112,7 +10112,8 @@ int btrfs_make_block_group(struct 
> >>>> btrfs_trans_handle *trans, u64 bytes_used,
> >>>>   struct btrfs_block_group_cache *cache;
> >>>>   int ret;
> >>>>
> >>>> - btrfs_set_log_full_commit(fs_info, trans);
> >>>> + if (type & BTRFS_BLOCK_GROUP_DATA)
> >>>> + btrfs_set_log_full_commit(fs_info, trans);
> >>>>
> >>>>   cache = btrfs_create_block_group_cache(fs_info, chunk_offset, 
> >>>> size);
> >>>>   if (!cache)
> >>>>
> >>>
>


Re: [PATCH] Btrfs: incremental send, fix infinite loop when apply children dir moves

2018-11-09 Thread Filipe Manana
> >
> > 6. In round 13, we processing 270, we delayed the rename because 270
> > has a path loop with 267,
> > and then we add 259, 265 to the stack, but we don't remove from
> > pending_dir_moves rb_tree.
> >
> > 7. In round 15, we processing 266, we delayed the rename because 266
> > has a path loop with 270,
> > So we look for parent_ino equal to 270 from pending_dir_moves, and we
> > find ino 259
> > because it was not removed from pending_dir_moves.
> > Then we create a new pending_dir and join the ino 259, because the ino
> > 259 is currently in the stack,
> > the new pending_dir ino 266 is also indirectly added to the stack,
> > placed between 267 and 259.
> >
> > So we fix this problem by remove node from pending_dir_moves,
> > avoid add new pending_dir_move to stack list.
> >
>
> Does anyone have any suggestions ?

A better changelog. But for that I'll have to go through it and
understand what's happening (and see if this is the right way to fix
it). Will probably do it next week.

> Later, I will submit the case in xfstest.
>
>
> > Qu Wenruo 於 2018-11-05 22:35 寫到:
> >> On 2018/11/5 下午7:11, Filipe Manana wrote:
> >>> On Mon, Nov 5, 2018 at 4:10 AM robbieko 
> >>> wrote:
> >>>>
> >>>> Filipe Manana 於 2018-10-30 19:36 寫到:
> >>>>> On Tue, Oct 30, 2018 at 7:00 AM robbieko 
> >>>>> wrote:
> >>>>>>
> >>>>>> From: Robbie Ko 
> >>>>>>
> >>>>>> In apply_children_dir_moves, we first create an empty list
> >>>>>> (stack),
> >>>>>> then we get an entry from pending_dir_moves and add it to the
> >>>>>> stack,
> >>>>>> but we didn't delete the entry from rb_tree.
> >>>>>>
> >>>>>> So, in add_pending_dir_move, we create a new entry and then use
> >>>>>> the
> >>>>>> parent_ino in the current rb_tree to find the corresponding entry,
> >>>>>> and if so, add the new entry to the corresponding list.
> >>>>>>
> >>>>>> However, the entry may have been added to the stack, causing new
> >>>>>> entries to be added to the stack as well.
> >>
> >> I'm not a send guy, so I can totally be wrong, but that 'may' word
> >> seems
> >> to hide the demon.
> >>
> >>>>>>
> >>>>>> Finally, each time we take the first entry from the stack and
> >>>>>> start
> >>>>>> processing, it ends up with an infinite loop.
> >>>>>>
> >>>>>> Fix this problem by remove node from pending_dir_moves,
> >>>>>> avoid add new pending_dir_move to error list.
> >>>>>
> >>>>> I can't parse that explanation.
> >>>>> Can you give a concrete example (reproducer) or did this came out
> >>>>> of
> >>>>> thin air?
> >>>>>
> >>>>> Thanks.
> >>>>>
> >>>>
> >>>> I am sorry that I replied so late.
> >>>>
> >>>> I have no way to give a simple example.
> >>>> But I can provide a btrfs image file
> >>>> You can restore the Image via btrfs-image
> >>>> Then directly command "btrfs send -e -p parent send -f dump_file"
> >>
> >> According to the name, it doesn't look like a real world case, but
> >> some
> >> more or less manually crafted image.
> >> It shouldn't be that hard to describe the root cause in details if
> >> it's
> >> crafted.
> >>
> >> Or, if it's a image caused by some stress test, then I really hope you
> >> could locate the direct and root cause, or at least minimize the
> >> image.
> >> The extra noise will really take a lot of time from reviewer.
> >>
> >> IMHO, it shouldn't be that hard to locate the key/key range that send
> >> loops, with that located it should provide some clue to further pin
> >> down
> >> the root cause.
> >>
> >> I totally understand that everyone has their own work, if you can't
> >> really spare time for this, would you please upload the image to
> >> public
> >> for anyone (me for example) to look into the case?
> >>
> >> Thanks,
> >> Qu
> >>
> >>>> Infinite loop will occur.
> >&

Re: [PATCH] Btrfs: do not set log for full commit when creating non-data block groups

2018-11-08 Thread Filipe Manana
On Thu, Nov 8, 2018 at 2:37 PM Filipe Manana  wrote:
>
> On Thu, Nov 8, 2018 at 2:35 PM Qu Wenruo  wrote:
> >
> >
> >
> > On 2018/11/8 下午9:17, fdman...@kernel.org wrote:
> > > From: Filipe Manana 
> > >
> > > When creating a block group we don't need to set the log for full commit
> > > if the new block group is not used for data. Logged items can only point
> > > to logical addresses of data block groups (through file extent items) so
> > > there is no need to for the next fsync to fallback to a transaction commit
> > > if the new block group is for metadata.
> >
> > Is it possible for the log tree blocks to be allocated in that new block
> > group?
>
> Yes.

Now I realize what might be your concern, and this would cause trouble.
Surprised this didn't trigger any problem and I had this (together
with other changes) running tests for some weeks already.

>
> >
> > Thanks,
> > Qu
> >
> > >
> > > Signed-off-by: Filipe Manana 
> > > ---
> > >  fs/btrfs/extent-tree.c | 3 ++-
> > >  1 file changed, 2 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> > > index 577878324799..588fbd1606fb 100644
> > > --- a/fs/btrfs/extent-tree.c
> > > +++ b/fs/btrfs/extent-tree.c
> > > @@ -10112,7 +10112,8 @@ int btrfs_make_block_group(struct 
> > > btrfs_trans_handle *trans, u64 bytes_used,
> > >   struct btrfs_block_group_cache *cache;
> > >   int ret;
> > >
> > > - btrfs_set_log_full_commit(fs_info, trans);
> > > + if (type & BTRFS_BLOCK_GROUP_DATA)
> > > + btrfs_set_log_full_commit(fs_info, trans);
> > >
> > >   cache = btrfs_create_block_group_cache(fs_info, chunk_offset, size);
> > >   if (!cache)
> > >
> >


Re: [PATCH] Btrfs: do not set log for full commit when creating non-data block groups

2018-11-08 Thread Filipe Manana
On Thu, Nov 8, 2018 at 2:35 PM Qu Wenruo  wrote:
>
>
>
> On 2018/11/8 下午9:17, fdman...@kernel.org wrote:
> > From: Filipe Manana 
> >
> > When creating a block group we don't need to set the log for full commit
> > if the new block group is not used for data. Logged items can only point
> > to logical addresses of data block groups (through file extent items) so
> > there is no need to for the next fsync to fallback to a transaction commit
> > if the new block group is for metadata.
>
> Is it possible for the log tree blocks to be allocated in that new block
> group?

Yes.

>
> Thanks,
> Qu
>
> >
> > Signed-off-by: Filipe Manana 
> > ---
> >  fs/btrfs/extent-tree.c | 3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> >
> > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> > index 577878324799..588fbd1606fb 100644
> > --- a/fs/btrfs/extent-tree.c
> > +++ b/fs/btrfs/extent-tree.c
> > @@ -10112,7 +10112,8 @@ int btrfs_make_block_group(struct 
> > btrfs_trans_handle *trans, u64 bytes_used,
> >   struct btrfs_block_group_cache *cache;
> >   int ret;
> >
> > - btrfs_set_log_full_commit(fs_info, trans);
> > + if (type & BTRFS_BLOCK_GROUP_DATA)
> > + btrfs_set_log_full_commit(fs_info, trans);
> >
> >   cache = btrfs_create_block_group_cache(fs_info, chunk_offset, size);
> >   if (!cache)
> >
>


Re: [PATCH v4] Btrfs: fix deadlock on tree root leaf when finding free extent

2018-11-05 Thread Filipe Manana
On Mon, Nov 5, 2018 at 4:34 PM David Sterba  wrote:
>
> On Mon, Nov 05, 2018 at 04:30:35PM +, Filipe Manana wrote:
> > On Mon, Nov 5, 2018 at 4:29 PM David Sterba  wrote:
> > >
> > > On Wed, Oct 24, 2018 at 01:48:40PM +0100, Filipe Manana wrote:
> > > > > Ah ok makes sense.  Well in that case lets just make 
> > > > > btrfs_read_locked_inode()
> > > > > take a path, and allocate it in btrfs_iget, that'll remove the ugly
> > > > >
> > > > > if (path != in_path)
> > > >
> > > > You mean the following on top of v4:
> > > >
> > > > https://friendpaste.com/6XrGXb5p0RSJGixUFYouHg
> > > >
> > > > Not much different, just saves one such if statement. I'm ok with that.
> > >
> > > Now in misc-next with v4 and the friendpaste incremental as
> > >
> > > https://github.com/kdave/btrfs-devel/commit/efcfd6c87d28b3aa9bcba52d7c1e1fc79a2dad69
> >
> > Please don't add the incremental. It's buggy. It was meant to figure
> > out what Josef was saying. That's why I haven't sent a V5.
>
> Ok dropped, I'll will wait for a proper patch.

It's V4, the last sent version. Just forget the incremental.
Thanks.


Re: [PATCH v4] Btrfs: fix deadlock on tree root leaf when finding free extent

2018-11-05 Thread Filipe Manana
On Mon, Nov 5, 2018 at 4:29 PM David Sterba  wrote:
>
> On Wed, Oct 24, 2018 at 01:48:40PM +0100, Filipe Manana wrote:
> > > Ah ok makes sense.  Well in that case lets just make 
> > > btrfs_read_locked_inode()
> > > take a path, and allocate it in btrfs_iget, that'll remove the ugly
> > >
> > > if (path != in_path)
> >
> > You mean the following on top of v4:
> >
> > https://friendpaste.com/6XrGXb5p0RSJGixUFYouHg
> >
> > Not much different, just saves one such if statement. I'm ok with that.
>
> Now in misc-next with v4 and the friendpaste incremental as
>
> https://github.com/kdave/btrfs-devel/commit/efcfd6c87d28b3aa9bcba52d7c1e1fc79a2dad69

Please don't add the incremental. It's buggy. It was meant to figure
out what Josef was saying. That's why I haven't sent a V5.


Re: [PATCH] Btrfs: incremental send, fix infinite loop when apply children dir moves

2018-11-05 Thread Filipe Manana
On Mon, Nov 5, 2018 at 4:10 AM robbieko  wrote:
>
> Filipe Manana 於 2018-10-30 19:36 寫到:
> > On Tue, Oct 30, 2018 at 7:00 AM robbieko  wrote:
> >>
> >> From: Robbie Ko 
> >>
> >> In apply_children_dir_moves, we first create an empty list (stack),
> >> then we get an entry from pending_dir_moves and add it to the stack,
> >> but we didn't delete the entry from rb_tree.
> >>
> >> So, in add_pending_dir_move, we create a new entry and then use the
> >> parent_ino in the current rb_tree to find the corresponding entry,
> >> and if so, add the new entry to the corresponding list.
> >>
> >> However, the entry may have been added to the stack, causing new
> >> entries to be added to the stack as well.
> >>
> >> Finally, each time we take the first entry from the stack and start
> >> processing, it ends up with an infinite loop.
> >>
> >> Fix this problem by remove node from pending_dir_moves,
> >> avoid add new pending_dir_move to error list.
> >
> > I can't parse that explanation.
> > Can you give a concrete example (reproducer) or did this came out of
> > thin air?
> >
> > Thanks.
> >
>
> I am sorry that I replied so late.
>
> I have no way to give a simple example.
> But I can provide a btrfs image file
> You can restore the Image via btrfs-image
> Then directly command "btrfs send -e -p parent send -f dump_file"
> Infinite loop will occur.
> I use ubuntu 16.04, kernel 4.15.0.36-generic can be stable reproduce

You have been occasionally submitting fixes for send/receive for a few
years now, and you know already
that I always ask for a changelog that describes well the problem and
an example/reproducer.

Why did you do this?

What I can read from your answer is that you were too lazy to extract
a reproducer from that image.
Just made some change that fixes the infinite loop and because it
apparently works you're done with it,
Without an example at least, I don't think you or anyone can fully
understand the problem, and if what
you have (despite somewhat making theoretical sense) is really a good
solution or just a workaround for
the cause of the problem - after all if you can't give an example, you
can't explain how in practice such loop
of dependencies between directories happens. This, as with most
send/receive problems, is a pure sequential
and deterministic problem so there's really no excuse for not getting
a reproducer.

Without an example, an explanation how it happens in the real world,
does one know that your change is
fixing the problem is the right place and not introducing other
problems? Like the receiver not getting some
changes (missing directories, files, or renames, etc).

Tests are not just to prove a change is correct, they exist to catch
and prevent regressions in the future too.

You can do better than that.

>
> Image file, please refer to the attachment.
>
> Thanks.
>
>
> >>
> >> Signed-off-by: Robbie Ko 
> >> ---
> >>  fs/btrfs/send.c | 11 ---
> >>  1 file changed, 8 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c
> >> index 094cc144..5be83b5 100644
> >> --- a/fs/btrfs/send.c
> >> +++ b/fs/btrfs/send.c
> >> @@ -3340,7 +3340,8 @@ static void free_pending_move(struct send_ctx
> >> *sctx, struct pending_dir_move *m)
> >> kfree(m);
> >>  }
> >>
> >> -static void tail_append_pending_moves(struct pending_dir_move *moves,
> >> +static void tail_append_pending_moves(struct send_ctx *sctx,
> >> + struct pending_dir_move *moves,
> >>   struct list_head *stack)
> >>  {
> >> if (list_empty(>list)) {
> >> @@ -3351,6 +3352,10 @@ static void tail_append_pending_moves(struct
> >> pending_dir_move *moves,
> >> list_add_tail(>list, stack);
> >> list_splice_tail(, stack);
> >> }
> >> +   if (!RB_EMPTY_NODE(>node)) {
> >> +   rb_erase(>node, >pending_dir_moves);
> >> +   RB_CLEAR_NODE(>node);
> >> +   }
> >>  }
> >>
> >>  static int apply_children_dir_moves(struct send_ctx *sctx)
> >> @@ -3365,7 +3370,7 @@ static int apply_children_dir_moves(struct
> >> send_ctx *sctx)
> >> return 0;
> >>
> >> INIT_LIST_HEAD();
> >> -   tail_append_pending_moves(pm, );
> >> +   tail_append_pending_moves(sctx, pm, );
> >>
> >> while (!list_empty()) {
> >> pm = list_first_entry(, struct pending_dir_move,
> >> list);
> >> @@ -3376,7 +3381,7 @@ static int apply_children_dir_moves(struct
> >> send_ctx *sctx)
> >> goto out;
> >> pm = get_pending_dir_moves(sctx, parent_ino);
> >> if (pm)
> >> -   tail_append_pending_moves(pm, );
> >> +   tail_append_pending_moves(sctx, pm, );
> >> }
> >> return 0;
> >>
> >> --
> >> 1.9.1
> >>



-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”


Re: [PATCH] Btrfs: fix cur_offset in the error case for nocow

2018-10-30 Thread Filipe Manana
On Tue, Oct 30, 2018 at 10:05 AM robbieko  wrote:
>
> From: Robbie Ko 
>
> When the cow_file_range fail, the related resources are
> unlocked according to the range (start-end), so the unlock
> cannot be repeated in run_delalloc_nocow.
>
> In some cases (e.g. cur_offset <= end && cow_start!= -1),
> cur_offset is not updated correctly, so move the cur_offset
> update before cow_file_range.
>
> [ cut here ]
> kernel BUG at mm/page-writeback.c:2663!
> Internal error: Oops - BUG: 0 [#1] SMP
> CPU: 3 PID: 31525 Comm: kworker/u8:7 Tainted: P O
> Hardware name: Realtek_RTD1296 (DT)
> Workqueue: writeback wb_workfn (flush-btrfs-1)
> task: ffc076db3380 ti: ffc02e9ac000 task.ti: ffc02e9ac000
> PC is at clear_page_dirty_for_io+0x1bc/0x1e8
> LR is at clear_page_dirty_for_io+0x14/0x1e8
> pc : [] lr : [] pstate: 4145
> sp : ffc02e9af4f0
> Process kworker/u8:7 (pid: 31525, stack limit = 0xffc02e9ac020)
> Call trace:
> [] clear_page_dirty_for_io+0x1bc/0x1e8
> [] extent_clear_unlock_delalloc+0x1e4/0x210 [btrfs]
> [] run_delalloc_nocow+0x3b8/0x948 [btrfs]
> [] run_delalloc_range+0x250/0x3a8 [btrfs]
> [] writepage_delalloc.isra.21+0xbc/0x1d8 [btrfs]
> [] __extent_writepage+0xe8/0x248 [btrfs]
> [] extent_write_cache_pages.isra.17+0x164/0x378 [btrfs]
> [] extent_writepages+0x48/0x68 [btrfs]
> [] btrfs_writepages+0x20/0x30 [btrfs]
> [] do_writepages+0x30/0x88
> [] __writeback_single_inode+0x34/0x198
> [] writeback_sb_inodes+0x184/0x3c0
> [] __writeback_inodes_wb+0x6c/0xc0
> [] wb_writeback+0x1b8/0x1c0
> [] wb_workfn+0x150/0x250
> [] process_one_work+0x1dc/0x388
> [] worker_thread+0x130/0x500
> [] kthread+0x10c/0x110
> [] ret_from_fork+0x10/0x40
> Code: d503201f a9025bb5 a90363b7 f90023b9 (d421)
> ---[ end trace 65fecee7c2296f25 ]---
>
> Signed-off-by: Robbie Ko 
> ---
>  fs/btrfs/inode.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 181c58b..b62299b 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -1532,10 +1532,10 @@ static noinline int run_delalloc_nocow(struct inode 
> *inode,
>
> if (cur_offset <= end && cow_start == (u64)-1) {
> cow_start = cur_offset;
> -   cur_offset = end;
> }

Also remove the { }

Other than that, it looks good to me and you can add:

Reviewed-by: Filipe Manana 

thanks

>
> if (cow_start != (u64)-1) {
> +   cur_offset = end;
> ret = cow_file_range(inode, locked_page, cow_start, end, end,
>  page_started, nr_written, 1, NULL);
> if (ret)
> --
> 1.9.1
>


-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”


Re: [PATCH] Btrfs: incremental send, fix infinite loop when apply children dir moves

2018-10-30 Thread Filipe Manana
On Tue, Oct 30, 2018 at 7:00 AM robbieko  wrote:
>
> From: Robbie Ko 
>
> In apply_children_dir_moves, we first create an empty list (stack),
> then we get an entry from pending_dir_moves and add it to the stack,
> but we didn't delete the entry from rb_tree.
>
> So, in add_pending_dir_move, we create a new entry and then use the
> parent_ino in the current rb_tree to find the corresponding entry,
> and if so, add the new entry to the corresponding list.
>
> However, the entry may have been added to the stack, causing new
> entries to be added to the stack as well.
>
> Finally, each time we take the first entry from the stack and start
> processing, it ends up with an infinite loop.
>
> Fix this problem by remove node from pending_dir_moves,
> avoid add new pending_dir_move to error list.

I can't parse that explanation.
Can you give a concrete example (reproducer) or did this came out of thin air?

Thanks.

>
> Signed-off-by: Robbie Ko 
> ---
>  fs/btrfs/send.c | 11 ---
>  1 file changed, 8 insertions(+), 3 deletions(-)
>
> diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c
> index 094cc144..5be83b5 100644
> --- a/fs/btrfs/send.c
> +++ b/fs/btrfs/send.c
> @@ -3340,7 +3340,8 @@ static void free_pending_move(struct send_ctx *sctx, 
> struct pending_dir_move *m)
> kfree(m);
>  }
>
> -static void tail_append_pending_moves(struct pending_dir_move *moves,
> +static void tail_append_pending_moves(struct send_ctx *sctx,
> + struct pending_dir_move *moves,
>   struct list_head *stack)
>  {
> if (list_empty(>list)) {
> @@ -3351,6 +3352,10 @@ static void tail_append_pending_moves(struct 
> pending_dir_move *moves,
> list_add_tail(>list, stack);
> list_splice_tail(, stack);
> }
> +   if (!RB_EMPTY_NODE(>node)) {
> +   rb_erase(>node, >pending_dir_moves);
> +   RB_CLEAR_NODE(>node);
> +   }
>  }
>
>  static int apply_children_dir_moves(struct send_ctx *sctx)
> @@ -3365,7 +3370,7 @@ static int apply_children_dir_moves(struct send_ctx 
> *sctx)
> return 0;
>
> INIT_LIST_HEAD();
> -   tail_append_pending_moves(pm, );
> +   tail_append_pending_moves(sctx, pm, );
>
> while (!list_empty()) {
> pm = list_first_entry(, struct pending_dir_move, list);
> @@ -3376,7 +3381,7 @@ static int apply_children_dir_moves(struct send_ctx 
> *sctx)
> goto out;
> pm = get_pending_dir_moves(sctx, parent_ino);
> if (pm)
> -   tail_append_pending_moves(pm, );
> +   tail_append_pending_moves(sctx, pm, );
> }
> return 0;
>
> --
> 1.9.1
>


-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”


Re: [PATCH] fstests: btrfs/057: Fix false alerts due to orphan files

2018-10-29 Thread Filipe Manana
On Mon, Oct 29, 2018 at 6:35 AM Qu Wenruo  wrote:
>
> For latest kernel, there is a chance that btrfs/057 reports false
> errors.
>
> The false error would look like:
>   btrfs/057 4s ... - output mismatch (see 
> /home/adam/xfstests-dev/results//btrfs/057.out.bad)
>   --- tests/btrfs/057.out   2017-08-21 09:25:33.1 +0800
>   +++ /home/adam/xfstests-dev/results//btrfs/057.out.bad2018-10-29 
> 14:07:28.443651293 +0800
>   @@ -1,3 +1,3 @@
>QA output created by 057
>4096 4096
>   -4096 4096
>   +28672 28672
>
> This is related to the fact that "btrfs subvolume sync" (or
> vanilla sync) will not ensure orphan (unlinked but still exist) files to
> be removed.
>
> In fact, for that false error case, if inspecting the fs after umount,
> its qgroup number is correct and btrfs check won't report qgroup error.
>
> To fix the false alerts, just skip any manual qgroup number comparison,
> and let fsck done after the test case to detect problem.
>
> This also elimiate the necessary of using specified mount and mkfs
> option, allowing us to improve coverage.
>
> Reported-by: Nikolay Borisov 
> Signed-off-by: Qu Wenruo 

Reviewed-by: Filipe Manana 

> ---
>  tests/btrfs/057 | 17 -
>  tests/btrfs/057.out |  3 +--
>  2 files changed, 5 insertions(+), 15 deletions(-)
>
> diff --git a/tests/btrfs/057 b/tests/btrfs/057
> index b019f4e1..0b5a36d3 100755
> --- a/tests/btrfs/057
> +++ b/tests/btrfs/057
> @@ -33,12 +33,9 @@ _require_scratch
>  rm -f $seqres.full
>
>  # use small leaf size to get higher btree height.
> -run_check _scratch_mkfs "-b 1g --nodesize 4096"
> +run_check _scratch_mkfs "-b 1g"

The comment above should go away too.

>
> -# inode cache is saved in the FS tree itself for every
> -# individual FS tree,that affects the sizes reported by qgroup show
> -# so we need to explicitly turn it off to get consistent values.
> -_scratch_mount "-o noinode_cache"
> +_scratch_mount
>
>  # -w ensures that the only ops are ones which cause write I/O
>  run_check $FSSTRESS_PROG -d $SCRATCH_MNT -w -p 5 -n 1000 \
> @@ -53,14 +50,8 @@ run_check $FSSTRESS_PROG -d $SCRATCH_MNT/snap1 -w -p 5 -n 
> 1000 \
>  _run_btrfs_util_prog quota enable $SCRATCH_MNT
>  _run_btrfs_util_prog quota rescan -w $SCRATCH_MNT
>
> -# remove all file/dir other than subvolume
> -rm -rf $SCRATCH_MNT/snap1/* >& /dev/null
> -rm -rf $SCRATCH_MNT/p* >& /dev/null
> -
> -_run_btrfs_util_prog filesystem sync $SCRATCH_MNT
> -units=`_btrfs_qgroup_units`
> -$BTRFS_UTIL_PROG qgroup show $units $SCRATCH_MNT | $SED_PROG -n '/[0-9]/p' \
> -   | $AWK_PROG '{print $2" "$3}'
> +echo "Silence is golden"
> +# btrfs check will detect any qgroup number mismatch.
>
>  status=0
>  exit
> diff --git a/tests/btrfs/057.out b/tests/btrfs/057.out
> index 60cb92d0..185023c7 100644
> --- a/tests/btrfs/057.out
> +++ b/tests/btrfs/057.out
> @@ -1,3 +1,2 @@
>  QA output created by 057
> -4096 4096
> -4096 4096
> +Silence is golden
> --
> 2.18.0
>


-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”


Re: [PATCH] fstests: btrfs/057: Fix false alerts due to orphan files

2018-10-29 Thread Filipe Manana
On Mon, Oct 29, 2018 at 11:33 AM Qu Wenruo  wrote:
>
>
>
> On 2018/10/29 下午5:52, Filipe Manana wrote:
> > On Mon, Oct 29, 2018 at 6:31 AM Qu Wenruo  wrote:
> >>
> >> For latest kernel, there is a chance that btrfs/057 reports false
> >> errors.
> >
> > By latest kernel you mean 4.20?
>
> I mean almost all kernels.

So s/For latest kernel/For any recent kernel/ or something like that
which isn't singular.

>
> >
> >>
> >> The false error would look like:
> >>   btrfs/057 4s ... - output mismatch (see 
> >> /home/adam/xfstests-dev/results//btrfs/057.out.bad)
> >>   --- tests/btrfs/057.out   2017-08-21 09:25:33.1 +0800
> >>   +++ /home/adam/xfstests-dev/results//btrfs/057.out.bad2018-10-29 
> >> 14:07:28.443651293 +0800
> >>   @@ -1,3 +1,3 @@
> >>QA output created by 057
> >>4096 4096
> >>   -4096 4096
> >>   +28672 28672
> >>
> >> This is related to the fact that "btrfs subvolume sync" (or
> >> vanilla sync) will not ensure orphan (unlinked but still exist) files to
> >> be removed.
> >
> > So when did that happen, which commit introduced the behaviour change?
>
> No behavior change, it's always the case.
> Just not that easy to hit.
>
> Thanks,
> Qu
>
> >
> >>
> >> In fact, for that false error case, if inspecting the fs after umount,
> >> its qgroup number is correct and btrfs check won't report qgroup error.
> >>
> >> To fix the false alerts, just skip any manual qgroup number comparison,
> >> and let fsck done after the test case to detect problem.
> >>
> >> This also elimiate the necessary of using specified mount and mkfs
> >> option, allowing us to improve coverage.
> >>
> >> Reported-by: Nikolay Borisov 
> >> Signed-off-by: Qu Wenruo 

Anyway, looks good to me.

Reviewed-by: Filipe Manana 

> >> ---
> >>  tests/btrfs/057 | 17 -
> >>  tests/btrfs/057.out |  3 +--
> >>  2 files changed, 5 insertions(+), 15 deletions(-)
> >>
> >> diff --git a/tests/btrfs/057 b/tests/btrfs/057
> >> index b019f4e1..0b5a36d3 100755
> >> --- a/tests/btrfs/057
> >> +++ b/tests/btrfs/057
> >> @@ -33,12 +33,9 @@ _require_scratch
> >>  rm -f $seqres.full
> >>
> >>  # use small leaf size to get higher btree height.
> >> -run_check _scratch_mkfs "-b 1g --nodesize 4096"
> >> +run_check _scratch_mkfs "-b 1g"
> >>
> >> -# inode cache is saved in the FS tree itself for every
> >> -# individual FS tree,that affects the sizes reported by qgroup show
> >> -# so we need to explicitly turn it off to get consistent values.
> >> -_scratch_mount "-o noinode_cache"
> >> +_scratch_mount
> >>
> >>  # -w ensures that the only ops are ones which cause write I/O
> >>  run_check $FSSTRESS_PROG -d $SCRATCH_MNT -w -p 5 -n 1000 \
> >> @@ -53,14 +50,8 @@ run_check $FSSTRESS_PROG -d $SCRATCH_MNT/snap1 -w -p 5 
> >> -n 1000 \
> >>  _run_btrfs_util_prog quota enable $SCRATCH_MNT
> >>  _run_btrfs_util_prog quota rescan -w $SCRATCH_MNT
> >>
> >> -# remove all file/dir other than subvolume
> >> -rm -rf $SCRATCH_MNT/snap1/* >& /dev/null
> >> -rm -rf $SCRATCH_MNT/p* >& /dev/null
> >> -
> >> -_run_btrfs_util_prog filesystem sync $SCRATCH_MNT
> >> -units=`_btrfs_qgroup_units`
> >> -$BTRFS_UTIL_PROG qgroup show $units $SCRATCH_MNT | $SED_PROG -n 
> >> '/[0-9]/p' \
> >> -   | $AWK_PROG '{print $2" "$3}'
> >> +echo "Silence is golden"
> >> +# btrfs check will detect any qgroup number mismatch.
> >>
> >>  status=0
> >>  exit
> >> diff --git a/tests/btrfs/057.out b/tests/btrfs/057.out
> >> index 60cb92d0..185023c7 100644
> >> --- a/tests/btrfs/057.out
> >> +++ b/tests/btrfs/057.out
> >> @@ -1,3 +1,2 @@
> >>  QA output created by 057
> >> -4096 4096
> >> -4096 4096
> >> +Silence is golden
> >> --
> >> 2.18.0
> >>
> >
> >
>


-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”


Re: [PATCH] fstests: btrfs/057: Fix false alerts due to orphan files

2018-10-29 Thread Filipe Manana
On Mon, Oct 29, 2018 at 6:31 AM Qu Wenruo  wrote:
>
> For latest kernel, there is a chance that btrfs/057 reports false
> errors.

By latest kernel you mean 4.20?

>
> The false error would look like:
>   btrfs/057 4s ... - output mismatch (see 
> /home/adam/xfstests-dev/results//btrfs/057.out.bad)
>   --- tests/btrfs/057.out   2017-08-21 09:25:33.1 +0800
>   +++ /home/adam/xfstests-dev/results//btrfs/057.out.bad2018-10-29 
> 14:07:28.443651293 +0800
>   @@ -1,3 +1,3 @@
>QA output created by 057
>4096 4096
>   -4096 4096
>   +28672 28672
>
> This is related to the fact that "btrfs subvolume sync" (or
> vanilla sync) will not ensure orphan (unlinked but still exist) files to
> be removed.

So when did that happen, which commit introduced the behaviour change?

>
> In fact, for that false error case, if inspecting the fs after umount,
> its qgroup number is correct and btrfs check won't report qgroup error.
>
> To fix the false alerts, just skip any manual qgroup number comparison,
> and let fsck done after the test case to detect problem.
>
> This also elimiate the necessary of using specified mount and mkfs
> option, allowing us to improve coverage.
>
> Reported-by: Nikolay Borisov 
> Signed-off-by: Qu Wenruo 
> ---
>  tests/btrfs/057 | 17 -
>  tests/btrfs/057.out |  3 +--
>  2 files changed, 5 insertions(+), 15 deletions(-)
>
> diff --git a/tests/btrfs/057 b/tests/btrfs/057
> index b019f4e1..0b5a36d3 100755
> --- a/tests/btrfs/057
> +++ b/tests/btrfs/057
> @@ -33,12 +33,9 @@ _require_scratch
>  rm -f $seqres.full
>
>  # use small leaf size to get higher btree height.
> -run_check _scratch_mkfs "-b 1g --nodesize 4096"
> +run_check _scratch_mkfs "-b 1g"
>
> -# inode cache is saved in the FS tree itself for every
> -# individual FS tree,that affects the sizes reported by qgroup show
> -# so we need to explicitly turn it off to get consistent values.
> -_scratch_mount "-o noinode_cache"
> +_scratch_mount
>
>  # -w ensures that the only ops are ones which cause write I/O
>  run_check $FSSTRESS_PROG -d $SCRATCH_MNT -w -p 5 -n 1000 \
> @@ -53,14 +50,8 @@ run_check $FSSTRESS_PROG -d $SCRATCH_MNT/snap1 -w -p 5 -n 
> 1000 \
>  _run_btrfs_util_prog quota enable $SCRATCH_MNT
>  _run_btrfs_util_prog quota rescan -w $SCRATCH_MNT
>
> -# remove all file/dir other than subvolume
> -rm -rf $SCRATCH_MNT/snap1/* >& /dev/null
> -rm -rf $SCRATCH_MNT/p* >& /dev/null
> -
> -_run_btrfs_util_prog filesystem sync $SCRATCH_MNT
> -units=`_btrfs_qgroup_units`
> -$BTRFS_UTIL_PROG qgroup show $units $SCRATCH_MNT | $SED_PROG -n '/[0-9]/p' \
> -   | $AWK_PROG '{print $2" "$3}'
> +echo "Silence is golden"
> +# btrfs check will detect any qgroup number mismatch.
>
>  status=0
>  exit
> diff --git a/tests/btrfs/057.out b/tests/btrfs/057.out
> index 60cb92d0..185023c7 100644
> --- a/tests/btrfs/057.out
> +++ b/tests/btrfs/057.out
> @@ -1,3 +1,2 @@
>  QA output created by 057
> -4096 4096
> -4096 4096
> +Silence is golden
> --
> 2.18.0
>


-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”


Re: [PATCH v4] Btrfs: fix deadlock on tree root leaf when finding free extent

2018-10-24 Thread Filipe Manana
On Wed, Oct 24, 2018 at 1:40 PM Josef Bacik  wrote:
>
> On Wed, Oct 24, 2018 at 12:53:59PM +0100, Filipe Manana wrote:
> > On Wed, Oct 24, 2018 at 12:37 PM Josef Bacik  wrote:
> > >
> > > On Wed, Oct 24, 2018 at 10:13:03AM +0100, fdman...@kernel.org wrote:
> > > > From: Filipe Manana 
> > > >
> > > > When we are writing out a free space cache, during the transaction 
> > > > commit
> > > > phase, we can end up in a deadlock which results in a stack trace like 
> > > > the
> > > > following:
> > > >
> > > >  schedule+0x28/0x80
> > > >  btrfs_tree_read_lock+0x8e/0x120 [btrfs]
> > > >  ? finish_wait+0x80/0x80
> > > >  btrfs_read_lock_root_node+0x2f/0x40 [btrfs]
> > > >  btrfs_search_slot+0xf6/0x9f0 [btrfs]
> > > >  ? evict_refill_and_join+0xd0/0xd0 [btrfs]
> > > >  ? inode_insert5+0x119/0x190
> > > >  btrfs_lookup_inode+0x3a/0xc0 [btrfs]
> > > >  ? kmem_cache_alloc+0x166/0x1d0
> > > >  btrfs_iget+0x113/0x690 [btrfs]
> > > >  __lookup_free_space_inode+0xd8/0x150 [btrfs]
> > > >  lookup_free_space_inode+0x5b/0xb0 [btrfs]
> > > >  load_free_space_cache+0x7c/0x170 [btrfs]
> > > >  ? cache_block_group+0x72/0x3b0 [btrfs]
> > > >  cache_block_group+0x1b3/0x3b0 [btrfs]
> > > >  ? finish_wait+0x80/0x80
> > > >  find_free_extent+0x799/0x1010 [btrfs]
> > > >  btrfs_reserve_extent+0x9b/0x180 [btrfs]
> > > >  btrfs_alloc_tree_block+0x1b3/0x4f0 [btrfs]
> > > >  __btrfs_cow_block+0x11d/0x500 [btrfs]
> > > >  btrfs_cow_block+0xdc/0x180 [btrfs]
> > > >  btrfs_search_slot+0x3bd/0x9f0 [btrfs]
> > > >  btrfs_lookup_inode+0x3a/0xc0 [btrfs]
> > > >  ? kmem_cache_alloc+0x166/0x1d0
> > > >  btrfs_update_inode_item+0x46/0x100 [btrfs]
> > > >  cache_save_setup+0xe4/0x3a0 [btrfs]
> > > >  btrfs_start_dirty_block_groups+0x1be/0x480 [btrfs]
> > > >  btrfs_commit_transaction+0xcb/0x8b0 [btrfs]
> > > >
> > > > At cache_save_setup() we need to update the inode item of a block 
> > > > group's
> > > > cache which is located in the tree root (fs_info->tree_root), which 
> > > > means
> > > > that it may result in COWing a leaf from that tree. If that happens we
> > > > need to find a free metadata extent and while looking for one, if we 
> > > > find
> > > > a block group which was not cached yet we attempt to load its cache by
> > > > calling cache_block_group(). However this function will try to load the
> > > > inode of the free space cache, which requires finding the matching inode
> > > > item in the tree root - if that inode item is located in the same leaf 
> > > > as
> > > > the inode item of the space cache we are updating at cache_save_setup(),
> > > > we end up in a deadlock, since we try to obtain a read lock on the same
> > > > extent buffer that we previously write locked.
> > > >
> > > > So fix this by using the tree root's commit root when searching for a
> > > > block group's free space cache inode item when we are attempting to load
> > > > a free space cache. This is safe since block groups once loaded stay in
> > > > memory forever, as well as their caches, so after they are first loaded
> > > > we will never need to read their inode items again. For new block 
> > > > groups,
> > > > once they are created they get their ->cached field set to
> > > > BTRFS_CACHE_FINISHED meaning we will not need to read their inode item.
> > > >
> > > > Reported-by: Andrew Nelson 
> > > > Link: 
> > > > https://lore.kernel.org/linux-btrfs/captelenq9x5kowuq+fa7h1r3nsjg8vyith8+ifjurc_duhh...@mail.gmail.com/
> > > > Fixes: 9d66e233c704 ("Btrfs: load free space cache if it exists")
> > > > Tested-by: Andrew Nelson 
> > > > Signed-off-by: Filipe Manana 
> > > > ---
> > > >
> > >
> > > Now my goal is to see how many times I can get you to redo this thing.
> > >
> > > Why not instead just do
> > >
> > > if (btrfs_is_free_space_inode(inode))
> > >   path->search_commit_root = 1;
> > >
> > > in read_locked_inode?  That would be cleaner.  If we don't want to do 
> > > that for
> > > the inode cache (I'm not sure if it's ok or not) we could just do
> > >
> > > if (root == fs_info->tree_root)
> >
> > We can't (not just that at least).
> > Tried something like that, but we get into a BUG_ON when writing out
> > the space cache for new block groups (created in the current
> > transaction).
> > Because at cache_save_setup() we have this:
> >
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/btrfs/extent-tree.c?h=v4.19#n3342
> >
> > Lookup for the inode in normal root, doesn't exist, create it then
> > repeat - if still not found, BUG_ON.
> > Could also make create_free_space_inode() return an inode pointer and
> > make it call btrfs_iget().
> >
>
> Ah ok makes sense.  Well in that case lets just make btrfs_read_locked_inode()
> take a path, and allocate it in btrfs_iget, that'll remove the ugly
>
> if (path != in_path)

You mean the following on top of v4:

https://friendpaste.com/6XrGXb5p0RSJGixUFYouHg

Not much different, just saves one such if statement. I'm ok with that.

>
> stuff.  Thanks,
>
> Josef


Re: [PATCH v4] Btrfs: fix deadlock on tree root leaf when finding free extent

2018-10-24 Thread Filipe Manana
On Wed, Oct 24, 2018 at 12:37 PM Josef Bacik  wrote:
>
> On Wed, Oct 24, 2018 at 10:13:03AM +0100, fdman...@kernel.org wrote:
> > From: Filipe Manana 
> >
> > When we are writing out a free space cache, during the transaction commit
> > phase, we can end up in a deadlock which results in a stack trace like the
> > following:
> >
> >  schedule+0x28/0x80
> >  btrfs_tree_read_lock+0x8e/0x120 [btrfs]
> >  ? finish_wait+0x80/0x80
> >  btrfs_read_lock_root_node+0x2f/0x40 [btrfs]
> >  btrfs_search_slot+0xf6/0x9f0 [btrfs]
> >  ? evict_refill_and_join+0xd0/0xd0 [btrfs]
> >  ? inode_insert5+0x119/0x190
> >  btrfs_lookup_inode+0x3a/0xc0 [btrfs]
> >  ? kmem_cache_alloc+0x166/0x1d0
> >  btrfs_iget+0x113/0x690 [btrfs]
> >  __lookup_free_space_inode+0xd8/0x150 [btrfs]
> >  lookup_free_space_inode+0x5b/0xb0 [btrfs]
> >  load_free_space_cache+0x7c/0x170 [btrfs]
> >  ? cache_block_group+0x72/0x3b0 [btrfs]
> >  cache_block_group+0x1b3/0x3b0 [btrfs]
> >  ? finish_wait+0x80/0x80
> >  find_free_extent+0x799/0x1010 [btrfs]
> >  btrfs_reserve_extent+0x9b/0x180 [btrfs]
> >  btrfs_alloc_tree_block+0x1b3/0x4f0 [btrfs]
> >  __btrfs_cow_block+0x11d/0x500 [btrfs]
> >  btrfs_cow_block+0xdc/0x180 [btrfs]
> >  btrfs_search_slot+0x3bd/0x9f0 [btrfs]
> >  btrfs_lookup_inode+0x3a/0xc0 [btrfs]
> >  ? kmem_cache_alloc+0x166/0x1d0
> >  btrfs_update_inode_item+0x46/0x100 [btrfs]
> >  cache_save_setup+0xe4/0x3a0 [btrfs]
> >  btrfs_start_dirty_block_groups+0x1be/0x480 [btrfs]
> >  btrfs_commit_transaction+0xcb/0x8b0 [btrfs]
> >
> > At cache_save_setup() we need to update the inode item of a block group's
> > cache which is located in the tree root (fs_info->tree_root), which means
> > that it may result in COWing a leaf from that tree. If that happens we
> > need to find a free metadata extent and while looking for one, if we find
> > a block group which was not cached yet we attempt to load its cache by
> > calling cache_block_group(). However this function will try to load the
> > inode of the free space cache, which requires finding the matching inode
> > item in the tree root - if that inode item is located in the same leaf as
> > the inode item of the space cache we are updating at cache_save_setup(),
> > we end up in a deadlock, since we try to obtain a read lock on the same
> > extent buffer that we previously write locked.
> >
> > So fix this by using the tree root's commit root when searching for a
> > block group's free space cache inode item when we are attempting to load
> > a free space cache. This is safe since block groups once loaded stay in
> > memory forever, as well as their caches, so after they are first loaded
> > we will never need to read their inode items again. For new block groups,
> > once they are created they get their ->cached field set to
> > BTRFS_CACHE_FINISHED meaning we will not need to read their inode item.
> >
> > Reported-by: Andrew Nelson 
> > Link: 
> > https://lore.kernel.org/linux-btrfs/captelenq9x5kowuq+fa7h1r3nsjg8vyith8+ifjurc_duhh...@mail.gmail.com/
> > Fixes: 9d66e233c704 ("Btrfs: load free space cache if it exists")
> > Tested-by: Andrew Nelson 
> > Signed-off-by: Filipe Manana 
> > ---
> >
>
> Now my goal is to see how many times I can get you to redo this thing.
>
> Why not instead just do
>
> if (btrfs_is_free_space_inode(inode))
>   path->search_commit_root = 1;
>
> in read_locked_inode?  That would be cleaner.  If we don't want to do that for
> the inode cache (I'm not sure if it's ok or not) we could just do
>
> if (root == fs_info->tree_root)

We can't (not just that at least).
Tried something like that, but we get into a BUG_ON when writing out
the space cache for new block groups (created in the current
transaction).
Because at cache_save_setup() we have this:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/btrfs/extent-tree.c?h=v4.19#n3342

Lookup for the inode in normal root, doesn't exist, create it then
repeat - if still not found, BUG_ON.
Could also make create_free_space_inode() return an inode pointer and
make it call btrfs_iget().

>
> instead.  Thanks,
>
> Josef


Re: [PATCH v3] Btrfs: fix deadlock on tree root leaf when finding free extent

2018-10-24 Thread Filipe Manana
On Wed, Oct 24, 2018 at 5:08 AM Josef Bacik  wrote:
>
> On Mon, Oct 22, 2018 at 11:05:08PM +0100, Filipe Manana wrote:
> > On Mon, Oct 22, 2018 at 8:18 PM Josef Bacik  wrote:
> > >
> > > On Mon, Oct 22, 2018 at 08:10:37PM +0100, fdman...@kernel.org wrote:
> > > > From: Filipe Manana 
> > > >
> > > > When we are writing out a free space cache, during the transaction 
> > > > commit
> > > > phase, we can end up in a deadlock which results in a stack trace like 
> > > > the
> > > > following:
> > > >
> > > >  schedule+0x28/0x80
> > > >  btrfs_tree_read_lock+0x8e/0x120 [btrfs]
> > > >  ? finish_wait+0x80/0x80
> > > >  btrfs_read_lock_root_node+0x2f/0x40 [btrfs]
> > > >  btrfs_search_slot+0xf6/0x9f0 [btrfs]
> > > >  ? evict_refill_and_join+0xd0/0xd0 [btrfs]
> > > >  ? inode_insert5+0x119/0x190
> > > >  btrfs_lookup_inode+0x3a/0xc0 [btrfs]
> > > >  ? kmem_cache_alloc+0x166/0x1d0
> > > >  btrfs_iget+0x113/0x690 [btrfs]
> > > >  __lookup_free_space_inode+0xd8/0x150 [btrfs]
> > > >  lookup_free_space_inode+0x5b/0xb0 [btrfs]
> > > >  load_free_space_cache+0x7c/0x170 [btrfs]
> > > >  ? cache_block_group+0x72/0x3b0 [btrfs]
> > > >  cache_block_group+0x1b3/0x3b0 [btrfs]
> > > >  ? finish_wait+0x80/0x80
> > > >  find_free_extent+0x799/0x1010 [btrfs]
> > > >  btrfs_reserve_extent+0x9b/0x180 [btrfs]
> > > >  btrfs_alloc_tree_block+0x1b3/0x4f0 [btrfs]
> > > >  __btrfs_cow_block+0x11d/0x500 [btrfs]
> > > >  btrfs_cow_block+0xdc/0x180 [btrfs]
> > > >  btrfs_search_slot+0x3bd/0x9f0 [btrfs]
> > > >  btrfs_lookup_inode+0x3a/0xc0 [btrfs]
> > > >  ? kmem_cache_alloc+0x166/0x1d0
> > > >  btrfs_update_inode_item+0x46/0x100 [btrfs]
> > > >  cache_save_setup+0xe4/0x3a0 [btrfs]
> > > >  btrfs_start_dirty_block_groups+0x1be/0x480 [btrfs]
> > > >  btrfs_commit_transaction+0xcb/0x8b0 [btrfs]
> > > >
> > > > At cache_save_setup() we need to update the inode item of a block 
> > > > group's
> > > > cache which is located in the tree root (fs_info->tree_root), which 
> > > > means
> > > > that it may result in COWing a leaf from that tree. If that happens we
> > > > need to find a free metadata extent and while looking for one, if we 
> > > > find
> > > > a block group which was not cached yet we attempt to load its cache by
> > > > calling cache_block_group(). However this function will try to load the
> > > > inode of the free space cache, which requires finding the matching inode
> > > > item in the tree root - if that inode item is located in the same leaf 
> > > > as
> > > > the inode item of the space cache we are updating at cache_save_setup(),
> > > > we end up in a deadlock, since we try to obtain a read lock on the same
> > > > extent buffer that we previously write locked.
> > > >
> > > > So fix this by skipping the loading of free space caches of any block
> > > > groups that are not yet cached (rare cases) if we are COWing an extent
> > > > buffer from the root tree and space caching is enabled (-o space_cache
> > > > mount option). This is a rare case and its downside is failure to
> > > > find a free extent (return -ENOSPC) when all the already cached block
> > > > groups have no free extents.
> > > >
> > > > Reported-by: Andrew Nelson 
> > > > Link: 
> > > > https://lore.kernel.org/linux-btrfs/captelenq9x5kowuq+fa7h1r3nsjg8vyith8+ifjurc_duhh...@mail.gmail.com/
> > > > Fixes: 9d66e233c704 ("Btrfs: load free space cache if it exists")
> > > > Tested-by: Andrew Nelson 
> > > > Signed-off-by: Filipe Manana 
> > >
> > > Great, thanks,
> > >
> > > Reviewed-by: Josef Bacik 
> >
> > So this makes many fstests occasionally fail with aborted transaction
> > due to ENOSPC.
> > It's late and I haven't verified yet, but I suppose this is because we
> > always skip loading the cache regardless of currently being COWing an
> > existing leaf or allocating a new one (growing the tree).
> > Needs to be fixed.
> >
>
> How about we just use path->search_commit_root?  If we're loading the cache we
> just want the last committed version, it's not like we read it after we've
> written it.  Then we can go back to business as usual.  Thanks,

Yeah, that works. It was an idea before sending v1 but it felt a bit
dirty at the time.
Left fstests running overnight using the commit root approach and
everything seems fine. Sending a v4.

Another alternative, which would solve similar problems for any tree,
would be to allow getting a read lock on an
already (spin) write locked eb, just like we do for blocking write
locks:  https://friendpaste.com/6XrGXb5p0RSJGixUFZ8lCt

thanks


>
> Josef


Re: [PATCH v3] Btrfs: fix deadlock on tree root leaf when finding free extent

2018-10-22 Thread Filipe Manana
On Mon, Oct 22, 2018 at 8:18 PM Josef Bacik  wrote:
>
> On Mon, Oct 22, 2018 at 08:10:37PM +0100, fdman...@kernel.org wrote:
> > From: Filipe Manana 
> >
> > When we are writing out a free space cache, during the transaction commit
> > phase, we can end up in a deadlock which results in a stack trace like the
> > following:
> >
> >  schedule+0x28/0x80
> >  btrfs_tree_read_lock+0x8e/0x120 [btrfs]
> >  ? finish_wait+0x80/0x80
> >  btrfs_read_lock_root_node+0x2f/0x40 [btrfs]
> >  btrfs_search_slot+0xf6/0x9f0 [btrfs]
> >  ? evict_refill_and_join+0xd0/0xd0 [btrfs]
> >  ? inode_insert5+0x119/0x190
> >  btrfs_lookup_inode+0x3a/0xc0 [btrfs]
> >  ? kmem_cache_alloc+0x166/0x1d0
> >  btrfs_iget+0x113/0x690 [btrfs]
> >  __lookup_free_space_inode+0xd8/0x150 [btrfs]
> >  lookup_free_space_inode+0x5b/0xb0 [btrfs]
> >  load_free_space_cache+0x7c/0x170 [btrfs]
> >  ? cache_block_group+0x72/0x3b0 [btrfs]
> >  cache_block_group+0x1b3/0x3b0 [btrfs]
> >  ? finish_wait+0x80/0x80
> >  find_free_extent+0x799/0x1010 [btrfs]
> >  btrfs_reserve_extent+0x9b/0x180 [btrfs]
> >  btrfs_alloc_tree_block+0x1b3/0x4f0 [btrfs]
> >  __btrfs_cow_block+0x11d/0x500 [btrfs]
> >  btrfs_cow_block+0xdc/0x180 [btrfs]
> >  btrfs_search_slot+0x3bd/0x9f0 [btrfs]
> >  btrfs_lookup_inode+0x3a/0xc0 [btrfs]
> >  ? kmem_cache_alloc+0x166/0x1d0
> >  btrfs_update_inode_item+0x46/0x100 [btrfs]
> >  cache_save_setup+0xe4/0x3a0 [btrfs]
> >  btrfs_start_dirty_block_groups+0x1be/0x480 [btrfs]
> >  btrfs_commit_transaction+0xcb/0x8b0 [btrfs]
> >
> > At cache_save_setup() we need to update the inode item of a block group's
> > cache which is located in the tree root (fs_info->tree_root), which means
> > that it may result in COWing a leaf from that tree. If that happens we
> > need to find a free metadata extent and while looking for one, if we find
> > a block group which was not cached yet we attempt to load its cache by
> > calling cache_block_group(). However this function will try to load the
> > inode of the free space cache, which requires finding the matching inode
> > item in the tree root - if that inode item is located in the same leaf as
> > the inode item of the space cache we are updating at cache_save_setup(),
> > we end up in a deadlock, since we try to obtain a read lock on the same
> > extent buffer that we previously write locked.
> >
> > So fix this by skipping the loading of free space caches of any block
> > groups that are not yet cached (rare cases) if we are COWing an extent
> > buffer from the root tree and space caching is enabled (-o space_cache
> > mount option). This is a rare case and its downside is failure to
> > find a free extent (return -ENOSPC) when all the already cached block
> > groups have no free extents.
> >
> > Reported-by: Andrew Nelson 
> > Link: 
> > https://lore.kernel.org/linux-btrfs/captelenq9x5kowuq+fa7h1r3nsjg8vyith8+ifjurc_duhh...@mail.gmail.com/
> > Fixes: 9d66e233c704 ("Btrfs: load free space cache if it exists")
> > Tested-by: Andrew Nelson 
> > Signed-off-by: Filipe Manana 
>
> Great, thanks,
>
> Reviewed-by: Josef Bacik 

So this makes many fstests occasionally fail with aborted transaction
due to ENOSPC.
It's late and I haven't verified yet, but I suppose this is because we
always skip loading the cache regardless of currently being COWing an
existing leaf or allocating a new one (growing the tree).
Needs to be fixed.

>
> Josef


Re: [PATCH v2] Btrfs: fix deadlock on tree root leaf when finding free extent

2018-10-22 Thread Filipe Manana
On Mon, Oct 22, 2018 at 7:56 PM Josef Bacik  wrote:
>
> On Mon, Oct 22, 2018 at 07:48:30PM +0100, fdman...@kernel.org wrote:
> > From: Filipe Manana 
> >
> > When we are writing out a free space cache, during the transaction commit
> > phase, we can end up in a deadlock which results in a stack trace like the
> > following:
> >
> >  schedule+0x28/0x80
> >  btrfs_tree_read_lock+0x8e/0x120 [btrfs]
> >  ? finish_wait+0x80/0x80
> >  btrfs_read_lock_root_node+0x2f/0x40 [btrfs]
> >  btrfs_search_slot+0xf6/0x9f0 [btrfs]
> >  ? evict_refill_and_join+0xd0/0xd0 [btrfs]
> >  ? inode_insert5+0x119/0x190
> >  btrfs_lookup_inode+0x3a/0xc0 [btrfs]
> >  ? kmem_cache_alloc+0x166/0x1d0
> >  btrfs_iget+0x113/0x690 [btrfs]
> >  __lookup_free_space_inode+0xd8/0x150 [btrfs]
> >  lookup_free_space_inode+0x5b/0xb0 [btrfs]
> >  load_free_space_cache+0x7c/0x170 [btrfs]
> >  ? cache_block_group+0x72/0x3b0 [btrfs]
> >  cache_block_group+0x1b3/0x3b0 [btrfs]
> >  ? finish_wait+0x80/0x80
> >  find_free_extent+0x799/0x1010 [btrfs]
> >  btrfs_reserve_extent+0x9b/0x180 [btrfs]
> >  btrfs_alloc_tree_block+0x1b3/0x4f0 [btrfs]
> >  __btrfs_cow_block+0x11d/0x500 [btrfs]
> >  btrfs_cow_block+0xdc/0x180 [btrfs]
> >  btrfs_search_slot+0x3bd/0x9f0 [btrfs]
> >  btrfs_lookup_inode+0x3a/0xc0 [btrfs]
> >  ? kmem_cache_alloc+0x166/0x1d0
> >  btrfs_update_inode_item+0x46/0x100 [btrfs]
> >  cache_save_setup+0xe4/0x3a0 [btrfs]
> >  btrfs_start_dirty_block_groups+0x1be/0x480 [btrfs]
> >  btrfs_commit_transaction+0xcb/0x8b0 [btrfs]
> >
> > At cache_save_setup() we need to update the inode item of a block group's
> > cache which is located in the tree root (fs_info->tree_root), which means
> > that it may result in COWing a leaf from that tree. If that happens we
> > need to find a free metadata extent and while looking for one, if we find
> > a block group which was not cached yet we attempt to load its cache by
> > calling cache_block_group(). However this function will try to load the
> > inode of the free space cache, which requires finding the matching inode
> > item in the tree root - if that inode item is located in the same leaf as
> > the inode item of the space cache we are updating at cache_save_setup(),
> > we end up in a deadlock, since we try to obtain a read lock on the same
> > extent buffer that we previously write locked.
> >
> > So fix this by skipping the loading of free space caches of any block
> > groups that are not yet cached (rare cases) if we are COWing an extent
> > buffer from the root tree and space caching is enabled (-o space_cache
> > mount option). This is a rare case and its downside is failure to
> > find a free extent (return -ENOSPC) when all the already cached block
> > groups have no free extents.
> >
> > Reported-by: Andrew Nelson 
> > Link: 
> > https://lore.kernel.org/linux-btrfs/captelenq9x5kowuq+fa7h1r3nsjg8vyith8+ifjurc_duhh...@mail.gmail.com/
> > Fixes: 9d66e233c704 ("Btrfs: load free space cache if it exists")
> > Tested-by: Andrew Nelson 
> > Signed-off-by: Filipe Manana 
> > ---
> >
> > V2: Made the solution more generic, since the problem could happen in any
> > path COWing an extent buffer from the root tree.
> >
> > Applies on top of a previous patch titled:
> >
> >  "Btrfs: fix deadlock when writing out free space caches"
> >
> >  fs/btrfs/ctree.c   |  4 
> >  fs/btrfs/ctree.h   |  3 +++
> >  fs/btrfs/disk-io.c |  2 ++
> >  fs/btrfs/extent-tree.c | 15 ++-
> >  4 files changed, 23 insertions(+), 1 deletion(-)
> >
> > diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
> > index 089b46c4d97f..646aafda55a3 100644
> > --- a/fs/btrfs/ctree.c
> > +++ b/fs/btrfs/ctree.c
> > @@ -1065,10 +1065,14 @@ static noinline int __btrfs_cow_block(struct 
> > btrfs_trans_handle *trans,
> >   root == fs_info->chunk_root ||
> >   root == fs_info->dev_root)
> >   trans->can_flush_pending_bgs = false;
> > + else if (root == fs_info->tree_root)
> > + atomic_inc(_info->tree_root_cows);
> >
> >   cow = btrfs_alloc_tree_block(trans, root, parent_start,
> >   root->root_key.objectid, _key, level,
> >   search_start, empty_size);
> > + if (root == fs_info->tree_root)
> > + atomic_dec(_info->tree_root_cows);
>
> Do we need this though?  Our root should be the root we're cow'ing the block
> for, and it should be passed all the way down to find_free_extent properly, so
> we really should be able to just do if (root == fs_info->tree_root) and not 
> add
> all this stuff.

Ups, I missed that we could actually pass the root down to find_free_extent().
That's why made the atomic thing.

Sending v3, thanks.

>
> Not to mention this will race with anybody else doing stuff, so if another
> thread that isn't actually touching the tree_root it would skip caching a 
> block
> group when it's completely ok for that thread to do it.  Thanks,
>
> Josef


Re: [PATCH] Btrfs: fix deadlock on tree root leaf when finding free extent

2018-10-22 Thread Filipe Manana
On Mon, Oct 22, 2018 at 7:07 PM Josef Bacik  wrote:
>
> On Mon, Oct 22, 2018 at 10:09:46AM +0100, fdman...@kernel.org wrote:
> > From: Filipe Manana 
> >
> > When we are writing out a free space cache, during the transaction commit
> > phase, we can end up in a deadlock which results in a stack trace like the
> > following:
> >
> >  schedule+0x28/0x80
> >  btrfs_tree_read_lock+0x8e/0x120 [btrfs]
> >  ? finish_wait+0x80/0x80
> >  btrfs_read_lock_root_node+0x2f/0x40 [btrfs]
> >  btrfs_search_slot+0xf6/0x9f0 [btrfs]
> >  ? evict_refill_and_join+0xd0/0xd0 [btrfs]
> >  ? inode_insert5+0x119/0x190
> >  btrfs_lookup_inode+0x3a/0xc0 [btrfs]
> >  ? kmem_cache_alloc+0x166/0x1d0
> >  btrfs_iget+0x113/0x690 [btrfs]
> >  __lookup_free_space_inode+0xd8/0x150 [btrfs]
> >  lookup_free_space_inode+0x5b/0xb0 [btrfs]
> >  load_free_space_cache+0x7c/0x170 [btrfs]
> >  ? cache_block_group+0x72/0x3b0 [btrfs]
> >  cache_block_group+0x1b3/0x3b0 [btrfs]
> >  ? finish_wait+0x80/0x80
> >  find_free_extent+0x799/0x1010 [btrfs]
> >  btrfs_reserve_extent+0x9b/0x180 [btrfs]
> >  btrfs_alloc_tree_block+0x1b3/0x4f0 [btrfs]
> >  __btrfs_cow_block+0x11d/0x500 [btrfs]
> >  btrfs_cow_block+0xdc/0x180 [btrfs]
> >  btrfs_search_slot+0x3bd/0x9f0 [btrfs]
> >  btrfs_lookup_inode+0x3a/0xc0 [btrfs]
> >  ? kmem_cache_alloc+0x166/0x1d0
> >  btrfs_update_inode_item+0x46/0x100 [btrfs]
> >  cache_save_setup+0xe4/0x3a0 [btrfs]
> >  btrfs_start_dirty_block_groups+0x1be/0x480 [btrfs]
> >  btrfs_commit_transaction+0xcb/0x8b0 [btrfs]
> >
> > At cache_save_setup() we need to update the inode item of a block group's
> > cache which is located in the tree root (fs_info->tree_root), which means
> > that it may result in COWing a leaf from that tree. If that happens we
> > need to find a free metadata extent and while looking for one, if we find
> > a block group which was not cached yet we attempt to load its cache by
> > calling cache_block_group(). However this function will try to load the
> > inode of the free space cache, which requires finding the matching inode
> > item in the tree root - if that inode item is located in the same leaf as
> > the inode item of the space cache we are updating at cache_save_setup(),
> > we end up in a deadlock, since we try to obtain a read lock on the same
> > extent buffer that we previously write locked.
> >
> > So fix this by skipping the loading of free space caches of any block
> > groups that are not yet cached (rare cases) if we are updating the inode
> > of a free space cache. This is a rare case and its downside is failure to
> > find a free extent (return -ENOSPC) when all the already cached block
> > groups have no free extents.
> >
>
> Actually isn't this a problem for anything that tries to allocate an extent
> while in the tree_root?  Like we snapshot or make a subvolume or anything?

Indeed. Initially I considered making it more generic (like the recent
fix for deadlock when cowing from extent/chunk/device tree) but I
totally forgot about the other cases like you mentioned.

>  We
> should just disallow if root == tree_root.  But even then we only need to do
> this if we're using SPACE_CACHE, using the ye-olde caching or the free space
> tree are both ok.  Let's just limit it to those cases.  Thanks,

Yep, makes all sense.

Thanks! V2 sent out.

>
> Josef


Re: [PATCH] Btrfs: fix deadlock on tree root leaf when finding free extent

2018-10-22 Thread Filipe Manana
On Mon, Oct 22, 2018 at 10:10 AM  wrote:
>
> From: Filipe Manana 
>
> When we are writing out a free space cache, during the transaction commit
> phase, we can end up in a deadlock which results in a stack trace like the
> following:
>
>  schedule+0x28/0x80
>  btrfs_tree_read_lock+0x8e/0x120 [btrfs]
>  ? finish_wait+0x80/0x80
>  btrfs_read_lock_root_node+0x2f/0x40 [btrfs]
>  btrfs_search_slot+0xf6/0x9f0 [btrfs]
>  ? evict_refill_and_join+0xd0/0xd0 [btrfs]
>  ? inode_insert5+0x119/0x190
>  btrfs_lookup_inode+0x3a/0xc0 [btrfs]
>  ? kmem_cache_alloc+0x166/0x1d0
>  btrfs_iget+0x113/0x690 [btrfs]
>  __lookup_free_space_inode+0xd8/0x150 [btrfs]
>  lookup_free_space_inode+0x5b/0xb0 [btrfs]
>  load_free_space_cache+0x7c/0x170 [btrfs]
>  ? cache_block_group+0x72/0x3b0 [btrfs]
>  cache_block_group+0x1b3/0x3b0 [btrfs]
>  ? finish_wait+0x80/0x80
>  find_free_extent+0x799/0x1010 [btrfs]
>  btrfs_reserve_extent+0x9b/0x180 [btrfs]
>  btrfs_alloc_tree_block+0x1b3/0x4f0 [btrfs]
>  __btrfs_cow_block+0x11d/0x500 [btrfs]
>  btrfs_cow_block+0xdc/0x180 [btrfs]
>  btrfs_search_slot+0x3bd/0x9f0 [btrfs]
>  btrfs_lookup_inode+0x3a/0xc0 [btrfs]
>  ? kmem_cache_alloc+0x166/0x1d0
>  btrfs_update_inode_item+0x46/0x100 [btrfs]
>  cache_save_setup+0xe4/0x3a0 [btrfs]
>  btrfs_start_dirty_block_groups+0x1be/0x480 [btrfs]
>  btrfs_commit_transaction+0xcb/0x8b0 [btrfs]
>
> At cache_save_setup() we need to update the inode item of a block group's
> cache which is located in the tree root (fs_info->tree_root), which means
> that it may result in COWing a leaf from that tree. If that happens we
> need to find a free metadata extent and while looking for one, if we find
> a block group which was not cached yet we attempt to load its cache by
> calling cache_block_group(). However this function will try to load the
> inode of the free space cache, which requires finding the matching inode
> item in the tree root - if that inode item is located in the same leaf as
> the inode item of the space cache we are updating at cache_save_setup(),
> we end up in a deadlock, since we try to obtain a read lock on the same
> extent buffer that we previously write locked.
>
> So fix this by skipping the loading of free space caches of any block
> groups that are not yet cached (rare cases) if we are updating the inode
> of a free space cache. This is a rare case and its downside is failure to
> find a free extent (return -ENOSPC) when all the already cached block
> groups have no free extents.
>
> Reported-by: Andrew Nelson 
> Link: 
> https://lore.kernel.org/linux-btrfs/captelenq9x5kowuq+fa7h1r3nsjg8vyith8+ifjurc_duhh...@mail.gmail.com/
> Fixes: 9d66e233c704 ("Btrfs: load free space cache if it exists")
> Signed-off-by: Filipe Manana 

Tested-by: Andrew Nelson 

> ---
>  fs/btrfs/ctree.h   |  3 +++
>  fs/btrfs/disk-io.c |  2 ++
>  fs/btrfs/extent-tree.c | 22 +-
>  3 files changed, 26 insertions(+), 1 deletion(-)
>
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 2cddfe7806a4..d23ee26eb17d 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -1121,6 +1121,9 @@ struct btrfs_fs_info {
> u32 sectorsize;
> u32 stripesize;
>
> +   /* The task currently updating a free space cache inode item. */
> +   struct task_struct *space_cache_updater;
> +
>  #ifdef CONFIG_BTRFS_FS_REF_VERIFY
> spinlock_t ref_verify_lock;
> struct rb_root block_tree;
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 05dc3c17cb62..aa5e9a91e560 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -2782,6 +2782,8 @@ int open_ctree(struct super_block *sb,
> fs_info->sectorsize = 4096;
> fs_info->stripesize = 4096;
>
> +   fs_info->space_cache_updater = NULL;
> +
> ret = btrfs_alloc_stripe_hash_table(fs_info);
> if (ret) {
> err = ret;
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 577878324799..e93040449771 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -3364,7 +3364,9 @@ static int cache_save_setup(struct 
> btrfs_block_group_cache *block_group,
>  * time.
>  */
> BTRFS_I(inode)->generation = 0;
> +   fs_info->space_cache_updater = current;
> ret = btrfs_update_inode(trans, root, inode);
> +   fs_info->space_cache_updater = NULL;
> if (ret) {
> /*
>  * So theoretically we could recover from this, simply set the
> @@ -7366,7 +7368,25 @@ static noinline int find_free_extent(struct 
> btrfs_fs_info *fs_info,
>
>  have_bl

Re: [PATCH] Btrfs: fix deadlock on tree root leaf when finding free extent

2018-10-22 Thread Filipe Manana
On Mon, Oct 22, 2018 at 10:10 AM  wrote:
>
> From: Filipe Manana 
>
> When we are writing out a free space cache, during the transaction commit
> phase, we can end up in a deadlock which results in a stack trace like the
> following:
>
>  schedule+0x28/0x80
>  btrfs_tree_read_lock+0x8e/0x120 [btrfs]
>  ? finish_wait+0x80/0x80
>  btrfs_read_lock_root_node+0x2f/0x40 [btrfs]
>  btrfs_search_slot+0xf6/0x9f0 [btrfs]
>  ? evict_refill_and_join+0xd0/0xd0 [btrfs]
>  ? inode_insert5+0x119/0x190
>  btrfs_lookup_inode+0x3a/0xc0 [btrfs]
>  ? kmem_cache_alloc+0x166/0x1d0
>  btrfs_iget+0x113/0x690 [btrfs]
>  __lookup_free_space_inode+0xd8/0x150 [btrfs]
>  lookup_free_space_inode+0x5b/0xb0 [btrfs]
>  load_free_space_cache+0x7c/0x170 [btrfs]
>  ? cache_block_group+0x72/0x3b0 [btrfs]
>  cache_block_group+0x1b3/0x3b0 [btrfs]
>  ? finish_wait+0x80/0x80
>  find_free_extent+0x799/0x1010 [btrfs]
>  btrfs_reserve_extent+0x9b/0x180 [btrfs]
>  btrfs_alloc_tree_block+0x1b3/0x4f0 [btrfs]
>  __btrfs_cow_block+0x11d/0x500 [btrfs]
>  btrfs_cow_block+0xdc/0x180 [btrfs]
>  btrfs_search_slot+0x3bd/0x9f0 [btrfs]
>  btrfs_lookup_inode+0x3a/0xc0 [btrfs]
>  ? kmem_cache_alloc+0x166/0x1d0
>  btrfs_update_inode_item+0x46/0x100 [btrfs]
>  cache_save_setup+0xe4/0x3a0 [btrfs]
>  btrfs_start_dirty_block_groups+0x1be/0x480 [btrfs]
>  btrfs_commit_transaction+0xcb/0x8b0 [btrfs]
>
> At cache_save_setup() we need to update the inode item of a block group's
> cache which is located in the tree root (fs_info->tree_root), which means
> that it may result in COWing a leaf from that tree. If that happens we
> need to find a free metadata extent and while looking for one, if we find
> a block group which was not cached yet we attempt to load its cache by
> calling cache_block_group(). However this function will try to load the
> inode of the free space cache, which requires finding the matching inode
> item in the tree root - if that inode item is located in the same leaf as
> the inode item of the space cache we are updating at cache_save_setup(),
> we end up in a deadlock, since we try to obtain a read lock on the same
> extent buffer that we previously write locked.
>
> So fix this by skipping the loading of free space caches of any block
> groups that are not yet cached (rare cases) if we are updating the inode
> of a free space cache. This is a rare case and its downside is failure to
> find a free extent (return -ENOSPC) when all the already cached block
> groups have no free extents.
>
> Reported-by: Andrew Nelson 
> Link: 
> https://lore.kernel.org/linux-btrfs/captelenq9x5kowuq+fa7h1r3nsjg8vyith8+ifjurc_duhh...@mail.gmail.com/
> Fixes: 9d66e233c704 ("Btrfs: load free space cache if it exists")
> Signed-off-by: Filipe Manana 

Andrew Nelson 


> ---
>  fs/btrfs/ctree.h   |  3 +++
>  fs/btrfs/disk-io.c |  2 ++
>  fs/btrfs/extent-tree.c | 22 +-
>  3 files changed, 26 insertions(+), 1 deletion(-)
>
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 2cddfe7806a4..d23ee26eb17d 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -1121,6 +1121,9 @@ struct btrfs_fs_info {
> u32 sectorsize;
> u32 stripesize;
>
> +   /* The task currently updating a free space cache inode item. */
> +   struct task_struct *space_cache_updater;
> +
>  #ifdef CONFIG_BTRFS_FS_REF_VERIFY
> spinlock_t ref_verify_lock;
> struct rb_root block_tree;
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 05dc3c17cb62..aa5e9a91e560 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -2782,6 +2782,8 @@ int open_ctree(struct super_block *sb,
> fs_info->sectorsize = 4096;
> fs_info->stripesize = 4096;
>
> +   fs_info->space_cache_updater = NULL;
> +
> ret = btrfs_alloc_stripe_hash_table(fs_info);
> if (ret) {
> err = ret;
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 577878324799..e93040449771 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -3364,7 +3364,9 @@ static int cache_save_setup(struct 
> btrfs_block_group_cache *block_group,
>  * time.
>  */
> BTRFS_I(inode)->generation = 0;
> +   fs_info->space_cache_updater = current;
> ret = btrfs_update_inode(trans, root, inode);
> +   fs_info->space_cache_updater = NULL;
> if (ret) {
> /*
>  * So theoretically we could recover from this, simply set the
> @@ -7366,7 +7368,25 @@ static noinline int find_free_extent(struct 
> btrfs_fs_info *fs_info,
>
>  have_block_group:
> cached = b

Re: Btrfs resize seems to deadlock

2018-10-22 Thread Filipe Manana
On Mon, Oct 22, 2018 at 10:06 AM Andrew Nelson
 wrote:
>
> OK, an update: After unmouting and running btrfs check, the drive
> reverted to reporting the old size. Not sure if this was due to
> unmounting / mounting or doing btrfs check. Btrfs check should have
> been running in readonly mode.

It reverted to the old size because the transaction used for the
resize operation never got committed due to the deadlock, not because
of 'btrfs check'.

>  Since it looked like something was
> wrong with the resize process, I patched my kernel with the posted
> patch. This time the resize operation finished successfully.

Great, thanks for testing!

> On Sun, Oct 21, 2018 at 1:56 AM Filipe Manana  wrote:
> >
> > On Sun, Oct 21, 2018 at 6:05 AM Andrew Nelson  
> > wrote:
> > >
> > > Also, is the drive in a safe state to use? Is there anything I should
> > > run on the drive to check consistency?
> >
> > It should be in a safe state. You can verify it running "btrfs check
> > /dev/" (it's a readonly operation).
> >
> > If you are able to patch and build a kernel, you can also try the
> > patch. I left it running tests overnight and haven't got any
> > regressions.
> >
> > Thanks.
> >
> > > On Sat, Oct 20, 2018 at 10:02 PM Andrew Nelson
> > >  wrote:
> > > >
> > > > I have ran the "btrfs inspect-internal dump-tree -t 1" command, but
> > > > the output is ~55mb. Is there something in particular you are looking
> > > > for in this?
> > > > On Sat, Oct 20, 2018 at 1:34 PM Filipe Manana  
> > > > wrote:
> > > > >
> > > > > On Sat, Oct 20, 2018 at 9:27 PM Liu Bo  wrote:
> > > > > >
> > > > > > On Fri, Oct 19, 2018 at 7:09 PM Andrew Nelson 
> > > > > >  wrote:
> > > > > > >
> > > > > > > I am having an issue with btrfs resize in Fedora 28. I am 
> > > > > > > attempting
> > > > > > > to enlarge my Btrfs partition. Every time I run "btrfs filesystem
> > > > > > > resize max $MOUNT", the command runs for a few minutes and then 
> > > > > > > hangs
> > > > > > > forcing the system to be reset. I am not sure what the state of 
> > > > > > > the
> > > > > > > filesystem really is at this point. Btrfs usage does report the
> > > > > > > correct size for after resizing. Details below:
> > > > > > >
> > > > > >
> > > > > > Thanks for the report, the stack is helpful, but this needs a few
> > > > > > deeper debugging, may I ask you to post "btrfs inspect-internal
> > > > > > dump-tree -t 1 /dev/your_btrfs_disk"?
> > > > >
> > > > > I believe it's actually easy to understand from the trace alone and
> > > > > it's kind of a bad luck scenario.
> > > > > I made this fix a few hours ago:
> > > > >
> > > > > https://git.kernel.org/pub/scm/linux/kernel/git/fdmanana/linux.git/commit/?h=fix_find_free_extent_deadlock
> > > > >
> > > > > But haven't done full testing yet and might have missed something.
> > > > > Bo, can you take a look and let me know what you think?
> > > > >
> > > > > Thanks.
> > > > >
> > > > > >
> > > > > > So I'd like to know what's the height of your tree "1" which refers 
> > > > > > to
> > > > > > root tree in btrfs.
> > > > > >
> > > > > > thanks,
> > > > > > liubo
> > > > > >
> > > > > > > $ sudo btrfs filesystem usage $MOUNT
> > > > > > > Overall:
> > > > > > > Device size:  90.96TiB
> > > > > > > Device allocated: 72.62TiB
> > > > > > > Device unallocated:   18.33TiB
> > > > > > > Device missing:  0.00B
> > > > > > > Used: 72.62TiB
> > > > > > > Free (estimated): 18.34TiB  (min: 9.17TiB)
> > > > > > > Data ratio:   1.00
> > > > > > > Metadata ratio:   2.00
> > > > > > > Global reserve:  512.00MiB  (used: 24.11MiB)
> > > > 

Re: [PATCH] btrfs/154: test for device dynamic rescan

2018-10-21 Thread Filipe Manana
On Sun, Oct 21, 2018 at 10:20 AM Nikolay Borisov  wrote:
>
>
>
> On 21.10.2018 10:16, Filipe Manana wrote:
> > On Mon, Nov 13, 2017 at 2:26 AM Anand Jain  wrote:
> >>
> >> Make sure missing device is included in the alloc list when it is
> >> scanned on a mounted FS.
> >>
> >> This test case needs btrfs kernel patch which is in the ML
> >>   [PATCH] btrfs: handle dynamically reappearing missing device
> >> Without the kernel patch, the test will run, but reports as
> >> failed, as the device scanned won't appear in the alloc_list.
> >
> > So that patch was never merged, at least not with that subject.
> > What happened?
>
> https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git/tree/tests/btrfs/154
>
> In my testing of misc-next this test has been failing for me.

Yes, that's why I'm asking.
It's not uncommon for Anand to submit tests that never get the
corresponding kernel patch merged or, like in his recent hole punch
test, that doesn't have a corresponding kernel fix.

Anand, what's the deal here? Are you planning on pushing the fixes to
the kernel? And for the hole punch test, are you working on a fix or
just hoping someone else will fix it?

> >
> >>
> >> Signed-off-by: Anand Jain 
> >> ---
> >>  tests/btrfs/154 | 188 
> >> 
> >>  tests/btrfs/154.out |  10 +++
> >>  tests/btrfs/group   |   1 +
> >>  3 files changed, 199 insertions(+)
> >>  create mode 100755 tests/btrfs/154
> >>  create mode 100644 tests/btrfs/154.out
> >>
> >> diff --git a/tests/btrfs/154 b/tests/btrfs/154
> >> new file mode 100755
> >> index ..8b06fc4d9347
> >> --- /dev/null
> >> +++ b/tests/btrfs/154
> >> @@ -0,0 +1,188 @@
> >> +#! /bin/bash
> >> +# FS QA Test 154
> >> +#
> >> +# Test for reappearing missing device functionality.
> >> +#   This test will fail without the btrfs kernel patch
> >> +#   [PATCH] btrfs: handle dynamically reappearing missing device
> >> +#
> >> +#-
> >> +# Copyright (c) 2017 Oracle.  All Rights Reserved.
> >> +# Author: Anand Jain 
> >> +#
> >> +# This program is free software; you can redistribute it and/or
> >> +# modify it under the terms of the GNU General Public License as
> >> +# published by the Free Software Foundation.
> >> +#
> >> +# This program is distributed in the hope that it would be useful,
> >> +# but WITHOUT ANY WARRANTY; without even the implied warranty of
> >> +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> >> +# GNU General Public License for more details.
> >> +#
> >> +# You should have received a copy of the GNU General Public License
> >> +# along with this program; if not, write the Free Software Foundation,
> >> +# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
> >> +#-
> >> +#
> >> +
> >> +seq=`basename $0`
> >> +seqres=$RESULT_DIR/$seq
> >> +echo "QA output created by $seq"
> >> +
> >> +here=`pwd`
> >> +tmp=/tmp/$$
> >> +status=1   # failure is the default!
> >> +trap "_cleanup; exit \$status" 0 1 2 3 15
> >> +
> >> +_cleanup()
> >> +{
> >> +   cd /
> >> +   rm -f $tmp.*
> >> +}
> >> +
> >> +# get standard environment, filters and checks
> >> +. ./common/rc
> >> +. ./common/filter
> >> +. ./common/module
> >> +
> >> +# remove previous $seqres.full before test
> >> +rm -f $seqres.full
> >> +
> >> +# real QA test starts here
> >> +
> >> +_supported_fs btrfs
> >> +_supported_os Linux
> >> +_require_scratch_dev_pool 2
> >> +_test_unmount
> >> +_require_loadable_fs_module "btrfs"
> >> +
> >> +_scratch_dev_pool_get 2
> >> +
> >> +DEV1=`echo $SCRATCH_DEV_POOL | awk '{print $1}'`
> >> +DEV2=`echo $SCRATCH_DEV_POOL | awk '{print $2}'`
> >> +
> >> +echo DEV1=$DEV1 >> $seqres.full
> >> +echo DEV2=$DEV2 >> $seqres.full
> >> +
> >> +# Balance won't be successful if filled too much
> >> +DEV1_SZ=`blockdev --getsize64 $DEV1`
> >> +DEV2_SZ=`blockdev --getsize64 $DEV2`
> >> +
> >> +#

Re: [PATCH] btrfs/154: test for device dynamic rescan

2018-10-21 Thread Filipe Manana
On Mon, Nov 13, 2017 at 2:26 AM Anand Jain  wrote:
>
> Make sure missing device is included in the alloc list when it is
> scanned on a mounted FS.
>
> This test case needs btrfs kernel patch which is in the ML
>   [PATCH] btrfs: handle dynamically reappearing missing device
> Without the kernel patch, the test will run, but reports as
> failed, as the device scanned won't appear in the alloc_list.

So that patch was never merged, at least not with that subject.
What happened?

>
> Signed-off-by: Anand Jain 
> ---
>  tests/btrfs/154 | 188 
> 
>  tests/btrfs/154.out |  10 +++
>  tests/btrfs/group   |   1 +
>  3 files changed, 199 insertions(+)
>  create mode 100755 tests/btrfs/154
>  create mode 100644 tests/btrfs/154.out
>
> diff --git a/tests/btrfs/154 b/tests/btrfs/154
> new file mode 100755
> index ..8b06fc4d9347
> --- /dev/null
> +++ b/tests/btrfs/154
> @@ -0,0 +1,188 @@
> +#! /bin/bash
> +# FS QA Test 154
> +#
> +# Test for reappearing missing device functionality.
> +#   This test will fail without the btrfs kernel patch
> +#   [PATCH] btrfs: handle dynamically reappearing missing device
> +#
> +#-
> +# Copyright (c) 2017 Oracle.  All Rights Reserved.
> +# Author: Anand Jain 
> +#
> +# This program is free software; you can redistribute it and/or
> +# modify it under the terms of the GNU General Public License as
> +# published by the Free Software Foundation.
> +#
> +# This program is distributed in the hope that it would be useful,
> +# but WITHOUT ANY WARRANTY; without even the implied warranty of
> +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> +# GNU General Public License for more details.
> +#
> +# You should have received a copy of the GNU General Public License
> +# along with this program; if not, write the Free Software Foundation,
> +# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
> +#-
> +#
> +
> +seq=`basename $0`
> +seqres=$RESULT_DIR/$seq
> +echo "QA output created by $seq"
> +
> +here=`pwd`
> +tmp=/tmp/$$
> +status=1   # failure is the default!
> +trap "_cleanup; exit \$status" 0 1 2 3 15
> +
> +_cleanup()
> +{
> +   cd /
> +   rm -f $tmp.*
> +}
> +
> +# get standard environment, filters and checks
> +. ./common/rc
> +. ./common/filter
> +. ./common/module
> +
> +# remove previous $seqres.full before test
> +rm -f $seqres.full
> +
> +# real QA test starts here
> +
> +_supported_fs btrfs
> +_supported_os Linux
> +_require_scratch_dev_pool 2
> +_test_unmount
> +_require_loadable_fs_module "btrfs"
> +
> +_scratch_dev_pool_get 2
> +
> +DEV1=`echo $SCRATCH_DEV_POOL | awk '{print $1}'`
> +DEV2=`echo $SCRATCH_DEV_POOL | awk '{print $2}'`
> +
> +echo DEV1=$DEV1 >> $seqres.full
> +echo DEV2=$DEV2 >> $seqres.full
> +
> +# Balance won't be successful if filled too much
> +DEV1_SZ=`blockdev --getsize64 $DEV1`
> +DEV2_SZ=`blockdev --getsize64 $DEV2`
> +
> +# get min
> +MAX_FS_SZ=`echo -e "$DEV1_SZ\n$DEV2_SZ" | sort | head -1`
> +# Need disks with more than 2G
> +if [ $MAX_FS_SZ -lt 20 ]; then
> +   _scratch_dev_pool_put
> +   _test_mount
> +   _notrun "Smallest dev size $MAX_FS_SZ, Need at least 2G"
> +fi
> +
> +MAX_FS_SZ=1
> +bs="1M"
> +COUNT=$(($MAX_FS_SZ / 100))
> +CHECKPOINT1=0
> +CHECKPOINT2=0
> +
> +setup()
> +{
> +   echo >> $seqres.full
> +   echo "MAX_FS_SZ=$MAX_FS_SZ COUNT=$COUNT" >> $seqres.full
> +   echo "setup"
> +   echo "-setup-" >> $seqres.full
> +   _scratch_pool_mkfs "-mraid1 -draid1" >> $seqres.full 2>&1
> +   _scratch_mount >> $seqres.full 2>&1
> +   dd if=/dev/urandom of="$SCRATCH_MNT"/tf bs=$bs count=1 \
> +   >>$seqres.full 2>&1
> +   _run_btrfs_util_prog filesystem show -m ${SCRATCH_MNT}
> +   _run_btrfs_util_prog filesystem df $SCRATCH_MNT
> +   COUNT=$(( $COUNT - 1 ))
> +   echo "unmount" >> $seqres.full
> +   _scratch_unmount
> +}
> +
> +degrade_mount_write()
> +{
> +   echo >> $seqres.full
> +   echo "--degraded mount: max_fs_sz $max_fs_sz bytes--" >> $seqres.full
> +   echo
> +   echo "degraded mount"
> +
> +   echo "clean btrfs ko" >> $seqres.full
> +   # un-scan the btrfs devices
> +   _reload_fs_module "btrfs"
> +   _mount -o degraded $DEV1 $SCRATCH_MNT >>$seqres.full 2>&1
> +   cnt=$(( $COUNT/10 ))
> +   dd if=/dev/urandom of="$SCRATCH_MNT"/tf1 bs=$bs count=$cnt \
> +   >>$seqres.full 2>&1
> +   COUNT=$(( $COUNT - $cnt ))
> +   _run_btrfs_util_prog filesystem show -m $SCRATCH_MNT
> +   _run_btrfs_util_prog filesystem df $SCRATCH_MNT
> +   CHECKPOINT1=`md5sum $SCRATCH_MNT/tf1`
> +   echo $SCRATCH_MNT/tf1:$CHECKPOINT1 >> $seqres.full 2>&1
> +}
> +
> +scan_missing_dev_and_write()
> +{
> +   echo 

Re: Btrfs resize seems to deadlock

2018-10-21 Thread Filipe Manana
On Sun, Oct 21, 2018 at 6:05 AM Andrew Nelson  wrote:
>
> Also, is the drive in a safe state to use? Is there anything I should
> run on the drive to check consistency?

It should be in a safe state. You can verify it running "btrfs check
/dev/" (it's a readonly operation).

If you are able to patch and build a kernel, you can also try the
patch. I left it running tests overnight and haven't got any
regressions.

Thanks.

> On Sat, Oct 20, 2018 at 10:02 PM Andrew Nelson
>  wrote:
> >
> > I have ran the "btrfs inspect-internal dump-tree -t 1" command, but
> > the output is ~55mb. Is there something in particular you are looking
> > for in this?
> > On Sat, Oct 20, 2018 at 1:34 PM Filipe Manana  wrote:
> > >
> > > On Sat, Oct 20, 2018 at 9:27 PM Liu Bo  wrote:
> > > >
> > > > On Fri, Oct 19, 2018 at 7:09 PM Andrew Nelson 
> > > >  wrote:
> > > > >
> > > > > I am having an issue with btrfs resize in Fedora 28. I am attempting
> > > > > to enlarge my Btrfs partition. Every time I run "btrfs filesystem
> > > > > resize max $MOUNT", the command runs for a few minutes and then hangs
> > > > > forcing the system to be reset. I am not sure what the state of the
> > > > > filesystem really is at this point. Btrfs usage does report the
> > > > > correct size for after resizing. Details below:
> > > > >
> > > >
> > > > Thanks for the report, the stack is helpful, but this needs a few
> > > > deeper debugging, may I ask you to post "btrfs inspect-internal
> > > > dump-tree -t 1 /dev/your_btrfs_disk"?
> > >
> > > I believe it's actually easy to understand from the trace alone and
> > > it's kind of a bad luck scenario.
> > > I made this fix a few hours ago:
> > >
> > > https://git.kernel.org/pub/scm/linux/kernel/git/fdmanana/linux.git/commit/?h=fix_find_free_extent_deadlock
> > >
> > > But haven't done full testing yet and might have missed something.
> > > Bo, can you take a look and let me know what you think?
> > >
> > > Thanks.
> > >
> > > >
> > > > So I'd like to know what's the height of your tree "1" which refers to
> > > > root tree in btrfs.
> > > >
> > > > thanks,
> > > > liubo
> > > >
> > > > > $ sudo btrfs filesystem usage $MOUNT
> > > > > Overall:
> > > > > Device size:  90.96TiB
> > > > > Device allocated: 72.62TiB
> > > > > Device unallocated:   18.33TiB
> > > > > Device missing:  0.00B
> > > > > Used: 72.62TiB
> > > > > Free (estimated): 18.34TiB  (min: 9.17TiB)
> > > > > Data ratio:   1.00
> > > > > Metadata ratio:   2.00
> > > > > Global reserve:  512.00MiB  (used: 24.11MiB)
> > > > >
> > > > > Data,single: Size:72.46TiB, Used:72.45TiB
> > > > > $MOUNT72.46TiB
> > > > >
> > > > > Metadata,DUP: Size:86.00GiB, Used:84.96GiB
> > > > > $MOUNT   172.00GiB
> > > > >
> > > > > System,DUP: Size:40.00MiB, Used:7.53MiB
> > > > >$MOUNT80.00MiB
> > > > >
> > > > > Unallocated:
> > > > > $MOUNT18.33TiB
> > > > >
> > > > > $ uname -a
> > > > > Linux localhost.localdomain 4.18.14-200.fc28.x86_64 #1 SMP Mon Oct 15
> > > > > 13:16:27 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
> > > > >
> > > > > btrfs-transacti D0  2501  2 0x8000
> > > > > Call Trace:
> > > > >  ? __schedule+0x253/0x860
> > > > >  schedule+0x28/0x80
> > > > >  btrfs_commit_transaction+0x7aa/0x8b0 [btrfs]
> > > > >  ? kmem_cache_alloc+0x166/0x1d0
> > > > >  ? join_transaction+0x22/0x3e0 [btrfs]
> > > > >  ? finish_wait+0x80/0x80
> > > > >  transaction_kthread+0x155/0x170 [btrfs]
> > > > >  ? btrfs_cleanup_transaction+0x550/0x550 [btrfs]
> > > > >  kthread+0x112/0x130
> > > > >  ? kthread_create_worker_on_cpu+0x70/0x70
> > > > >  ret_from_fork+0x35/0x40
> > > > > btrfs   D0  2504   2502 0x000

Re: Btrfs resize seems to deadlock

2018-10-20 Thread Filipe Manana
On Sat, Oct 20, 2018 at 9:27 PM Liu Bo  wrote:
>
> On Fri, Oct 19, 2018 at 7:09 PM Andrew Nelson  
> wrote:
> >
> > I am having an issue with btrfs resize in Fedora 28. I am attempting
> > to enlarge my Btrfs partition. Every time I run "btrfs filesystem
> > resize max $MOUNT", the command runs for a few minutes and then hangs
> > forcing the system to be reset. I am not sure what the state of the
> > filesystem really is at this point. Btrfs usage does report the
> > correct size for after resizing. Details below:
> >
>
> Thanks for the report, the stack is helpful, but this needs a few
> deeper debugging, may I ask you to post "btrfs inspect-internal
> dump-tree -t 1 /dev/your_btrfs_disk"?

I believe it's actually easy to understand from the trace alone and
it's kind of a bad luck scenario.
I made this fix a few hours ago:

https://git.kernel.org/pub/scm/linux/kernel/git/fdmanana/linux.git/commit/?h=fix_find_free_extent_deadlock

But haven't done full testing yet and might have missed something.
Bo, can you take a look and let me know what you think?

Thanks.

>
> So I'd like to know what's the height of your tree "1" which refers to
> root tree in btrfs.
>
> thanks,
> liubo
>
> > $ sudo btrfs filesystem usage $MOUNT
> > Overall:
> > Device size:  90.96TiB
> > Device allocated: 72.62TiB
> > Device unallocated:   18.33TiB
> > Device missing:  0.00B
> > Used: 72.62TiB
> > Free (estimated): 18.34TiB  (min: 9.17TiB)
> > Data ratio:   1.00
> > Metadata ratio:   2.00
> > Global reserve:  512.00MiB  (used: 24.11MiB)
> >
> > Data,single: Size:72.46TiB, Used:72.45TiB
> > $MOUNT72.46TiB
> >
> > Metadata,DUP: Size:86.00GiB, Used:84.96GiB
> > $MOUNT   172.00GiB
> >
> > System,DUP: Size:40.00MiB, Used:7.53MiB
> >$MOUNT80.00MiB
> >
> > Unallocated:
> > $MOUNT18.33TiB
> >
> > $ uname -a
> > Linux localhost.localdomain 4.18.14-200.fc28.x86_64 #1 SMP Mon Oct 15
> > 13:16:27 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
> >
> > btrfs-transacti D0  2501  2 0x8000
> > Call Trace:
> >  ? __schedule+0x253/0x860
> >  schedule+0x28/0x80
> >  btrfs_commit_transaction+0x7aa/0x8b0 [btrfs]
> >  ? kmem_cache_alloc+0x166/0x1d0
> >  ? join_transaction+0x22/0x3e0 [btrfs]
> >  ? finish_wait+0x80/0x80
> >  transaction_kthread+0x155/0x170 [btrfs]
> >  ? btrfs_cleanup_transaction+0x550/0x550 [btrfs]
> >  kthread+0x112/0x130
> >  ? kthread_create_worker_on_cpu+0x70/0x70
> >  ret_from_fork+0x35/0x40
> > btrfs   D0  2504   2502 0x0002
> > Call Trace:
> >  ? __schedule+0x253/0x860
> >  schedule+0x28/0x80
> >  btrfs_tree_read_lock+0x8e/0x120 [btrfs]
> >  ? finish_wait+0x80/0x80
> >  btrfs_read_lock_root_node+0x2f/0x40 [btrfs]
> >  btrfs_search_slot+0xf6/0x9f0 [btrfs]
> >  ? evict_refill_and_join+0xd0/0xd0 [btrfs]
> >  ? inode_insert5+0x119/0x190
> >  btrfs_lookup_inode+0x3a/0xc0 [btrfs]
> >  ? kmem_cache_alloc+0x166/0x1d0
> >  btrfs_iget+0x113/0x690 [btrfs]
> >  __lookup_free_space_inode+0xd8/0x150 [btrfs]
> >  lookup_free_space_inode+0x5b/0xb0 [btrfs]
> >  load_free_space_cache+0x7c/0x170 [btrfs]
> >  ? cache_block_group+0x72/0x3b0 [btrfs]
> >  cache_block_group+0x1b3/0x3b0 [btrfs]
> >  ? finish_wait+0x80/0x80
> >  find_free_extent+0x799/0x1010 [btrfs]
> >  btrfs_reserve_extent+0x9b/0x180 [btrfs]
> >  btrfs_alloc_tree_block+0x1b3/0x4f0 [btrfs]
> >  __btrfs_cow_block+0x11d/0x500 [btrfs]
> >  btrfs_cow_block+0xdc/0x180 [btrfs]
> >  btrfs_search_slot+0x3bd/0x9f0 [btrfs]
> >  btrfs_lookup_inode+0x3a/0xc0 [btrfs]
> >  ? kmem_cache_alloc+0x166/0x1d0
> >  btrfs_update_inode_item+0x46/0x100 [btrfs]
> >  cache_save_setup+0xe4/0x3a0 [btrfs]
> >  btrfs_start_dirty_block_groups+0x1be/0x480 [btrfs]
> >  btrfs_commit_transaction+0xcb/0x8b0 [btrfs]
> >  ? btrfs_release_path+0x13/0x80 [btrfs]
> >  ? btrfs_update_device+0x8d/0x1c0 [btrfs]
> >  btrfs_ioctl_resize.cold.46+0xf4/0xf9 [btrfs]
> >  btrfs_ioctl+0xa25/0x2cf0 [btrfs]
> >  ? tty_write+0x1fc/0x330
> >  ? do_vfs_ioctl+0xa4/0x620
> >  do_vfs_ioctl+0xa4/0x620
> >  ksys_ioctl+0x60/0x90
> >  ? ksys_write+0x4f/0xb0
> >  __x64_sys_ioctl+0x16/0x20
> >  do_syscall_64+0x5b/0x160
> >  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > RIP: 0033:0x7fcdc0d78c57
> > Code: Bad RIP value.
> > RSP: 002b:7ffdd1ee6cf8 EFLAGS: 0246 ORIG_RAX: 0010
> > RAX: ffda RBX: 7ffdd1ee888a RCX: 7fcdc0d78c57
> > RDX: 7ffdd1ee6da0 RSI: 50009403 RDI: 0003
> > RBP: 7ffdd1ee6da0 R08:  R09: 7ffdd1ee67e0
> > R10:  R11: 0246 R12: 7ffdd1ee888e
> > R13: 0003 R14:  R15: 
> > kworker/u48:1   D0  2505  2 0x8000
> > Workqueue: btrfs-freespace-write btrfs_freespace_write_helper [btrfs]
> > Call Trace:
> >  ? __schedule+0x253/0x860
> >  

Re: [PATCH 42/42] btrfs: don't run delayed_iputs in commit

2018-10-12 Thread Filipe Manana
On Fri, Oct 12, 2018 at 8:35 PM Josef Bacik  wrote:
>
> This could result in a really bad case where we do something like
>
> evict
>   evict_refill_and_join
> btrfs_commit_transaction
>   btrfs_run_delayed_iputs
> evict
>   evict_refill_and_join
> btrfs_commit_transaction
> ... forever
>
> We have plenty of other places where we run delayed iputs that are much
> safer, let those do the work.
>
> Signed-off-by: Josef Bacik 
Reviewed-by: Filipe Manana 

> ---
>  fs/btrfs/transaction.c | 9 -
>  1 file changed, 9 deletions(-)
>
> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> index 9168efaca37e..c91dc36fccae 100644
> --- a/fs/btrfs/transaction.c
> +++ b/fs/btrfs/transaction.c
> @@ -2265,15 +2265,6 @@ int btrfs_commit_transaction(struct btrfs_trans_handle 
> *trans)
>
> kmem_cache_free(btrfs_trans_handle_cachep, trans);
>
> -   /*
> -* If fs has been frozen, we can not handle delayed iputs, otherwise
> -* it'll result in deadlock about SB_FREEZE_FS.
> -*/
> -   if (current != fs_info->transaction_kthread &&
> -   current != fs_info->cleaner_kthread &&
> -   !test_bit(BTRFS_FS_FROZEN, _info->flags))
> -   btrfs_run_delayed_iputs(fs_info);
> -
> return ret;
>
>  scrub_continue:
> --
> 2.14.3
>


-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”


Re: [PATCH 14/42] btrfs: reset max_extent_size properly

2018-10-12 Thread Filipe Manana
On Thu, Oct 11, 2018 at 8:57 PM Josef Bacik  wrote:
>
> If we use up our block group before allocating a new one we'll easily
> get a max_extent_size that's set really really low, which will result in
> a lot of fragmentation.  We need to make sure we're resetting the
> max_extent_size when we add a new chunk or add new space.
>
> Signed-off-by: Josef Bacik 
Reviewed-by: Filipe Manana 

> ---
>  fs/btrfs/extent-tree.c | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index cd2280962c8c..f84537a1d7eb 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -4573,6 +4573,7 @@ static int do_chunk_alloc(struct btrfs_trans_handle 
> *trans, u64 flags,
> goto out;
> } else {
> ret = 1;
> +   space_info->max_extent_size = 0;
> }
>
> space_info->force_alloc = CHUNK_ALLOC_NO_FORCE;
> @@ -6671,6 +6672,7 @@ static int btrfs_free_reserved_bytes(struct 
> btrfs_block_group_cache *cache,
> space_info->bytes_readonly += num_bytes;
> cache->reserved -= num_bytes;
> space_info->bytes_reserved -= num_bytes;
> +   space_info->max_extent_size = 0;
>
> if (delalloc)
> cache->delalloc_bytes -= num_bytes;
> --
> 2.14.3
>


-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”


Re: [PATCH 20/42] btrfs: don't use ctl->free_space for max_extent_size

2018-10-12 Thread Filipe Manana
On Thu, Oct 11, 2018 at 8:57 PM Josef Bacik  wrote:
>
> From: Josef Bacik 
>
> max_extent_size is supposed to be the largest contiguous range for the
> space info, and ctl->free_space is the total free space in the block
> group.  We need to keep track of these separately and _only_ use the
> max_free_space if we don't have a max_extent_size, as that means our
> original request was too large to search any of the block groups for and
> therefore wouldn't have a max_extent_size set.
>
> Signed-off-by: Josef Bacik 
Reviewed-by: Filipe Manana 

> ---
>  fs/btrfs/extent-tree.c | 7 +--
>  1 file changed, 5 insertions(+), 2 deletions(-)
>
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 6e7bc3197737..4f48d047a1ec 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -7496,6 +7496,7 @@ static noinline int find_free_extent(struct 
> btrfs_fs_info *fs_info,
> struct btrfs_block_group_cache *block_group = NULL;
> u64 search_start = 0;
> u64 max_extent_size = 0;
> +   u64 max_free_space = 0;
> u64 empty_cluster = 0;
> struct btrfs_space_info *space_info;
> int loop = 0;
> @@ -7791,8 +7792,8 @@ static noinline int find_free_extent(struct 
> btrfs_fs_info *fs_info,
> spin_lock(>tree_lock);
> if (ctl->free_space <
> num_bytes + empty_cluster + empty_size) {
> -   if (ctl->free_space > max_extent_size)
> -   max_extent_size = ctl->free_space;
> +   max_free_space = max(max_free_space,
> +ctl->free_space);
> spin_unlock(>tree_lock);
> goto loop;
> }
> @@ -7959,6 +7960,8 @@ static noinline int find_free_extent(struct 
> btrfs_fs_info *fs_info,
> }
>  out:
> if (ret == -ENOSPC) {
> +   if (!max_extent_size)
> +   max_extent_size = max_free_space;
> spin_lock(_info->lock);
> space_info->max_extent_size = max_extent_size;
> spin_unlock(_info->lock);
> --
> 2.14.3
>


-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”


Re: [PATCH 19/42] btrfs: set max_extent_size properly

2018-10-12 Thread Filipe Manana
On Thu, Oct 11, 2018 at 8:57 PM Josef Bacik  wrote:
>
> From: Josef Bacik 
>
> We can't use entry->bytes if our entry is a bitmap entry, we need to use
> entry->max_extent_size in that case.  Fix up all the logic to make this
> consistent.
>
> Signed-off-by: Josef Bacik 
Reviewed-by: Filipe Manana 

> ---
>  fs/btrfs/free-space-cache.c | 29 +++--
>  1 file changed, 19 insertions(+), 10 deletions(-)
>
> diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
> index e077ad3b4549..2e96ee7da3ec 100644
> --- a/fs/btrfs/free-space-cache.c
> +++ b/fs/btrfs/free-space-cache.c
> @@ -1770,6 +1770,18 @@ static int search_bitmap(struct btrfs_free_space_ctl 
> *ctl,
> return -1;
>  }
>
> +static void set_max_extent_size(struct btrfs_free_space *entry,
> +   u64 *max_extent_size)
> +{
> +   if (entry->bitmap) {
> +   if (entry->max_extent_size > *max_extent_size)
> +   *max_extent_size = entry->max_extent_size;
> +   } else {
> +   if (entry->bytes > *max_extent_size)
> +   *max_extent_size = entry->bytes;
> +   }
> +}
> +
>  /* Cache the size of the max extent in bytes */
>  static struct btrfs_free_space *
>  find_free_space(struct btrfs_free_space_ctl *ctl, u64 *offset, u64 *bytes,
> @@ -1791,8 +1803,7 @@ find_free_space(struct btrfs_free_space_ctl *ctl, u64 
> *offset, u64 *bytes,
> for (node = >offset_index; node; node = rb_next(node)) {
> entry = rb_entry(node, struct btrfs_free_space, offset_index);
> if (entry->bytes < *bytes) {
> -   if (entry->bytes > *max_extent_size)
> -   *max_extent_size = entry->bytes;
> +   set_max_extent_size(entry, max_extent_size);
> continue;
> }
>
> @@ -1810,8 +1821,7 @@ find_free_space(struct btrfs_free_space_ctl *ctl, u64 
> *offset, u64 *bytes,
> }
>
> if (entry->bytes < *bytes + align_off) {
> -   if (entry->bytes > *max_extent_size)
> -   *max_extent_size = entry->bytes;
> +   set_max_extent_size(entry, max_extent_size);
> continue;
> }
>
> @@ -1823,8 +1833,8 @@ find_free_space(struct btrfs_free_space_ctl *ctl, u64 
> *offset, u64 *bytes,
> *offset = tmp;
> *bytes = size;
> return entry;
> -   } else if (size > *max_extent_size) {
> -   *max_extent_size = size;
> +   } else {
> +   set_max_extent_size(entry, max_extent_size);
> }
> continue;
> }
> @@ -2684,8 +2694,7 @@ static u64 btrfs_alloc_from_bitmap(struct 
> btrfs_block_group_cache *block_group,
>
> err = search_bitmap(ctl, entry, _start, _bytes, true);
> if (err) {
> -   if (search_bytes > *max_extent_size)
> -   *max_extent_size = search_bytes;
> +   set_max_extent_size(entry, max_extent_size);
> return 0;
> }
>
> @@ -2722,8 +2731,8 @@ u64 btrfs_alloc_from_cluster(struct 
> btrfs_block_group_cache *block_group,
>
> entry = rb_entry(node, struct btrfs_free_space, offset_index);
> while (1) {
> -   if (entry->bytes < bytes && entry->bytes > *max_extent_size)
> -   *max_extent_size = entry->bytes;
> +   if (entry->bytes < bytes)
> +   set_max_extent_size(entry, max_extent_size);
>
> if (entry->bytes < bytes ||
> (!entry->bitmap && entry->offset < min_start)) {
> --
> 2.14.3
>


-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”


Re: [PATCH 32/42] btrfs: only free reserved extent if we didn't insert it

2018-10-11 Thread Filipe Manana
On Thu, Oct 11, 2018 at 8:58 PM Josef Bacik  wrote:
>
> When we insert the file extent once the ordered extent completes we free
> the reserved extent reservation as it'll have been migrated to the
> bytes_used counter.  However if we error out after this step we'll still
> clear the reserved extent reservation, resulting in a negative
> accounting of the reserved bytes for the block group and space info.
> Fix this by only doing the free if we didn't successfully insert a file
> extent for this extent.
>
> Signed-off-by: Josef Bacik 
> Reviewed-by: Omar Sandoval 
Reviewed-by: Filipe Manana 

> ---
>  fs/btrfs/inode.c | 10 +-
>  1 file changed, 9 insertions(+), 1 deletion(-)
>
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 5a91055a13b2..2b257d14bd3d 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -2992,6 +2992,7 @@ static int btrfs_finish_ordered_io(struct 
> btrfs_ordered_extent *ordered_extent)
> bool truncated = false;
> bool range_locked = false;
> bool clear_new_delalloc_bytes = false;
> +   bool clear_reserved_extent = true;
>
> if (!test_bit(BTRFS_ORDERED_NOCOW, _extent->flags) &&
> !test_bit(BTRFS_ORDERED_PREALLOC, _extent->flags) &&
> @@ -3095,10 +3096,12 @@ static int btrfs_finish_ordered_io(struct 
> btrfs_ordered_extent *ordered_extent)
> logical_len, logical_len,
> compress_type, 0, 0,
> BTRFS_FILE_EXTENT_REG);
> -   if (!ret)
> +   if (!ret) {
> +   clear_reserved_extent = false;
> btrfs_release_delalloc_bytes(fs_info,
>  ordered_extent->start,
>  
> ordered_extent->disk_len);
> +   }
> }
> unpin_extent_cache(_I(inode)->extent_tree,
>ordered_extent->file_offset, ordered_extent->len,
> @@ -3159,8 +3162,13 @@ static int btrfs_finish_ordered_io(struct 
> btrfs_ordered_extent *ordered_extent)
>  * wrong we need to return the space for this ordered extent
>  * back to the allocator.  We only free the extent in the
>  * truncated case if we didn't write out the extent at all.
> +*
> +* If we made it past insert_reserved_file_extent before we
> +* errored out then we don't need to do this as the accounting
> +* has already been done.
>  */
> if ((ret || !logical_len) &&
> +   clear_reserved_extent &&
> !test_bit(BTRFS_ORDERED_NOCOW, _extent->flags) &&
> !test_bit(BTRFS_ORDERED_PREALLOC, _extent->flags))
> btrfs_free_reserved_extent(fs_info,
> --
> 2.14.3
>


-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”


Re: [PATCH 42/42] btrfs: don't run delayed_iputs in commit

2018-10-11 Thread Filipe Manana
On Thu, Oct 11, 2018 at 8:58 PM Josef Bacik  wrote:
>
> This could result in a really bad case where we do something like
>
> evict
>   evict_refill_and_join
> btrfs_commit_transaction
>   btrfs_run_delayed_iputs
> evict
>   evict_refill_and_join
> btrfs_commit_transaction
> ... forever
>
> We have plenty of other places where we run delayed iputs that are much
> safer, let those do the work.
>
> Signed-off-by: Josef Bacik 
Reviewed-by: Filipe Manana 

Great catch!

> ---
>  fs/btrfs/transaction.c | 9 -
>  1 file changed, 9 deletions(-)
>
> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> index 9168efaca37e..c91dc36fccae 100644
> --- a/fs/btrfs/transaction.c
> +++ b/fs/btrfs/transaction.c
> @@ -2265,15 +2265,6 @@ int btrfs_commit_transaction(struct btrfs_trans_handle 
> *trans)
>
> kmem_cache_free(btrfs_trans_handle_cachep, trans);
>
> -   /*
> -* If fs has been frozen, we can not handle delayed iputs, otherwise
> -* it'll result in deadlock about SB_FREEZE_FS.
> -*/
> -   if (current != fs_info->transaction_kthread &&
> -   current != fs_info->cleaner_kthread &&
> -   !test_bit(BTRFS_FS_FROZEN, _info->flags))
> -   btrfs_run_delayed_iputs(fs_info);
> -
> return ret;
>
>  scrub_continue:
> --
> 2.14.3
>


-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”


Re: [PATCH 37/42] btrfs: wakeup cleaner thread when adding delayed iput

2018-10-08 Thread Filipe Manana
On Fri, Sep 28, 2018 at 12:21 PM Josef Bacik  wrote:
>
> The cleaner thread usually takes care of delayed iputs, with the
> exception of the btrfs_end_transaction_throttle path.  The cleaner
> thread only gets woken up every 30 seconds, so instead wake it up to do
> it's work so that we can free up that space as quickly as possible.
>
> Signed-off-by: Josef Bacik 
Reviewed-by: Filipe Manana 
> ---
>  fs/btrfs/inode.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 2b257d14bd3d..0a1671fb03bf 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -3323,6 +3323,7 @@ void btrfs_add_delayed_iput(struct inode *inode)
> ASSERT(list_empty(>delayed_iput));
> list_add_tail(>delayed_iput, _info->delayed_iputs);
> spin_unlock(_info->delayed_iput_lock);
> +   wake_up_process(fs_info->cleaner_kthread);
>  }
>
>  void btrfs_run_delayed_iputs(struct btrfs_fs_info *fs_info)
> --
> 2.14.3
>


-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”


Re: BTRFS related kernel backtrace on boot on 4.18.7 after blackout due to discharged battery

2018-10-05 Thread Filipe Manana
On Fri, Oct 5, 2018 at 3:23 PM Martin Steigerwald  wrote:
>
> Hello!
>
> On ThinkPad T520 after battery was discharged and machine just blacked
> out.
>
> Is that some sign of regular consistency check / replay or something to
> investigate further?

I think it's harmless, if anything were messed up with link counts or
mismatches between those and dir entries, fsck (btrfs check) should
have reported something.
I'll dig a big further and remove the warning if it's really harmless.

Thanks.

>
> I already scrubbed all data and there are no errors. Also btrfs device stats
> reports no errors. SMART status appears to be okay as well on both SSD.
>
> [4.524355] BTRFS info (device dm-4): disk space caching is enabled
> [4.524356] BTRFS info (device dm-4): has skinny extents
> [4.563950] BTRFS info (device dm-4): enabling ssd optimizations
> [5.463085] Console: switching to colour frame buffer device 240x67
> [5.492236] i915 :00:02.0: fb0: inteldrmfb frame buffer device
> [5.882661] BTRFS info (device dm-3): disk space caching is enabled
> [5.882664] BTRFS info (device dm-3): has skinny extents
> [5.918579] SGI XFS with ACLs, security attributes, realtime, scrub, no 
> debug enabled
> [5.927421] Adding 20971516k swap on /dev/mapper/sata-swap.  Priority:-2 
> extents:1 across:20971516k SSDsc
> [5.935051] XFS (sdb1): Mounting V5 Filesystem
> [5.935218] XFS (sda1): Mounting V5 Filesystem
> [5.961100] XFS (sda1): Ending clean mount
> [5.970857] BTRFS info (device dm-3): enabling ssd optimizations
> [5.972358] XFS (sdb1): Ending clean mount
> [5.975955] WARNING: CPU: 1 PID: 1104 at fs/inode.c:342 inc_nlink+0x28/0x30
> [5.978271] Modules linked in: xfs msr pktcdvd intel_rapl 
> x86_pkg_temp_thermal intel_powerclamp kvm_intel kvm irqbypass 
> crct10dif_pclmul crc32_pclmul ghash_clmulni_intel snd_hda_codec_hdmi pcbc 
> arc4 snd_hda_codec_conexant snd_hda_codec_generic iwldvm mac80211 iwlwifi 
> aesni_intel snd_hda_intel snd_hda_codec aes_x86_64 crypto_simd cryptd 
> snd_hda_core glue_helper intel_cstate snd_hwdep intel_rapl_perf snd_pcm 
> pcspkr input_leds i915 sg cfg80211 snd_timer thinkpad_acpi nvram 
> drm_kms_helper snd soundcore tpm_tis tpm_tis_core drm rfkill ac tpm 
> i2c_algo_bit fb_sys_fops battery rng_core video syscopyarea sysfillrect 
> sysimgblt button evdev sbs sbshc coretemp bfq hdaps(O) tp_smapi(O) 
> thinkpad_ec(O) loop ecryptfs cbc sunrpc mcryptd sha256_ssse3 sha256_generic 
> encrypted_keys ip_tables x_tables autofs4 dm_mod
> [5.990499]  btrfs xor zstd_decompress zstd_compress xxhash zlib_deflate 
> raid6_pq libcrc32c crc32c_generic sr_mod cdrom sd_mod hid_lenovo hid_generic 
> usbhid hid ahci libahci libata ehci_pci crc32c_intel psmouse i2c_i801 
> sdhci_pci cqhci lpc_ich sdhci ehci_hcd e1000e scsi_mod i2c_core mfd_core 
> mmc_core usbcore usb_common thermal
> [5.990529] CPU: 1 PID: 1104 Comm: mount Tainted: G   O  
> 4.18.7-tp520 #63
> [5.990532] Hardware name: LENOVO 42433WG/42433WG, BIOS 8AET69WW (1.49 ) 
> 06/14/2018
> [6.000153] RIP: 0010:inc_nlink+0x28/0x30
> [6.000154] Code: 00 00 8b 47 48 85 c0 74 07 83 c0 01 89 47 48 c3 f6 87 a1 
> 00 00 00 04 74 11 48 8b 47 28 f0 48 ff 88 98 04 00 00 8b 47 48 eb df <0f> 0b 
> eb eb 0f 1f 40 00 41 54 8b 0d 70 3f aa 00 48 ba eb 83 b5 80
> [6.008573] RSP: 0018:c90002283828 EFLAGS: 00010246
> [6.008575] RAX:  RBX: 8804018bed58 RCX: 
> 00022261
> [6.008576] RDX: 00022251 RSI:  RDI: 
> 8804018bed58
> [6.008577] RBP: c90002283a50 R08: 0002a330 R09: 
> a02f3873
> [6.008578] R10: 30ff R11: 7763 R12: 
> 0011
> [6.008579] R13: 3d5f R14: 880403e19800 R15: 
> 88040a3c69a0
> [6.008580] FS:  7f071598f100() GS:88041e24() 
> knlGS:
> [6.008581] CS:  0010 DS:  ES:  CR0: 80050033
> [6.008589] CR2: 7fda4fbf8218 CR3: 000403e42001 CR4: 
> 000606e0
> [6.008590] Call Trace:
> [6.008614]  replay_one_buffer+0x80e/0x890 [btrfs]
> [6.008632]  walk_up_log_tree+0x1dc/0x260 [btrfs]
> [6.046858]  walk_log_tree+0xaf/0x1e0 [btrfs]
> [6.046872]  btrfs_recover_log_trees+0x21c/0x410 [btrfs]
> [6.046885]  ? btree_read_extent_buffer_pages+0xcd/0x210 [btrfs]
> [6.055941]  ? fixup_inode_link_counts+0x170/0x170 [btrfs]
> [6.055953]  open_ctree+0x1a0d/0x1b60 [btrfs]
> [6.055965]  btrfs_mount_root+0x67b/0x760 [btrfs]
> [6.065039]  ? pcpu_alloc_area+0xdd/0x120
> [6.065040]  ? pcpu_next_unpop+0x32/0x40
> [6.065052]  mount_fs+0x36/0x162
> [6.065055]  vfs_kern_mount.part.34+0x4f/0x120
> [6.065064]  btrfs_mount+0x15f/0x890 [btrfs]
> [6.065067]  ? pcpu_cnt_pop_pages+0x40/0x50
> [6.065069]  ? pcpu_alloc_area+0xdd/0x120
> [6.065071]  ? pcpu_next_unpop+0x32/0x40
> [6.065073]  ? cpumask_next+0x16/0x20
> [ 

Re: [PATCH v6] test unaligned punch hole at ENOSPC

2018-10-05 Thread Filipe Manana
On Sun, Sep 30, 2018 at 2:40 AM Anand Jain  wrote:
>
> Try to punch hole with unaligned size and offset when the FS is
> full. Mainly holes are punched at locations which are unaligned
> with the file extent boundaries when the FS is full by data.
> As the punching holes at unaligned location will involve
> truncating blocks instead of just dropping the extents, it shall
> involve reserving data and metadata space for delalloc and so data
> alloc fails as the FS is full.
>
> btrfs_punch_hole()
>  btrfs_truncate_block()
>btrfs_check_data_free_space() <-- ENOSPC
>
> We don't fail punch hole if the holes are aligned with the file
> extent boundaries as it shall involve just dropping the related
> extents, without truncating data extent blocks.
>
> Signed-off-by: Anand Jain 
Reviewed-by: Filipe Manana 

Looks good, thanks!

> ---
> v5->v6:
>  fix comments at two places in the test case
>  drop -f when using xfs_io to punch hole
>  sync after dd is dropped
>  change log is slightly updated
> v4->v5:
>  Update the change log
>  Drop the directio option for xfs_io
> v3->v4:
>  Add to the group punch
> v2->v3:
>  Add _require_xfs_io_command "fpunch"
>  Add more logs to $seqfull.full
>mount options and
>group profile info
>  Add sync after dd upto ENOSPC
>  Drop fallocate -p and use xfs_io punch to create holes
>  Use a testfile instead of filler file so that easy to trace
> v1->v2: Use at least 256MB to test.
> This test case fails on btrfs as of now.
>  tests/btrfs/172 | 73 
> +
>  tests/btrfs/172.out |  2 ++
>  tests/btrfs/group   |  1 +
>  3 files changed, 76 insertions(+)
>  create mode 100755 tests/btrfs/172
>  create mode 100644 tests/btrfs/172.out
>
> diff --git a/tests/btrfs/172 b/tests/btrfs/172
> new file mode 100755
> index ..0dffb2dff40b
> --- /dev/null
> +++ b/tests/btrfs/172
> @@ -0,0 +1,73 @@
> +#! /bin/bash
> +# SPDX-License-Identifier: GPL-2.0
> +# Copyright (c) 2018 Oracle. All Rights Reserved.
> +#
> +# FS QA Test 172
> +#
> +# Test if the unaligned (by size and offset) punch hole is successful when FS
> +# is at ENOSPC.
> +#
> +seq=`basename $0`
> +seqres=$RESULT_DIR/$seq
> +echo "QA output created by $seq"
> +
> +here=`pwd`
> +tmp=/tmp/$$
> +status=1   # failure is the default!
> +trap "_cleanup; exit \$status" 0 1 2 3 15
> +
> +_cleanup()
> +{
> +   cd /
> +   rm -f $tmp.*
> +}
> +
> +# get standard environment, filters and checks
> +. ./common/rc
> +. ./common/filter
> +
> +# remove previous $seqres.full before test
> +rm -f $seqres.full
> +
> +# real QA test starts here
> +
> +# Modify as appropriate.
> +_supported_fs btrfs
> +_supported_os Linux
> +_require_scratch
> +_require_xfs_io_command "fpunch"
> +
> +_scratch_mkfs_sized $((256 * 1024 *1024)) >> $seqres.full
> +
> +# max_inline ensures data is not inlined within metadata extents
> +_scratch_mount "-o max_inline=0,nodatacow"
> +
> +cat /proc/self/mounts | grep $SCRATCH_DEV >> $seqres.full
> +$BTRFS_UTIL_PROG filesystem df $SCRATCH_MNT >> $seqres.full
> +
> +extent_size=$(_scratch_btrfs_sectorsize)
> +unalign_by=512
> +echo extent_size=$extent_size unalign_by=$unalign_by >> $seqres.full
> +
> +$XFS_IO_PROG -f -c "pwrite -S 0xab 0 $((extent_size * 10))" \
> +   $SCRATCH_MNT/testfile >> $seqres.full
> +
> +echo "Fill all space available for data and all unallocated space." >> 
> $seqres.full
> +dd status=none if=/dev/zero of=$SCRATCH_MNT/filler bs=512 >> $seqres.full 
> 2>&1
> +
> +hole_offset=0
> +hole_len=$unalign_by
> +$XFS_IO_PROG -c "fpunch $hole_offset $hole_len" $SCRATCH_MNT/testfile
> +
> +hole_offset=$(($extent_size + $unalign_by))
> +hole_len=$(($extent_size - $unalign_by))
> +$XFS_IO_PROG -c "fpunch $hole_offset $hole_len" $SCRATCH_MNT/testfile
> +
> +hole_offset=$(($extent_size * 2 + $unalign_by))
> +hole_len=$(($extent_size * 5))
> +$XFS_IO_PROG -c "fpunch $hole_offset $hole_len" $SCRATCH_MNT/testfile
> +
> +# success, all done
> +echo "Silence is golden"
> +status=0
> +exit
> diff --git a/tests/btrfs/172.out b/tests/btrfs/172.out
> new file mode 100644
> index ..ce2de3f0d107
> --- /dev/null
> +++ b/tests/btrfs/172.out
> @@ -0,0 +1,2 @@
> +QA output created by 172
> +Silence is golden
> diff --git a/tests/btrfs/group b/tests/btrfs/group
> index feffc45b6564..45782565c3b7 100644
> --- a/tests/btrfs/group
> +++ b/tests/btrfs/group
> @@ -174,3 +174,4 @@
>  169 auto quick send
>  170 auto quick snapshot
>  171 auto quick qgroup
> +172 auto quick punch
> --
> 1.8.3.1
>


-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”


Re: [PATCH 18/42] btrfs: move the dio_sem higher up the callchain

2018-10-03 Thread Filipe Manana
On Fri, Sep 28, 2018 at 12:19 PM Josef Bacik  wrote:
>
> We're getting a lockdep splat because we take the dio_sem under the
> log_mutex.  What we really need is to protect fsync() from logging an
> extent map for an extent we never waited on higher up, so just guard the
> whole thing with dio_sem.
>
> Signed-off-by: Josef Bacik 
Reviewed-by: Filipe Manana 

Looks good, thanks. However as David said, it would be nice to have a
sample trace pasted in the changelog (several fstests test cases
trigger this often).


> ---
>  fs/btrfs/file.c | 12 
>  fs/btrfs/tree-log.c |  2 --
>  2 files changed, 12 insertions(+), 2 deletions(-)
>
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index 095f0bb86bb7..c07110edb9de 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -2079,6 +2079,14 @@ int btrfs_sync_file(struct file *file, loff_t start, 
> loff_t end, int datasync)
> goto out;
>
> inode_lock(inode);
> +
> +   /*
> +* We take the dio_sem here because the tree log stuff can race with
> +* lockless dio writes and get an extent map logged for an extent we
> +* never waited on.  We need it this high up for lockdep reasons.
> +*/
> +   down_write(_I(inode)->dio_sem);
> +
> atomic_inc(>log_batch);
>
> /*
> @@ -2087,6 +2095,7 @@ int btrfs_sync_file(struct file *file, loff_t start, 
> loff_t end, int datasync)
>  */
> ret = btrfs_wait_ordered_range(inode, start, len);
> if (ret) {
> +   up_write(_I(inode)->dio_sem);
> inode_unlock(inode);
> goto out;
> }
> @@ -2110,6 +2119,7 @@ int btrfs_sync_file(struct file *file, loff_t start, 
> loff_t end, int datasync)
>  * checked called fsync.
>  */
> ret = filemap_check_wb_err(inode->i_mapping, file->f_wb_err);
> +   up_write(_I(inode)->dio_sem);
> inode_unlock(inode);
> goto out;
> }
> @@ -2128,6 +2138,7 @@ int btrfs_sync_file(struct file *file, loff_t start, 
> loff_t end, int datasync)
> trans = btrfs_start_transaction(root, 0);
> if (IS_ERR(trans)) {
> ret = PTR_ERR(trans);
> +   up_write(_I(inode)->dio_sem);
> inode_unlock(inode);
> goto out;
> }
> @@ -2149,6 +2160,7 @@ int btrfs_sync_file(struct file *file, loff_t start, 
> loff_t end, int datasync)
>  * file again, but that will end up using the synchronization
>  * inside btrfs_sync_log to keep things safe.
>  */
> +   up_write(_I(inode)->dio_sem);
> inode_unlock(inode);
>
> /*
> diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
> index 1650dc44a5e3..66b7e059b765 100644
> --- a/fs/btrfs/tree-log.c
> +++ b/fs/btrfs/tree-log.c
> @@ -4374,7 +4374,6 @@ static int btrfs_log_changed_extents(struct 
> btrfs_trans_handle *trans,
>
> INIT_LIST_HEAD();
>
> -   down_write(>dio_sem);
> write_lock(>lock);
> test_gen = root->fs_info->last_trans_committed;
> logged_start = start;
> @@ -4440,7 +4439,6 @@ static int btrfs_log_changed_extents(struct 
> btrfs_trans_handle *trans,
> }
> WARN_ON(!list_empty());
> write_unlock(>lock);
> -   up_write(>dio_sem);
>
> btrfs_release_path(path);
> if (!ret)
> --
> 2.14.3
>


-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”


Re: btrfs send receive ERROR: chown failed: No such file or directory

2018-10-02 Thread Filipe Manana
On Tue, Oct 2, 2018 at 7:02 AM Leonard Lausen  wrote:
>
> Hello,
>
> does anyone have an idea about below issue? It is a severe issue as it
> renders btrfs send / receive dysfunctional and it is not clear if there
> may be a data corruption issue hiding in the current send / receive
> code.
>
> Thank you.
>
> Best regards
> Leonard
>
> Leonard Lausen  writes:
> > Hello!
> >
> > I observe the following issue with btrfs send | btrfs receive in a setup
> > with 2 machines and 3 btrfs file-systems. All machines run Linux 4.18.9.
> > Machine 1 runs btrfs-progs 4.17.1, machine 2 runs btrfs-progs 4.17 (via
> > https://packages.debian.org/stretch-backports/btrfs-progs).
> >
> > 1) Machine 1 takes regular snapshots and sends them to machine 2. btrfs
> >btrfs send ... | ssh user@machine2 "btrfs receive /path1"
> > 2) Machine 2 backups all subvolumes stored at /path1 to a second
> >independent btrfs filesystem. Let /path1/rootsnapshot be the first
> >snapshot stored at /path1 (ie. it has no Parent UUID). Let
> >/path1/incrementalsnapshot be a snapshot that has /path1/rootsnapshot
> >as a parent. Then
> >btrfs send -v /path1/rootsnapshot | btrfs receive /path2
> >works without issues, but
> >btrfs send -v -p /path1/rootsnapshot /path1/incrementalsnapshot | btrfs 
> > receive /path2

-v is useless. Use -vv, which will dump all commands.

> >fails as follows:
> >ERROR: chown o257-4639416-0 failed: No such file or directory
> >
> > No error is shown in dmesg. /path1 and /path2 denote two independent
> > btrfs filesystems.
> >
> > Note that there was no issue with transferring incrementalsnapshot from
> > machine 1 to machine 2. No error is shown in dmesg.
> >
> > Best regards
> > Leonard



-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”


Re: [PATCH v5] test unaligned punch hole at ENOSPC

2018-09-29 Thread Filipe Manana
On Sat, Sep 29, 2018 at 1:52 AM Anand Jain  wrote:
>
> Try to punch hole with unaligned size and offset when the FS is
> full. Mainly holes are punched at locations which are unaligned
> with the file extent boundaries when the FS is full by data.
> As the punching holes at unaligned location will involve
> truncating blocks instead of just dropping the extents, it shall
> involve reserving data and metadata space for delalloc and data
> alloc fails as the FS is full.
>
> btrfs_punch_hole()
>  btrfs_truncate_block()
>btrfs_check_data_free_space() <-- ENOSPC
>
> We don't fail punch hole if the holes are aligned with the file
> extent boundaries as it shall involve just dropping the related
> extents.
>
> Signed-off-by: Anand Jain 
> ---
> v4->v5:
>  Update the change log
>  Drop the directio option for xfs_io

Except for the direct IO and change log, all my previous comments and
questions weren't addressed/ answered.
Thanks.

> v3->v4:
>  Add to the group punch
> v2->v3:
>  Add _require_xfs_io_command "fpunch"
>  Add more logs to $seqfull.full
>mount options and
>group profile info
>  Add sync after dd upto ENOSPC
>  Drop fallocate -p and use xfs_io punch to create holes
>  Use a testfile instead of filler file so that easy to trace
> v1->v2: Use at least 256MB to test.
> This test case fails on btrfs as of now.
>  tests/btrfs/172 | 74 
> +
>  tests/btrfs/172.out |  2 ++
>  tests/btrfs/group   |  1 +
>  3 files changed, 77 insertions(+)
>  create mode 100755 tests/btrfs/172
>  create mode 100644 tests/btrfs/172.out
>
> diff --git a/tests/btrfs/172 b/tests/btrfs/172
> new file mode 100755
> index ..1ecf01d862a2
> --- /dev/null
> +++ b/tests/btrfs/172
> @@ -0,0 +1,74 @@
> +#! /bin/bash
> +# SPDX-License-Identifier: GPL-2.0
> +# Copyright (c) 2018 Oracle. All Rights Reserved.
> +#
> +# FS QA Test 172
> +#
> +# Test if the unaligned (by size and offset) punch hole is successful when FS
> +# is at ENOSPC.
> +#
> +seq=`basename $0`
> +seqres=$RESULT_DIR/$seq
> +echo "QA output created by $seq"
> +
> +here=`pwd`
> +tmp=/tmp/$$
> +status=1   # failure is the default!
> +trap "_cleanup; exit \$status" 0 1 2 3 15
> +
> +_cleanup()
> +{
> +   cd /
> +   rm -f $tmp.*
> +}
> +
> +# get standard environment, filters and checks
> +. ./common/rc
> +. ./common/filter
> +
> +# remove previous $seqres.full before test
> +rm -f $seqres.full
> +
> +# real QA test starts here
> +
> +# Modify as appropriate.
> +_supported_fs btrfs
> +_supported_os Linux
> +_require_scratch
> +_require_xfs_io_command "fpunch"
> +
> +_scratch_mkfs_sized $((256 * 1024 *1024)) >> $seqres.full
> +
> +# max_inline helps to create regular extent
> +_scratch_mount "-o max_inline=0,nodatacow"
> +
> +cat /proc/self/mounts | grep $SCRATCH_DEV >> $seqres.full
> +$BTRFS_UTIL_PROG filesystem df $SCRATCH_MNT >> $seqres.full
> +
> +extent_size=$(_scratch_btrfs_sectorsize)
> +unalign_by=512
> +echo extent_size=$extent_size unalign_by=$unalign_by >> $seqres.full
> +
> +$XFS_IO_PROG -f -c "pwrite -S 0xab 0 $((extent_size * 10))" \
> +   $SCRATCH_MNT/testfile >> $seqres.full
> +
> +echo "Fill fs upto ENOSPC" >> $seqres.full
> +dd status=none if=/dev/zero of=$SCRATCH_MNT/filler bs=512 >> $seqres.full 
> 2>&1
> +sync
> +
> +hole_offset=0
> +hole_len=$unalign_by
> +$XFS_IO_PROG -f -c "fpunch $hole_offset $hole_len" $SCRATCH_MNT/testfile
> +
> +hole_offset=$(($extent_size + $unalign_by))
> +hole_len=$(($extent_size - $unalign_by))
> +$XFS_IO_PROG -f -c "fpunch $hole_offset $hole_len" $SCRATCH_MNT/testfile
> +
> +hole_offset=$(($extent_size * 2 + $unalign_by))
> +hole_len=$(($extent_size * 5))
> +$XFS_IO_PROG -f -c "fpunch $hole_offset $hole_len" $SCRATCH_MNT/testfile
> +
> +# success, all done
> +echo "Silence is golden"
> +status=0
> +exit
> diff --git a/tests/btrfs/172.out b/tests/btrfs/172.out
> new file mode 100644
> index ..ce2de3f0d107
> --- /dev/null
> +++ b/tests/btrfs/172.out
> @@ -0,0 +1,2 @@
> +QA output created by 172
> +Silence is golden
> diff --git a/tests/btrfs/group b/tests/btrfs/group
> index feffc45b6564..45782565c3b7 100644
> --- a/tests/btrfs/group
> +++ b/tests/btrfs/group
> @@ -174,3 +174,4 @@
>  169 auto quick send
>  170 auto quick snapshot
>  171 auto quick qgroup
> +172 auto quick punch
> --
> 1.8.3.1
>


-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”


Re: [PATCH v3] test unaligned punch hole at ENOSPC

2018-09-28 Thread Filipe Manana
On Fri, Sep 28, 2018 at 6:08 PM Filipe Manana  wrote:
>
> On Fri, Sep 28, 2018 at 3:51 PM Anand Jain  wrote:
> >
> > Try to punch hole with unaligned size and offset when the FS
> > returns ENOSPC
>
> The FS returns ENOSPC is confusing. It's more clear to say when the
> filesystem doesn't have more space available for data allocation.
> >
> > Signed-off-by: Anand Jain 
> > ---
> > v2->v3:
> >  add _require_xfs_io_command "fpunch"
> >  add more logs to $seqfull.full
> >mount options and
> >group profile info
> >  add sync after dd upto ENOSPC
> >  drop fallocate -p and use xfs_io punch to create holes
> > v1->v2: Use at least 256MB to test.
> > This test case fails on btrfs as of now.
> >
> >  tests/btrfs/172 | 74 
> > +
> >  tests/btrfs/172.out |  2 ++
> >  tests/btrfs/group   |  1 +
> >  3 files changed, 77 insertions(+)
> >  create mode 100755 tests/btrfs/172
> >  create mode 100644 tests/btrfs/172.out
> >
> > diff --git a/tests/btrfs/172 b/tests/btrfs/172
> > new file mode 100755
> > index ..59413a5de12f
> > --- /dev/null
> > +++ b/tests/btrfs/172
> > @@ -0,0 +1,74 @@
> > +#! /bin/bash
> > +# SPDX-License-Identifier: GPL-2.0
> > +# Copyright (c) 2018 Oracle. All Rights Reserved.
> > +#
> > +# FS QA Test 172
> > +#
> > +# Test if the unaligned (by size and offset) punch hole is successful when 
> > FS
> > +# is at ENOSPC.
> > +#
> > +seq=`basename $0`
> > +seqres=$RESULT_DIR/$seq
> > +echo "QA output created by $seq"
> > +
> > +here=`pwd`
> > +tmp=/tmp/$$
> > +status=1   # failure is the default!
> > +trap "_cleanup; exit \$status" 0 1 2 3 15
> > +
> > +_cleanup()
> > +{
> > +   cd /
> > +   rm -f $tmp.*
> > +}
> > +
> > +# get standard environment, filters and checks
> > +. ./common/rc
> > +. ./common/filter
> > +
> > +# remove previous $seqres.full before test
> > +rm -f $seqres.full
> > +
> > +# real QA test starts here
> > +
> > +# Modify as appropriate.
> > +_supported_fs btrfs
> > +_supported_os Linux
> > +_require_scratch
> > +_require_xfs_io_command "fpunch"
> > +
> > +_scratch_mkfs_sized $((256 * 1024 *1024)) >> $seqres.full
> > +
> > +# max_inline helps to create regular extent
> max_inline ensures data is not inlined within metadata extents
>
> > +_scratch_mount "-o max_inline=0,nodatacow"
> > +
> > +cat /proc/self/mounts | grep $SCRATCH_DEV >> $seqres.full
> > +$BTRFS_UTIL_PROG filesystem df $SCRATCH_MNT >> $seqres.full
> > +
> > +extent_size=$(_scratch_btrfs_sectorsize)
> > +unalign_by=512
> > +echo extent_size=$extent_size unalign_by=$unalign_by >> $seqres.full
> > +
> > +$XFS_IO_PROG -f -d -c "pwrite -S 0xab 0 $((extent_size * 10))" \
> > +   $SCRATCH_MNT/testfile >> 
> > $seqres.full

Also missing _require_odirect.
Why is direct IO needed? If not needed (which I don't see why), it can
be avoided.

> > +
> > +echo "Fill fs upto ENOSPC" >> $seqres.full
> Fill all space available for data and all unallocated space.
>
> > +dd status=none if=/dev/zero of=$SCRATCH_MNT/filler bs=512 >> $seqres.full 
> > 2>&1
> Why do you use dd here and not xfs_io?
>
> > +sync
> Why is the sync needed?
>
> > +
> > +hole_offset=0
> > +hole_len=$unalign_by
> > +$XFS_IO_PROG -f -d -c "fpunch $hole_offset $hole_len" $SCRATCH_MNT/testfile
>
> No need to pass -f anymore. No need for -d either.
>
> > +
> > +hole_offset=$(($extent_size + $unalign_by))
> > +hole_len=$(($extent_size - $unalign_by))
> > +$XFS_IO_PROG -f -d -c "fpunch $hole_offset $hole_len" $SCRATCH_MNT/testfile
>
> No need to pass -f anymore. No need for -d either.
>
> > +
> > +hole_offset=$(($extent_size * 2 + $unalign_by))
> > +hole_len=$(($extent_size * 5))
> > +$XFS_IO_PROG -f -d -c "fpunch $hole_offset $hole_len" $SCRATCH_MNT/testfile
>
> No need to pass -f anymore. No need for -d either.
> > +
> > +# success, all done
> > +echo "Silence is golden"
> > +status=0
> > +exit
> > diff --git a/tests/btrfs/172.out b/tests/btrfs/172.out
> > new file mode 100644
> > index ..ce2de3f0d107
> > --- /dev/null
> > +++ b/tests/btrfs/172.out
> > @@ -0,0 +1,2 @@
> > +QA output created by 172
> > +Silence is golden
> > diff --git a/tests/btrfs/group b/tests/btrfs/group
> > index feffc45b6564..7e1a638ab7e1 100644
> > --- a/tests/btrfs/group
> > +++ b/tests/btrfs/group
> > @@ -174,3 +174,4 @@
> >  169 auto quick send
> >  170 auto quick snapshot
> >  171 auto quick qgroup
> > +172 auto quick
> > --
> > 1.8.3.1
> >
>
>
> --
> Filipe David Manana,
>
> “Whether you think you can, or you think you can't — you're right.”



-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”


Re: [PATCH v3] test unaligned punch hole at ENOSPC

2018-09-28 Thread Filipe Manana
On Fri, Sep 28, 2018 at 3:51 PM Anand Jain  wrote:
>
> Try to punch hole with unaligned size and offset when the FS
> returns ENOSPC

The FS returns ENOSPC is confusing. It's more clear to say when the
filesystem doesn't have more space available for data allocation.
>
> Signed-off-by: Anand Jain 
> ---
> v2->v3:
>  add _require_xfs_io_command "fpunch"
>  add more logs to $seqfull.full
>mount options and
>group profile info
>  add sync after dd upto ENOSPC
>  drop fallocate -p and use xfs_io punch to create holes
> v1->v2: Use at least 256MB to test.
> This test case fails on btrfs as of now.
>
>  tests/btrfs/172 | 74 
> +
>  tests/btrfs/172.out |  2 ++
>  tests/btrfs/group   |  1 +
>  3 files changed, 77 insertions(+)
>  create mode 100755 tests/btrfs/172
>  create mode 100644 tests/btrfs/172.out
>
> diff --git a/tests/btrfs/172 b/tests/btrfs/172
> new file mode 100755
> index ..59413a5de12f
> --- /dev/null
> +++ b/tests/btrfs/172
> @@ -0,0 +1,74 @@
> +#! /bin/bash
> +# SPDX-License-Identifier: GPL-2.0
> +# Copyright (c) 2018 Oracle. All Rights Reserved.
> +#
> +# FS QA Test 172
> +#
> +# Test if the unaligned (by size and offset) punch hole is successful when FS
> +# is at ENOSPC.
> +#
> +seq=`basename $0`
> +seqres=$RESULT_DIR/$seq
> +echo "QA output created by $seq"
> +
> +here=`pwd`
> +tmp=/tmp/$$
> +status=1   # failure is the default!
> +trap "_cleanup; exit \$status" 0 1 2 3 15
> +
> +_cleanup()
> +{
> +   cd /
> +   rm -f $tmp.*
> +}
> +
> +# get standard environment, filters and checks
> +. ./common/rc
> +. ./common/filter
> +
> +# remove previous $seqres.full before test
> +rm -f $seqres.full
> +
> +# real QA test starts here
> +
> +# Modify as appropriate.
> +_supported_fs btrfs
> +_supported_os Linux
> +_require_scratch
> +_require_xfs_io_command "fpunch"
> +
> +_scratch_mkfs_sized $((256 * 1024 *1024)) >> $seqres.full
> +
> +# max_inline helps to create regular extent
max_inline ensures data is not inlined within metadata extents

> +_scratch_mount "-o max_inline=0,nodatacow"
> +
> +cat /proc/self/mounts | grep $SCRATCH_DEV >> $seqres.full
> +$BTRFS_UTIL_PROG filesystem df $SCRATCH_MNT >> $seqres.full
> +
> +extent_size=$(_scratch_btrfs_sectorsize)
> +unalign_by=512
> +echo extent_size=$extent_size unalign_by=$unalign_by >> $seqres.full
> +
> +$XFS_IO_PROG -f -d -c "pwrite -S 0xab 0 $((extent_size * 10))" \
> +   $SCRATCH_MNT/testfile >> $seqres.full
> +
> +echo "Fill fs upto ENOSPC" >> $seqres.full
Fill all space available for data and all unallocated space.

> +dd status=none if=/dev/zero of=$SCRATCH_MNT/filler bs=512 >> $seqres.full 
> 2>&1
Why do you use dd here and not xfs_io?

> +sync
Why is the sync needed?

> +
> +hole_offset=0
> +hole_len=$unalign_by
> +$XFS_IO_PROG -f -d -c "fpunch $hole_offset $hole_len" $SCRATCH_MNT/testfile

No need to pass -f anymore. No need for -d either.

> +
> +hole_offset=$(($extent_size + $unalign_by))
> +hole_len=$(($extent_size - $unalign_by))
> +$XFS_IO_PROG -f -d -c "fpunch $hole_offset $hole_len" $SCRATCH_MNT/testfile

No need to pass -f anymore. No need for -d either.

> +
> +hole_offset=$(($extent_size * 2 + $unalign_by))
> +hole_len=$(($extent_size * 5))
> +$XFS_IO_PROG -f -d -c "fpunch $hole_offset $hole_len" $SCRATCH_MNT/testfile

No need to pass -f anymore. No need for -d either.
> +
> +# success, all done
> +echo "Silence is golden"
> +status=0
> +exit
> diff --git a/tests/btrfs/172.out b/tests/btrfs/172.out
> new file mode 100644
> index ..ce2de3f0d107
> --- /dev/null
> +++ b/tests/btrfs/172.out
> @@ -0,0 +1,2 @@
> +QA output created by 172
> +Silence is golden
> diff --git a/tests/btrfs/group b/tests/btrfs/group
> index feffc45b6564..7e1a638ab7e1 100644
> --- a/tests/btrfs/group
> +++ b/tests/btrfs/group
> @@ -174,3 +174,4 @@
>  169 auto quick send
>  170 auto quick snapshot
>  171 auto quick qgroup
> +172 auto quick
> --
> 1.8.3.1
>


-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”


Re: Strange behavior (possible bugs) in btrfs

2018-08-29 Thread Filipe Manana
On Tue, Aug 28, 2018 at 9:35 PM Jayashree Mohan  wrote:
>
> Hi Filipe,
>
> This is to follow up the status of crash consistency bugs we reported
> on btrfs. We see that there has been a patch(not in the kernel yet)
> (https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg77875.html)
> that resolves one of the reported bugs. However, the other bugs we
> reported still exist on the latest kernel (4.19-rc1), even with the
> submitted patch. Here is the list of other inconsistencies we
> reported, along with the workload to reproduce them :
> https://www.spinics.net/lists/linux-btrfs/msg77219.html
>
> We just wanted to ensure that resolving these are on your to-do list.
> Additionally, if there are more patches queued to address these
> issues, please let us know.

Hi,

I go through the issues as time allows. Not all of these are top
priorities for me right now.
Been working on a fix for some of them but they are not yet ready to
submit (need more testing, or cause other problems or are too
complex).
If suddenly there are people hitting any of these issues frequently
and causing trouble I'll give them higher priority.

Thanks.

>
> Thanks,
> Jayashree Mohan
>
> Thanks,
> Jayashree Mohan
>
>
>
> On Fri, May 11, 2018 at 10:45 AM Filipe Manana  wrote:
> >
> > On Mon, Apr 30, 2018 at 5:04 PM, Vijay Chidambaram  
> > wrote:
> > > Hi,
> > >
> > > We found two more cases where the btrfs behavior is a little strange.
> > > In one case, an fsync-ed file goes missing after a crash. In the
> > > other, a renamed file shows up in both directories after a crash.
> > >
> > > Workload 1:
> > >
> > > mkdir A
> > > mkdir B
> > > mkdir A/C
> > > creat B/foo
> > > fsync B/foo
> > > link B/foo A/C/foo
> > > fsync A
> > > -- crash --
> > >
> > > Expected state after recovery:
> > > B B/foo A A/C exist
> > >
> > > What we find:
> > > Only B B/foo exist
> > >
> > > A is lost even after explicit fsync to A.
> > >
> > > Workload 2:
> > >
> > > mkdir A
> > > mkdir A/C
> > > rename A/C B
> > > touch B/bar
> > > fsync B/bar
> > > rename B/bar A/bar
> > > rename A B (replacing B with A at this point)
> > > fsync B/bar
> > > -- crash --
> > >
> > > Expected contents after recovery:
> > > A/bar
> > >
> > > What we find after recovery:
> > > A/bar
> > > B/bar
> > >
> > > We think this breaks rename's atomicity guarantee. bar should be
> > > present in either A or B, but now it is present in both.
> >
> > I'll take a look at these, and all the other potential issues you
> > reported in other threads, next week and let you know.
> > Thanks.
> >
> > >
> > > Thanks,
> > > Vijay
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> > > the body of a message to majord...@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >
> >
> >
> > --
> > Filipe David Manana,
> >
> > “Whether you think you can, or you think you can't — you're right.”



-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”


Re: [PATCH] generic: test for deduplication between different files

2018-08-21 Thread Filipe Manana
On Mon, Aug 20, 2018 at 12:11 AM, Dave Chinner  wrote:
> [cc linux-...@vger.kernel.org]
>
> On Fri, Aug 17, 2018 at 09:39:24AM +0100, fdman...@kernel.org wrote:
>> From: Filipe Manana 
>>
>> Test that deduplication of an entire file that has a size that is not
>> aligned to the filesystem's block size into a different file does not
>> corrupt the destination's file data.
>>
>> This test is motivated by a bug found in Btrfs which is fixed by the
>> following patch for the linux kernel:
>>
>>   "Btrfs: fix data corruption when deduplicating between different files"
>>
>> XFS also fails this test, at least as of linux kernel 4.18-rc7, exactly
>> with the same corruption as in Btrfs - some bytes of a block get replaced
>> with zeroes after the deduplication.
>
> Filipe, in future can please report XFS bugs you find to the XFS
> list the moment you find them. We shouldn't ever find out about a
> data corruption bug we need to fix via a "oh, by the way" comment in
> a commit message for a regression test....

I actually intended to add linux-xfs in CC, but I clearly forgot to do it.

>
> Cheers,
>
> Dave.
>
>> Signed-off-by: Filipe Manana 
>> ---
>>  tests/generic/505 | 84 
>> +++
>>  tests/generic/505.out | 33 
>>  tests/generic/group   |  1 +
>>  3 files changed, 118 insertions(+)
>>  create mode 100755 tests/generic/505
>>  create mode 100644 tests/generic/505.out
>>
>> diff --git a/tests/generic/505 b/tests/generic/505
>> new file mode 100755
>> index ..5ee232a2
>> --- /dev/null
>> +++ b/tests/generic/505
>> @@ -0,0 +1,84 @@
>> +#! /bin/bash
>> +# SPDX-License-Identifier: GPL-2.0
>> +# Copyright (C) 2018 SUSE Linux Products GmbH. All Rights Reserved.
>> +#
>> +# FS QA Test No. 505
>> +#
>> +# Test that deduplication of an entire file that has a size that is not 
>> aligned
>> +# to the filesystem's block size into a different file does not corrupt the
>> +# destination's file data.
>> +#
>> +seq=`basename $0`
>> +seqres=$RESULT_DIR/$seq
>> +echo "QA output created by $seq"
>> +tmp=/tmp/$$
>> +status=1 # failure is the default!
>> +trap "_cleanup; exit \$status" 0 1 2 3 15
>> +
>> +_cleanup()
>> +{
>> + cd /
>> + rm -f $tmp.*
>> +}
>> +
>> +# get standard environment, filters and checks
>> +. ./common/rc
>> +. ./common/filter
>> +. ./common/reflink
>> +
>> +# real QA test starts here
>> +_supported_fs generic
>> +_supported_os Linux
>> +_require_scratch_dedupe
>> +
>> +rm -f $seqres.full
>> +
>> +_scratch_mkfs >>$seqres.full 2>&1
>> +_scratch_mount
>> +
>> +# The first byte with a value of 0xae starts at an offset (2518890) which 
>> is not
>> +# a multiple of the block size.
>> +$XFS_IO_PROG -f \
>> + -c "pwrite -S 0x6b 0 2518890" \
>> + -c "pwrite -S 0xae 2518890 102398" \
>> + $SCRATCH_MNT/foo | _filter_xfs_io
>> +
>> +# Create a second file with a length not aligned to the block size, whose 
>> bytes
>> +# all have the value 0x6b, so that its extent(s) can be deduplicated with 
>> the
>> +# first file.
>> +$XFS_IO_PROG -f -c "pwrite -S 0x6b 0 557771" $SCRATCH_MNT/bar | 
>> _filter_xfs_io
>> +
>> +# The file is filled with bytes having the value 0x6b from offset 0 to 
>> offset
>> +# 2518889 and with the value 0xae from offset 2518890 to offset 2621287.
>> +echo "File content before deduplication:"
>> +od -t x1 $SCRATCH_MNT/foo
>> +
>> +# Now deduplicate the entire second file into a range of the first file that
>> +# also has all bytes with the value 0x6b. The destination range's end offset
>> +# must not be aligned to the block size and must be less then the offset of
>> +# the first byte with the value 0xae (byte at offset 2518890).
>> +$XFS_IO_PROG -c "dedupe $SCRATCH_MNT/bar 0 1957888 557771" $SCRATCH_MNT/foo 
>> \
>> + | _filter_xfs_io
>> +
>> +# The bytes in the range starting at offset 2515659 (end of the 
>> deduplication
>> +# range) and ending at offset 2519040 (start offset rounded up to the block
>> +# size) must all have the value 0xae (and not replaced with 0x00 values).
>> +# In other words, we should have exactly the same data we had before we 
>> asked
>> +# for deduplication.
>> +echo "File content after 

Re: [patch] file dedupe (and maybe clone) data corruption (was Re: [PATCH] generic: test for deduplication between different files)

2018-08-21 Thread Filipe Manana
On Mon, Aug 20, 2018 at 2:09 AM, Dave Chinner  wrote:
> [cc linux-fsdevel now, too]
>
> On Mon, Aug 20, 2018 at 09:11:26AM +1000, Dave Chinner wrote:
>> [cc linux-...@vger.kernel.org]
>>
>> On Fri, Aug 17, 2018 at 09:39:24AM +0100, fdman...@kernel.org wrote:
>> > From: Filipe Manana 
>> >
>> > Test that deduplication of an entire file that has a size that is not
>> > aligned to the filesystem's block size into a different file does not
>> > corrupt the destination's file data.
>
> Ok, I've looked at this now. My first question is where did all the
> magic offsets in this test come from? i.e. how was this bug
> found and who is it affecting?

I found it myself. I'm not aware of any users or applications affected by it.

>
>> > This test is motivated by a bug found in Btrfs which is fixed by the
>> > following patch for the linux kernel:
>> >
>> >   "Btrfs: fix data corruption when deduplicating between different files"
>> >
>> > XFS also fails this test, at least as of linux kernel 4.18-rc7, exactly
>> > with the same corruption as in Btrfs - some bytes of a block get replaced
>> > with zeroes after the deduplication.
>>
>> Filipe, in future can please report XFS bugs you find to the XFS
>> list the moment you find them. We shouldn't ever find out about a
>> data corruption bug we need to fix via a "oh, by the way" comment in
>> a commit message for a regression test
>
> This becomes much more relevant because of what I've just found
>
> .
>
>> > +# The first byte with a value of 0xae starts at an offset (2518890) which 
>> > is not
>> > +# a multiple of the block size.
>> > +$XFS_IO_PROG -f \
>> > +   -c "pwrite -S 0x6b 0 2518890" \
>> > +   -c "pwrite -S 0xae 2518890 102398" \
>> > +   $SCRATCH_MNT/foo | _filter_xfs_io
>> > +
>> > +# Create a second file with a length not aligned to the block size, whose 
>> > bytes
>> > +# all have the value 0x6b, so that its extent(s) can be deduplicated with 
>> > the
>> > +# first file.
>> > +$XFS_IO_PROG -f -c "pwrite -S 0x6b 0 557771" $SCRATCH_MNT/bar | 
>> > _filter_xfs_io
>> > +
>> > +# The file is filled with bytes having the value 0x6b from offset 0 to 
>> > offset
>> > +# 2518889 and with the value 0xae from offset 2518890 to offset 2621287.
>> > +echo "File content before deduplication:"
>> > +od -t x1 $SCRATCH_MNT/foo
>
> Please use "od -Ad -t x1 " so the file offsets reported by od
> match the offsets used in the test (i.e. in decimal, not octal).

Will do, in the next test version after agreement on the fix.

>
>> > +
>> > +# Now deduplicate the entire second file into a range of the first file 
>> > that
>> > +# also has all bytes with the value 0x6b. The destination range's end 
>> > offset
>> > +# must not be aligned to the block size and must be less then the offset 
>> > of
>> > +# the first byte with the value 0xae (byte at offset 2518890).
>> > +$XFS_IO_PROG -c "dedupe $SCRATCH_MNT/bar 0 1957888 557771" 
>> > $SCRATCH_MNT/foo \
>> > +   | _filter_xfs_io
>
> Ok, now it gets fun. dedupe to non-block aligned rtanges is supposed
> to be rejected by the kernel in vfs_clone_file_prep_inodes(). i.e
> this check:
>
> /* Only reflink if we're aligned to block boundaries */
> if (!IS_ALIGNED(pos_in, bs) || !IS_ALIGNED(pos_in + blen, bs) ||
> !IS_ALIGNED(pos_out, bs) || !IS_ALIGNED(pos_out + blen, bs))
> return -EINVAL;
>
> And it's pretty clear that a length of 557771 is not block aligned
> (being an odd number).
>
> So why was this dedupe request even accepted by the kernel? Well,
> I think there's a bug in the check just above this:
>
> /* If we're linking to EOF, continue to the block boundary. */
> if (pos_in + *len == isize)
> blen = ALIGN(isize, bs) - pos_in;
> else
> blen = *len;

Yes, btrfs, for the dedupe call, also has its own place where it does
the same thing,
at fs/btrfs/ioctl.c:extent_same_check_offsets().
And that's precisely what made me suspicious about it, together with
what you note below about the call to btrfs_cmp_data() using the
original, unaligned, length.

However, I just ran the same test using reflink and not dedupe and the
same problem happens. In earlier versions of the test/debugging I
either did not notice
or made some mistake because I hadn't seen the same problem for the
r

Re: [PATCH] generic: test for deduplication between different files

2018-08-19 Thread Filipe Manana
On Sun, Aug 19, 2018 at 5:19 PM, Eryu Guan  wrote:
> On Sun, Aug 19, 2018 at 04:41:31PM +0100, Filipe Manana wrote:
>> On Sun, Aug 19, 2018 at 3:07 PM, Eryu Guan  wrote:
>> > On Fri, Aug 17, 2018 at 09:39:24AM +0100, fdman...@kernel.org wrote:
>> >> From: Filipe Manana 
>> >>
>> >> Test that deduplication of an entire file that has a size that is not
>> >> aligned to the filesystem's block size into a different file does not
>> >> corrupt the destination's file data.
>> >>
>> >> This test is motivated by a bug found in Btrfs which is fixed by the
>> >> following patch for the linux kernel:
>> >>
>> >>   "Btrfs: fix data corruption when deduplicating between different files"
>> >>
>> >> XFS also fails this test, at least as of linux kernel 4.18-rc7, exactly
>> >> with the same corruption as in Btrfs - some bytes of a block get replaced
>> >> with zeroes after the deduplication.
>> >>
>> >> Signed-off-by: Filipe Manana 
>> >> ---
>> >>  tests/generic/505 | 84 
>> >> +++
>> >>  tests/generic/505.out | 33 
>> >>  tests/generic/group   |  1 +
>> >>  3 files changed, 118 insertions(+)
>> >>  create mode 100755 tests/generic/505
>> >>  create mode 100644 tests/generic/505.out
>> >>
>> >> diff --git a/tests/generic/505 b/tests/generic/505
>> >> new file mode 100755
>> >> index ..5ee232a2
>> >> --- /dev/null
>> >> +++ b/tests/generic/505
>> >> @@ -0,0 +1,84 @@
>> >> +#! /bin/bash
>> >> +# SPDX-License-Identifier: GPL-2.0
>> >> +# Copyright (C) 2018 SUSE Linux Products GmbH. All Rights Reserved.
>> >> +#
>> >> +# FS QA Test No. 505
>> >> +#
>> >> +# Test that deduplication of an entire file that has a size that is not 
>> >> aligned
>> >> +# to the filesystem's block size into a different file does not corrupt 
>> >> the
>> >> +# destination's file data.
>> >> +#
>> >> +seq=`basename $0`
>> >> +seqres=$RESULT_DIR/$seq
>> >> +echo "QA output created by $seq"
>> >> +tmp=/tmp/$$
>> >> +status=1 # failure is the default!
>> >> +trap "_cleanup; exit \$status" 0 1 2 3 15
>> >> +
>> >> +_cleanup()
>> >> +{
>> >> + cd /
>> >> + rm -f $tmp.*
>> >> +}
>> >> +
>> >> +# get standard environment, filters and checks
>> >> +. ./common/rc
>> >> +. ./common/filter
>> >> +. ./common/reflink
>> >> +
>> >> +# real QA test starts here
>> >> +_supported_fs generic
>> >> +_supported_os Linux
>> >> +_require_scratch_dedupe
>> >> +
>> >> +rm -f $seqres.full
>> >> +
>> >> +_scratch_mkfs >>$seqres.full 2>&1
>> >> +_scratch_mount
>> >> +
>> >> +# The first byte with a value of 0xae starts at an offset (2518890) 
>> >> which is not
>> >> +# a multiple of the block size.
>> >> +$XFS_IO_PROG -f \
>> >> + -c "pwrite -S 0x6b 0 2518890" \
>> >> + -c "pwrite -S 0xae 2518890 102398" \
>> >> + $SCRATCH_MNT/foo | _filter_xfs_io
>> >> +
>> >> +# Create a second file with a length not aligned to the block size, 
>> >> whose bytes
>> >> +# all have the value 0x6b, so that its extent(s) can be deduplicated 
>> >> with the
>> >> +# first file.
>> >> +$XFS_IO_PROG -f -c "pwrite -S 0x6b 0 557771" $SCRATCH_MNT/bar | 
>> >> _filter_xfs_io
>> >> +
>> >> +# The file is filled with bytes having the value 0x6b from offset 0 to 
>> >> offset
>> >> +# 2518889 and with the value 0xae from offset 2518890 to offset 2621287.
>> >> +echo "File content before deduplication:"
>> >> +od -t x1 $SCRATCH_MNT/foo
>> >> +
>> >> +# Now deduplicate the entire second file into a range of the first file 
>> >> that
>> >> +# also has all bytes with the value 0x6b. The destination range's end 
>> >> offset
>> >> +# must not be aligned to the block size and must be less then the offset 
>> >> of
>> >> +# the first byte with the value 0xae (byte at offset 2518890).
>> >> +$XFS_IO_PROG -c "dedupe $SCRATCH_MNT/bar 0 1957888 557771" 
>> >> $SCRATCH_MNT/foo \
>> >> + | _filter_xfs_io
>> >> +
>> >> +# The bytes in the range starting at offset 2515659 (end of the 
>> >> deduplication
>> >> +# range) and ending at offset 2519040 (start offset rounded up to the 
>> >> block
>> >> +# size) must all have the value 0xae (and not replaced with 0x00 values).
>> >
>> > This doesn't seem right to me, range [2515659, 2518890) should be 0x6b
>> > not 0xae, while range [2518890, 2519040) indeed should contain 0xae.
>>
>> Yes, indeed. My mistake (got it right in the comment before the first
>> call to "od").
>> Can you fix it up (if there's nothing else to fix), or do you need me
>> to send a new version?
>
> Sure, I can fix it on commit. But I've already pushed this week's update
> to upstream, so you won't see it until next week :)

No problem.
Thanks!

>
> Thanks,
> Eryu


Re: [PATCH] generic: test for deduplication between different files

2018-08-19 Thread Filipe Manana
On Sun, Aug 19, 2018 at 3:07 PM, Eryu Guan  wrote:
> On Fri, Aug 17, 2018 at 09:39:24AM +0100, fdman...@kernel.org wrote:
>> From: Filipe Manana 
>>
>> Test that deduplication of an entire file that has a size that is not
>> aligned to the filesystem's block size into a different file does not
>> corrupt the destination's file data.
>>
>> This test is motivated by a bug found in Btrfs which is fixed by the
>> following patch for the linux kernel:
>>
>>   "Btrfs: fix data corruption when deduplicating between different files"
>>
>> XFS also fails this test, at least as of linux kernel 4.18-rc7, exactly
>> with the same corruption as in Btrfs - some bytes of a block get replaced
>> with zeroes after the deduplication.
>>
>> Signed-off-by: Filipe Manana 
>> ---
>>  tests/generic/505 | 84 
>> +++
>>  tests/generic/505.out | 33 
>>  tests/generic/group   |  1 +
>>  3 files changed, 118 insertions(+)
>>  create mode 100755 tests/generic/505
>>  create mode 100644 tests/generic/505.out
>>
>> diff --git a/tests/generic/505 b/tests/generic/505
>> new file mode 100755
>> index ..5ee232a2
>> --- /dev/null
>> +++ b/tests/generic/505
>> @@ -0,0 +1,84 @@
>> +#! /bin/bash
>> +# SPDX-License-Identifier: GPL-2.0
>> +# Copyright (C) 2018 SUSE Linux Products GmbH. All Rights Reserved.
>> +#
>> +# FS QA Test No. 505
>> +#
>> +# Test that deduplication of an entire file that has a size that is not 
>> aligned
>> +# to the filesystem's block size into a different file does not corrupt the
>> +# destination's file data.
>> +#
>> +seq=`basename $0`
>> +seqres=$RESULT_DIR/$seq
>> +echo "QA output created by $seq"
>> +tmp=/tmp/$$
>> +status=1 # failure is the default!
>> +trap "_cleanup; exit \$status" 0 1 2 3 15
>> +
>> +_cleanup()
>> +{
>> + cd /
>> + rm -f $tmp.*
>> +}
>> +
>> +# get standard environment, filters and checks
>> +. ./common/rc
>> +. ./common/filter
>> +. ./common/reflink
>> +
>> +# real QA test starts here
>> +_supported_fs generic
>> +_supported_os Linux
>> +_require_scratch_dedupe
>> +
>> +rm -f $seqres.full
>> +
>> +_scratch_mkfs >>$seqres.full 2>&1
>> +_scratch_mount
>> +
>> +# The first byte with a value of 0xae starts at an offset (2518890) which 
>> is not
>> +# a multiple of the block size.
>> +$XFS_IO_PROG -f \
>> + -c "pwrite -S 0x6b 0 2518890" \
>> + -c "pwrite -S 0xae 2518890 102398" \
>> + $SCRATCH_MNT/foo | _filter_xfs_io
>> +
>> +# Create a second file with a length not aligned to the block size, whose 
>> bytes
>> +# all have the value 0x6b, so that its extent(s) can be deduplicated with 
>> the
>> +# first file.
>> +$XFS_IO_PROG -f -c "pwrite -S 0x6b 0 557771" $SCRATCH_MNT/bar | 
>> _filter_xfs_io
>> +
>> +# The file is filled with bytes having the value 0x6b from offset 0 to 
>> offset
>> +# 2518889 and with the value 0xae from offset 2518890 to offset 2621287.
>> +echo "File content before deduplication:"
>> +od -t x1 $SCRATCH_MNT/foo
>> +
>> +# Now deduplicate the entire second file into a range of the first file that
>> +# also has all bytes with the value 0x6b. The destination range's end offset
>> +# must not be aligned to the block size and must be less then the offset of
>> +# the first byte with the value 0xae (byte at offset 2518890).
>> +$XFS_IO_PROG -c "dedupe $SCRATCH_MNT/bar 0 1957888 557771" $SCRATCH_MNT/foo 
>> \
>> + | _filter_xfs_io
>> +
>> +# The bytes in the range starting at offset 2515659 (end of the 
>> deduplication
>> +# range) and ending at offset 2519040 (start offset rounded up to the block
>> +# size) must all have the value 0xae (and not replaced with 0x00 values).
>
> This doesn't seem right to me, range [2515659, 2518890) should be 0x6b
> not 0xae, while range [2518890, 2519040) indeed should contain 0xae.

Yes, indeed. My mistake (got it right in the comment before the first
call to "od").
Can you fix it up (if there's nothing else to fix), or do you need me
to send a new version?

Thanks!

>
> Thanks,
> Eryu
>
>> +# In other words, we should have exactly the same data we had before we 
>> asked
>> +# for deduplication.
>> +echo "File content after deduplication and before unmounting:"

Re: [PATCH 2/2] Btrfs: sync log after logging new name

2018-08-15 Thread Filipe Manana
On Tue, Aug 14, 2018 at 11:53 PM, David Sterba  wrote:
> On Tue, Aug 14, 2018 at 12:04:05PM -0700, Omar Sandoval wrote:
>> On Mon, Jun 18, 2018 at 01:06:16PM +0200, David Sterba wrote:
>> > On Fri, Jun 15, 2018 at 05:19:07PM +0100, Filipe Manana wrote:
>> > > On Fri, Jun 15, 2018 at 4:54 PM, David Sterba  wrote:
>> > > > On Mon, Jun 11, 2018 at 07:24:28PM +0100, fdman...@kernel.org wrote:
>> > > >> From: Filipe Manana 
>> > > >> Fixes: 12fcfd22fe5b ("Btrfs: tree logging unlink/rename fixes")
>> > > >> Reported-by: Vijay Chidambaram 
>> > > >> Signed-off-by: Filipe Manana 
>> > > >
>> > > > There are some warnings and possible lock up caused by this patch, the
>> > > > 1/2 alone is ok but 1/2 + 2/2 leads to the following warnings. I 
>> > > > checked
>> > > > twice, the patch base was the pull request ie. without any other 4.18
>> > > > stuff.
>> > >
>> > > Are you sure it's this patch?
>> > > On top of for-4.18 it didn't cause any problems here, plus the trace
>> > > below has nothing to do with renames, hard links or fsync at all -
>> > > everything seems stuck on waiting for IO from dev replace.
>> >
>> > It was a false alert, sorry. Strange that the warnings appeared only in
>> > the VM running both patches and not otherwise.
>> >
>> > Though the test did not directly use rename, the possible error scenario
>> > I had in mind was some leftover from locking, error handling or state
>> > that blocked umount of 011.
>>
>> Dave, are you sending this in for 4.19? I don't see it in your first
>> pull request.

In another thread, related to the first patch in the series iirc, I
specifically asked to not merge it.
That's because I run twice (in the long period of nearly 2 months now)
into a hang which could be caused
by this patch. The traces were weird and only contained inexact lines
that showed only the transaction kthread
waiting forever on transaction commit.

I recently found that I have hardware problems that were causing
issues with qemu (stalls, ocassional crashes)
so I'm hoping that's the cause but I still need to test it with long
stress tests on good hardware.

I don't mind getting it to linux-next in the meanwhile, but for 4.19 I
would prefer to not include yet.

>
> Will send it in 2nd pull for 4.19. The patch is 2 months old and I don't
> remember where it was lost on the way. I had some suspicions but turned
> out to be false. Thanks for the reminder.


Re: [PATCH] fstests: btrfs: Add test for corrupted orphan qgroup numbers

2018-08-10 Thread Filipe Manana
On Fri, Aug 10, 2018 at 9:46 AM, Qu Wenruo  wrote:
>
>
> On 8/9/18 5:26 PM, Filipe Manana wrote:
>> On Thu, Aug 9, 2018 at 8:45 AM, Qu Wenruo  wrote:
>>> This bug is exposed by populating a high level qgroup, and then make it
>>> orphan (high level qgroup without child)
>>
>> Same comment as in the kernel patch:
>>
>> "That sentence is confusing. An orphan, by definition [1], is someone
>> (or something in this case) without parents.
>> But you mention a group without children, so that should be named
>> "childless" or simply say "without children".
>> So one part of the sentence is wrong, either what is in parenthesis or
>> what comes before them.
>>
>> [1] https://www.thefreedictionary.com/orphan
>> "
>>
>>> with old qgroup numbers, and
>>> finally do rescan.
>>>
>>> Normally rescan should zero out all qgroups' accounting number, but due
>>> to a kernel bug which won't mark orphan qgroups dirty, their on-disk
>>> data is not updated, thus old numbers remain and cause qgroup
>>> corruption.
>>>
>>> Fixed by the following kernel patch:
>>> "btrfs: qgroup: Dirty all qgroups before rescan"
>>>
>>> Reported-by: Misono Tomohiro 
>>> Signed-off-by: Qu Wenruo 
>>> ---
>>>  tests/btrfs/170 | 82 +
>>>  tests/btrfs/170.out |  3 ++
>>>  tests/btrfs/group   |  1 +
>>>  3 files changed, 86 insertions(+)
>>>  create mode 100755 tests/btrfs/170
>>>  create mode 100644 tests/btrfs/170.out
>>>
>>> diff --git a/tests/btrfs/170 b/tests/btrfs/170
>>> new file mode 100755
>>> index ..bcf8b5c0e4f3
>>> --- /dev/null
>>> +++ b/tests/btrfs/170
>>> @@ -0,0 +1,82 @@
>>> +#! /bin/bash
>>> +# SPDX-License-Identifier: GPL-2.0
>>> +# Copyright (c) 2018 SUSE Linux Products GmbH.  All Rights Reserved.
>>> +#
>>> +# FS QA Test 170
>>> +#
>>> +# Test if btrfs can clear orphan (high level qgroup without child) qgroup's
>>> +# accounting numbers during rescan.
>>> +# Fixed by the following kernel patch:
>>> +# "btrfs: qgroup: Dirty all qgroups before rescan"
>>> +#
>>> +seq=`basename $0`
>>> +seqres=$RESULT_DIR/$seq
>>> +echo "QA output created by $seq"
>>> +
>>> +here=`pwd`
>>> +tmp=/tmp/$$
>>> +status=1   # failure is the default!
>>> +trap "_cleanup; exit \$status" 0 1 2 3 15
>>> +
>>> +_cleanup()
>>> +{
>>> +   cd /
>>> +   rm -f $tmp.*
>>> +}
>>> +
>>> +# get standard environment, filters and checks
>>> +. ./common/rc
>>> +. ./common/filter
>>> +
>>> +# remove previous $seqres.full before test
>>> +rm -f $seqres.full
>>> +
>>> +# real QA test starts here
>>> +
>>> +# Modify as appropriate.
>>> +_supported_fs btrfs
>>> +_supported_os Linux
>>> +_require_scratch
>>> +
>>> +_scratch_mkfs > /dev/null 2>&1
>>> +_scratch_mount
>>> +
>>> +
>>> +# Populate the fs
>>> +_run_btrfs_util_prog subvolume create "$SCRATCH_MNT/subvol"
>>> +_pwrite_byte 0xcdcd 0 1M "$SCRATCH_MNT/subvol/file1" | _filter_xfs_io > 
>>> /dev/null
>>> +
>>> +# Ensure that file reach disk, so it will also appear in snapshot
>>
>> # Ensure that buffered file data is persisted, so we won't have an
>> empty file in the snapshot.
>>> +sync
>>> +_run_btrfs_util_prog subvolume snapshot "$SCRATCH_MNT/subvol" 
>>> "$SCRATCH_MNT/snapshot"
>>> +
>>> +
>>> +_run_btrfs_util_prog quota enable "$SCRATCH_MNT"
>>> +_run_btrfs_util_prog quota rescan -w "$SCRATCH_MNT"
>>> +
>>> +# Create high level qgroup
>>> +_run_btrfs_util_prog qgroup create 1/0 "$SCRATCH_MNT"
>>> +
>>> +# Don't use _run_btrfs_util_prog here, as it can return 1 to info user
>>> +# that qgroup is marked inconsistent, this is a bug in btrfs-progs, but
>>> +# to ensure it will work, we just ignore the return value.
>>
>> Comment should go away IMHO. The preferred way is to call
>> $BTRFS_UTIL_PROG and have failures noticed
>> through differences in the golden output. There's no point in
>> mentioning something that currently doesn't work
>&

Re: [PATCH v2] fstests: btrfs: Add test for corrupted childless qgroup numbers

2018-08-10 Thread Filipe Manana
On Fri, Aug 10, 2018 at 3:20 AM, Qu Wenruo  wrote:
> This bug is exposed by populating a high level qgroup, and then make it
> childless with old qgroup numbers, and finally do rescan.
>
> Normally rescan should zero out all qgroups' accounting number, but due
> to a kernel bug which won't mark childless qgroups dirty, their on-disk
> data is never updated, thus old numbers remain and cause qgroup
> corruption.
>
> Fixed by the following kernel patch:
> "btrfs: qgroup: Dirty all qgroups before rescan"
>
> Reported-by: Misono Tomohiro 
> Signed-off-by: Qu Wenruo 
> ---
> changelog:
> v2:
>   Change the adjective for the offending group, from "orphan" to
>   "childless"

All the previous comments still apply as they weren't addressed, for
example using $BTRFS_UTIL_PROG instead of _run_btrfs_util_prog.
Thanks.

> ---
>  tests/btrfs/170 | 83 +
>  tests/btrfs/170.out |  3 ++
>  tests/btrfs/group   |  1 +
>  3 files changed, 87 insertions(+)
>  create mode 100755 tests/btrfs/170
>  create mode 100644 tests/btrfs/170.out
>
> diff --git a/tests/btrfs/170 b/tests/btrfs/170
> new file mode 100755
> index ..3a810e80562f
> --- /dev/null
> +++ b/tests/btrfs/170
> @@ -0,0 +1,83 @@
> +#! /bin/bash
> +# SPDX-License-Identifier: GPL-2.0
> +# Copyright (c) 2018 SUSE Linux Products GmbH.  All Rights Reserved.
> +#
> +# FS QA Test 170
> +#
> +# Test if btrfs can clear high level childless qgroup's accounting numbers
> +# during rescan.
> +#
> +# Fixed by the following kernel patch:
> +# "btrfs: qgroup: Dirty all qgroups before rescan"
> +#
> +seq=`basename $0`
> +seqres=$RESULT_DIR/$seq
> +echo "QA output created by $seq"
> +
> +here=`pwd`
> +tmp=/tmp/$$
> +status=1   # failure is the default!
> +trap "_cleanup; exit \$status" 0 1 2 3 15
> +
> +_cleanup()
> +{
> +   cd /
> +   rm -f $tmp.*
> +}
> +
> +# get standard environment, filters and checks
> +. ./common/rc
> +. ./common/filter
> +
> +# remove previous $seqres.full before test
> +rm -f $seqres.full
> +
> +# real QA test starts here
> +
> +# Modify as appropriate.
> +_supported_fs btrfs
> +_supported_os Linux
> +_require_scratch
> +
> +_scratch_mkfs > /dev/null 2>&1
> +_scratch_mount
> +
> +
> +# Populate the fs
> +_run_btrfs_util_prog subvolume create "$SCRATCH_MNT/subvol"
> +_pwrite_byte 0xcdcd 0 1M "$SCRATCH_MNT/subvol/file1" | _filter_xfs_io > 
> /dev/null
> +
> +# Ensure that file reach disk, so it will also appear in snapshot
> +sync
> +_run_btrfs_util_prog subvolume snapshot "$SCRATCH_MNT/subvol" 
> "$SCRATCH_MNT/snapshot"
> +
> +
> +_run_btrfs_util_prog quota enable "$SCRATCH_MNT"
> +_run_btrfs_util_prog quota rescan -w "$SCRATCH_MNT"
> +
> +# Create high level qgroup
> +_run_btrfs_util_prog qgroup create 1/0 "$SCRATCH_MNT"
> +
> +# Don't use _run_btrfs_util_prog here, as it can return 1 to info user
> +# that qgroup is marked inconsistent, this is a bug in btrfs-progs, but
> +# to ensure it will work, we just ignore the return value.
> +$BTRFS_UTIL_PROG qgroup assign "$SCRATCH_MNT/snapshot" 1/0 "$SCRATCH_MNT"
> +
> +# Above assign will mark qgroup inconsistent due to the shared extents
> +# between subvol/snapshot/high level qgroup, do rescan here
> +_run_btrfs_util_prog quota rescan -w "$SCRATCH_MNT"
> +
> +# Now remove the qgroup relationship and make 1/0 childless
> +# Due to the shared extent outside of 1/0, we will mark qgroup inconsistent
> +# and keep the number of qgroup 1/0
> +$BTRFS_UTIL_PROG qgroup remove "$SCRATCH_MNT/snapshot" 1/0 "$SCRATCH_MNT"
> +
> +# Above removal also marks qgroup inconsistent, rescan again
> +_run_btrfs_util_prog quota rescan -w "$SCRATCH_MNT"
> +
> +# After the test, btrfs check will verify qgroup numbers to catch any
> +# corruption.
> +
> +# success, all done
> +status=0
> +exit
> diff --git a/tests/btrfs/170.out b/tests/btrfs/170.out
> new file mode 100644
> index ..9002199e48ed
> --- /dev/null
> +++ b/tests/btrfs/170.out
> @@ -0,0 +1,3 @@
> +QA output created by 170
> +WARNING: quotas may be inconsistent, rescan needed
> +WARNING: quotas may be inconsistent, rescan needed
> diff --git a/tests/btrfs/group b/tests/btrfs/group
> index b616c73d09bf..339c977135c0 100644
> --- a/tests/btrfs/group
> +++ b/tests/btrfs/group
> @@ -172,3 +172,4 @@
>  167 auto quick replace volume
>  168 auto quick send
>  169 auto quick send
> +170 auto quick qgroup
> --
> 2.18.0
>



-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”


Re: [PATCH] fstests: btrfs: Add test for corrupted orphan qgroup numbers

2018-08-09 Thread Filipe Manana
On Thu, Aug 9, 2018 at 8:45 AM, Qu Wenruo  wrote:
> This bug is exposed by populating a high level qgroup, and then make it
> orphan (high level qgroup without child)

Same comment as in the kernel patch:

"That sentence is confusing. An orphan, by definition [1], is someone
(or something in this case) without parents.
But you mention a group without children, so that should be named
"childless" or simply say "without children".
So one part of the sentence is wrong, either what is in parenthesis or
what comes before them.

[1] https://www.thefreedictionary.com/orphan
"

> with old qgroup numbers, and
> finally do rescan.
>
> Normally rescan should zero out all qgroups' accounting number, but due
> to a kernel bug which won't mark orphan qgroups dirty, their on-disk
> data is not updated, thus old numbers remain and cause qgroup
> corruption.
>
> Fixed by the following kernel patch:
> "btrfs: qgroup: Dirty all qgroups before rescan"
>
> Reported-by: Misono Tomohiro 
> Signed-off-by: Qu Wenruo 
> ---
>  tests/btrfs/170 | 82 +
>  tests/btrfs/170.out |  3 ++
>  tests/btrfs/group   |  1 +
>  3 files changed, 86 insertions(+)
>  create mode 100755 tests/btrfs/170
>  create mode 100644 tests/btrfs/170.out
>
> diff --git a/tests/btrfs/170 b/tests/btrfs/170
> new file mode 100755
> index ..bcf8b5c0e4f3
> --- /dev/null
> +++ b/tests/btrfs/170
> @@ -0,0 +1,82 @@
> +#! /bin/bash
> +# SPDX-License-Identifier: GPL-2.0
> +# Copyright (c) 2018 SUSE Linux Products GmbH.  All Rights Reserved.
> +#
> +# FS QA Test 170
> +#
> +# Test if btrfs can clear orphan (high level qgroup without child) qgroup's
> +# accounting numbers during rescan.
> +# Fixed by the following kernel patch:
> +# "btrfs: qgroup: Dirty all qgroups before rescan"
> +#
> +seq=`basename $0`
> +seqres=$RESULT_DIR/$seq
> +echo "QA output created by $seq"
> +
> +here=`pwd`
> +tmp=/tmp/$$
> +status=1   # failure is the default!
> +trap "_cleanup; exit \$status" 0 1 2 3 15
> +
> +_cleanup()
> +{
> +   cd /
> +   rm -f $tmp.*
> +}
> +
> +# get standard environment, filters and checks
> +. ./common/rc
> +. ./common/filter
> +
> +# remove previous $seqres.full before test
> +rm -f $seqres.full
> +
> +# real QA test starts here
> +
> +# Modify as appropriate.
> +_supported_fs btrfs
> +_supported_os Linux
> +_require_scratch
> +
> +_scratch_mkfs > /dev/null 2>&1
> +_scratch_mount
> +
> +
> +# Populate the fs
> +_run_btrfs_util_prog subvolume create "$SCRATCH_MNT/subvol"
> +_pwrite_byte 0xcdcd 0 1M "$SCRATCH_MNT/subvol/file1" | _filter_xfs_io > 
> /dev/null
> +
> +# Ensure that file reach disk, so it will also appear in snapshot

# Ensure that buffered file data is persisted, so we won't have an
empty file in the snapshot.
> +sync
> +_run_btrfs_util_prog subvolume snapshot "$SCRATCH_MNT/subvol" 
> "$SCRATCH_MNT/snapshot"
> +
> +
> +_run_btrfs_util_prog quota enable "$SCRATCH_MNT"
> +_run_btrfs_util_prog quota rescan -w "$SCRATCH_MNT"
> +
> +# Create high level qgroup
> +_run_btrfs_util_prog qgroup create 1/0 "$SCRATCH_MNT"
> +
> +# Don't use _run_btrfs_util_prog here, as it can return 1 to info user
> +# that qgroup is marked inconsistent, this is a bug in btrfs-progs, but
> +# to ensure it will work, we just ignore the return value.

Comment should go away IMHO. The preferred way is to call
$BTRFS_UTIL_PROG and have failures noticed
through differences in the golden output. There's no point in
mentioning something that currently doesn't work
if it's not used here.

> +$BTRFS_UTIL_PROG qgroup assign "$SCRATCH_MNT/snapshot" 1/0 "$SCRATCH_MNT"
> +
> +# Above assign will mark qgroup inconsistent due to the shared extents

assign -> assignment

> +# between subvol/snapshot/high level qgroup, do rescan here
> +_run_btrfs_util_prog quota rescan -w "$SCRATCH_MNT"

Use $BTRFS_UTIL_PROG directly instead, and adjust the golden output if needed.

> +
> +# Now remove the qgroup relationship and make 1/0 orphan
> +# Due to the shared extent outside of 1/0, we will mark qgroup inconsistent
> +# and keep the number of qgroup 1/0

Missing "." at the end of the sentences.

> +$BTRFS_UTIL_PROG qgroup remove "$SCRATCH_MNT/snapshot" 1/0 "$SCRATCH_MNT"
> +
> +# Above removal also marks qgroup inconsistent, rescan again
> +_run_btrfs_util_prog quota rescan -w "$SCRATCH_MNT"

Use $BTRFS_UTIL_PROG directly instead, and adjust the golden output if needed.

Thanks.

> +
> +# After the test, btrfs check will verify qgroup numbers to catch any
> +# corruption.
> +
> +# success, all done
> +status=0
> +exit
> diff --git a/tests/btrfs/170.out b/tests/btrfs/170.out
> new file mode 100644
> index ..9002199e48ed
> --- /dev/null
> +++ b/tests/btrfs/170.out
> @@ -0,0 +1,3 @@
> +QA output created by 170
> +WARNING: quotas may be inconsistent, rescan needed
> +WARNING: quotas may be inconsistent, rescan needed
> diff --git a/tests/btrfs/group b/tests/btrfs/group
> index b616c73d09bf..339c977135c0 100644
> --- 

Re: [PATCH v2] btrfs: qgroup: Dirty all qgroups before rescan

2018-08-09 Thread Filipe Manana
On Thu, Aug 9, 2018 at 8:08 AM, Qu Wenruo  wrote:
> [BUG]
> In the following case, rescan won't zero out the number of qgroup 1/0:
> --
> $ mkfs.btrfs -fq $DEV
> $ mount $DEV /mnt
>
> $ btrfs quota enable /mnt
> $ btrfs qgroup create 1/0 /mnt
> $ btrfs sub create /mnt/sub
> $ btrfs qgroup assign 0/257 1/0 /mnt
>
> $ dd if=/dev/urandom of=/mnt/sub/file bs=1k count=1000
> $ btrfs sub snap /mnt/sub /mnt/snap
> $ btrfs quota rescan -w /mnt
> $ btrfs qgroup show -pcre /mnt
> qgroupid rfer excl max_rfer max_excl parent  child
>      --  -
> 0/5  16.00KiB 16.00KiB none none --- ---
> 0/257  1016.00KiB 16.00KiB none none 1/0 ---
> 0/258  1016.00KiB 16.00KiB none none --- ---
> 1/01016.00KiB 16.00KiB none none --- 0/257
>
> so far so good, but:
>
> $ btrfs qgroup remove 0/257 1/0 /mnt
> WARNING: quotas may be inconsistent, rescan needed
> $ btrfs quota rescan -w /mnt
> $ btrfs qgroup show -pcre  /mnt
> qgoupid rfer excl max_rfer max_excl parent  child
>      --  -
> 0/5  16.00KiB 16.00KiB none none --- ---
> 0/257  1016.00KiB 16.00KiB none none --- ---
> 0/258  1016.00KiB 16.00KiB none none --- ---
> 1/01016.00KiB 16.00KiB none none --- ---
>^^  not cleared
> --
>
> [CAUSE]
> Before rescan we call qgroup_rescan_zero_tracking() to zero out all
> qgroups' accounting numbers.
>
> However we don't mark all qgroups dirty, but rely on rescan to mark
> qgroups dirty.
>
> If we have any high level qgroup but without any child (orphan group),

That sentence is confusing. An orphan, by definition [1], is someone
(or something in this case) without parents.
But you mention a group without children, so that should be named
"childless" or simply say "without children".
So one part of the sentence is wrong, either what is in parenthesis or
what comes before them.

[1] https://www.thefreedictionary.com/orphan

> it
> won't be marked dirty during rescan, since we can not reach that qgroup.
>
> This will cause QGROUP_INFO items of orphan qgroups never get updated in
> quota tree, thus their numbers will stay the same in "btrfs qgroup show"
> output.
>
> [FIX]
> Just mark all qgroups dirty in qgroup_rescan_zero_tracking(), so even we
> have orphan qgroups their QGROUP_INFO items will still get updated during
> rescan.
>
> Reported-by: Misono Tomohiro 
> Signed-off-by: Qu Wenruo 
> ---
> changelog:
> v2:
>   Fix some grammar errors in commit message.
> ---
>  fs/btrfs/qgroup.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
> index 48c1c3e7baf3..5a5372b33d96 100644
> --- a/fs/btrfs/qgroup.c
> +++ b/fs/btrfs/qgroup.c
> @@ -2864,6 +2864,7 @@ qgroup_rescan_zero_tracking(struct btrfs_fs_info 
> *fs_info)
> qgroup->rfer_cmpr = 0;
> qgroup->excl = 0;
> qgroup->excl_cmpr = 0;
> +   qgroup_dirty(fs_info, qgroup);
> }
> spin_unlock(_info->qgroup_lock);
>  }
> --
> 2.18.0
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: optimization to avoid ENOSPC for nocow writes after snapshot when low on data space

2018-08-06 Thread Filipe Manana
On Mon, Aug 6, 2018 at 3:33 AM, robbieko  wrote:
> Filipe Manana 於 2018-08-03 18:22 寫到:
>
>> On Fri, Aug 3, 2018 at 10:13 AM, robbieko  wrote:
>>>
>>> From: Robbie Ko 
>>>
>>> Commit e9894fd3e3b3 ("Btrfs: fix snapshot vs nocow writting")
>>> forced writeback fallback to COW when subvolume is snapshotted.
>>
>>
>> Commit e9894fd3e3b3 ("Btrfs: fix snapshot vs nocow writting") forced
>> nocow writes to fallback
>> to COW, during writeback, when a snapshot is created. This resulted in
>> writes made before creating
>> the snapshot to unexpectedly fail with ENOSPC during writeback when
>> success (0) was returned
>> to user space through the write system call.
>>
>> The steps leading to this problem are:
>>
>>>
>>> 1. When the space is full, write syscall will check if can
>>> nocow, and space reservation will not happen.
>>
>>
>> 1. When it's not possible to allocate data space for a write, the
>> buffered write path checks if
>> a NOCOW write is possible. If it is, it will not reserve space and
>> success (0) is returned to
>> user space.
>>
>>>
>>> 2. Then snapshot happens before flushing IO (running dealloc),
>>> we will increase will_be_snapshotted, and then when running
>>> dealloc we fallback to COW and fail (ENOSPC).
>>
>>
>> 2. Then when a snapshot is created, the root's will_be_snapshotted
>> atomic is incremented and writeback
>> is triggered for all inode's that belong to the root being
>> snapshotted. Incrementing that atomic forces
>> all previous writes to fallback to COW during writeback (running
>> delalloc).
>>
>> 3. This results in the writeback for the inodes to fail and therefore
>> setting the ENOSPC error in their mappings,
>> so that a subsequent fsync on them will report the error to user
>> space. So it's not a completely silent data loss
>> (since fsync will report ENOSPC) but it's a very unexpected and
>> undesirable behaviour, because if a clean
>> shutdown/unmount of the filesystem happens without previous calls to
>> fsync, it is expected to have the data
>> present in the files after mounting the filesystem again.
>>
>>>
>>> So fix this by we add a snapshot_force_cow, this is used to
>>> distinguish between write and writeback.
>>
>>
>> So fix this by adding a new atomic named snapshot_force_cow to the
>> root structure which prevents
>> this behaviour and works the following way:
>>
>>>
>>> 1. Increase will_be_snapshotted, so that write force to the cow,
>>> always need space reservation.
>>
>>
>> 1. It is incremented when we start to create a snapshot after
>> triggering writeback and
>> before waiting for writeback to finish.
>>
>>>
>>> 2. Flushing all dirty pages (running dealloc), then now writeback
>>> is still flushed in nocow mode, make sure all ditry pages that might
>>> not reserve space previously have flushed this time otherwise they
>>> will fallback to cow mode and fail due to no space.
>>
>>
>> 2. This new atomic is now what is used by writeback (running delalloc)
>> to decide whether we need to
>> fallback to COW or not. Because we incremented this new atomic after
>> triggering writeback in the snapshot
>> creation ioctl, we ensure that all buffered writes that happened
>> before snapshot creation will succeed and
>> not fallback to COW (which would make them fail with ENOSPC).
>>
>>>
>>> 3. Increase snapshot_force_cow, since all new dirty pages are
>>> guaranteed space reservation, when running dealloc we can safely
>>> fallback to COW.
>>
>>
>> 3. The existing atomic, will_be_snapshotted, is kept because it is
>> used to force new buffered writes, that
>> start after we started snapshotting, to reserve data space even when
>> NOCOW is possible.
>> This makes these writes fail early with ENOSPC when there's no
>> available space to allocate, preventing the
>> unexpected behaviour of writeback later failing with ENOSPC due to a
>> fallback to COW mode.
>>
>>>
>>> Fixes: e9894fd3e3b3 ("Btrfs: fix snapshot vs nocow writting")
>>> Signed-off-by: Robbie Ko 
>>> ---
>>>  fs/btrfs/ctree.h   |  1 +
>>>  fs/btrfs/disk-io.c |  1 +
>>>  fs/btrfs/inode.c   | 26 +-
>>>  fs/btrfs/ioctl.c   | 14 ++
>>>  4 files changed, 21 insertions(+), 21 deletions(-)
&g

Re: [PATCH] btrfs: revert fs_devices state on error of btrfs_init_new_device()

2018-08-03 Thread Filipe Manana
On Fri, Aug 3, 2018 at 8:29 AM, Anand Jain  wrote:
>
>
> On 08/03/2018 02:36 PM, Anand Jain wrote:
>>
>>
>>
>>
>> On 07/31/2018 07:47 PM, Filipe Manana wrote:
>>>
>>> On Tue, Jul 31, 2018 at 11:12 AM, Anand Jain 
>>> wrote:
>>>>
>>>>
>>>>
>>>> On 07/27/2018 08:04 AM, Naohiro Aota wrote:
>>>>>
>>>>>
>>>>> When btrfs hits error after modifying fs_devices in
>>>>> btrfs_init_new_device() (such as btrfs_add_dev_item() returns error),
>>>>> it
>>>>> leaves everything as is, but frees allocated btrfs_device. As a result,
>>>>> fs_devices->devices and fs_devices->alloc_list contain already freed
>>>>> btrfs_device, leading to later use-after-free bug.
>>>>
>>>>
>>>>
>>>>   the undo part of the btrfs_init_new_device() is broken for a while
>>>> now.
>>>>   Thanks for the fix, but..
>>>>
>>>>- this patch does not fix the seed device context, its ok to fix that
>>>>  in a separate patch though.
>>>>- and does not undo the effect of
>>>>
>>>> -
>>>>  if (!blk_queue_nonrot(q))
>>>>  fs_info->fs_devices->rotating = 1
>>>> ::
>>>>  btrfs_clear_space_info_full(fs_info);
>>>> 
>>>>   which I think should be handled as part of this patch.
>>>
>>>
>>> Doesn't matter, the filesystem was turned to RO mode (transaction
>>> aborted).
>>
>>
>> . That's not true in all cases. Filesystem can still be in the RW

Yes, if nothing was done yet in the transaction (exactly what happens
in your test), in which case there's no risk of leaving inconsistent
metadata on disk.

Space info being full is rather rare, further setting it to full only
makes the next allocation attempt to do some work looking for space
instead of returning enospc immediately.
Settting the rotating flag has currently no effect for a mounted filesystem.

That is, there are no problems.

>
>
>typo I mean not true in some cases and FS can still be RW able
>after the transaction abort, below is a test case and results.
>
> Thanks. Aannd
>
>
>>mode after the transaction aborted. Tested with the following
>>simulation.
>>
>> --
>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
>> index f46af7928963..5609d70b4372 100644
>> --- a/fs/btrfs/volumes.c
>> +++ b/fs/btrfs/volumes.c
>> @@ -2458,6 +2458,10 @@ int btrfs_init_new_device(struct btrfs_fs_info
>> *fs_info, const char *device_path
>>  }
>>  }
>>
>> +   ret = -ENOMEM;
>> +   btrfs_abort_transaction(trans, ret);
>> +   goto error_sysfs;
>> +
>>  ret = btrfs_add_dev_item(trans, device);
>>  if (ret) {
>>  btrfs_abort_transaction(trans, ret);
>> ---
>>
>>
>> # mount /dev/sdb /btrfs
>>
>> # btrfs dev add /dev/sdc /btrfs
>> ERROR: error adding device '/dev/sdc': Cannot allocate memory
>>
>> # cat /proc/self/mounts | grep btrfs
>> /dev/sdb /btrfs btrfs rw,relatime,space_cache,subvolid=5,subvol=/ 0 0
>>
>> # echo "test" > /btrfs/tf; echo $?
>> 0
>>
>> . In any case, I would rather put the things right even if it just
>>   theoretical. A core dump taken after this would indicate a wrong
>>   state of the space and fs_devices::rotating.
>>
>>
>> Thanks, Anand
>>
>>>>
>>>> Thanks, Anand
>>>>
>>>>
>>>>
>>>>> Error path also messes the things like ->num_devices. While they go
>>>>> backs
>>>>> to the original value by unscanning btrfs devices, it is safe to revert
>>>>> them here.
>>>>>
>>>>> Fixes: 79787eaab461 ("btrfs: replace many BUG_ONs with proper error
>>>>> handling")
>>>>> Signed-off-by: Naohiro Aota 
>>>>> ---
>>>>>fs/btrfs/volumes.c | 28 +++-
>>>>>1 file changed, 23 insertions(+), 5 deletions(-)
>>>>>
>>>>>This patch applies on master, but not on kdave/for-next because of
>>>>>74b9f4e186eb ("btrfs: declare fs_devices in
>>>>> btrfs_init_new_device()")
>>>>>
>>>>> diff --git a/fs/btrfs/vo

Re: [PATCH] Btrfs: optimization to avoid ENOSPC for nocow writes after snapshot when low on data space

2018-08-03 Thread Filipe Manana
   cur_offset = extent_end;
>
> /*
> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> index b077544..42af06b 100644
> --- a/fs/btrfs/ioctl.c
> +++ b/fs/btrfs/ioctl.c
> @@ -761,6 +761,7 @@ static int create_snapshot(struct btrfs_root *root, 
> struct inode *dir,
> struct btrfs_pending_snapshot *pending_snapshot;
> struct btrfs_trans_handle *trans;
> int ret;
> +   bool snapshot_force_cow = false;
>
> if (!test_bit(BTRFS_ROOT_REF_COWS, >state))
> return -EINVAL;
> @@ -777,6 +778,10 @@ static int create_snapshot(struct btrfs_root *root, 
> struct inode *dir,
> goto free_pending;
> }
>
> +   /*
> +* We force a new write to reserve space to
> +* avoid the space being full since they'll fallback to cow.
> +*/


Force new buffered writes to reserve space even when NOCOW is
possible. This is to
avoid later writeback (running dealloc) to fallback to COW mode and
unexpectedly fail
with ENOSPC.


> atomic_inc(>will_be_snapshotted);
> smp_mb__after_atomic();
> /* wait for no snapshot writes */
> @@ -787,6 +792,13 @@ static int create_snapshot(struct btrfs_root *root, 
> struct inode *dir,
> if (ret)
> goto dec_and_free;
>
> +   /*
> +* When all previous writes are finished,
> +* we can safely convert writeback to cow.
> +*/

All previous writes have started writeback in NOCOW mode, so now we
force future writes to
fallback to COW mode during snapshot creation.


I would also change the subject from:

"Btrfs: optimization to avoid ENOSPC for nocow writes after snapshot
when low on data space"

to:

"Btrfs: fix unexpected failure of nocow buffered writes after
snapshotting when low on space"


With that you can have my:
Reviewed-by: Filipe Manana 

Thanks, great work!


> +   atomic_inc(>snapshot_force_cow);
> +   snapshot_force_cow = true;
> +
> btrfs_wait_ordered_extents(root, U64_MAX, 0, (u64)-1);
>
> btrfs_init_block_rsv(_snapshot->block_rsv,
> @@ -851,6 +863,8 @@ static int create_snapshot(struct btrfs_root *root, 
> struct inode *dir,
>  fail:
> btrfs_subvolume_release_metadata(fs_info, 
> _snapshot->block_rsv);
>  dec_and_free:
> +   if (snapshot_force_cow)
> +   atomic_dec(>snapshot_force_cow);
> if (atomic_dec_and_test(>will_be_snapshotted))
> wake_up_var(>will_be_snapshotted);
>  free_pending:
> --
> 1.9.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: fix data lose with snapshot when nospace

2018-08-01 Thread Filipe Manana
On Wed, Aug 1, 2018 at 1:54 PM, Filipe Manana  wrote:
> On Wed, Aug 1, 2018 at 11:20 AM, robbieko  wrote:
>> Filipe Manana 於 2018-07-31 19:33 寫到:
>>
>>> On Tue, Jul 31, 2018 at 11:17 AM, robbieko  wrote:
>>>>
>>>> Filipe Manana 於 2018-07-30 20:34 寫到:
>>>>
>>>>> On Mon, Jul 30, 2018 at 12:28 PM, Filipe Manana 
>>>>> wrote:
>>>>>>
>>>>>>
>>>>>> On Mon, Jul 30, 2018 at 12:08 PM, Filipe Manana 
>>>>>> wrote:
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Jul 30, 2018 at 11:21 AM, robbieko 
>>>>>>> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> From: Robbie Ko 
>>>>>>>>
>>>>>>>> Commit e9894fd3e3b3 ("Btrfs: fix snapshot vs nocow writting")
>>>>>>>> modified the nocow writeback mechanism, if you create a snapshot,
>>>>>>>> it will always switch to cow writeback.
>>>>>>>>
>>>>>>>> This will cause data loss when there is no space, because
>>>>>>>> when the space is full, the write will not reserve any space, only
>>>>>>>> check if it can be nocow write.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> This is a bit vague.
>>>>>>> You need to mention where space reservation does not happen (at the
>>>>>>> time of the write syscall) and why,
>>>>>>> and that the snapshot happens before flushing IO (running dealloc).
>>>>>>> Then when running dealloc we fallback
>>>>>>> to COW and fail.
>>>>>>>
>>>>>>> You also need to tell that although the write syscall did not return
>>>>>>> an error, the writeback will
>>>>>>> fail but a subsequent fsync on the file will return an error (ENOSPC)
>>>>>>> because the writeback set the error
>>>>>>> on the inode's mapping, so it's not completely a silent data loss, as
>>>>>>> for buffered writes there's no guarantee
>>>>>>> that if write syscall returns 0 the data will be persisted
>>>>>>> successfully (that can only be guaranteed if a subsequent
>>>>>>> fsync call returns 0).
>>>>>>>
>>>>>>>>
>>>>>>>> So fix this by first flush the nocow data, and then switch to the
>>>>>>>> cow write.
>>>>>>
>>>>>>
>>>>>>
>>>>>> I'm also not seeing how what you've done is better then we have now
>>>>>> using the root->will_be_snapshotted atomic,
>>>>>> which is essentially used the same way as the new atomic you are
>>>>>> adding, and forces the writeback code no nocow
>>>>>> writes as well.
>>>>>
>>>>>
>>>>>
>>>>> So what you have done can be made much more simple by flushing
>>>>> delalloc before incrementing root->will_be_snapshotted instead of
>>>>> after incrementing it:
>>>>>
>>>>> https://friendpaste.com/2LY9eLAR9q0RoOtRK7VYmX
>>>>
>>>>
>>>> There is no way to solve this problem in this modification.
>>>
>>>
>>> It minimizes it. It only gives better guarantees that nocow buffered
>>> writes that happened before calling the snapshot ioctl will not fall
>>> back to cow,
>>> not for the ones that happen while the call to the ioctl is happening.
>>>
>>>>
>>>> When writing and create snapshot at the same time, the write will not
>>>> reserve space,
>>>> and will not return to ENOSPC, because will_be_snapshotted is still 0.
>>>> So when writeback flush data, there will still be problems with ENOSPC.
>>>
>>>
>>> Which is precisely what I proposed does without adding a new atomic
>>> and more changes.
>>> It flushes delalloc before incrementing root->will_be_snapshotted, so
>>> that previous buffered nocow writes will not fallback to cow mode (and
>>> require data space allocation).
>>>
>>> It only leaves a very tiny and very unlikely to hit (but not
>>> impossible) time window where nocow writes will fallback
>>> to cow mo

Re: [PATCH] Btrfs: fix data lose with snapshot when nospace

2018-08-01 Thread Filipe Manana
On Wed, Aug 1, 2018 at 11:20 AM, robbieko  wrote:
> Filipe Manana 於 2018-07-31 19:33 寫到:
>
>> On Tue, Jul 31, 2018 at 11:17 AM, robbieko  wrote:
>>>
>>> Filipe Manana 於 2018-07-30 20:34 寫到:
>>>
>>>> On Mon, Jul 30, 2018 at 12:28 PM, Filipe Manana 
>>>> wrote:
>>>>>
>>>>>
>>>>> On Mon, Jul 30, 2018 at 12:08 PM, Filipe Manana 
>>>>> wrote:
>>>>>>
>>>>>>
>>>>>> On Mon, Jul 30, 2018 at 11:21 AM, robbieko 
>>>>>> wrote:
>>>>>>>
>>>>>>>
>>>>>>> From: Robbie Ko 
>>>>>>>
>>>>>>> Commit e9894fd3e3b3 ("Btrfs: fix snapshot vs nocow writting")
>>>>>>> modified the nocow writeback mechanism, if you create a snapshot,
>>>>>>> it will always switch to cow writeback.
>>>>>>>
>>>>>>> This will cause data loss when there is no space, because
>>>>>>> when the space is full, the write will not reserve any space, only
>>>>>>> check if it can be nocow write.
>>>>>>
>>>>>>
>>>>>>
>>>>>> This is a bit vague.
>>>>>> You need to mention where space reservation does not happen (at the
>>>>>> time of the write syscall) and why,
>>>>>> and that the snapshot happens before flushing IO (running dealloc).
>>>>>> Then when running dealloc we fallback
>>>>>> to COW and fail.
>>>>>>
>>>>>> You also need to tell that although the write syscall did not return
>>>>>> an error, the writeback will
>>>>>> fail but a subsequent fsync on the file will return an error (ENOSPC)
>>>>>> because the writeback set the error
>>>>>> on the inode's mapping, so it's not completely a silent data loss, as
>>>>>> for buffered writes there's no guarantee
>>>>>> that if write syscall returns 0 the data will be persisted
>>>>>> successfully (that can only be guaranteed if a subsequent
>>>>>> fsync call returns 0).
>>>>>>
>>>>>>>
>>>>>>> So fix this by first flush the nocow data, and then switch to the
>>>>>>> cow write.
>>>>>
>>>>>
>>>>>
>>>>> I'm also not seeing how what you've done is better then we have now
>>>>> using the root->will_be_snapshotted atomic,
>>>>> which is essentially used the same way as the new atomic you are
>>>>> adding, and forces the writeback code no nocow
>>>>> writes as well.
>>>>
>>>>
>>>>
>>>> So what you have done can be made much more simple by flushing
>>>> delalloc before incrementing root->will_be_snapshotted instead of
>>>> after incrementing it:
>>>>
>>>> https://friendpaste.com/2LY9eLAR9q0RoOtRK7VYmX
>>>
>>>
>>> There is no way to solve this problem in this modification.
>>
>>
>> It minimizes it. It only gives better guarantees that nocow buffered
>> writes that happened before calling the snapshot ioctl will not fall
>> back to cow,
>> not for the ones that happen while the call to the ioctl is happening.
>>
>>>
>>> When writing and create snapshot at the same time, the write will not
>>> reserve space,
>>> and will not return to ENOSPC, because will_be_snapshotted is still 0.
>>> So when writeback flush data, there will still be problems with ENOSPC.
>>
>>
>> Which is precisely what I proposed does without adding a new atomic
>> and more changes.
>> It flushes delalloc before incrementing root->will_be_snapshotted, so
>> that previous buffered nocow writes will not fallback to cow mode (and
>> require data space allocation).
>>
>> It only leaves a very tiny and very unlikely to hit (but not
>> impossible) time window where nocow writes will fallback
>> to cow mode - after calling start_delalloc_inodes() and before
>> incrementing root->will_be_snapshotted a new buffered write can comes
>> in and gets immediately flushed
>> because someone called fsync() on the file or the VM decided to
>> trigger writeback (due to memory pressure or some other reason).
>>
>
> It is very easy to reproduce. Not a tiny time.
> Because the time of start_d

Re: [PATCH] btrfs: revert fs_devices state on error of btrfs_init_new_device()

2018-07-31 Thread Filipe Manana
On Tue, Jul 31, 2018 at 11:12 AM, Anand Jain  wrote:
>
>
> On 07/27/2018 08:04 AM, Naohiro Aota wrote:
>>
>> When btrfs hits error after modifying fs_devices in
>> btrfs_init_new_device() (such as btrfs_add_dev_item() returns error), it
>> leaves everything as is, but frees allocated btrfs_device. As a result,
>> fs_devices->devices and fs_devices->alloc_list contain already freed
>> btrfs_device, leading to later use-after-free bug.
>
>
>  the undo part of the btrfs_init_new_device() is broken for a while now.
>  Thanks for the fix, but..
>
>   - this patch does not fix the seed device context, its ok to fix that
> in a separate patch though.
>   - and does not undo the effect of
>
> -
> if (!blk_queue_nonrot(q))
> fs_info->fs_devices->rotating = 1
> ::
> btrfs_clear_space_info_full(fs_info);
> 
>  which I think should be handled as part of this patch.

Doesn't matter, the filesystem was turned to RO mode (transaction aborted).

>
> Thanks, Anand
>
>
>
>> Error path also messes the things like ->num_devices. While they go backs
>> to the original value by unscanning btrfs devices, it is safe to revert
>> them here.
>>
>> Fixes: 79787eaab461 ("btrfs: replace many BUG_ONs with proper error
>> handling")
>> Signed-off-by: Naohiro Aota 
>> ---
>>   fs/btrfs/volumes.c | 28 +++-
>>   1 file changed, 23 insertions(+), 5 deletions(-)
>>
>>   This patch applies on master, but not on kdave/for-next because of
>>   74b9f4e186eb ("btrfs: declare fs_devices in btrfs_init_new_device()")
>>
>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
>> index 1da162928d1a..5f0512fffa52 100644
>> --- a/fs/btrfs/volumes.c
>> +++ b/fs/btrfs/volumes.c
>> @@ -2410,7 +2410,7 @@ int btrfs_init_new_device(struct btrfs_fs_info
>> *fs_info, const char *device_path
>> struct list_head *devices;
>> struct super_block *sb = fs_info->sb;
>> struct rcu_string *name;
>> -   u64 tmp;
>> +   u64 orig_super_total_bytes, orig_super_num_devices;
>> int seeding_dev = 0;
>> int ret = 0;
>> bool unlocked = false;
>> @@ -2509,12 +2509,14 @@ int btrfs_init_new_device(struct btrfs_fs_info
>> *fs_info, const char *device_path
>> if (!blk_queue_nonrot(q))
>> fs_info->fs_devices->rotating = 1;
>>   - tmp = btrfs_super_total_bytes(fs_info->super_copy);
>> +   orig_super_total_bytes =
>> btrfs_super_total_bytes(fs_info->super_copy);
>> btrfs_set_super_total_bytes(fs_info->super_copy,
>> -   round_down(tmp + device->total_bytes,
>> fs_info->sectorsize));
>> +   round_down(orig_super_total_bytes + device->total_bytes,
>> +  fs_info->sectorsize));
>>   - tmp = btrfs_super_num_devices(fs_info->super_copy);
>> -   btrfs_set_super_num_devices(fs_info->super_copy, tmp + 1);
>> +   orig_super_num_devices =
>> btrfs_super_num_devices(fs_info->super_copy);
>> +   btrfs_set_super_num_devices(fs_info->super_copy,
>> +   orig_super_num_devices + 1);
>> /* add sysfs device entry */
>> btrfs_sysfs_add_device_link(fs_info->fs_devices, device);
>> @@ -2594,6 +2596,22 @@ int btrfs_init_new_device(struct btrfs_fs_info
>> *fs_info, const char *device_path
>> error_sysfs:
>> btrfs_sysfs_rm_device_link(fs_info->fs_devices, device);
>> +   mutex_lock(_info->fs_devices->device_list_mutex);
>> +   mutex_lock(_info->chunk_mutex);
>> +   list_del_rcu(>dev_list);
>> +   list_del(>dev_alloc_list);
>> +   fs_info->fs_devices->num_devices--;
>> +   fs_info->fs_devices->open_devices--;
>> +   fs_info->fs_devices->rw_devices--;
>> +   fs_info->fs_devices->total_devices--;
>> +   fs_info->fs_devices->total_rw_bytes -= device->total_bytes;
>> +   atomic64_sub(device->total_bytes, _info->free_chunk_space);
>> +   btrfs_set_super_total_bytes(fs_info->super_copy,
>> +   orig_super_total_bytes);
>> +   btrfs_set_super_num_devices(fs_info->super_copy,
>> +   orig_super_num_devices);
>> +   mutex_unlock(_info->chunk_mutex);
>> +   mutex_unlock(_info->fs_devices->device_list_mutex);
>>   error_trans:
>> if (seeding_dev)
>> sb->s_flags |= SB_RDONLY;
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: fix data lose with snapshot when nospace

2018-07-31 Thread Filipe Manana
On Tue, Jul 31, 2018 at 11:17 AM, robbieko  wrote:
> Filipe Manana 於 2018-07-30 20:34 寫到:
>
>> On Mon, Jul 30, 2018 at 12:28 PM, Filipe Manana 
>> wrote:
>>>
>>> On Mon, Jul 30, 2018 at 12:08 PM, Filipe Manana 
>>> wrote:
>>>>
>>>> On Mon, Jul 30, 2018 at 11:21 AM, robbieko 
>>>> wrote:
>>>>>
>>>>> From: Robbie Ko 
>>>>>
>>>>> Commit e9894fd3e3b3 ("Btrfs: fix snapshot vs nocow writting")
>>>>> modified the nocow writeback mechanism, if you create a snapshot,
>>>>> it will always switch to cow writeback.
>>>>>
>>>>> This will cause data loss when there is no space, because
>>>>> when the space is full, the write will not reserve any space, only
>>>>> check if it can be nocow write.
>>>>
>>>>
>>>> This is a bit vague.
>>>> You need to mention where space reservation does not happen (at the
>>>> time of the write syscall) and why,
>>>> and that the snapshot happens before flushing IO (running dealloc).
>>>> Then when running dealloc we fallback
>>>> to COW and fail.
>>>>
>>>> You also need to tell that although the write syscall did not return
>>>> an error, the writeback will
>>>> fail but a subsequent fsync on the file will return an error (ENOSPC)
>>>> because the writeback set the error
>>>> on the inode's mapping, so it's not completely a silent data loss, as
>>>> for buffered writes there's no guarantee
>>>> that if write syscall returns 0 the data will be persisted
>>>> successfully (that can only be guaranteed if a subsequent
>>>> fsync call returns 0).
>>>>
>>>>>
>>>>> So fix this by first flush the nocow data, and then switch to the
>>>>> cow write.
>>>
>>>
>>> I'm also not seeing how what you've done is better then we have now
>>> using the root->will_be_snapshotted atomic,
>>> which is essentially used the same way as the new atomic you are
>>> adding, and forces the writeback code no nocow
>>> writes as well.
>>
>>
>> So what you have done can be made much more simple by flushing
>> delalloc before incrementing root->will_be_snapshotted instead of
>> after incrementing it:
>>
>> https://friendpaste.com/2LY9eLAR9q0RoOtRK7VYmX
>
> There is no way to solve this problem in this modification.

It minimizes it. It only gives better guarantees that nocow buffered
writes that happened before calling the snapshot ioctl will not fall
back to cow,
not for the ones that happen while the call to the ioctl is happening.

>
> When writing and create snapshot at the same time, the write will not
> reserve space,
> and will not return to ENOSPC, because will_be_snapshotted is still 0.
> So when writeback flush data, there will still be problems with ENOSPC.

Which is precisely what I proposed does without adding a new atomic
and more changes.
It flushes delalloc before incrementing root->will_be_snapshotted, so
that previous buffered nocow writes will not fallback to cow mode (and
require data space allocation).

It only leaves a very tiny and very unlikely to hit (but not
impossible) time window where nocow writes will fallback
to cow mode - after calling start_delalloc_inodes() and before
incrementing root->will_be_snapshotted a new buffered write can comes
in and gets immediately flushed
because someone called fsync() on the file or the VM decided to
trigger writeback (due to memory pressure or some other reason).

>
> The behavior I changed was to increase will_be_snapshotted first,
> so the following write must have a reserve space,
> otherwise it must be returned to ENOSPC.
> And then go to flush data and flush the diry page with nocow,
> When all the dirty pages are written back, then switch to cow mode.

And why did you wrote such a vague changelog? It misses all those
important and subtle details of the change.

>
>>
>> Just checked the code and failure to allocate space during writeback
>> after falling back to COW mode does indeed set
>> AS_ENOSPC on the inode's mapping, which makes fsync return ENOSPC
>> (through file_check_and_advance_wb_err()
>> and filemap_check_wb_err()).
>>
>> Since fsync reports the error, I'm unsure to call it data loss but
>> rather an optimization to avoid ENOSPC for nocow writes when running
>> low on space.
>>
>
> If you do not use fsync, you will not find the data loss.

That's one of the reasons why fsync exists.

> I think that as 

Re: [PATCH] btrfs: revert fs_devices state on error of btrfs_init_new_device()

2018-07-30 Thread Filipe Manana
On Fri, Jul 27, 2018 at 1:04 AM, Naohiro Aota  wrote:
> When btrfs hits error after modifying fs_devices in
> btrfs_init_new_device() (such as btrfs_add_dev_item() returns error), it
> leaves everything as is, but frees allocated btrfs_device. As a result,
> fs_devices->devices and fs_devices->alloc_list contain already freed
> btrfs_device, leading to later use-after-free bug.
>
> Error path also messes the things like ->num_devices. While they go backs
> to the original value by unscanning btrfs devices, it is safe to revert
> them here.
>
> Fixes: 79787eaab461 ("btrfs: replace many BUG_ONs with proper error handling")
> Signed-off-by: Naohiro Aota 

Reviewed-by: Filipe Manana 

Looks good, only fs_info->fs_devices->rotating isn't restored but
currently that causes no problems.

> ---
>  fs/btrfs/volumes.c | 28 +++-
>  1 file changed, 23 insertions(+), 5 deletions(-)
>
>  This patch applies on master, but not on kdave/for-next because of
>  74b9f4e186eb ("btrfs: declare fs_devices in btrfs_init_new_device()")
>
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 1da162928d1a..5f0512fffa52 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -2410,7 +2410,7 @@ int btrfs_init_new_device(struct btrfs_fs_info 
> *fs_info, const char *device_path
> struct list_head *devices;
> struct super_block *sb = fs_info->sb;
> struct rcu_string *name;
> -   u64 tmp;
> +   u64 orig_super_total_bytes, orig_super_num_devices;
> int seeding_dev = 0;
> int ret = 0;
> bool unlocked = false;
> @@ -2509,12 +2509,14 @@ int btrfs_init_new_device(struct btrfs_fs_info 
> *fs_info, const char *device_path
> if (!blk_queue_nonrot(q))
> fs_info->fs_devices->rotating = 1;
>
> -   tmp = btrfs_super_total_bytes(fs_info->super_copy);
> +   orig_super_total_bytes = btrfs_super_total_bytes(fs_info->super_copy);
> btrfs_set_super_total_bytes(fs_info->super_copy,
> -   round_down(tmp + device->total_bytes, fs_info->sectorsize));
> +   round_down(orig_super_total_bytes + device->total_bytes,
> +  fs_info->sectorsize));
>
> -   tmp = btrfs_super_num_devices(fs_info->super_copy);
> -   btrfs_set_super_num_devices(fs_info->super_copy, tmp + 1);
> +   orig_super_num_devices = btrfs_super_num_devices(fs_info->super_copy);
> +   btrfs_set_super_num_devices(fs_info->super_copy,
> +   orig_super_num_devices + 1);
>
> /* add sysfs device entry */
> btrfs_sysfs_add_device_link(fs_info->fs_devices, device);
> @@ -2594,6 +2596,22 @@ int btrfs_init_new_device(struct btrfs_fs_info 
> *fs_info, const char *device_path
>
>  error_sysfs:
> btrfs_sysfs_rm_device_link(fs_info->fs_devices, device);
> +   mutex_lock(_info->fs_devices->device_list_mutex);
> +   mutex_lock(_info->chunk_mutex);
> +   list_del_rcu(>dev_list);
> +   list_del(>dev_alloc_list);
> +   fs_info->fs_devices->num_devices--;
> +   fs_info->fs_devices->open_devices--;
> +   fs_info->fs_devices->rw_devices--;
> +   fs_info->fs_devices->total_devices--;
> +   fs_info->fs_devices->total_rw_bytes -= device->total_bytes;
> +   atomic64_sub(device->total_bytes, _info->free_chunk_space);
> +   btrfs_set_super_total_bytes(fs_info->super_copy,
> +   orig_super_total_bytes);
> +   btrfs_set_super_num_devices(fs_info->super_copy,
> +   orig_super_num_devices);
> +   mutex_unlock(_info->chunk_mutex);
> +   mutex_unlock(_info->fs_devices->device_list_mutex);
>  error_trans:
> if (seeding_dev)
> sb->s_flags |= SB_RDONLY;
> --
> 2.18.0
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: fix data lose with snapshot when nospace

2018-07-30 Thread Filipe Manana
On Mon, Jul 30, 2018 at 12:28 PM, Filipe Manana  wrote:
> On Mon, Jul 30, 2018 at 12:08 PM, Filipe Manana  wrote:
>> On Mon, Jul 30, 2018 at 11:21 AM, robbieko  wrote:
>>> From: Robbie Ko 
>>>
>>> Commit e9894fd3e3b3 ("Btrfs: fix snapshot vs nocow writting")
>>> modified the nocow writeback mechanism, if you create a snapshot,
>>> it will always switch to cow writeback.
>>>
>>> This will cause data loss when there is no space, because
>>> when the space is full, the write will not reserve any space, only
>>> check if it can be nocow write.
>>
>> This is a bit vague.
>> You need to mention where space reservation does not happen (at the
>> time of the write syscall) and why,
>> and that the snapshot happens before flushing IO (running dealloc).
>> Then when running dealloc we fallback
>> to COW and fail.
>>
>> You also need to tell that although the write syscall did not return
>> an error, the writeback will
>> fail but a subsequent fsync on the file will return an error (ENOSPC)
>> because the writeback set the error
>> on the inode's mapping, so it's not completely a silent data loss, as
>> for buffered writes there's no guarantee
>> that if write syscall returns 0 the data will be persisted
>> successfully (that can only be guaranteed if a subsequent
>> fsync call returns 0).
>>
>>>
>>> So fix this by first flush the nocow data, and then switch to the
>>> cow write.
>
> I'm also not seeing how what you've done is better then we have now
> using the root->will_be_snapshotted atomic,
> which is essentially used the same way as the new atomic you are
> adding, and forces the writeback code no nocow
> writes as well.

So what you have done can be made much more simple by flushing
delalloc before incrementing root->will_be_snapshotted instead of
after incrementing it:

https://friendpaste.com/2LY9eLAR9q0RoOtRK7VYmX

Just checked the code and failure to allocate space during writeback
after falling back to COW mode does indeed set
AS_ENOSPC on the inode's mapping, which makes fsync return ENOSPC
(through file_check_and_advance_wb_err()
and filemap_check_wb_err()).

Since fsync reports the error, I'm unsure to call it data loss but
rather an optimization to avoid ENOSPC for nocow writes when running
low on space.


>
>>
>>
>> This seems easy to reproduce using deterministic steps.
>> Can you please write a test case for fstests?
>>
>> Also the subject "Btrfs: fix data lose with snapshot when nospace",
>> besides the typo (lose -> loss), should be
>> more clear like for example "Btrfs: fix data loss for nocow writes
>> after snapshot when low on data space".
>
> Also I'm not even sure if I would call it data loss.
> If there was no error returned from a subsequent fsync, I would
> definitely call it data loss.
>
> So unless the fsync is not returning ENOSPC, I don't see anything that
> needs to be fixed.
>
>>
>> Thanks.
>>>
>>> Fixes: e9894fd3e3b3 ("Btrfs: fix snapshot vs nocow writting")
>>> Signed-off-by: Robbie Ko 
>>> ---
>>>  fs/btrfs/ctree.h   |  1 +
>>>  fs/btrfs/disk-io.c |  1 +
>>>  fs/btrfs/inode.c   | 26 +-
>>>  fs/btrfs/ioctl.c   |  6 ++
>>>  4 files changed, 13 insertions(+), 21 deletions(-)
>>>
>>> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
>>> index 118346a..663ce05 100644
>>> --- a/fs/btrfs/ctree.h
>>> +++ b/fs/btrfs/ctree.h
>>> @@ -1277,6 +1277,7 @@ struct btrfs_root {
>>> int send_in_progress;
>>> struct btrfs_subvolume_writers *subv_writers;
>>> atomic_t will_be_snapshotted;
>>> +   atomic_t snapshot_force_cow;
>>>
>>> /* For qgroup metadata reserved space */
>>> spinlock_t qgroup_meta_rsv_lock;
>>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
>>> index 205092d..5573916 100644
>>> --- a/fs/btrfs/disk-io.c
>>> +++ b/fs/btrfs/disk-io.c
>>> @@ -1216,6 +1216,7 @@ static void __setup_root(struct btrfs_root *root, 
>>> struct btrfs_fs_info *fs_info,
>>> atomic_set(>log_batch, 0);
>>> refcount_set(>refs, 1);
>>> atomic_set(>will_be_snapshotted, 0);
>>> +   atomic_set(>snapshot_force_cow, 0);
>>> root->log_transid = 0;
>>> root->log_transid_committed = -1;
>>> root->last_log_commit = 0;
>>> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
>

Re: [PATCH] Btrfs: fix data lose with snapshot when nospace

2018-07-30 Thread Filipe Manana
On Mon, Jul 30, 2018 at 12:08 PM, Filipe Manana  wrote:
> On Mon, Jul 30, 2018 at 11:21 AM, robbieko  wrote:
>> From: Robbie Ko 
>>
>> Commit e9894fd3e3b3 ("Btrfs: fix snapshot vs nocow writting")
>> modified the nocow writeback mechanism, if you create a snapshot,
>> it will always switch to cow writeback.
>>
>> This will cause data loss when there is no space, because
>> when the space is full, the write will not reserve any space, only
>> check if it can be nocow write.
>
> This is a bit vague.
> You need to mention where space reservation does not happen (at the
> time of the write syscall) and why,
> and that the snapshot happens before flushing IO (running dealloc).
> Then when running dealloc we fallback
> to COW and fail.
>
> You also need to tell that although the write syscall did not return
> an error, the writeback will
> fail but a subsequent fsync on the file will return an error (ENOSPC)
> because the writeback set the error
> on the inode's mapping, so it's not completely a silent data loss, as
> for buffered writes there's no guarantee
> that if write syscall returns 0 the data will be persisted
> successfully (that can only be guaranteed if a subsequent
> fsync call returns 0).
>
>>
>> So fix this by first flush the nocow data, and then switch to the
>> cow write.

I'm also not seeing how what you've done is better then we have now
using the root->will_be_snapshotted atomic,
which is essentially used the same way as the new atomic you are
adding, and forces the writeback code no nocow
writes as well.

>
>
> This seems easy to reproduce using deterministic steps.
> Can you please write a test case for fstests?
>
> Also the subject "Btrfs: fix data lose with snapshot when nospace",
> besides the typo (lose -> loss), should be
> more clear like for example "Btrfs: fix data loss for nocow writes
> after snapshot when low on data space".

Also I'm not even sure if I would call it data loss.
If there was no error returned from a subsequent fsync, I would
definitely call it data loss.

So unless the fsync is not returning ENOSPC, I don't see anything that
needs to be fixed.

>
> Thanks.
>>
>> Fixes: e9894fd3e3b3 ("Btrfs: fix snapshot vs nocow writting")
>> Signed-off-by: Robbie Ko 
>> ---
>>  fs/btrfs/ctree.h   |  1 +
>>  fs/btrfs/disk-io.c |  1 +
>>  fs/btrfs/inode.c   | 26 +-
>>  fs/btrfs/ioctl.c   |  6 ++
>>  4 files changed, 13 insertions(+), 21 deletions(-)
>>
>> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
>> index 118346a..663ce05 100644
>> --- a/fs/btrfs/ctree.h
>> +++ b/fs/btrfs/ctree.h
>> @@ -1277,6 +1277,7 @@ struct btrfs_root {
>> int send_in_progress;
>> struct btrfs_subvolume_writers *subv_writers;
>> atomic_t will_be_snapshotted;
>> +   atomic_t snapshot_force_cow;
>>
>> /* For qgroup metadata reserved space */
>> spinlock_t qgroup_meta_rsv_lock;
>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
>> index 205092d..5573916 100644
>> --- a/fs/btrfs/disk-io.c
>> +++ b/fs/btrfs/disk-io.c
>> @@ -1216,6 +1216,7 @@ static void __setup_root(struct btrfs_root *root, 
>> struct btrfs_fs_info *fs_info,
>> atomic_set(>log_batch, 0);
>> refcount_set(>refs, 1);
>> atomic_set(>will_be_snapshotted, 0);
>> +   atomic_set(>snapshot_force_cow, 0);
>> root->log_transid = 0;
>> root->log_transid_committed = -1;
>> root->last_log_commit = 0;
>> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
>> index eba61bc..263b852 100644
>> --- a/fs/btrfs/inode.c
>> +++ b/fs/btrfs/inode.c
>> @@ -1275,7 +1275,7 @@ static noinline int run_delalloc_nocow(struct inode 
>> *inode,
>> u64 disk_num_bytes;
>> u64 ram_bytes;
>> int extent_type;
>> -   int ret, err;
>> +   int ret;
>> int type;
>> int nocow;
>> int check_prev = 1;
>> @@ -1407,11 +1407,9 @@ static noinline int run_delalloc_nocow(struct inode 
>> *inode,
>>  * if there are pending snapshots for this root,
>>  * we fall into common COW way.
>>  */
>> -   if (!nolock) {
>> -   err = 
>> btrfs_start_write_no_snapshotting(root);
>> -   if (!err)
>> -   goto out_check;
>> -

Re: [PATCH] Btrfs: fix data lose with snapshot when nospace

2018-07-30 Thread Filipe Manana
On Mon, Jul 30, 2018 at 11:21 AM, robbieko  wrote:
> From: Robbie Ko 
>
> Commit e9894fd3e3b3 ("Btrfs: fix snapshot vs nocow writting")
> modified the nocow writeback mechanism, if you create a snapshot,
> it will always switch to cow writeback.
>
> This will cause data loss when there is no space, because
> when the space is full, the write will not reserve any space, only
> check if it can be nocow write.

This is a bit vague.
You need to mention where space reservation does not happen (at the
time of the write syscall) and why,
and that the snapshot happens before flushing IO (running dealloc).
Then when running dealloc we fallback
to COW and fail.

You also need to tell that although the write syscall did not return
an error, the writeback will
fail but a subsequent fsync on the file will return an error (ENOSPC)
because the writeback set the error
on the inode's mapping, so it's not completely a silent data loss, as
for buffered writes there's no guarantee
that if write syscall returns 0 the data will be persisted
successfully (that can only be guaranteed if a subsequent
fsync call returns 0).

>
> So fix this by first flush the nocow data, and then switch to the
> cow write.


This seems easy to reproduce using deterministic steps.
Can you please write a test case for fstests?

Also the subject "Btrfs: fix data lose with snapshot when nospace",
besides the typo (lose -> loss), should be
more clear like for example "Btrfs: fix data loss for nocow writes
after snapshot when low on data space".

Thanks.
>
> Fixes: e9894fd3e3b3 ("Btrfs: fix snapshot vs nocow writting")
> Signed-off-by: Robbie Ko 
> ---
>  fs/btrfs/ctree.h   |  1 +
>  fs/btrfs/disk-io.c |  1 +
>  fs/btrfs/inode.c   | 26 +-
>  fs/btrfs/ioctl.c   |  6 ++
>  4 files changed, 13 insertions(+), 21 deletions(-)
>
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 118346a..663ce05 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -1277,6 +1277,7 @@ struct btrfs_root {
> int send_in_progress;
> struct btrfs_subvolume_writers *subv_writers;
> atomic_t will_be_snapshotted;
> +   atomic_t snapshot_force_cow;
>
> /* For qgroup metadata reserved space */
> spinlock_t qgroup_meta_rsv_lock;
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 205092d..5573916 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -1216,6 +1216,7 @@ static void __setup_root(struct btrfs_root *root, 
> struct btrfs_fs_info *fs_info,
> atomic_set(>log_batch, 0);
> refcount_set(>refs, 1);
> atomic_set(>will_be_snapshotted, 0);
> +   atomic_set(>snapshot_force_cow, 0);
> root->log_transid = 0;
> root->log_transid_committed = -1;
> root->last_log_commit = 0;
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index eba61bc..263b852 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -1275,7 +1275,7 @@ static noinline int run_delalloc_nocow(struct inode 
> *inode,
> u64 disk_num_bytes;
> u64 ram_bytes;
> int extent_type;
> -   int ret, err;
> +   int ret;
> int type;
> int nocow;
> int check_prev = 1;
> @@ -1407,11 +1407,9 @@ static noinline int run_delalloc_nocow(struct inode 
> *inode,
>  * if there are pending snapshots for this root,
>  * we fall into common COW way.
>  */
> -   if (!nolock) {
> -   err = btrfs_start_write_no_snapshotting(root);
> -   if (!err)
> -   goto out_check;
> -   }
> +   if (!nolock &&
> +   
> unlikely(atomic_read(>snapshot_force_cow)))
> +   goto out_check;
> /*
>  * force cow if csum exists in the range.
>  * this ensure that csum for a given extent are
> @@ -1420,9 +1418,6 @@ static noinline int run_delalloc_nocow(struct inode 
> *inode,
> ret = csum_exist_in_range(fs_info, disk_bytenr,
>   num_bytes);
> if (ret) {
> -   if (!nolock)
> -   btrfs_end_write_no_snapshotting(root);
> -
> /*
>  * ret could be -EIO if the above fails to 
> read
>  * metadata.
> @@ -1435,11 +1430,8 @@ static noinline int run_delalloc_nocow(struct inode 
> *inode,
> WARN_ON_ONCE(nolock);
> goto out_check;
> }
> -   if (!btrfs_inc_nocow_writers(fs_info, disk_bytenr)) {
> -   if (!nolock)
> - 

Re: [PATCH] btrfs: introduce feature to forget a btrfs device

2018-07-26 Thread Filipe Manana
On Thu, Jul 26, 2018 at 12:32 PM, Anand Jain  wrote:
> Support for a new command 'btrfs dev forget [dev]' is proposed here,
> to undo the effects of 'btrfs dev scan [dev]'. For this purpose,
> this patch proposes to use ioctl #5 as it was empty.
> IOW(BTRFS_IOCTL_MAGIC, 5, ..)
> This patch adds new ioctl BTRFS_IOC_FORGET_DEV which can be sent from
> the /dev/btrfs-control to forget one or all devices, (devices which are
> not mounted) from the btrfs kernel.
>
> The argument it takes is struct btrfs_ioctl_vol_args, and ::name can be
> set to specify the device path. And all unmounted devices can be removed
> from the kernel if no device path is provided.
>
> Again, the devices are removed only if the relevant fsid aren't mounted.

And why is the feature needed? What problems does it solve?
That is missing from the changelog, no matter how obvious it is to you
(or anyone else), it should be mentioned in the changelog.

Thanks.

>
> Signed-off-by: Anand Jain 
> ---
>  fs/btrfs/super.c   | 3 +++
>  fs/btrfs/volumes.c | 9 +
>  fs/btrfs/volumes.h | 1 +
>  include/uapi/linux/btrfs.h | 2 ++
>  4 files changed, 15 insertions(+)
>
> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
> index 67de3c0fc85b..470a32af474e 100644
> --- a/fs/btrfs/super.c
> +++ b/fs/btrfs/super.c
> @@ -2244,6 +2244,9 @@ static long btrfs_control_ioctl(struct file *file, 
> unsigned int cmd,
> ret = PTR_ERR_OR_ZERO(device);
> mutex_unlock(_mutex);
> break;
> +   case BTRFS_IOC_FORGET_DEV:
> +   ret = btrfs_forget_devices(vol->name);
> +   break;
> case BTRFS_IOC_DEVICES_READY:
> mutex_lock(_mutex);
> device = btrfs_scan_one_device(vol->name, FMODE_READ,
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 8844904f9009..cd54a926141a 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -1208,6 +1208,15 @@ static int btrfs_read_disk_super(struct block_device 
> *bdev, u64 bytenr,
> return 0;
>  }
>
> +int btrfs_forget_devices(const char *path)
> +{
> +   mutex_lock(_mutex);
> +   btrfs_free_stale_devices(strlen(path) ? path:NULL, NULL);
> +   mutex_unlock(_mutex);
> +
> +   return 0;
> +}
> +
>  /*
>   * Look for a btrfs signature on a device. This may be called out of the 
> mount path
>   * and we are not allowed to call set_blocksize during the scan. The 
> superblock
> diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
> index 049619176831..1602b5faa7e7 100644
> --- a/fs/btrfs/volumes.h
> +++ b/fs/btrfs/volumes.h
> @@ -405,6 +405,7 @@ int btrfs_open_devices(struct btrfs_fs_devices 
> *fs_devices,
>fmode_t flags, void *holder);
>  struct btrfs_device *btrfs_scan_one_device(const char *path,
>fmode_t flags, void *holder);
> +int btrfs_forget_devices(const char *path);
>  int btrfs_close_devices(struct btrfs_fs_devices *fs_devices);
>  void btrfs_free_extra_devids(struct btrfs_fs_devices *fs_devices, int step);
>  void btrfs_assign_next_active_device(struct btrfs_device *device,
> diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h
> index 5ca1d21fc4a7..b1be7f828cb4 100644
> --- a/include/uapi/linux/btrfs.h
> +++ b/include/uapi/linux/btrfs.h
> @@ -836,6 +836,8 @@ enum btrfs_err_code {
>struct btrfs_ioctl_vol_args)
>  #define BTRFS_IOC_SCAN_DEV _IOW(BTRFS_IOCTL_MAGIC, 4, \
>struct btrfs_ioctl_vol_args)
> +#define BTRFS_IOC_FORGET_DEV _IOW(BTRFS_IOCTL_MAGIC, 5, \
> +  struct btrfs_ioctl_vol_args)
>  /* trans start and trans end are dangerous, and only for
>   * use by applications that know how to avoid the
>   * resulting deadlocks
> --
> 2.7.0
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 17/22] btrfs: don't take the dio_sem in the fsync path

2018-07-19 Thread Filipe Manana
On Thu, Jul 19, 2018 at 4:54 PM, Josef Bacik  wrote:
> On Thu, Jul 19, 2018 at 04:21:58PM +0100, Filipe Manana wrote:
>> On Thu, Jul 19, 2018 at 3:50 PM, Josef Bacik  wrote:
>> > Since we've changed the fsync() path to always run ordered extents
>> > before doing the tree log we no longer need to take the dio_sem in the
>> > tree log path.  This gets rid of a lockdep splat that I was seeing with
>> > the AIO tests.
>>
>> So actually, we still need it (or some other means of sync).
>> Because even after the recent changes to fsync, the fast path still
>> logs extent items based on the extent maps, and the dio write path
>> creates first the extent map and then the ordered extent.
>> So the old problem can still happen between concurrent fsync and
>> lockless dio write, where fsync logs an extent item for an extent map
>> whose ordered extent we never waited for.
>> The solution prior to the introduction of dio_sem solved this - make
>> the dio write create first the ordered extent, and, only after it,
>> create the extent map.
>>
>
> Oooh balls I see.  This is still a problem even if we add the ordered extent
> first, because we can easily just start the lockless dio write after we've
> waited for ordered extents, so the order we create the extent map and ordered
> extent don't actually matter.  We still have this lockdep thing, I think we 
> just
> move the dio_sem to the start of fsync, if we're fsyncing you just don't get 
> to
> do lockless dio writes while we're doing the fsync.  What do you think?  
> Thanks,

On first though, it seems to be the simplest and sanest solution :)
Thanks.

>
> Josef



-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 17/22] btrfs: don't take the dio_sem in the fsync path

2018-07-19 Thread Filipe Manana
On Thu, Jul 19, 2018 at 3:50 PM, Josef Bacik  wrote:
> Since we've changed the fsync() path to always run ordered extents
> before doing the tree log we no longer need to take the dio_sem in the
> tree log path.  This gets rid of a lockdep splat that I was seeing with
> the AIO tests.

So actually, we still need it (or some other means of sync).
Because even after the recent changes to fsync, the fast path still
logs extent items based on the extent maps, and the dio write path
creates first the extent map and then the ordered extent.
So the old problem can still happen between concurrent fsync and
lockless dio write, where fsync logs an extent item for an extent map
whose ordered extent we never waited for.
The solution prior to the introduction of dio_sem solved this - make
the dio write create first the ordered extent, and, only after it,
create the extent map.

>
> Signed-off-by: Josef Bacik 
> ---
>  fs/btrfs/tree-log.c | 2 --
>  1 file changed, 2 deletions(-)
>
> diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
> index f8220ec02036..aa06e1954b84 100644
> --- a/fs/btrfs/tree-log.c
> +++ b/fs/btrfs/tree-log.c
> @@ -4439,7 +4439,6 @@ static int btrfs_log_changed_extents(struct 
> btrfs_trans_handle *trans,
>
> INIT_LIST_HEAD();
>
> -   down_write(>dio_sem);
> write_lock(>lock);
> test_gen = root->fs_info->last_trans_committed;
> logged_start = start;
> @@ -4520,7 +4519,6 @@ static int btrfs_log_changed_extents(struct 
> btrfs_trans_handle *trans,
> }
> WARN_ON(!list_empty());
> write_unlock(>lock);
> -   up_write(>dio_sem);
>
> btrfs_release_path(path);
> if (!ret)
> --
> 2.14.3
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 17/22] btrfs: don't take the dio_sem in the fsync path

2018-07-19 Thread Filipe Manana
On Thu, Jul 19, 2018 at 3:50 PM, Josef Bacik  wrote:
> Since we've changed the fsync() path to always run ordered extents
> before doing the tree log we no longer need to take the dio_sem in the
> tree log path.  This gets rid of a lockdep splat that I was seeing with
> the AIO tests.

The dio_sem can be removed completely, it was added to synchronize
lockless dio writes with fsync to avoid races.
I.e., just removing it here in the tree-log makes it useless.

thanks!

>
> Signed-off-by: Josef Bacik 
> ---
>  fs/btrfs/tree-log.c | 2 --
>  1 file changed, 2 deletions(-)
>
> diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
> index f8220ec02036..aa06e1954b84 100644
> --- a/fs/btrfs/tree-log.c
> +++ b/fs/btrfs/tree-log.c
> @@ -4439,7 +4439,6 @@ static int btrfs_log_changed_extents(struct 
> btrfs_trans_handle *trans,
>
> INIT_LIST_HEAD();
>
> -   down_write(>dio_sem);
> write_lock(>lock);
> test_gen = root->fs_info->last_trans_committed;
> logged_start = start;
> @@ -4520,7 +4519,6 @@ static int btrfs_log_changed_extents(struct 
> btrfs_trans_handle *trans,
> }
> WARN_ON(!list_empty());
> write_unlock(>lock);
> -   up_write(>dio_sem);
>
> btrfs_release_path(path);
> if (!ret)
> --
> 2.14.3
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: About hung task on generic/041

2018-07-12 Thread Filipe Manana
On Wed, Jul 11, 2018 at 10:02 AM, Lu Fengqi  wrote:
> Hi,
>
> When I run generic/041 with v4.18-rc3 (turn on kasan and hung task
> detection), btrfs-transaction kthread will trigger the hung task timeout
> (stall at wait_event in btrfs_commit_transaction). At the same time, you
> can see that xfs_io -c fsync will occupy 100% of the CPU. I am not sure
> whether this is a problem. Any suggestion?

Well, something at 100% cpu and that seems hang forever is definitely
a problem, specially a workload as simple as the one in generic/041
(never happened to me, even on vanilla 4.18-rc4).
Do you have the stack trace for the fsync task? What you pasted below
is only for the transaction kthread and that alone doesn't help.

>
> [Wed Jul 11 15:50:08 2018] INFO: task btrfs-transacti:1053 blocked for more 
> than 120 seconds.
> [Wed Jul 11 15:50:08 2018]   Not tainted 4.18.0-rc3-custom #14
> [Wed Jul 11 15:50:08 2018] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
> disables this message.
> [Wed Jul 11 15:50:08 2018] btrfs-transacti D0  1053  2 0x8000
> [Wed Jul 11 15:50:08 2018] Call Trace:
> [Wed Jul 11 15:50:08 2018]  ? __schedule+0x5b2/0x1380
> [Wed Jul 11 15:50:08 2018]  ? check_flags.part.23+0x240/0x240
> [Wed Jul 11 15:50:08 2018]  ? firmware_map_remove+0x187/0x187
> [Wed Jul 11 15:50:08 2018]  ? ___preempt_schedule+0x16/0x18
> [Wed Jul 11 15:50:08 2018]  ? mark_held_locks+0x6e/0x90
> [Wed Jul 11 15:50:08 2018]  ? _raw_spin_unlock_irqrestore+0x59/0x70
> [Wed Jul 11 15:50:08 2018]  ? preempt_count_sub+0x14/0xc0
> [Wed Jul 11 15:50:08 2018]  ? _raw_spin_unlock_irqrestore+0x46/0x70
> [Wed Jul 11 15:50:08 2018]  ? prepare_to_wait_event+0x191/0x410
> [Wed Jul 11 15:50:08 2018]  ? prepare_to_wait_exclusive+0x210/0x210
> [Wed Jul 11 15:50:08 2018]  ? print_usage_bug+0x3a0/0x3a0
> [Wed Jul 11 15:50:08 2018]  ? do_raw_spin_unlock+0x10f/0x1e0
> [Wed Jul 11 15:50:08 2018]  ? do_raw_spin_trylock+0x120/0x120
> [Wed Jul 11 15:50:08 2018]  schedule+0xca/0x260
> [Wed Jul 11 15:50:08 2018]  ? rcu_lockdep_current_cpu_online+0x12b/0x160
> [Wed Jul 11 15:50:08 2018]  ? __schedule+0x1380/0x1380
> [Wed Jul 11 15:50:08 2018]  ? ___might_sleep+0x126/0x370
> [Wed Jul 11 15:50:08 2018]  ? init_wait_entry+0xc7/0x100
> [Wed Jul 11 15:50:08 2018]  ? __wake_up_locked_key_bookmark+0x20/0x20
> [Wed Jul 11 15:50:08 2018]  ? __btrfs_run_delayed_items+0x1e5/0x280 [btrfs]
> [Wed Jul 11 15:50:08 2018]  ? __might_sleep+0x31/0xd0
> [Wed Jul 11 15:50:08 2018]  btrfs_commit_transaction+0x122a/0x1640 [btrfs]
> [Wed Jul 11 15:50:08 2018]  ? btrfs_apply_pending_changes+0x90/0x90 [btrfs]
> [Wed Jul 11 15:50:08 2018]  ? wait_woken+0x150/0x150
> [Wed Jul 11 15:50:08 2018]  ? ret_from_fork+0x27/0x50
> [Wed Jul 11 15:50:08 2018]  ? ret_from_fork+0x27/0x50
> [Wed Jul 11 15:50:08 2018]  ? deref_stack_reg+0xe0/0xe0
> [Wed Jul 11 15:50:08 2018]  ? __module_text_address+0x63/0xa0
> [Wed Jul 11 15:50:08 2018]  ? preempt_count_sub+0x14/0xc0
> [Wed Jul 11 15:50:08 2018]  ? transaction_kthread+0x161/0x240 [btrfs]
> [Wed Jul 11 15:50:08 2018]  ? is_module_text_address+0x2b/0x50
> [Wed Jul 11 15:50:08 2018]  ? transaction_kthread+0x1d9/0x240 [btrfs]
> [Wed Jul 11 15:50:08 2018]  ? kernel_text_address+0x5a/0x100
> [Wed Jul 11 15:50:08 2018]  ? deactivate_slab.isra.27+0x64f/0x7a0
> [Wed Jul 11 15:50:08 2018]  ? __save_stack_trace+0x82/0x100
> [Wed Jul 11 15:50:08 2018]  ? kasan_kmalloc+0x142/0x170
> [Wed Jul 11 15:50:08 2018]  ? kmem_cache_alloc+0xfc/0x2e0
> [Wed Jul 11 15:50:08 2018]  ? start_transaction+0x596/0x930 [btrfs]
> [Wed Jul 11 15:50:08 2018]  ? transaction_kthread+0x1d9/0x240 [btrfs]
> [Wed Jul 11 15:50:08 2018]  ? kthread+0x1b9/0x1e0
> [Wed Jul 11 15:50:08 2018]  ? ret_from_fork+0x27/0x50
> [Wed Jul 11 15:50:08 2018]  ? deactivate_slab.isra.27+0x64f/0x7a0
> [Wed Jul 11 15:50:08 2018]  ? mark_lock+0x149/0xa80
> [Wed Jul 11 15:50:08 2018]  ? init_object+0x6b/0x80
> [Wed Jul 11 15:50:08 2018]  ? print_usage_bug+0x3a0/0x3a0
> [Wed Jul 11 15:50:08 2018]  ? ___slab_alloc+0x62a/0x690
> [Wed Jul 11 15:50:08 2018]  ? ___slab_alloc+0x62a/0x690
> [Wed Jul 11 15:50:08 2018]  ? __lock_is_held+0x8c/0xe0
> [Wed Jul 11 15:50:08 2018]  ? start_transaction+0x596/0x930 [btrfs]
> [Wed Jul 11 15:50:08 2018]  ? preempt_count_sub+0x14/0xc0
> [Wed Jul 11 15:50:08 2018]  ? rcu_lockdep_current_cpu_online+0x12b/0x160
> [Wed Jul 11 15:50:08 2018]  ? rcu_oom_callback+0x40/0x40
> [Wed Jul 11 15:50:08 2018]  ? __lock_is_held+0x8c/0xe0
> [Wed Jul 11 15:50:08 2018]  ? start_transaction+0x596/0x930 [btrfs]
> [Wed Jul 11 15:50:08 2018]  ? rcu_read_lock_sched_held+0x8f/0xa0
> [Wed Jul 11 15:50:08 2018]  ? btrfs_record_root_in_trans+0x1f/0xa0 [btrfs]
> [Wed Jul 11 15:50:08 2018]  ? start_transaction+0x26b/0x930 [btrfs]
> [Wed Jul 11 15:50:08 2018]  ? btrfs_commit_transaction+0x1640/0x1640 [btrfs]
> [Wed Jul 11 15:50:08 2018]  ? check_flags.part.23+0x240/0x240
> [Wed Jul 11 15:50:08 2018]  ? lock_downgrade+0x380/0x380
> [Wed Jul 11 15:50:08 2018]  ? 

Re: [PATCH] Btrfs: fix mount failure when qgroup rescan is in progress

2018-06-27 Thread Filipe Manana
On Wed, Jun 27, 2018 at 4:55 PM, Nikolay Borisov  wrote:
>
>
> On 27.06.2018 18:45, Filipe Manana wrote:
>> On Wed, Jun 27, 2018 at 4:44 PM, Nikolay Borisov  wrote:
>>>
>>>
>>> On 27.06.2018 02:43, fdman...@kernel.org wrote:
>>>> From: Filipe Manana 
>>>>
>>>> If a power failure happens while the qgroup rescan kthread is running,
>>>> the next mount operation will always fail. This is because of a recent
>>>> regression that makes qgroup_rescan_init() incorrectly return -EINVAL
>>>> when we are mounting the filesystem (through btrfs_read_qgroup_config()).
>>>> This causes the -EINVAL error to be returned regardless of any qgroup
>>>> flags being set instead of returning the error only when neither of
>>>> the flags BTRFS_QGROUP_STATUS_FLAG_RESCAN nor BTRFS_QGROUP_STATUS_FLAG_ON
>>>> are set.
>>>>
>>>> A test case for fstests follows up soon.
>>>>
>>>> Fixes: 9593bf49675e ("btrfs: qgroup: show more meaningful 
>>>> qgroup_rescan_init error message")
>>>> Signed-off-by: Filipe Manana 
>>>> ---
>>>>  fs/btrfs/qgroup.c | 13 ++---
>>>>  1 file changed, 10 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
>>>> index 1874a6d2e6f5..d4171de93087 100644
>>>> --- a/fs/btrfs/qgroup.c
>>>> +++ b/fs/btrfs/qgroup.c
>>>> @@ -2784,13 +2784,20 @@ qgroup_rescan_init(struct btrfs_fs_info *fs_info, 
>>>> u64 progress_objectid,
>>>>
>>>>   if (!init_flags) {
>>>>   /* we're resuming qgroup rescan at mount time */
>>>> - if (!(fs_info->qgroup_flags & 
>>>> BTRFS_QGROUP_STATUS_FLAG_RESCAN))
>>>> + if (!(fs_info->qgroup_flags &
>>>> +   BTRFS_QGROUP_STATUS_FLAG_RESCAN)) {
>>>>   btrfs_warn(fs_info,
>>>>   "qgroup rescan init failed, qgroup is not enabled");
>>>> - else if (!(fs_info->qgroup_flags & 
>>>> BTRFS_QGROUP_STATUS_FLAG_ON))
>>>> + ret = -EINVAL;
>>>> + } else if (!(fs_info->qgroup_flags &
>>>> +  BTRFS_QGROUP_STATUS_FLAG_ON)) {
>>>>   btrfs_warn(fs_info,
>>>>   "qgroup rescan init failed, qgroup rescan is not 
>>>> queued");
>>>> - return -EINVAL;
>>>> + ret = -EINVAL;
>>>> + }
>>>> +
>>>> + if (ret)
>>>> + return ret;
>>>
>>>
>>> How is this patch functionally different than the old code. In both
>>> cases if either of those 2 is not set a warn is printed and -EINVAL is
>>> returned?
>>
>> It is explained in the changelog:
>
> No need to be snide

No one's being snide. Simply, I can't see how the changelog doesn't
explain (what is already quite easy to notice from the code).

>
>>
>> "This is because of a recent
>> regression that makes qgroup_rescan_init() incorrectly return -EINVAL
>> when we are mounting the filesystem (through btrfs_read_qgroup_config()).
>> This causes the -EINVAL error to be returned regardless of any qgroup
>> flags being set instead of returning the error only when neither of
>> the flags BTRFS_QGROUP_STATUS_FLAG_RESCAN nor BTRFS_QGROUP_STATUS_FLAG_ON
>> are set."
>>
>> If you can't understand it, try the test case...
>
> Ok I see it now, however your description contradicts the code.
> Currently -EINVAL will be returned when either of the 2 flags is unset i.e
>
> !BTRFS_QGROUP_STATUS_FLAG_RESCAN && BTRFS_QGROUP_STATUS_FLAG_ON
> !BTRFS_QGROUP_STATUS_FLAG_ON && BTRFS_QGROUP_STATUS_FLAG_RESCAN
>
> and in your description you refer to neither that is the 2 flags being
> unset. Perhaps those combinations are invalid due to some other reasons
> which are not visible in the code but in this case the changelog should
> be expanded to cover why those 2 combinations are impossible (because if
> they are -EINVAL is still likely ) ?

I don't think the changelog is contradictory.
It says that -EINVAL is always returned, independently of which qgroup
flags are set/missing.
Further it says that the error should be returned only when one of
those 2 qgroup flags is not set (or both are not set).

>
>>
>>
>>>
>>>>   }
>>>>
>>>>   mutex_lock(_info->qgroup_rescan_lock);
>>>>
>>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: fix mount failure when qgroup rescan is in progress

2018-06-27 Thread Filipe Manana
On Wed, Jun 27, 2018 at 4:44 PM, Nikolay Borisov  wrote:
>
>
> On 27.06.2018 02:43, fdman...@kernel.org wrote:
>> From: Filipe Manana 
>>
>> If a power failure happens while the qgroup rescan kthread is running,
>> the next mount operation will always fail. This is because of a recent
>> regression that makes qgroup_rescan_init() incorrectly return -EINVAL
>> when we are mounting the filesystem (through btrfs_read_qgroup_config()).
>> This causes the -EINVAL error to be returned regardless of any qgroup
>> flags being set instead of returning the error only when neither of
>> the flags BTRFS_QGROUP_STATUS_FLAG_RESCAN nor BTRFS_QGROUP_STATUS_FLAG_ON
>> are set.
>>
>> A test case for fstests follows up soon.
>>
>> Fixes: 9593bf49675e ("btrfs: qgroup: show more meaningful qgroup_rescan_init 
>> error message")
>> Signed-off-by: Filipe Manana 
>> ---
>>  fs/btrfs/qgroup.c | 13 ++---
>>  1 file changed, 10 insertions(+), 3 deletions(-)
>>
>> diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
>> index 1874a6d2e6f5..d4171de93087 100644
>> --- a/fs/btrfs/qgroup.c
>> +++ b/fs/btrfs/qgroup.c
>> @@ -2784,13 +2784,20 @@ qgroup_rescan_init(struct btrfs_fs_info *fs_info, 
>> u64 progress_objectid,
>>
>>   if (!init_flags) {
>>   /* we're resuming qgroup rescan at mount time */
>> - if (!(fs_info->qgroup_flags & BTRFS_QGROUP_STATUS_FLAG_RESCAN))
>> + if (!(fs_info->qgroup_flags &
>> +   BTRFS_QGROUP_STATUS_FLAG_RESCAN)) {
>>   btrfs_warn(fs_info,
>>   "qgroup rescan init failed, qgroup is not enabled");
>> - else if (!(fs_info->qgroup_flags & 
>> BTRFS_QGROUP_STATUS_FLAG_ON))
>> + ret = -EINVAL;
>> + } else if (!(fs_info->qgroup_flags &
>> +  BTRFS_QGROUP_STATUS_FLAG_ON)) {
>>   btrfs_warn(fs_info,
>>   "qgroup rescan init failed, qgroup rescan is not 
>> queued");
>> - return -EINVAL;
>> + ret = -EINVAL;
>> + }
>> +
>> + if (ret)
>> + return ret;
>
>
> How is this patch functionally different than the old code. In both
> cases if either of those 2 is not set a warn is printed and -EINVAL is
> returned?

It is explained in the changelog:

"This is because of a recent
regression that makes qgroup_rescan_init() incorrectly return -EINVAL
when we are mounting the filesystem (through btrfs_read_qgroup_config()).
This causes the -EINVAL error to be returned regardless of any qgroup
flags being set instead of returning the error only when neither of
the flags BTRFS_QGROUP_STATUS_FLAG_RESCAN nor BTRFS_QGROUP_STATUS_FLAG_ON
are set."

If you can't understand it, try the test case...


>
>>   }
>>
>>   mutex_lock(_info->qgroup_rescan_lock);
>>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4] btrfs: Don't remove block group still has pinned down bytes

2018-06-22 Thread Filipe Manana
On Fri, Jun 22, 2018 at 5:35 AM, Qu Wenruo  wrote:
> [BUG]
> Under certain KVM load and LTP tests, we are possible to hit the
> following calltrace if quota is enabled:
> --
> BTRFS critical (device vda2): unable to find logical 8820195328 length 4096
> BTRFS critical (device vda2): unable to find logical 8820195328 length 4096
> [ cut here ]
> WARNING: CPU: 0 PID: 49 at ../block/blk-core.c:172 
> blk_status_to_errno+0x1a/0x30
> CPU: 0 PID: 49 Comm: kworker/u2:1 Not tainted 4.12.14-15-default #1 SLE15 
> (unreleased)
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> 1.0.0-prebuilt.qemu-project.org 04/01/2014
> Workqueue: btrfs-endio-write btrfs_endio_write_helper [btrfs]
> task: 9f827b340bc0 task.stack: b4f8c0304000
> RIP: 0010:blk_status_to_errno+0x1a/0x30
> Call Trace:
>  submit_extent_page+0x191/0x270 [btrfs]
>  ? btrfs_create_repair_bio+0x130/0x130 [btrfs]
>  __do_readpage+0x2d2/0x810 [btrfs]
>  ? btrfs_create_repair_bio+0x130/0x130 [btrfs]
>  ? run_one_async_done+0xc0/0xc0 [btrfs]
>  __extent_read_full_page+0xe7/0x100 [btrfs]
>  ? run_one_async_done+0xc0/0xc0 [btrfs]
>  read_extent_buffer_pages+0x1ab/0x2d0 [btrfs]
>  ? run_one_async_done+0xc0/0xc0 [btrfs]
>  btree_read_extent_buffer_pages+0x94/0xf0 [btrfs]
>  read_tree_block+0x31/0x60 [btrfs]
>  read_block_for_search.isra.35+0xf0/0x2e0 [btrfs]
>  btrfs_search_slot+0x46b/0xa00 [btrfs]
>  ? kmem_cache_alloc+0x1a8/0x510
>  ? btrfs_get_token_32+0x5b/0x120 [btrfs]
>  find_parent_nodes+0x11d/0xeb0 [btrfs]
>  ? leaf_space_used+0xb8/0xd0 [btrfs]
>  ? btrfs_leaf_free_space+0x49/0x90 [btrfs]
>  ? btrfs_find_all_roots_safe+0x93/0x100 [btrfs]
>  btrfs_find_all_roots_safe+0x93/0x100 [btrfs]
>  btrfs_find_all_roots+0x45/0x60 [btrfs]
>  btrfs_qgroup_trace_extent_post+0x20/0x40 [btrfs]
>  btrfs_add_delayed_data_ref+0x1a3/0x1d0 [btrfs]
>  btrfs_alloc_reserved_file_extent+0x38/0x40 [btrfs]
>  insert_reserved_file_extent.constprop.71+0x289/0x2e0 [btrfs]
>  btrfs_finish_ordered_io+0x2f4/0x7f0 [btrfs]
>  ? pick_next_task_fair+0x2cd/0x530
>  ? __switch_to+0x92/0x4b0
>  btrfs_worker_helper+0x81/0x300 [btrfs]
>  process_one_work+0x1da/0x3f0
>  worker_thread+0x2b/0x3f0
>  ? process_one_work+0x3f0/0x3f0
>  kthread+0x11a/0x130
>  ? kthread_create_on_node+0x40/0x40
>  ret_from_fork+0x35/0x40
> Code: 00 00 5b c3 0f 1f 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 40 80 
> ff 0c 40 0f b6 c7 77 0b 48 c1 e0 04 8b 80 00 bf c8 bd c3 <0f> 0b b8 fb ff ff 
> ff c3 0f 1f 40 00 66 2e 0f 1f 84 00 00 00 00
> ---[ end trace f079fb809e7a862b ]---
> BTRFS critical (device vda2): unable to find logical 8820195328 length 16384
> BTRFS: error (device vda2) in btrfs_finish_ordered_io:3023: errno=-5 IO 
> failure
> BTRFS info (device vda2): forced readonly
> BTRFS error (device vda2): pending csums is 2887680
> --
>
> [CAUSE]
> It's caused by race with block group auto removal like the following
> case:
> - There is a meta block group X, which has only one tree block
>   The tree block belongs to fs tree 257.
> - In current transaction, some operation modified fs tree 257
>   The tree block get CoWed, so the block group X is empty, and marked as
>   unused, queued to be deleted.
> - Some workload (like fsync) wakes up cleaner_kthread()
>   Which will call btrfs_deleted_unused_bgs() to remove unused block
>   groups.
>   So block group X along its chunk map get removed.
> - Some delalloc work finished for fs tree 257
>   Quota needs to get the original reference of the extent, which will
>   reads tree blocks of commit root of 257.
>   Then since the chunk map get removed, above warning get triggered.
>
> [FIX]
> Just teach btrfs_delete_unused_bgs() to skip block group who still has
> pinned bytes.
>
> However there is a minor side effect, since currently we only queue
> empty blocks at update_block_group(), and such empty block group with
> pinned bytes won't go through update_block_group() again, such block
> group won't be removed, until it get new extent allocated and removed.
>
> Signed-off-by: Qu Wenruo 
Reviewed-by: Filipe Manana 

thanks

> ---
> changelog:
> v2:
>   Commit message update, to better indicate how pinned byte is used in
>   btrfs and why it's related to quota.
> v3:
>   Commit message update, further explaining the bug with an example.
>   And added the side effect of the fix, and possible further fix.
> v4:
>   Remove unrelated and confusing commit message.
> ---
>  fs/btrfs/extent-tree.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index f190023386a9..7d14c4ca8232 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs

Re: [PATCH 1/2] Btrfs: fix return value on rename exchange failure

2018-06-21 Thread Filipe Manana
On Tue, Jun 19, 2018 at 2:38 PM, David Sterba  wrote:
> On Mon, Jun 11, 2018 at 07:24:16PM +0100, fdman...@kernel.org wrote:
>> From: Filipe Manana 
>>
>> If we failed during a rename exchange operation after starting/joining a
>> transaction, we would end up replacing the return value, stored in the
>> local 'ret' variable, with the return value from btrfs_end_transaction().
>> So this could end up returning 0 (success) to user space despite the
>> operation having failed and aborted the transaction, because if there are
>> multiple tasks having a reference on the transaction at the time
>> btrfs_end_transaction() is called by the rename exchange, that function
>> returns 0 (otherwise it returns -EIO and not the original error value).
>> So fix this by not overwriting the return value on error after getting
>> a transaction handle.
>>
>> Signed-off-by: Filipe Manana 
>
> 1 and 2 queued for 4.18, thanks.

Please removed the 2nd patch, because I just ran into a deadlock
between syncing the log and transaction kthread commiting the
transaction while a rename was in progress.
I'll send a v2 once I understand better the problem and have a fix. Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] btrfs: Don't remove block group still has pinned down bytes

2018-06-21 Thread Filipe Manana
On Wed, Jun 20, 2018 at 12:03 PM, Qu Wenruo  wrote:
>
>
> On 2018年06月20日 17:33, Filipe Manana wrote:
>> On Wed, Jun 20, 2018 at 10:22 AM, Qu Wenruo  wrote:
>>>
>>>
>>> On 2018年06月20日 17:13, Filipe Manana wrote:
>>>> On Fri, Jun 15, 2018 at 2:35 AM, Qu Wenruo  wrote:
>>>>> [BUG]
>>>>> Under certain KVM load and LTP tests, we are possible to hit the
>>>>> following calltrace if quota is enabled:
>>>>> --
>>>>> BTRFS critical (device vda2): unable to find logical 8820195328 length 
>>>>> 4096
>>>>> BTRFS critical (device vda2): unable to find logical 8820195328 length 
>>>>> 4096
>>>>> [ cut here ]
>>>>> WARNING: CPU: 0 PID: 49 at ../block/blk-core.c:172 
>>>>> blk_status_to_errno+0x1a/0x30
>>>>> CPU: 0 PID: 49 Comm: kworker/u2:1 Not tainted 4.12.14-15-default #1 SLE15 
>>>>> (unreleased)
>>>>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
>>>>> 1.0.0-prebuilt.qemu-project.org 04/01/2014
>>>>> Workqueue: btrfs-endio-write btrfs_endio_write_helper [btrfs]
>>>>> task: 9f827b340bc0 task.stack: b4f8c0304000
>>>>> RIP: 0010:blk_status_to_errno+0x1a/0x30
>>>>> Call Trace:
>>>>>  submit_extent_page+0x191/0x270 [btrfs]
>>>>>  ? btrfs_create_repair_bio+0x130/0x130 [btrfs]
>>>>>  __do_readpage+0x2d2/0x810 [btrfs]
>>>>>  ? btrfs_create_repair_bio+0x130/0x130 [btrfs]
>>>>>  ? run_one_async_done+0xc0/0xc0 [btrfs]
>>>>>  __extent_read_full_page+0xe7/0x100 [btrfs]
>>>>>  ? run_one_async_done+0xc0/0xc0 [btrfs]
>>>>>  read_extent_buffer_pages+0x1ab/0x2d0 [btrfs]
>>>>>  ? run_one_async_done+0xc0/0xc0 [btrfs]
>>>>>  btree_read_extent_buffer_pages+0x94/0xf0 [btrfs]
>>>>>  read_tree_block+0x31/0x60 [btrfs]
>>>>>  read_block_for_search.isra.35+0xf0/0x2e0 [btrfs]
>>>>>  btrfs_search_slot+0x46b/0xa00 [btrfs]
>>>>>  ? kmem_cache_alloc+0x1a8/0x510
>>>>>  ? btrfs_get_token_32+0x5b/0x120 [btrfs]
>>>>>  find_parent_nodes+0x11d/0xeb0 [btrfs]
>>>>>  ? leaf_space_used+0xb8/0xd0 [btrfs]
>>>>>  ? btrfs_leaf_free_space+0x49/0x90 [btrfs]
>>>>>  ? btrfs_find_all_roots_safe+0x93/0x100 [btrfs]
>>>>>  btrfs_find_all_roots_safe+0x93/0x100 [btrfs]
>>>>>  btrfs_find_all_roots+0x45/0x60 [btrfs]
>>>>>  btrfs_qgroup_trace_extent_post+0x20/0x40 [btrfs]
>>>>>  btrfs_add_delayed_data_ref+0x1a3/0x1d0 [btrfs]
>>>>>  btrfs_alloc_reserved_file_extent+0x38/0x40 [btrfs]
>>>>>  insert_reserved_file_extent.constprop.71+0x289/0x2e0 [btrfs]
>>>>>  btrfs_finish_ordered_io+0x2f4/0x7f0 [btrfs]
>>>>>  ? pick_next_task_fair+0x2cd/0x530
>>>>>  ? __switch_to+0x92/0x4b0
>>>>>  btrfs_worker_helper+0x81/0x300 [btrfs]
>>>>>  process_one_work+0x1da/0x3f0
>>>>>  worker_thread+0x2b/0x3f0
>>>>>  ? process_one_work+0x3f0/0x3f0
>>>>>  kthread+0x11a/0x130
>>>>>  ? kthread_create_on_node+0x40/0x40
>>>>>  ret_from_fork+0x35/0x40
>>>>> Code: 00 00 5b c3 0f 1f 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 
>>>>> 40 80 ff 0c 40 0f b6 c7 77 0b 48 c1 e0 04 8b 80 00 bf c8 bd c3 <0f> 0b b8 
>>>>> fb ff ff ff c3 0f 1f 40 00 66 2e 0f 1f 84 00 00 00 00
>>>>> ---[ end trace f079fb809e7a862b ]---
>>>>> BTRFS critical (device vda2): unable to find logical 8820195328 length 
>>>>> 16384
>>>>> BTRFS: error (device vda2) in btrfs_finish_ordered_io:3023: errno=-5 IO 
>>>>> failure
>>>>> BTRFS info (device vda2): forced readonly
>>>>> BTRFS error (device vda2): pending csums is 2887680
>>>>> --
>>>>>
>>>>> [CAUSE]
>>>>> It's caused by race with block group auto removal like the following
>>>>> case:
>>>>> - There is a meta block group X, which has only one tree block
>>>>>   The tree block belongs to fs tree 257.
>>>>> - In current transaction, some operation modified fs tree 257
>>>>>   The tree block get CoWed, so the block group X is empty, and marked as
>>>>>   unused, queued to be deleted.
>>>>> - Some workload (like fsync) wakes up cleaner_kthread()
>>>>>   Which w

Re: [PATCH v3] btrfs: Don't remove block group still has pinned down bytes

2018-06-20 Thread Filipe Manana
On Wed, Jun 20, 2018 at 10:22 AM, Qu Wenruo  wrote:
>
>
> On 2018年06月20日 17:13, Filipe Manana wrote:
>> On Fri, Jun 15, 2018 at 2:35 AM, Qu Wenruo  wrote:
>>> [BUG]
>>> Under certain KVM load and LTP tests, we are possible to hit the
>>> following calltrace if quota is enabled:
>>> --
>>> BTRFS critical (device vda2): unable to find logical 8820195328 length 4096
>>> BTRFS critical (device vda2): unable to find logical 8820195328 length 4096
>>> [ cut here ]
>>> WARNING: CPU: 0 PID: 49 at ../block/blk-core.c:172 
>>> blk_status_to_errno+0x1a/0x30
>>> CPU: 0 PID: 49 Comm: kworker/u2:1 Not tainted 4.12.14-15-default #1 SLE15 
>>> (unreleased)
>>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
>>> 1.0.0-prebuilt.qemu-project.org 04/01/2014
>>> Workqueue: btrfs-endio-write btrfs_endio_write_helper [btrfs]
>>> task: 9f827b340bc0 task.stack: b4f8c0304000
>>> RIP: 0010:blk_status_to_errno+0x1a/0x30
>>> Call Trace:
>>>  submit_extent_page+0x191/0x270 [btrfs]
>>>  ? btrfs_create_repair_bio+0x130/0x130 [btrfs]
>>>  __do_readpage+0x2d2/0x810 [btrfs]
>>>  ? btrfs_create_repair_bio+0x130/0x130 [btrfs]
>>>  ? run_one_async_done+0xc0/0xc0 [btrfs]
>>>  __extent_read_full_page+0xe7/0x100 [btrfs]
>>>  ? run_one_async_done+0xc0/0xc0 [btrfs]
>>>  read_extent_buffer_pages+0x1ab/0x2d0 [btrfs]
>>>  ? run_one_async_done+0xc0/0xc0 [btrfs]
>>>  btree_read_extent_buffer_pages+0x94/0xf0 [btrfs]
>>>  read_tree_block+0x31/0x60 [btrfs]
>>>  read_block_for_search.isra.35+0xf0/0x2e0 [btrfs]
>>>  btrfs_search_slot+0x46b/0xa00 [btrfs]
>>>  ? kmem_cache_alloc+0x1a8/0x510
>>>  ? btrfs_get_token_32+0x5b/0x120 [btrfs]
>>>  find_parent_nodes+0x11d/0xeb0 [btrfs]
>>>  ? leaf_space_used+0xb8/0xd0 [btrfs]
>>>  ? btrfs_leaf_free_space+0x49/0x90 [btrfs]
>>>  ? btrfs_find_all_roots_safe+0x93/0x100 [btrfs]
>>>  btrfs_find_all_roots_safe+0x93/0x100 [btrfs]
>>>  btrfs_find_all_roots+0x45/0x60 [btrfs]
>>>  btrfs_qgroup_trace_extent_post+0x20/0x40 [btrfs]
>>>  btrfs_add_delayed_data_ref+0x1a3/0x1d0 [btrfs]
>>>  btrfs_alloc_reserved_file_extent+0x38/0x40 [btrfs]
>>>  insert_reserved_file_extent.constprop.71+0x289/0x2e0 [btrfs]
>>>  btrfs_finish_ordered_io+0x2f4/0x7f0 [btrfs]
>>>  ? pick_next_task_fair+0x2cd/0x530
>>>  ? __switch_to+0x92/0x4b0
>>>  btrfs_worker_helper+0x81/0x300 [btrfs]
>>>  process_one_work+0x1da/0x3f0
>>>  worker_thread+0x2b/0x3f0
>>>  ? process_one_work+0x3f0/0x3f0
>>>  kthread+0x11a/0x130
>>>  ? kthread_create_on_node+0x40/0x40
>>>  ret_from_fork+0x35/0x40
>>> Code: 00 00 5b c3 0f 1f 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 40 
>>> 80 ff 0c 40 0f b6 c7 77 0b 48 c1 e0 04 8b 80 00 bf c8 bd c3 <0f> 0b b8 fb 
>>> ff ff ff c3 0f 1f 40 00 66 2e 0f 1f 84 00 00 00 00
>>> ---[ end trace f079fb809e7a862b ]---
>>> BTRFS critical (device vda2): unable to find logical 8820195328 length 16384
>>> BTRFS: error (device vda2) in btrfs_finish_ordered_io:3023: errno=-5 IO 
>>> failure
>>> BTRFS info (device vda2): forced readonly
>>> BTRFS error (device vda2): pending csums is 2887680
>>> --
>>>
>>> [CAUSE]
>>> It's caused by race with block group auto removal like the following
>>> case:
>>> - There is a meta block group X, which has only one tree block
>>>   The tree block belongs to fs tree 257.
>>> - In current transaction, some operation modified fs tree 257
>>>   The tree block get CoWed, so the block group X is empty, and marked as
>>>   unused, queued to be deleted.
>>> - Some workload (like fsync) wakes up cleaner_kthread()
>>>   Which will call btrfs_deleted_unused_bgs() to remove unused block
>>>   groups.
>>>   So block group X along its chunk map get removed.
>>> - Some delalloc work finished for fs tree 257
>>>   Quota needs to get the original reference of the extent, which will
>>>   reads tree blocks of commit root of 257.
>>>   Then since the chunk map get removed, above warning get triggered.
>>>
>>> [FIX]
>>> Just teach btrfs_delete_unused_bgs() to skip block group who still has
>>> pinned bytes.
>>>
>>> However there is a minor side effect, since currently we only queue
>>> empty blocks at update_block_group(), and such empty block group with
>>> pinned bytes won't go through update_bloc

Re: [PATCH v3] btrfs: Don't remove block group still has pinned down bytes

2018-06-20 Thread Filipe Manana
On Fri, Jun 15, 2018 at 2:35 AM, Qu Wenruo  wrote:
> [BUG]
> Under certain KVM load and LTP tests, we are possible to hit the
> following calltrace if quota is enabled:
> --
> BTRFS critical (device vda2): unable to find logical 8820195328 length 4096
> BTRFS critical (device vda2): unable to find logical 8820195328 length 4096
> [ cut here ]
> WARNING: CPU: 0 PID: 49 at ../block/blk-core.c:172 
> blk_status_to_errno+0x1a/0x30
> CPU: 0 PID: 49 Comm: kworker/u2:1 Not tainted 4.12.14-15-default #1 SLE15 
> (unreleased)
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> 1.0.0-prebuilt.qemu-project.org 04/01/2014
> Workqueue: btrfs-endio-write btrfs_endio_write_helper [btrfs]
> task: 9f827b340bc0 task.stack: b4f8c0304000
> RIP: 0010:blk_status_to_errno+0x1a/0x30
> Call Trace:
>  submit_extent_page+0x191/0x270 [btrfs]
>  ? btrfs_create_repair_bio+0x130/0x130 [btrfs]
>  __do_readpage+0x2d2/0x810 [btrfs]
>  ? btrfs_create_repair_bio+0x130/0x130 [btrfs]
>  ? run_one_async_done+0xc0/0xc0 [btrfs]
>  __extent_read_full_page+0xe7/0x100 [btrfs]
>  ? run_one_async_done+0xc0/0xc0 [btrfs]
>  read_extent_buffer_pages+0x1ab/0x2d0 [btrfs]
>  ? run_one_async_done+0xc0/0xc0 [btrfs]
>  btree_read_extent_buffer_pages+0x94/0xf0 [btrfs]
>  read_tree_block+0x31/0x60 [btrfs]
>  read_block_for_search.isra.35+0xf0/0x2e0 [btrfs]
>  btrfs_search_slot+0x46b/0xa00 [btrfs]
>  ? kmem_cache_alloc+0x1a8/0x510
>  ? btrfs_get_token_32+0x5b/0x120 [btrfs]
>  find_parent_nodes+0x11d/0xeb0 [btrfs]
>  ? leaf_space_used+0xb8/0xd0 [btrfs]
>  ? btrfs_leaf_free_space+0x49/0x90 [btrfs]
>  ? btrfs_find_all_roots_safe+0x93/0x100 [btrfs]
>  btrfs_find_all_roots_safe+0x93/0x100 [btrfs]
>  btrfs_find_all_roots+0x45/0x60 [btrfs]
>  btrfs_qgroup_trace_extent_post+0x20/0x40 [btrfs]
>  btrfs_add_delayed_data_ref+0x1a3/0x1d0 [btrfs]
>  btrfs_alloc_reserved_file_extent+0x38/0x40 [btrfs]
>  insert_reserved_file_extent.constprop.71+0x289/0x2e0 [btrfs]
>  btrfs_finish_ordered_io+0x2f4/0x7f0 [btrfs]
>  ? pick_next_task_fair+0x2cd/0x530
>  ? __switch_to+0x92/0x4b0
>  btrfs_worker_helper+0x81/0x300 [btrfs]
>  process_one_work+0x1da/0x3f0
>  worker_thread+0x2b/0x3f0
>  ? process_one_work+0x3f0/0x3f0
>  kthread+0x11a/0x130
>  ? kthread_create_on_node+0x40/0x40
>  ret_from_fork+0x35/0x40
> Code: 00 00 5b c3 0f 1f 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 40 80 
> ff 0c 40 0f b6 c7 77 0b 48 c1 e0 04 8b 80 00 bf c8 bd c3 <0f> 0b b8 fb ff ff 
> ff c3 0f 1f 40 00 66 2e 0f 1f 84 00 00 00 00
> ---[ end trace f079fb809e7a862b ]---
> BTRFS critical (device vda2): unable to find logical 8820195328 length 16384
> BTRFS: error (device vda2) in btrfs_finish_ordered_io:3023: errno=-5 IO 
> failure
> BTRFS info (device vda2): forced readonly
> BTRFS error (device vda2): pending csums is 2887680
> --
>
> [CAUSE]
> It's caused by race with block group auto removal like the following
> case:
> - There is a meta block group X, which has only one tree block
>   The tree block belongs to fs tree 257.
> - In current transaction, some operation modified fs tree 257
>   The tree block get CoWed, so the block group X is empty, and marked as
>   unused, queued to be deleted.
> - Some workload (like fsync) wakes up cleaner_kthread()
>   Which will call btrfs_deleted_unused_bgs() to remove unused block
>   groups.
>   So block group X along its chunk map get removed.
> - Some delalloc work finished for fs tree 257
>   Quota needs to get the original reference of the extent, which will
>   reads tree blocks of commit root of 257.
>   Then since the chunk map get removed, above warning get triggered.
>
> [FIX]
> Just teach btrfs_delete_unused_bgs() to skip block group who still has
> pinned bytes.
>
> However there is a minor side effect, since currently we only queue
> empty blocks at update_block_group(), and such empty block group with
> pinned bytes won't go through update_block_group() again, such block
> group won't be removed, until it get new extent allocated and removed.

So that can be fixed in a separate patch, to add it back to the list
of block groups to be deleted once everything is unpinned and passes
all other necessary criteria.

>
> But please note that, there are more problems related to extent
> allocator with block group auto removal.

The above isn't a problem of the allocator itself but rather in the
way we manage COW, commit roots and unpinning.

>
> Even a block group is marked unused, extent allocator can still allocate
> new extents from unused block group.

Why is that a problem?
It's ok (with some good benefits), as long as the cleaner thread (or
any thing that attempts to delete block groups in the unused list),
doesn't delete it.

> Thus delaying block group to next transaction won't work.
> (Extents get allocated in current transaction, and removed again in next
> transaction).
>
> So the root fix need to co-operate with extent allocator.

What do you mean by co-operation with the extent allocator? I don't

Re: [PATCH] Btrfs: fix physical offset reported by fiemap for inline extents

2018-06-20 Thread Filipe Manana
On Wed, Jun 20, 2018 at 3:55 AM, robbieko  wrote:
> fdman...@kernel.org 於 2018-06-19 19:31 寫到:
>
>> From: Filipe Manana 
>>
>> Commit 9d311e11fc1f ("Btrfs: fiemap: pass correct bytenr when
>> fm_extent_count is zero") introduced a regression where we no longer
>> report 0 as the physical offset for inline extents. This is because it
>> always sets the variable used to report the physical offset ("disko")
>> as em->block_start plus some offset, and em->block_start has the value
>> 18446744073709551614 ((u64) -2) for inline extents.
>>
>> This made the btrfs test 004 (from fstests) often fail, for example, for
>> a file with an inline extent we have the following items in the subvolume
>> tree:
>>
>> item 101 key (418 INODE_ITEM 0) itemoff 11029 itemsize 160
>>generation 25 transid 38 size 1525 nbytes 1525
>>block group 0 mode 100666 links 1 uid 0 gid 0 rdev 0
>>sequence 0 flags 0x2(none)
>>atime 1529342058.461891730 (2018-06-18 18:14:18)
>>ctime 1529342058.461891730 (2018-06-18 18:14:18)
>>mtime 1529342058.461891730 (2018-06-18 18:14:18)
>>otime 1529342055.869892885 (2018-06-18 18:14:15)
>> item 102 key (418 INODE_REF 264) itemoff 11016 itemsize 13
>>index 25 namelen 3 name: fc7
>> item 103 key (418 EXTENT_DATA 0) itemoff 9470 itemsize 1546
>>generation 38 type 0 (inline)
>>inline extent data size 1525 ram_bytes 1525 compression 0
>> (none)
>>
>> Then when test 004 invoked fiemap against the file it got a non-zero
>> physical offset:
>>
>>  $ filefrag -v /mnt/p0/d4/d7/fc7
>>  Filesystem type is: 9123683e
>>  File size of /mnt/p0/d4/d7/fc7 is 1525 (1 block of 4096 bytes)
>>   ext: logical_offset:physical_offset: length:   expected:
>> flags:
>> 0:0..4095: 18446744073709551614..  4093:   4096:
>>   last,not_aligned,inline,eof
>>  /mnt/p0/d4/d7/fc7: 1 extent found
>>
>> This resulted in the test failing like this:
>>
>> btrfs/004 49s ... [failed, exit status 1]- output mismatch (see
>> /home/fdmanana/git/hub/xfstests/results//btrfs/004.out.bad)
>> --- tests/btrfs/004.out 2016-08-23 10:17:35.027012095 +0100
>> +++
>> /home/fdmanana/git/hub/xfstests/results//btrfs/004.out.bad  2018-06-18
>> 18:15:02.385872155 +0100
>> @@ -1,3 +1,10 @@
>>  QA output created by 004
>>  *** test backref walking
>> -*** done
>> +./tests/btrfs/004: line 227: [: 7.55578637259143e+22: integer
>> expression expected
>> +ERROR: 7.55578637259143e+22 is not a valid numeric value.
>> +unexpected output from
>> +   /home/fdmanana/git/hub/btrfs-progs/btrfs inspect-internal
>> logical-resolve -s 65536 -P 7.55578637259143e+22
>> /home/fdmanana/btrfs-tests/scratch_1
>> ...
>> (Run 'diff -u tests/btrfs/004.out
>> /home/fdmanana/git/hub/xfstests/results//btrfs/004.out.bad'  to see
>> the entire diff)
>> Ran: btrfs/004
>>
>> The large number in scientific notation reported as an invalid numeric
>> value is the result from the filter passed to perl which multiplies the
>> physical offset by the block size reported by fiemap.
>>
>> So fix this by ensuring the physical offset is always set to 0 when we
>> are processing an inline extent.
>>
>> Fixes: 9d311e11fc1f ("Btrfs: fiemap: pass correct bytenr when
>> fm_extent_count is zero")
>> Signed-off-by: Filipe Manana 
>> ---
>>  fs/btrfs/extent_io.c | 1 +
>>  1 file changed, 1 insertion(+)
>>
>> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
>> index 8e4a7cdbc9f5..978327d98fc5 100644
>> --- a/fs/btrfs/extent_io.c
>> +++ b/fs/btrfs/extent_io.c
>> @@ -4559,6 +4559,7 @@ int extent_fiemap(struct inode *inode, struct
>> fiemap_extent_info *fieinfo,
>> end = 1;
>> flags |= FIEMAP_EXTENT_LAST;
>> } else if (em->block_start == EXTENT_MAP_INLINE) {
>> +   disko = 0;
>> flags |= (FIEMAP_EXTENT_DATA_INLINE |
>>   FIEMAP_EXTENT_NOT_ALIGNED);
>> } else if (em->block_start == EXTENT_MAP_DELALLOC) {
>
>
>
> EXTENT_MAP_DELALLOC should have the same problem.
>
> em->block_start has some special values. The following values should not be
> considered disko
> #define EXTENT_MAP_LAST_BYTE((u64)-4)
> #define EX

Re: [PATCH 2/2] Btrfs: sync log after logging new name

2018-06-15 Thread Filipe Manana
On Fri, Jun 15, 2018 at 4:54 PM, David Sterba  wrote:
> On Mon, Jun 11, 2018 at 07:24:28PM +0100, fdman...@kernel.org wrote:
>> From: Filipe Manana 
>> Fixes: 12fcfd22fe5b ("Btrfs: tree logging unlink/rename fixes")
>> Reported-by: Vijay Chidambaram 
>> Signed-off-by: Filipe Manana 
>
> There are some warnings and possible lock up caused by this patch, the
> 1/2 alone is ok but 1/2 + 2/2 leads to the following warnings. I checked
> twice, the patch base was the pull request ie. without any other 4.18
> stuff.

Are you sure it's this patch?
On top of for-4.18 it didn't cause any problems here, plus the trace
below has nothing to do with renames, hard links or fsync at all -
everything seems stuck on waiting for IO from dev replace.

>
> It's a qemu with 4 cpus and 2g of memory.
>
> btrfs/011
>
> [  876.705586] watchdog: BUG: soft lockup - CPU#2 stuck for 77s!
> [od:12857]
> [  876.708167] Modules linked in: btrfs libcrc32c xor 
> zstd_decompresszstd_compress xxhash raid6_pq loop
> [  876.710717] CPU: 2 PID: 12857 Comm: od Not tainted4.17.0-rc7-default+ #143
> [  876.712007] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),BIOS 
> 1.0.0-prebuilt.qemu-project.org 04/01/2014
> [  876.71] RIP: 0010:copy_user_generic_string+0x2c/0x40
> [  876.714586] RSP: 0018:a7968344bc68 EFLAGS: 00010246 
> ORIG_RAX:ff13
> [  876.716411] RAX: 55b128b4a370 RBX: 1000 
> RCX:0200
> [  876.718023] RDX:  RSI: 55b128b49370 
> RDI:88bb1b60
> [  876.719681] RBP: 1000 R08: 88bb1b60 
> R09:88bb
> [  876.721568] R10: 1000 R11: 0030 
> R12:1000
> [  876.723123] R13: 88bb1b601000 R14: a7968344be58 
> R15:
> [  876.724087] FS:  7f417c596540() 
> GS:88bb7fd0()knlGS:
> [  876.725165] CS:  0010 DS:  ES:  CR0: 80050033
> [  876.726375] CR2: 55c99ecbf298 CR3: 68436000 
> CR4:06e0
> [  876.728080] Call Trace:
> [  876.728849]  copyin+0x22/0x30
> [  876.729704]  iov_iter_copy_from_user_atomic+0x19a/0x410
> [  876.730789]  ? ptep_clear_flush+0x40/0x40
> [  876.731391]  btrfs_copy_from_user+0xab/0x120 [btrfs]
> [  876.732058]  __btrfs_buffered_write+0x367/0x710 [btrfs]
> [  876.732747]  btrfs_file_write_iter+0x2b8/0x5d0 [btrfs]
> [  876.733507]  ? touch_atime+0x27/0xb0
> [  876.734257]  __vfs_write+0xd4/0x130
> [  876.734860]  vfs_write+0xad/0x1e0
> [  876.735346]  ksys_write+0x42/0x90
> [  876.735858]  do_syscall_64+0x4f/0xe0
> [  876.736515]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [  876.737690] RIP: 0033:0x7f417c0a3c94
> [  876.738565] RSP: 002b:7ffc07af7c48 EFLAGS: 0246 
> ORIG_RAX:0001
> [  876.740186] RAX: ffda RBX: 1000 
> RCX:7f417c0a3c94
> [  876.741263] RDX: 1000 RSI: 55b128b49370 
> RDI:0001
> [  876.742819] RBP: 55b128b49370 R08: 7f417c372760 
> R09:
> [  876.744157] R10: 07d0 R11: 0246 
> R12:7f417c372760
> [  876.745507] R13: 1000 R14: 7f417c36d760 
> R15:1000
> [ 1463.260071] INFO: task kworker/u8:12:9261 blocked for more than 480seconds.
> [ 1463.264100]   Tainted: G L4.17.0-rc7-default+#143
> [ 1463.267639] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"disables 
> this message.
> [ 1463.272086] kworker/u8:12   D0  9261  2 0x8000
> [ 1463.274224] Workqueue: btrfs-submit btrfs_submit_helper [btrfs]
> [ 1463.275442] Call Trace:
> [ 1463.276221]  ? __schedule+0x268/0x7f0
> [ 1463.277348]  schedule+0x33/0x90
> [ 1463.278105]  io_schedule+0x16/0x40
> [ 1463.279053]  wbt_wait+0x19b/0x310
> [ 1463.280085]  ? finish_wait+0x80/0x80
> [ 1463.281153]  blk_mq_make_request+0xba/0x6f0
> [ 1463.282354]  generic_make_request+0x173/0x3d0
> [ 1463.283617]  ? submit_bio+0x6c/0x140
> [ 1463.284678]  submit_bio+0x6c/0x140
> [ 1463.285705]  run_scheduled_bios+0x18e/0x480 [btrfs]
> [ 1463.287121]  ? normal_work_helper+0x6a/0x330 [btrfs]
> [ 1463.288543]  normal_work_helper+0x6a/0x330 [btrfs]
> [ 1463.289937]  process_one_work+0x16d/0x380
> [ 1463.291119]  worker_thread+0x2e/0x380
> [ 1463.292205]  ? process_one_work+0x380/0x380
> [ 1463.293420]  kthread+0x111/0x130
> [ 1463.294377]  ? kthread_create_worker_on_cpu+0x50/0x50
> [ 1463.295782]  ret_from_fork+0x1f/0x30
> [ 1463.296827] INFO: task btrfs-transacti:5799 blocked for more than 480 
> seconds.
> [ 1463.298874]   Tainted: G L4.17.0-rc7-default+ #143
> [ 1463.300596] "echo 0 > /proc/sys/kernel/hung

Re: [GIT PULL] Btrfs updates for 4.18

2018-06-11 Thread Filipe Manana
On Mon, Jun 11, 2018 at 9:14 AM, Anand Jain  wrote:
>
>
> On 06/10/2018 12:21 AM, Filipe Manana wrote:
>>
>> On Mon, Jun 4, 2018 at 4:43 PM, David Sterba  wrote:
>>>
>>> Hi,
>>>
>>> there are some new features and a usual load of cleanups, more details
>>> below.
>>>
>>> Specifically, there's a set of new non-privileged ioctls to allow
>>> subvolume listing.  It works but still needs a security review as it's a
>>> new interface and we might need to do some tweaks to the data
>>> structures. The fixes could be considred regressions but may touch the
>>> interfaces too.
>>>
>>> Currently there are no merge conflicts but linux-next has reported a few
>>> in the past, originating from other *FS trees.
>>>
>>> Please pull, thanks.
>>>
>>> ---
>>>
>>> User visible features:
>>>
>>> - added support for the ioctl FS_IOC_FSGETXATTR, per-inode flags,
>>> successor
>>>of GET/SETFLAGS; now supports only existing flags: append, immutable,
>>>noatime, nodump, sync
>>>
>>> - 3 new unprivileged ioctls to allow users to enumerate subvolumes
>>>
>>> - dedupe syscall implementation does not restrict the range to 16MiB,
>>> though it
>>>still splits the whole range to 16MiB chunks
>>>
>>> - on user demand, rmdir() is able to delete an empty subvolume, export
>>> the
>>>capability in sysfs
>>>
>>> - fix inode number types in tracepoints, other cleanups
>>>
>>> - send: improved speed when dealing with a large removed directory,
>>>measurements show decrease from 2000 minutes to 2 minutes on a
>>> directory with
>>>2 million entries
>>>
>>> - pre-commit check of superblock to detect a mysterious in-memory
>>> corruption
>>>
>>> - log message updates
>>>
>>>
>>> Other changes:
>>>
>>> - orphan inode cleanup improved, does no keep long-standing reservations
>>> that
>>>could lead up to early ENOSPC in some cases
>>>
>>> - slight improvement of handling snapshotted NOCOW files by avoiding some
>>>unnecessary tree searches
>>>
>>> - avoid OOM when dealing with many unmergeable small extents at flush
>>> time
>>>
>>> - speedup conversion of free space tree representations from/to
>>> bitmap/tree
>>>
>>> - code refactoring, deletion, cleanups
>>>- delayed refs
>>>- delayed iput
>>>- redundant argument removals
>>>- memory barrier cleanups
>>>- remove a redundant mutex supposedly excluding several ioctls to run
>>> in
>>>  parallel
>>>
>>> - new tracepoints for blockgroup manipulation
>>>
>>> - more sanity checks of compressed headers
>>>
>>> 
>>> The following changes since commit
>>> b04e217704b7f879c6b91222b066983a44a7a09f:
>>>
>>>Linux 4.17-rc7 (2018-05-27 13:01:47 -0700)
>>>
>>> are available in the Git repository at:
>>>
>>>git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git
>>> for-4.18-tag
>>>
>>> for you to fetch changes up to 23d0b79dfaed2305b500b0215b0421701ada6b1a:
>>>
>>>btrfs: Add unprivileged version of ino_lookup ioctl (2018-05-31
>>> 11:35:24 +0200)
>>>
>>> 
>>> Al Viro (1):
>>>btrfs: take the last remnants of ->d_fsdata use out
>>>
>>> Anand Jain (19):
>>>btrfs: add comment about BTRFS_FS_EXCL_OP
>>>btrfs: rename struct btrfs_fs_devices::list
>>>btrfs: cleanup __btrfs_open_devices() drop head pointer
>>>btrfs: rename __btrfs_close_devices to close_fs_devices
>>>btrfs: rename __btrfs_open_devices to open_fs_devices
>>>btrfs: cleanup find_device() drop list_head pointer
>>>btrfs: cleanup btrfs_rm_device() promote fs_devices pointer
>>>btrfs: move btrfs_raid_type_names values to btrfs_raid_attr table
>>>btrfs: move btrfs_raid_group values to btrfs_raid_attr table
>>>btrfs: move btrfs_raid_mindev_errorvalues to btrfs_raid_attr table
>>>btrfs: reduce uuid_mutex critical section while scanning devices
>>>btrfs: use existin

Re: [GIT PULL] Btrfs updates for 4.18

2018-06-09 Thread Filipe Manana
On Mon, Jun 4, 2018 at 4:43 PM, David Sterba  wrote:
> Hi,
>
> there are some new features and a usual load of cleanups, more details below.
>
> Specifically, there's a set of new non-privileged ioctls to allow
> subvolume listing.  It works but still needs a security review as it's a
> new interface and we might need to do some tweaks to the data
> structures. The fixes could be considred regressions but may touch the
> interfaces too.
>
> Currently there are no merge conflicts but linux-next has reported a few
> in the past, originating from other *FS trees.
>
> Please pull, thanks.
>
> ---
>
> User visible features:
>
> - added support for the ioctl FS_IOC_FSGETXATTR, per-inode flags, successor
>   of GET/SETFLAGS; now supports only existing flags: append, immutable,
>   noatime, nodump, sync
>
> - 3 new unprivileged ioctls to allow users to enumerate subvolumes
>
> - dedupe syscall implementation does not restrict the range to 16MiB, though 
> it
>   still splits the whole range to 16MiB chunks
>
> - on user demand, rmdir() is able to delete an empty subvolume, export the
>   capability in sysfs
>
> - fix inode number types in tracepoints, other cleanups
>
> - send: improved speed when dealing with a large removed directory,
>   measurements show decrease from 2000 minutes to 2 minutes on a directory 
> with
>   2 million entries
>
> - pre-commit check of superblock to detect a mysterious in-memory corruption
>
> - log message updates
>
>
> Other changes:
>
> - orphan inode cleanup improved, does no keep long-standing reservations that
>   could lead up to early ENOSPC in some cases
>
> - slight improvement of handling snapshotted NOCOW files by avoiding some
>   unnecessary tree searches
>
> - avoid OOM when dealing with many unmergeable small extents at flush time
>
> - speedup conversion of free space tree representations from/to bitmap/tree
>
> - code refactoring, deletion, cleanups
>   - delayed refs
>   - delayed iput
>   - redundant argument removals
>   - memory barrier cleanups
>   - remove a redundant mutex supposedly excluding several ioctls to run in
> parallel
>
> - new tracepoints for blockgroup manipulation
>
> - more sanity checks of compressed headers
>
> 
> The following changes since commit b04e217704b7f879c6b91222b066983a44a7a09f:
>
>   Linux 4.17-rc7 (2018-05-27 13:01:47 -0700)
>
> are available in the Git repository at:
>
>   git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git for-4.18-tag
>
> for you to fetch changes up to 23d0b79dfaed2305b500b0215b0421701ada6b1a:
>
>   btrfs: Add unprivileged version of ino_lookup ioctl (2018-05-31 11:35:24 
> +0200)
>
> 
> Al Viro (1):
>   btrfs: take the last remnants of ->d_fsdata use out
>
> Anand Jain (19):
>   btrfs: add comment about BTRFS_FS_EXCL_OP
>   btrfs: rename struct btrfs_fs_devices::list
>   btrfs: cleanup __btrfs_open_devices() drop head pointer
>   btrfs: rename __btrfs_close_devices to close_fs_devices
>   btrfs: rename __btrfs_open_devices to open_fs_devices
>   btrfs: cleanup find_device() drop list_head pointer
>   btrfs: cleanup btrfs_rm_device() promote fs_devices pointer
>   btrfs: move btrfs_raid_type_names values to btrfs_raid_attr table
>   btrfs: move btrfs_raid_group values to btrfs_raid_attr table
>   btrfs: move btrfs_raid_mindev_errorvalues to btrfs_raid_attr table
>   btrfs: reduce uuid_mutex critical section while scanning devices
>   btrfs: use existing cur_devices, cleanup btrfs_rm_device
>   btrfs: document uuid_mutex uasge in read_chunk_tree
>   btrfs: replace uuid_mutex by device_list_mutex in btrfs_open_devices

This change (commit 542c5908abfe84f7b4c1717492ecc92ea0ea328d, "btrfs:
replace uuid_mutex by device_list_mutex in btrfs_open_devices"), at
the very least
introduces a lockdep warning:

[  865.021049] ==
[  865.021950] WARNING: possible circular locking dependency detected
[  865.022828] 4.17.0-rc7-btrfs-next-59+ #1 Not tainted
[  865.023491] --
[  865.024342] fsstress/27897 is trying to acquire lock:
[  865.025070] 99260c12 (_info->reloc_mutex){+.+.}, at:
btrfs_record_root_in_trans+0x43/0x62 [btrfs]
[  865.026369]
[  865.026369] but task is already holding lock:
[  865.027206] 8dc17c22 (>mmap_sem){}, at:
vm_mmap_pgoff+0x77/0xe8
[  865.028251]
[  865.028251] which lock already depends on the new lock.
[  865.028251]
[  865.029482]
[  865.029482] the existing dependency chain (in reverse order) is:
[  865.030523]
[  865.030523] -> #7 (>mmap_sem){}:
[  865.031241]_copy_to_user+0x1e/0x63
[  865.031745]filldir+0x9e/0xef
[  865.032285]dir_emit_dots+0x3b/0xbd
[  865.032881]dcache_readdir+0x22/0xbb
[  865.033502]iterate_dir+0xa3/0x13e
[  

Re: [PATCH 1/4] btrfs: always wait on ordered extents at fsync time

2018-05-24 Thread Filipe Manana
On Wed, May 23, 2018 at 4:58 PM, Josef Bacik  wrote:
> From: Josef Bacik 
>
> There's a priority inversion that exists currently with btrfs fsync.  In
> some cases we will collect outstanding ordered extents onto a list and
> only wait on them at the very last second.  However this "very last
> second" falls inside of a transaction handle, so if we are in a lower
> priority cgroup we can end up holding the transaction open for longer
> than needed, so if a high priority cgroup is also trying to fsync()
> it'll see latency.
>
> Signed-off-by: Josef Bacik 
> ---
>  fs/btrfs/file.c | 56 
>  1 file changed, 4 insertions(+), 52 deletions(-)
>
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index 5772f0cbedef..2b1c36612384 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -2069,53 +2069,12 @@ int btrfs_sync_file(struct file *file, loff_t start, 
> loff_t end, int datasync)
> atomic_inc(>log_batch);
> full_sync = test_bit(BTRFS_INODE_NEEDS_FULL_SYNC,
>  _I(inode)->runtime_flags);
> +
> /*
> -* We might have have had more pages made dirty after calling
> -* start_ordered_ops and before acquiring the inode's i_mutex.
> +* We have to do this here to avoid the priority inversion of waiting 
> on
> +* IO of a lower priority task while holding a transaciton open.
>  */
> -   if (full_sync) {
> -   /*
> -* For a full sync, we need to make sure any ordered 
> operations
> -* start and finish before we start logging the inode, so that
> -* all extents are persisted and the respective file extent
> -* items are in the fs/subvol btree.
> -*/
> -   ret = btrfs_wait_ordered_range(inode, start, len);
> -   } else {
> -   /*
> -* Start any new ordered operations before starting to log the
> -* inode. We will wait for them to finish in btrfs_sync_log().
> -*
> -* Right before acquiring the inode's mutex, we might have new
> -* writes dirtying pages, which won't immediately start the
> -* respective ordered operations - that is done through the
> -* fill_delalloc callbacks invoked from the writepage and
> -* writepages address space operations. So make sure we start
> -* all ordered operations before starting to log our inode. 
> Not
> -* doing this means that while logging the inode, writeback
> -* could start and invoke writepage/writepages, which would 
> call
> -* the fill_delalloc callbacks (cow_file_range,
> -* submit_compressed_extents). These callbacks add first an
> -* extent map to the modified list of extents and then create
> -* the respective ordered operation, which means in
> -* tree-log.c:btrfs_log_inode() we might capture all existing
> -* ordered operations (with btrfs_get_logged_extents()) before
> -* the fill_delalloc callback adds its ordered operation, and 
> by
> -* the time we visit the modified list of extent maps (with
> -* btrfs_log_changed_extents()), we see and process the extent
> -* map they created. We then use the extent map to construct a
> -* file extent item for logging without waiting for the
> -* respective ordered operation to finish - this file extent
> -* item points to a disk location that might not have yet been
> -* written to, containing random data - so after a crash a log
> -* replay will make our inode have file extent items that 
> point
> -* to disk locations containing invalid data, as we returned
> -* success to userspace without waiting for the respective
> -* ordered operation to finish, because it wasn't captured by
> -* btrfs_get_logged_extents().
> -*/
> -   ret = start_ordered_ops(inode, start, end);
> -   }
> +   ret = btrfs_wait_ordered_range(inode, start, len);
> if (ret) {
> inode_unlock(inode);
> goto out;
> @@ -2240,13 +2199,6 @@ int btrfs_sync_file(struct file *file, loff_t start, 
> loff_t end, int datasync)
> goto out;
> }
> }
> -   if (!full_sync) {
> -   ret = btrfs_wait_ordered_range(inode, start, len);
> -   if (ret) {
> -   btrfs_end_transaction(trans);
> -   goto out;
> -   }
> -   

Re: [PATCH 2/2] btrfs: always wait on ordered extents at fsync time

2018-05-23 Thread Filipe Manana
On Tue, May 22, 2018 at 6:47 PM, Josef Bacik <jo...@toxicpanda.com> wrote:
> From: Josef Bacik <jba...@fb.com>
>
> There's a priority inversion that exists currently with btrfs fsync.  In
> some cases we will collect outstanding ordered extents onto a list and
> only wait on them at the very last second.  However this "very last
> second" falls inside of a transaction handle, so if we are in a lower
> priority cgroup we can end up holding the transaction open for longer
> than needed, so if a high priority cgroup is also trying to fsync()
> it'll see latency.
>
> Fix this by getting rid of all of the logged extents magic and simply
> wait on ordered extent before we star the tree log stuff.  This code has
> changed a lot since I first wrote it and really isn't the performance
> win it was originally because of the things we had to do around getting
> the right checksums.  Killing all of this makes our lives easier and
> gets rid of the priority inversion.

Much easier!

>
> Signed-off-by: Josef Bacik <jba...@fb.com>
Reviewed-by: Filipe Manana <fdman...@suse.com>

Looks good to me.
Happy to see all that complexity go away and knowing it no longer
offers any benefit.

> ---
>  fs/btrfs/file.c  |  56 ++-
>  fs/btrfs/ordered-data.c  | 123 
>  fs/btrfs/ordered-data.h  |  20 +-
>  fs/btrfs/tree-log.c  | 166 
> ---
>  include/trace/events/btrfs.h |   1 -
>  5 files changed, 19 insertions(+), 347 deletions(-)
>
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index 5772f0cbedef..2b1c36612384 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -2069,53 +2069,12 @@ int btrfs_sync_file(struct file *file, loff_t start, 
> loff_t end, int datasync)
> atomic_inc(>log_batch);
> full_sync = test_bit(BTRFS_INODE_NEEDS_FULL_SYNC,
>  _I(inode)->runtime_flags);
> +
> /*
> -* We might have have had more pages made dirty after calling
> -* start_ordered_ops and before acquiring the inode's i_mutex.
> +* We have to do this here to avoid the priority inversion of waiting 
> on
> +* IO of a lower priority task while holding a transaciton open.
>  */
> -   if (full_sync) {
> -   /*
> -* For a full sync, we need to make sure any ordered 
> operations
> -* start and finish before we start logging the inode, so that
> -* all extents are persisted and the respective file extent
> -* items are in the fs/subvol btree.
> -*/
> -   ret = btrfs_wait_ordered_range(inode, start, len);
> -   } else {
> -   /*
> -* Start any new ordered operations before starting to log the
> -* inode. We will wait for them to finish in btrfs_sync_log().
> -*
> -* Right before acquiring the inode's mutex, we might have new
> -* writes dirtying pages, which won't immediately start the
> -* respective ordered operations - that is done through the
> -* fill_delalloc callbacks invoked from the writepage and
> -* writepages address space operations. So make sure we start
> -* all ordered operations before starting to log our inode. 
> Not
> -* doing this means that while logging the inode, writeback
> -* could start and invoke writepage/writepages, which would 
> call
> -* the fill_delalloc callbacks (cow_file_range,
> -* submit_compressed_extents). These callbacks add first an
> -* extent map to the modified list of extents and then create
> -* the respective ordered operation, which means in
> -* tree-log.c:btrfs_log_inode() we might capture all existing
> -* ordered operations (with btrfs_get_logged_extents()) before
> -* the fill_delalloc callback adds its ordered operation, and 
> by
> -* the time we visit the modified list of extent maps (with
> -* btrfs_log_changed_extents()), we see and process the extent
> -* map they created. We then use the extent map to construct a
> -* file extent item for logging without waiting for the
> -* respective ordered operation to finish - this file extent
> -* item points to a disk location that might not have yet been
> -* written to, containing random data - so after a crash a log
> -* repl

Re: Strange behavior (possible bugs) in btrfs

2018-05-23 Thread Filipe Manana
On Mon, Apr 30, 2018 at 5:04 PM, Vijay Chidambaram  wrote:
> Hi,
>
> We found two more cases where the btrfs behavior is a little strange.
> In one case, an fsync-ed file goes missing after a crash. In the
> other, a renamed file shows up in both directories after a crash.
>
> Workload 1:
>
> mkdir A
> mkdir B
> mkdir A/C
> creat B/foo
> fsync B/foo
> link B/foo A/C/foo
> fsync A
> -- crash --
>
> Expected state after recovery:
> B B/foo A A/C exist

Why don't you expect A/C/foo as well? I would expect it to be persisted.
With xfs we don't get A/C/foo persisted, but it's persisted with ext4 and f2fs.

Adding xfs folks in cc to confirm the expected behaviour.

>
> What we find:
> Only B B/foo exist
>
> A is lost even after explicit fsync to A.
>
> Workload 2:
>
> mkdir A
> mkdir A/C
> rename A/C B
> touch B/bar
> fsync B/bar
> rename B/bar A/bar
> rename A B (replacing B with A at this point)
> fsync B/bar
> -- crash --
>
> Expected contents after recovery:
> A/bar
>
> What we find after recovery:
> A/bar
> B/bar
>
> We think this breaks rename's atomicity guarantee. bar should be
> present in either A or B, but now it is present in both.
>
> Thanks,
> Vijay
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


  1   2   3   4   5   6   7   8   >