Re: Any chance to get snapshot-aware defragmentation?

2018-05-18 Thread Tomasz Pala
On Fri, May 18, 2018 at 13:10:02 -0400, Austin S. Hemmelgarn wrote:

> Personally though, I think the biggest issue with what was done was not 
> the memory consumption, but the fact that there was no switch to turn it 
> on or off.  Making defrag unconditionally snapshot aware removes one of 
> the easiest ways to forcibly unshare data without otherwise altering the 

The "defrag only not-snapshotted data" mode would be enough for many
use cases and wouldn't require more RAM. One could run this before
taking a snapshot and merge _at least_ the new data.

And even with current approach it should be possible to interlace
defragmentation with some kind of naive-deduplication; "naive" in the
approach of comparing blocks only within the same in-subvolume paths.

-- 
Tomasz Pala 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Any chance to get snapshot-aware defragmentation?

2018-05-18 Thread Omar Sandoval
On Fri, May 18, 2018 at 04:26:16PM -0600, Chris Murphy wrote:
> On Fri, May 18, 2018 at 12:33 PM, Austin S. Hemmelgarn
>  wrote:
> > On 2018-05-18 13:18, Niccolò Belli wrote:
> >>
> >> On venerdì 18 maggio 2018 19:10:02 CEST, Austin S. Hemmelgarn wrote:
> >>>
> >>> and also forces the people who have ridiculous numbers of snapshots to
> >>> deal with the memory usage or never defrag
> >>
> >>
> >> Whoever has at least one snapshot is never going to defrag anyway, unless
> >> he is willing to double the used space.
> >>
> > With a bit of work, it's possible to handle things sanely.  You can
> > deduplicate data from snapshots, even if they are read-only (you need to
> > pass the `-A` option to duperemove and run it as root), so it's perfectly
> > reasonable to only defrag the main subvolume, and then deduplicate the
> > snapshots against that (so that they end up all being reflinks to the main
> > subvolume).  Of course, this won't work if you're short on space, but if
> > you're dealing with snapshots, you should have enough space that this will
> > work (because even without defrag, it's fully possible for something to
> > cause the snapshots to suddenly take up a lot more space).
> 
> 
> Curiously, snapshot aware defragmentation is going to increase free
> space fragmentation. For busy in-use systems, it might be necessary to
> use space cache v2 to avoid performance problems.
> 
> I forget the exact reason why the free space tree is not the default,
> I think it has to do with missing repair support?

Yeah, Nikolay is working on that.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Any chance to get snapshot-aware defragmentation?

2018-05-18 Thread Chris Murphy
On Fri, May 18, 2018 at 12:33 PM, Austin S. Hemmelgarn
 wrote:
> On 2018-05-18 13:18, Niccolò Belli wrote:
>>
>> On venerdì 18 maggio 2018 19:10:02 CEST, Austin S. Hemmelgarn wrote:
>>>
>>> and also forces the people who have ridiculous numbers of snapshots to
>>> deal with the memory usage or never defrag
>>
>>
>> Whoever has at least one snapshot is never going to defrag anyway, unless
>> he is willing to double the used space.
>>
> With a bit of work, it's possible to handle things sanely.  You can
> deduplicate data from snapshots, even if they are read-only (you need to
> pass the `-A` option to duperemove and run it as root), so it's perfectly
> reasonable to only defrag the main subvolume, and then deduplicate the
> snapshots against that (so that they end up all being reflinks to the main
> subvolume).  Of course, this won't work if you're short on space, but if
> you're dealing with snapshots, you should have enough space that this will
> work (because even without defrag, it's fully possible for something to
> cause the snapshots to suddenly take up a lot more space).


Curiously, snapshot aware defragmentation is going to increase free
space fragmentation. For busy in-use systems, it might be necessary to
use space cache v2 to avoid performance problems.

I forget the exact reason why the free space tree is not the default,
I think it has to do with missing repair support?


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 2/2] vfs: dedupe should return EPERM if permission is not granted

2018-05-18 Thread Mark Fasheh
On Fri, May 18, 2018 at 03:04:13PM -0700, Darrick J. Wong wrote:
> On Fri, May 18, 2018 at 02:57:27PM -0700, Mark Fasheh wrote:
> > Right now we return EINVAL if a process does not have permission to dedupe a
> > file. This was an oversight on my part. EPERM gives a true description of
> > the nature of our error, and EINVAL is already used for the case that the
> > filesystem does not support dedupe.
> > 
> > Signed-off-by: Mark Fasheh 
> 
> Looks ok what with all the okays after I squawked last time,
> Reviewed-by: Darrick J. Wong 

Awesome, I'll put that in the patch. Thanks Darrick.
--Mark
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 1/2] vfs: allow dedupe of user owned read-only files

2018-05-18 Thread Mark Fasheh
On Fri, May 18, 2018 at 03:03:38PM -0700, Darrick J. Wong wrote:
> > +/* Check whether we are allowed to dedupe the destination file */
> > +static int allow_file_dedupe(struct file *file)
> 
> Shouldn't this return bool?  It's a predicate, after all...

Yeah that should be bool. Thanks for pointing it out, I'll fix it up in the
next round.
--Mark
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 2/2] vfs: dedupe should return EPERM if permission is not granted

2018-05-18 Thread Darrick J. Wong
On Fri, May 18, 2018 at 02:57:27PM -0700, Mark Fasheh wrote:
> Right now we return EINVAL if a process does not have permission to dedupe a
> file. This was an oversight on my part. EPERM gives a true description of
> the nature of our error, and EINVAL is already used for the case that the
> filesystem does not support dedupe.
> 
> Signed-off-by: Mark Fasheh 

Looks ok what with all the okays after I squawked last time,
Reviewed-by: Darrick J. Wong 

--D

> ---
>  fs/read_write.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/read_write.c b/fs/read_write.c
> index cbea4ce58ad1..2238928ca819 100644
> --- a/fs/read_write.c
> +++ b/fs/read_write.c
> @@ -2050,7 +2050,7 @@ int vfs_dedupe_file_range(struct file *file, struct 
> file_dedupe_range *same)
>   if (info->reserved) {
>   info->status = -EINVAL;
>   } else if (!allow_file_dedupe(dst_file)) {
> - info->status = -EINVAL;
> + info->status = -EPERM;
>   } else if (file->f_path.mnt != dst_file->f_path.mnt) {
>   info->status = -EXDEV;
>   } else if (S_ISDIR(dst->i_mode)) {
> -- 
> 2.15.1
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 1/2] vfs: allow dedupe of user owned read-only files

2018-05-18 Thread Darrick J. Wong
On Fri, May 18, 2018 at 02:57:26PM -0700, Mark Fasheh wrote:
> The permission check in vfs_dedupe_file_range() is too coarse - We
> only allow dedupe of the destination file if the user is root, or
> they have the file open for write.
> 
> This effectively limits a non-root user from deduping their own read-only
> files. In addition, the write file descriptor that the user is forced to
> hold open can prevent execution of files. As file data during a dedupe
> does not change, the behavior is unexpected and this has caused a number of
> issue reports. For an example, see:
> 
> https://github.com/markfasheh/duperemove/issues/129
> 
> So change the check so we allow dedupe on the target if:
> 
> - the root or admin is asking for it
> - the process has write access
> - the owner of the file is asking for the dedupe
> - the process could get write access
> 
> That way users can open read-only and still get dedupe.
> 
> Signed-off-by: Mark Fasheh 
> ---
>  fs/read_write.c | 17 +++--
>  1 file changed, 15 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/read_write.c b/fs/read_write.c
> index c4eabbfc90df..cbea4ce58ad1 100644
> --- a/fs/read_write.c
> +++ b/fs/read_write.c
> @@ -1964,6 +1964,20 @@ int vfs_dedupe_file_range_compare(struct inode *src, 
> loff_t srcoff,
>  }
>  EXPORT_SYMBOL(vfs_dedupe_file_range_compare);
>  
> +/* Check whether we are allowed to dedupe the destination file */
> +static int allow_file_dedupe(struct file *file)

Shouldn't this return bool?  It's a predicate, after all...

--D

> +{
> + if (capable(CAP_SYS_ADMIN))
> + return 1;
> + if (file->f_mode & FMODE_WRITE)
> + return 1;
> + if (uid_eq(current_fsuid(), file_inode(file)->i_uid))
> + return 1;
> + if (!inode_permission(file_inode(file), MAY_WRITE))
> + return 1;
> + return 0;
> +}
> +
>  int vfs_dedupe_file_range(struct file *file, struct file_dedupe_range *same)
>  {
>   struct file_dedupe_range_info *info;
> @@ -1972,7 +1986,6 @@ int vfs_dedupe_file_range(struct file *file, struct 
> file_dedupe_range *same)
>   u64 len;
>   int i;
>   int ret;
> - bool is_admin = capable(CAP_SYS_ADMIN);
>   u16 count = same->dest_count;
>   struct file *dst_file;
>   loff_t dst_off;
> @@ -2036,7 +2049,7 @@ int vfs_dedupe_file_range(struct file *file, struct 
> file_dedupe_range *same)
>  
>   if (info->reserved) {
>   info->status = -EINVAL;
> - } else if (!(is_admin || (dst_file->f_mode & FMODE_WRITE))) {
> + } else if (!allow_file_dedupe(dst_file)) {
>   info->status = -EINVAL;
>   } else if (file->f_path.mnt != dst_file->f_path.mnt) {
>   info->status = -EXDEV;
> -- 
> 2.15.1
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 0/2] vfs: better dedupe permission check

2018-05-18 Thread Mark Fasheh
Hi,

The following patches fix a couple of issues with the permission
check we do in vfs_dedupe_file_range().

The first patch expands our check to allow dedupe of a file if the
user owns it or otherwise would be allowed to write to it.

Current behavior is that we'll allow dedupe only if:

- the user is an admin (root)
- the user has the file open for write

This makes it impossible for a user to dedupe their own file set
unless they do it as root, or ensure that all files have write
permission. There's a couple of duperemove bugs open for this:

https://github.com/markfasheh/duperemove/issues/129
https://github.com/markfasheh/duperemove/issues/86

The other problem we have is also related to forcing the user to open
target files for write - A process trying to exec a file currently
being deduped gets ETXTBUSY. The answer (as above) is to allow them to
open the targets ro - root can already do this. There was a patch from
Adam Borowski to fix this back in 2016:

https://lkml.org/lkml/2016/7/17/130

which I have incorporated into my changes.


The 2nd patch fixes our return code for permission denied to be
EPERM. For some reason we're returning EINVAL - I think that's
probably my fault. At any rate, we need to be returning something
descriptive of the actual problem, otherwise callers see EINVAL and
can't really make a valid determination of what's gone wrong.

This has also popped up in duperemove, mostly in the form of cryptic
error messages. Because this is a code returned to userspace, I did
check the other users of extent-same that I could find. Both 'bees'
and 'rust-btrfs' do the same as duperemove and simply report the error
(as they should).

The patches are also available in git:

git pull https://github.com/markfasheh/linux dedupe-perms

Thanks,
  --Mark


Changes from V1 to V2:
- Add inode_permission check as suggested by Adam Borowski
- V1 discussion: https://marc.info/?l=linux-xfs=152606684017965=2
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 1/2] vfs: allow dedupe of user owned read-only files

2018-05-18 Thread Mark Fasheh
The permission check in vfs_dedupe_file_range() is too coarse - We
only allow dedupe of the destination file if the user is root, or
they have the file open for write.

This effectively limits a non-root user from deduping their own read-only
files. In addition, the write file descriptor that the user is forced to
hold open can prevent execution of files. As file data during a dedupe
does not change, the behavior is unexpected and this has caused a number of
issue reports. For an example, see:

https://github.com/markfasheh/duperemove/issues/129

So change the check so we allow dedupe on the target if:

- the root or admin is asking for it
- the process has write access
- the owner of the file is asking for the dedupe
- the process could get write access

That way users can open read-only and still get dedupe.

Signed-off-by: Mark Fasheh 
---
 fs/read_write.c | 17 +++--
 1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index c4eabbfc90df..cbea4ce58ad1 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1964,6 +1964,20 @@ int vfs_dedupe_file_range_compare(struct inode *src, 
loff_t srcoff,
 }
 EXPORT_SYMBOL(vfs_dedupe_file_range_compare);
 
+/* Check whether we are allowed to dedupe the destination file */
+static int allow_file_dedupe(struct file *file)
+{
+   if (capable(CAP_SYS_ADMIN))
+   return 1;
+   if (file->f_mode & FMODE_WRITE)
+   return 1;
+   if (uid_eq(current_fsuid(), file_inode(file)->i_uid))
+   return 1;
+   if (!inode_permission(file_inode(file), MAY_WRITE))
+   return 1;
+   return 0;
+}
+
 int vfs_dedupe_file_range(struct file *file, struct file_dedupe_range *same)
 {
struct file_dedupe_range_info *info;
@@ -1972,7 +1986,6 @@ int vfs_dedupe_file_range(struct file *file, struct 
file_dedupe_range *same)
u64 len;
int i;
int ret;
-   bool is_admin = capable(CAP_SYS_ADMIN);
u16 count = same->dest_count;
struct file *dst_file;
loff_t dst_off;
@@ -2036,7 +2049,7 @@ int vfs_dedupe_file_range(struct file *file, struct 
file_dedupe_range *same)
 
if (info->reserved) {
info->status = -EINVAL;
-   } else if (!(is_admin || (dst_file->f_mode & FMODE_WRITE))) {
+   } else if (!allow_file_dedupe(dst_file)) {
info->status = -EINVAL;
} else if (file->f_path.mnt != dst_file->f_path.mnt) {
info->status = -EXDEV;
-- 
2.15.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 2/2] vfs: dedupe should return EPERM if permission is not granted

2018-05-18 Thread Mark Fasheh
Right now we return EINVAL if a process does not have permission to dedupe a
file. This was an oversight on my part. EPERM gives a true description of
the nature of our error, and EINVAL is already used for the case that the
filesystem does not support dedupe.

Signed-off-by: Mark Fasheh 
---
 fs/read_write.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index cbea4ce58ad1..2238928ca819 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -2050,7 +2050,7 @@ int vfs_dedupe_file_range(struct file *file, struct 
file_dedupe_range *same)
if (info->reserved) {
info->status = -EINVAL;
} else if (!allow_file_dedupe(dst_file)) {
-   info->status = -EINVAL;
+   info->status = -EPERM;
} else if (file->f_path.mnt != dst_file->f_path.mnt) {
info->status = -EXDEV;
} else if (S_ISDIR(dst->i_mode)) {
-- 
2.15.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: fix error handling in btrfs_truncate()

2018-05-18 Thread Omar Sandoval
On Fri, May 18, 2018 at 02:43:02PM -0700, Omar Sandoval wrote:
> From: Omar Sandoval 
> 
> Jun Wu at Facebook reported that an internal service was seeing a return
> value of 1 from ftruncate() on Btrfs when compression is enabled. This
> is coming from the NEED_TRUNCATE_BLOCK return value from
> btrfs_truncate_inode_items().
> 
> btrfs_truncate() uses two variables for error handling, ret and err (if
> this sounds familiar, it's because btrfs_truncate_inode_items() does
> something similar). When btrfs_truncate_inode_items() returns non-zero,
> we set err to the return value, but we don't reset it to zero in the
> successful NEED_TRUNCATE_BLOCK case. We only have err because we don't
> want to mask an error if we call btrfs_update_inode() and
> btrfs_end_transaction(), so let's make that its own scoped return
> variable and use ret everywhere else.

To expand on this, this is bad because userspace that checks for a
non-zero return value will think the truncate failed even though it
succeeded, and we also end up not creating an inotify event for the
truncate.

> Fixes: ddfae63cc8e0 ("btrfs: move btrfs_truncate_block out of trans handle")
> Reported-by: Jun Wu 
> Signed-off-by: Omar Sandoval 
> ---
> This is based on Linus' master rather than my orphan ENOSPC fixes
> because I think we want to get this in for v4.17 and stable, and rebase
> my fixes on top of this.
> 
>  fs/btrfs/inode.c | 34 ++
>  1 file changed, 14 insertions(+), 20 deletions(-)
> 
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index d241285a0d2a..d4a47ae36ed8 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -9031,8 +9031,7 @@ static int btrfs_truncate(struct inode *inode, bool 
> skip_writeback)
>   struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
>   struct btrfs_root *root = BTRFS_I(inode)->root;
>   struct btrfs_block_rsv *rsv;
> - int ret = 0;
> - int err = 0;
> + int ret;
>   struct btrfs_trans_handle *trans;
>   u64 mask = fs_info->sectorsize - 1;
>   u64 min_size = btrfs_calc_trunc_metadata_size(fs_info, 1);
> @@ -9092,7 +9091,7 @@ static int btrfs_truncate(struct inode *inode, bool 
> skip_writeback)
>*/
>   trans = btrfs_start_transaction(root, 2);
>   if (IS_ERR(trans)) {
> - err = PTR_ERR(trans);
> + ret = PTR_ERR(trans);
>   goto out;
>   }
>  
> @@ -9116,23 +9115,19 @@ static int btrfs_truncate(struct inode *inode, bool 
> skip_writeback)
>inode->i_size,
>BTRFS_EXTENT_DATA_KEY);
>   trans->block_rsv = _info->trans_block_rsv;
> - if (ret != -ENOSPC && ret != -EAGAIN) {
> - err = ret;
> + if (ret != -ENOSPC && ret != -EAGAIN)
>   break;
> - }
>  
>   ret = btrfs_update_inode(trans, root, inode);
> - if (ret) {
> - err = ret;
> + if (ret)
>   break;
> - }
>  
>   btrfs_end_transaction(trans);
>   btrfs_btree_balance_dirty(fs_info);
>  
>   trans = btrfs_start_transaction(root, 2);
>   if (IS_ERR(trans)) {
> - ret = err = PTR_ERR(trans);
> + ret = PTR_ERR(trans);
>   trans = NULL;
>   break;
>   }
> @@ -9168,26 +9163,25 @@ static int btrfs_truncate(struct inode *inode, bool 
> skip_writeback)
>   if (ret == 0 && inode->i_nlink > 0) {
>   trans->block_rsv = root->orphan_block_rsv;
>   ret = btrfs_orphan_del(trans, BTRFS_I(inode));
> - if (ret)
> - err = ret;
>   }
>  
>   if (trans) {
> + int ret2;
> +
>   trans->block_rsv = _info->trans_block_rsv;
> - ret = btrfs_update_inode(trans, root, inode);
> - if (ret && !err)
> - err = ret;
> + ret2 = btrfs_update_inode(trans, root, inode);
> + if (ret2 && !ret)
> + ret = ret2;
>  
> - ret = btrfs_end_transaction(trans);
> + ret2 = btrfs_end_transaction(trans);
> + if (ret2 && !ret)
> + ret = ret2;
>   btrfs_btree_balance_dirty(fs_info);
>   }
>  out:
>   btrfs_free_block_rsv(fs_info, rsv);
>  
> - if (ret && !err)
> - err = ret;
> -
> - return err;
> + return ret;
>  }
>  
>  /*
> -- 
> 2.17.0
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs: fix error handling in btrfs_truncate()

2018-05-18 Thread Omar Sandoval
From: Omar Sandoval 

Jun Wu at Facebook reported that an internal service was seeing a return
value of 1 from ftruncate() on Btrfs when compression is enabled. This
is coming from the NEED_TRUNCATE_BLOCK return value from
btrfs_truncate_inode_items().

btrfs_truncate() uses two variables for error handling, ret and err (if
this sounds familiar, it's because btrfs_truncate_inode_items() does
something similar). When btrfs_truncate_inode_items() returns non-zero,
we set err to the return value, but we don't reset it to zero in the
successful NEED_TRUNCATE_BLOCK case. We only have err because we don't
want to mask an error if we call btrfs_update_inode() and
btrfs_end_transaction(), so let's make that its own scoped return
variable and use ret everywhere else.

Fixes: ddfae63cc8e0 ("btrfs: move btrfs_truncate_block out of trans handle")
Reported-by: Jun Wu 
Signed-off-by: Omar Sandoval 
---
This is based on Linus' master rather than my orphan ENOSPC fixes
because I think we want to get this in for v4.17 and stable, and rebase
my fixes on top of this.

 fs/btrfs/inode.c | 34 ++
 1 file changed, 14 insertions(+), 20 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index d241285a0d2a..d4a47ae36ed8 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -9031,8 +9031,7 @@ static int btrfs_truncate(struct inode *inode, bool 
skip_writeback)
struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
struct btrfs_root *root = BTRFS_I(inode)->root;
struct btrfs_block_rsv *rsv;
-   int ret = 0;
-   int err = 0;
+   int ret;
struct btrfs_trans_handle *trans;
u64 mask = fs_info->sectorsize - 1;
u64 min_size = btrfs_calc_trunc_metadata_size(fs_info, 1);
@@ -9092,7 +9091,7 @@ static int btrfs_truncate(struct inode *inode, bool 
skip_writeback)
 */
trans = btrfs_start_transaction(root, 2);
if (IS_ERR(trans)) {
-   err = PTR_ERR(trans);
+   ret = PTR_ERR(trans);
goto out;
}
 
@@ -9116,23 +9115,19 @@ static int btrfs_truncate(struct inode *inode, bool 
skip_writeback)
 inode->i_size,
 BTRFS_EXTENT_DATA_KEY);
trans->block_rsv = _info->trans_block_rsv;
-   if (ret != -ENOSPC && ret != -EAGAIN) {
-   err = ret;
+   if (ret != -ENOSPC && ret != -EAGAIN)
break;
-   }
 
ret = btrfs_update_inode(trans, root, inode);
-   if (ret) {
-   err = ret;
+   if (ret)
break;
-   }
 
btrfs_end_transaction(trans);
btrfs_btree_balance_dirty(fs_info);
 
trans = btrfs_start_transaction(root, 2);
if (IS_ERR(trans)) {
-   ret = err = PTR_ERR(trans);
+   ret = PTR_ERR(trans);
trans = NULL;
break;
}
@@ -9168,26 +9163,25 @@ static int btrfs_truncate(struct inode *inode, bool 
skip_writeback)
if (ret == 0 && inode->i_nlink > 0) {
trans->block_rsv = root->orphan_block_rsv;
ret = btrfs_orphan_del(trans, BTRFS_I(inode));
-   if (ret)
-   err = ret;
}
 
if (trans) {
+   int ret2;
+
trans->block_rsv = _info->trans_block_rsv;
-   ret = btrfs_update_inode(trans, root, inode);
-   if (ret && !err)
-   err = ret;
+   ret2 = btrfs_update_inode(trans, root, inode);
+   if (ret2 && !ret)
+   ret = ret2;
 
-   ret = btrfs_end_transaction(trans);
+   ret2 = btrfs_end_transaction(trans);
+   if (ret2 && !ret)
+   ret = ret2;
btrfs_btree_balance_dirty(fs_info);
}
 out:
btrfs_free_block_rsv(fs_info, rsv);
 
-   if (ret && !err)
-   err = ret;
-
-   return err;
+   return ret;
 }
 
 /*
-- 
2.17.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 10/10] Dynamic fault injection

2018-05-18 Thread Andreas Dilger
On May 18, 2018, at 1:10 PM, Kent Overstreet  wrote:
> 
> On Fri, May 18, 2018 at 01:05:20PM -0600, Andreas Dilger wrote:
>> On May 18, 2018, at 1:49 AM, Kent Overstreet  
>> wrote:
>>> 
>>> Signed-off-by: Kent Overstreet 
>> 
>> I agree with Christoph that even if there was some explanation in the cover
>> letter, there should be something at least as good in the patch itself.  The
>> cover letter is not saved, but the commit stays around forever, and should
>> explain how this should be added to code, and how to use it from userspace.
>> 
>> 
>> That said, I think this is a useful functionality.  We have something similar
>> in Lustre (OBD_FAIL_CHECK() and friends) that is necessary for being able to
>> test a distributed filesystem, which is just a CPP macro with an unlikely()
>> branch, while this looks more sophisticated.  This looks like it has some
>> added functionality like having more than one fault enabled at a time.
>> If this lands we could likely switch our code over to using this.
> 
> This is pretty much what I was looking for, I just wanted to know if this
> patch was interesting enough to anyone that I should spend more time on it
> or just drop it :) Agreed on documentation. I think it's also worth
> factoring out the functionality for the elf section trick that dynamic
> debug uses too.
> 
>> Some things that are missing from this patch that is in our code:
>> 
>> - in addition to the basic "enabled" and "oneshot" mechanisms, we have:
>>  - timeout: sleep for N msec to simulate network/disk/locking delays
>>  - race: wait with one thread until a second thread hits matching check
>> 
>> We also have a "fail_val" that allows making the check conditional (e.g.
>> only operation on server "N" should fail, only RPC opcode "N", etc).
> 
> Those all sound like good ideas... fail_val especially, I think with that
> we'd have all the functionality the existing fault injection framework has
> (which is way too heavyweight to actually get used, imo)

The other thing that we have that is slightly orthogonal to your modes,
which is possible because we just have a __u32 for the fault location,
is that the "oneshot" mode is just a mask added to the fault location
together with "fail_val" is that we can add other masks "fail N times",
"fail randomly 1/N times", or "pass N times before failure".  The other
mask is set in the kernel when the fault was actually hit, so that test
scripts can poll until that happens, and then continue running.

The "fail randomly 1/N times" was useful for detecting memory allocation
failure handling under load, but that has been superseded by the same
functionality in kmalloc(), and it sounds like your fault injection can
do this deterministically for every allocation?

Cheers, Andreas







signature.asc
Description: Message signed with OpenPGP


Re: [PATCH 10/10] Dynamic fault injection

2018-05-18 Thread Kent Overstreet
On Fri, May 18, 2018 at 01:05:20PM -0600, Andreas Dilger wrote:
> On May 18, 2018, at 1:49 AM, Kent Overstreet  
> wrote:
> > 
> > Signed-off-by: Kent Overstreet 
> 
> I agree with Christoph that even if there was some explanation in the cover
> letter, there should be something at least as good in the patch itself.  The
> cover letter is not saved, but the commit stays around forever, and should
> explain how this should be added to code, and how to use it from userspace.
> 
> 
> That said, I think this is a useful functionality.  We have something similar
> in Lustre (OBD_FAIL_CHECK() and friends) that is necessary for being able to
> test a distributed filesystem, which is just a CPP macro with an unlikely()
> branch, while this looks more sophisticated.  This looks like it has some
> added functionality like having more than one fault enabled at a time.
> If this lands we could likely switch our code over to using this.

This is pretty much what I was looking for, I just wanted to know if this patch
was interesting enough to anyone that I should spend more time on it or just
drop it :) Agreed on documentation. I think it's also worth factoring out the
functionality for the elf section trick that dynamic debug uses too.

> Some things that are missing from this patch that is in our code:
> 
> - in addition to the basic "enabled" and "oneshot" mechanisms, we have:
>   - timeout: sleep for N msec to simulate network/disk/locking delays
>   - race: wait with one thread until a second thread hits matching check
> 
> We also have a "fail_val" that allows making the check conditional (e.g.
> only operation on server "N" should fail, only RPC opcode "N", etc).

Those all sound like good ideas... fail_val especially, I think with that we'd
have all the functionality the existing fault injection framework has (which is
way to heavyweight to actually get used, imo)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 10/10] Dynamic fault injection

2018-05-18 Thread Andreas Dilger
On May 18, 2018, at 1:49 AM, Kent Overstreet  wrote:
> 
> Signed-off-by: Kent Overstreet 

I agree with Christoph that even if there was some explanation in the cover
letter, there should be something at least as good in the patch itself.  The
cover letter is not saved, but the commit stays around forever, and should
explain how this should be added to code, and how to use it from userspace.


That said, I think this is a useful functionality.  We have something similar
in Lustre (OBD_FAIL_CHECK() and friends) that is necessary for being able to
test a distributed filesystem, which is just a CPP macro with an unlikely()
branch, while this looks more sophisticated.  This looks like it has some
added functionality like having more than one fault enabled at a time.
If this lands we could likely switch our code over to using this.


Some things that are missing from this patch that is in our code:

- in addition to the basic "enabled" and "oneshot" mechanisms, we have:
  - timeout: sleep for N msec to simulate network/disk/locking delays
  - race: wait with one thread until a second thread hits matching check

We also have a "fail_val" that allows making the check conditional (e.g.
only operation on server "N" should fail, only RPC opcode "N", etc).

Cheers, Andreas

> ---
> include/asm-generic/vmlinux.lds.h |   4 +
> include/linux/dynamic_fault.h | 117 +
> lib/Kconfig.debug |   5 +
> lib/Makefile  |   2 +
> lib/dynamic_fault.c   | 760 ++
> 5 files changed, 888 insertions(+)
> create mode 100644 include/linux/dynamic_fault.h
> create mode 100644 lib/dynamic_fault.c
> 
> diff --git a/include/asm-generic/vmlinux.lds.h 
> b/include/asm-generic/vmlinux.lds.h
> index 1ab0e520d6..a4c9dfcbbd 100644
> --- a/include/asm-generic/vmlinux.lds.h
> +++ b/include/asm-generic/vmlinux.lds.h
> @@ -246,6 +246,10 @@
>   VMLINUX_SYMBOL(__start___verbose) = .;  \
>   KEEP(*(__verbose))  \
>   VMLINUX_SYMBOL(__stop___verbose) = .;   \
> + . = ALIGN(8);   \
> + VMLINUX_SYMBOL(__start___faults) = .;   \
> + *(__faults) \
> + VMLINUX_SYMBOL(__stop___faults) = .;\
>   LIKELY_PROFILE()\
>   BRANCH_PROFILE()\
>   TRACE_PRINTKS() \
> diff --git a/include/linux/dynamic_fault.h b/include/linux/dynamic_fault.h
> new file mode 100644
> index 00..6e7bb56ae8
> --- /dev/null
> +++ b/include/linux/dynamic_fault.h
> @@ -0,0 +1,117 @@
> +#ifndef _DYNAMIC_FAULT_H
> +#define _DYNAMIC_FAULT_H
> +
> +#include 
> +#include 
> +#include 
> +
> +enum dfault_enabled {
> + DFAULT_DISABLED,
> + DFAULT_ENABLED,
> + DFAULT_ONESHOT,
> +};
> +
> +union dfault_state {
> + struct {
> + unsignedenabled:2;
> + unsignedcount:30;
> + };
> +
> + struct {
> + unsignedv;
> + };
> +};
> +
> +/*
> + * An instance of this structure is created in a special
> + * ELF section at every dynamic fault callsite.  At runtime,
> + * the special section is treated as an array of these.
> + */
> +struct _dfault {
> + const char  *modname;
> + const char  *function;
> + const char  *filename;
> + const char  *class;
> +
> + const u16   line;
> +
> + unsignedfrequency;
> + union dfault_state  state;
> +
> + struct static_key   enabled;
> +} __aligned(8);
> +
> +
> +#ifdef CONFIG_DYNAMIC_FAULT
> +
> +int dfault_add_module(struct _dfault *tab, unsigned int n, const char *mod);
> +int dfault_remove_module(char *mod_name);
> +bool __dynamic_fault_enabled(struct _dfault *);
> +
> +#define dynamic_fault(_class)
> \
> +({   \
> + static struct _dfault descriptor\
> + __used __aligned(8) __attribute__((section("__faults"))) = {\
> + .modname= KBUILD_MODNAME,   \
> + .function   = __func__, \
> + .filename   = __FILE__, \
> + .line   = __LINE__, \
> + .class  = _class,   \
> + };  \
> + \
> + 

Re: Any chance to get snapshot-aware defragmentation?

2018-05-18 Thread Austin S. Hemmelgarn

On 2018-05-18 13:18, Niccolò Belli wrote:

On venerdì 18 maggio 2018 19:10:02 CEST, Austin S. Hemmelgarn wrote:
and also forces the people who have ridiculous numbers of snapshots to 
deal with the memory usage or never defrag


Whoever has at least one snapshot is never going to defrag anyway, 
unless he is willing to double the used space.


With a bit of work, it's possible to handle things sanely.  You can 
deduplicate data from snapshots, even if they are read-only (you need to 
pass the `-A` option to duperemove and run it as root), so it's 
perfectly reasonable to only defrag the main subvolume, and then 
deduplicate the snapshots against that (so that they end up all being 
reflinks to the main subvolume).  Of course, this won't work if you're 
short on space, but if you're dealing with snapshots, you should have 
enough space that this will work (because even without defrag, it's 
fully possible for something to cause the snapshots to suddenly take up 
a lot more space).

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/10] RFC: assorted bcachefs patches

2018-05-18 Thread Kent Overstreet
On Fri, May 18, 2018 at 02:03:25PM -0400, Josef Bacik wrote:
> There's nothing stopping us from doing that, it just uses a kprobe to override
> the function with our helper, so we could conceivably put it anywhere in the
> function.  The reason I limited it to individual functions was because it was
> easier than trying to figure out the side-effects of stopping mid-function.  
> If
> I needed to fail mid-function I just added a helper where I needed it and 
> failed
> that instead.  I imagine safety is going to be of larger concern if we allow 
> bpf
> scripts to randomly return anywhere inside a function, even if the function is
> marked as allowing error injection.  Thanks,

Ahh no, that's not what I want... here's an example:

https://evilpiepirate.org/git/bcachefs.git/tree/fs/bcachefs/btree_cache.c#n674

Here we've got to do this thing which can race - which is fine, we just need to
check for and handle the race, on line 709 - but actually exercising that with a
test is difficult since it requires a heavily multithreaded workload with btree
nodes getting evicted to see it happen, so - it pretends the race happened if
race_fault() returns true. The race_fault() invocation shows up in debugfs,
where userspace can tell it to fire.

the way it works is dynamic_fault() is a macro that expands to a static struct
dfault_descriptor, stuck in a particular linker section so the dynamic fault
code can find them and stick them in debugfs (which is also the way dynamic
debug works).

#define dynamic_fault(_class)   \
({  \
static struct _dfault descriptor\
__used __aligned(8) __attribute__((section("__faults"))) = {\
.modname= KBUILD_MODNAME,   \
.function   = __func__, \
.filename   = __FILE__, \
.line   = __LINE__, \
.class  = _class,   \
};  \
\
static_key_false() &&\
__dynamic_fault_enabled();   \
})

Honestly it still seems like the cleanest and safest way of doing it to me...
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/10] RFC: assorted bcachefs patches

2018-05-18 Thread Josef Bacik
On Fri, May 18, 2018 at 01:49:12PM -0400, Kent Overstreet wrote:
> On Fri, May 18, 2018 at 01:45:36PM -0400, Josef Bacik wrote:
> > On Fri, May 18, 2018 at 03:48:58AM -0400, Kent Overstreet wrote:
> > > These are all the remaining patches in my bcachefs tree that touch stuff 
> > > outside
> > > fs/bcachefs. Not all of them are suitable for inclusion as is, I wanted 
> > > to get
> > > some discussion first.
> > > 
> > >  * pagecache add lock
> > > 
> > > This is the only one that touches existing code in nontrivial ways.  The 
> > > problem
> > > it's solving is that there is no existing general mechanism for shooting 
> > > down
> > > pages in the page and keeping them removed, which is a real problem if 
> > > you're
> > > doing anything that modifies file data and isn't buffered writes.
> > > 
> > > Historically, the only problematic case has been direct IO, and people 
> > > have been
> > > willing to say "well, if you mix buffered and direct IO you get what you
> > > deserve", and that's probably not unreasonable. But now we have fallocate 
> > > insert
> > > range and collapse range, and those are broken in ways I frankly don't 
> > > want to
> > > think about if they can't ensure consistency with the page cache.
> > > 
> > > Also, the mechanism truncate uses (i_size and sacrificing a goat) has
> > > historically been rather fragile, IMO it might be a good think if we 
> > > switched it
> > > to a more general rigorous mechanism.
> > > 
> > > I need this solved for bcachefs because without this mechanism, the page 
> > > cache
> > > inconsistencies lead to various assertions popping (primarily when we 
> > > didn't
> > > think we need to get a disk reservation going by page cache state, but 
> > > then do
> > > the actual write and disk space accounting says oops, we did need one). 
> > > And
> > > having to reason about what can happen without a locking mechanism for 
> > > this is
> > > not something I care to spend brain cycles on.
> > > 
> > > That said, my patch is kind of ugly, and it requires filesystem changes 
> > > for
> > > other filesystems to take advantage of it. And unfortunately, since one 
> > > of the
> > > code paths that needs locking is readahead, I don't see any realistic way 
> > > of
> > > implementing the locking within just bcachefs code.
> > > 
> > > So I'm hoping someone has an idea for something cleaner (I think I recall
> > > Matthew Wilcox saying he had an idea for how to use xarray to solve 
> > > this), but
> > > if not I'll polish up my pagecache add lock patch and see what I can do 
> > > to make
> > > it less ugly, and hopefully other people find it palatable or at least 
> > > useful.
> > > 
> > >  * lglocks
> > > 
> > > They were removed by Peter Zijlstra when the last in kernel user was 
> > > removed,
> > > but I've found them useful. His commit message seems to imply he doesn't 
> > > think
> > > people should be using them, but I'm not sure why. They are a bit niche 
> > > though,
> > > I can move them to fs/bcachefs if people would prefer. 
> > > 
> > >  * Generic radix trees
> > > 
> > > This is a very simple radix tree implementation that can store types of
> > > arbitrary size, not just pointers/unsigned long. It could probably replace
> > > flex arrays.
> > > 
> > >  * Dynamic fault injection
> > > 
> > 
> > I've not looked at this at all so this may not cover your usecase, but I
> > implemeted a bpf_override_return() to do focused error injection a year 
> > ago.  I
> > have this script
> > 
> > https://github.com/josefbacik/debug-scripts/blob/master/inject-error.py
> > 
> > that does it generically, all you have to do is tag the function you want 
> > to be
> > error injectable with ALLOW_ERROR_INJECTION() and then you get all these 
> > nice
> > things like a debugfs interface to trigger them or use the above script to
> > trigger specific errors and such.  Thanks,
> 
> That sounds pretty cool...
> 
> What about being able to add a random fault injection point in the middle of 
> an
> existing function? Being able to stick race_fault() in random places was a
> pretty big win in terms of getting good code coverage out of realistic tests.

There's nothing stopping us from doing that, it just uses a kprobe to override
the function with our helper, so we could conceivably put it anywhere in the
function.  The reason I limited it to individual functions was because it was
easier than trying to figure out the side-effects of stopping mid-function.  If
I needed to fail mid-function I just added a helper where I needed it and failed
that instead.  I imagine safety is going to be of larger concern if we allow bpf
scripts to randomly return anywhere inside a function, even if the function is
marked as allowing error injection.  Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/10] RFC: assorted bcachefs patches

2018-05-18 Thread Kent Overstreet
On Fri, May 18, 2018 at 01:45:36PM -0400, Josef Bacik wrote:
> On Fri, May 18, 2018 at 03:48:58AM -0400, Kent Overstreet wrote:
> > These are all the remaining patches in my bcachefs tree that touch stuff 
> > outside
> > fs/bcachefs. Not all of them are suitable for inclusion as is, I wanted to 
> > get
> > some discussion first.
> > 
> >  * pagecache add lock
> > 
> > This is the only one that touches existing code in nontrivial ways.  The 
> > problem
> > it's solving is that there is no existing general mechanism for shooting 
> > down
> > pages in the page and keeping them removed, which is a real problem if 
> > you're
> > doing anything that modifies file data and isn't buffered writes.
> > 
> > Historically, the only problematic case has been direct IO, and people have 
> > been
> > willing to say "well, if you mix buffered and direct IO you get what you
> > deserve", and that's probably not unreasonable. But now we have fallocate 
> > insert
> > range and collapse range, and those are broken in ways I frankly don't want 
> > to
> > think about if they can't ensure consistency with the page cache.
> > 
> > Also, the mechanism truncate uses (i_size and sacrificing a goat) has
> > historically been rather fragile, IMO it might be a good think if we 
> > switched it
> > to a more general rigorous mechanism.
> > 
> > I need this solved for bcachefs because without this mechanism, the page 
> > cache
> > inconsistencies lead to various assertions popping (primarily when we didn't
> > think we need to get a disk reservation going by page cache state, but then 
> > do
> > the actual write and disk space accounting says oops, we did need one). And
> > having to reason about what can happen without a locking mechanism for this 
> > is
> > not something I care to spend brain cycles on.
> > 
> > That said, my patch is kind of ugly, and it requires filesystem changes for
> > other filesystems to take advantage of it. And unfortunately, since one of 
> > the
> > code paths that needs locking is readahead, I don't see any realistic way of
> > implementing the locking within just bcachefs code.
> > 
> > So I'm hoping someone has an idea for something cleaner (I think I recall
> > Matthew Wilcox saying he had an idea for how to use xarray to solve this), 
> > but
> > if not I'll polish up my pagecache add lock patch and see what I can do to 
> > make
> > it less ugly, and hopefully other people find it palatable or at least 
> > useful.
> > 
> >  * lglocks
> > 
> > They were removed by Peter Zijlstra when the last in kernel user was 
> > removed,
> > but I've found them useful. His commit message seems to imply he doesn't 
> > think
> > people should be using them, but I'm not sure why. They are a bit niche 
> > though,
> > I can move them to fs/bcachefs if people would prefer. 
> > 
> >  * Generic radix trees
> > 
> > This is a very simple radix tree implementation that can store types of
> > arbitrary size, not just pointers/unsigned long. It could probably replace
> > flex arrays.
> > 
> >  * Dynamic fault injection
> > 
> 
> I've not looked at this at all so this may not cover your usecase, but I
> implemeted a bpf_override_return() to do focused error injection a year ago.  
> I
> have this script
> 
> https://github.com/josefbacik/debug-scripts/blob/master/inject-error.py
> 
> that does it generically, all you have to do is tag the function you want to 
> be
> error injectable with ALLOW_ERROR_INJECTION() and then you get all these nice
> things like a debugfs interface to trigger them or use the above script to
> trigger specific errors and such.  Thanks,

That sounds pretty cool...

What about being able to add a random fault injection point in the middle of an
existing function? Being able to stick race_fault() in random places was a
pretty big win in terms of getting good code coverage out of realistic tests.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/10] RFC: assorted bcachefs patches

2018-05-18 Thread Josef Bacik
On Fri, May 18, 2018 at 03:48:58AM -0400, Kent Overstreet wrote:
> These are all the remaining patches in my bcachefs tree that touch stuff 
> outside
> fs/bcachefs. Not all of them are suitable for inclusion as is, I wanted to get
> some discussion first.
> 
>  * pagecache add lock
> 
> This is the only one that touches existing code in nontrivial ways.  The 
> problem
> it's solving is that there is no existing general mechanism for shooting down
> pages in the page and keeping them removed, which is a real problem if you're
> doing anything that modifies file data and isn't buffered writes.
> 
> Historically, the only problematic case has been direct IO, and people have 
> been
> willing to say "well, if you mix buffered and direct IO you get what you
> deserve", and that's probably not unreasonable. But now we have fallocate 
> insert
> range and collapse range, and those are broken in ways I frankly don't want to
> think about if they can't ensure consistency with the page cache.
> 
> Also, the mechanism truncate uses (i_size and sacrificing a goat) has
> historically been rather fragile, IMO it might be a good think if we switched 
> it
> to a more general rigorous mechanism.
> 
> I need this solved for bcachefs because without this mechanism, the page cache
> inconsistencies lead to various assertions popping (primarily when we didn't
> think we need to get a disk reservation going by page cache state, but then do
> the actual write and disk space accounting says oops, we did need one). And
> having to reason about what can happen without a locking mechanism for this is
> not something I care to spend brain cycles on.
> 
> That said, my patch is kind of ugly, and it requires filesystem changes for
> other filesystems to take advantage of it. And unfortunately, since one of the
> code paths that needs locking is readahead, I don't see any realistic way of
> implementing the locking within just bcachefs code.
> 
> So I'm hoping someone has an idea for something cleaner (I think I recall
> Matthew Wilcox saying he had an idea for how to use xarray to solve this), but
> if not I'll polish up my pagecache add lock patch and see what I can do to 
> make
> it less ugly, and hopefully other people find it palatable or at least useful.
> 
>  * lglocks
> 
> They were removed by Peter Zijlstra when the last in kernel user was removed,
> but I've found them useful. His commit message seems to imply he doesn't think
> people should be using them, but I'm not sure why. They are a bit niche 
> though,
> I can move them to fs/bcachefs if people would prefer. 
> 
>  * Generic radix trees
> 
> This is a very simple radix tree implementation that can store types of
> arbitrary size, not just pointers/unsigned long. It could probably replace
> flex arrays.
> 
>  * Dynamic fault injection
> 

I've not looked at this at all so this may not cover your usecase, but I
implemeted a bpf_override_return() to do focused error injection a year ago.  I
have this script

https://github.com/josefbacik/debug-scripts/blob/master/inject-error.py

that does it generically, all you have to do is tag the function you want to be
error injectable with ALLOW_ERROR_INJECTION() and then you get all these nice
things like a debugfs interface to trigger them or use the above script to
trigger specific errors and such.  Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 01/10] mm: pagecache add lock

2018-05-18 Thread Kent Overstreet
On Fri, May 18, 2018 at 08:53:30AM -0700, Christoph Hellwig wrote:
> On Fri, May 18, 2018 at 06:13:06AM -0700, Matthew Wilcox wrote:
> > > Historically, the only problematic case has been direct IO, and people
> > > have been willing to say "well, if you mix buffered and direct IO you
> > > get what you deserve", and that's probably not unreasonable. But now we
> > > have fallocate insert range and collapse range, and those are broken in
> > > ways I frankly don't want to think about if they can't ensure consistency
> > > with the page cache.
> > 
> > ext4 manages collapse-vs-pagefault with the ext4-specific i_mmap_sem.
> > You may get pushback on the grounds that this ought to be a
> > filesystem-specific lock rather than one embedded in the generic inode.
> 
> Honestly I think this probably should be in the core.  But IFF we move
> it to the core the existing users of per-fs locks need to be moved
> over first.  E.g. XFS as the very first one, and at least ext4 and f2fs
> that copied the approach, and probably more if you audit deep enough.

I didn't know about i_mmap_sem, thanks

But, using a rw semaphore wouldn't work for dio writes, and I do need dio writes
to block pagecache adds in bcachefs since the dio write could overwrite
uncompressed data or a reservation with compressed data, which means new writes
need a disk reservation.

Also I'm guessing ext4 takes the lock at the top of the read path? That sucks
for reads that are already cached, the generic buffered read path can do cached
reads without taking any per inode locks - that's why I pushed the lock down to
only be taken when we have to add a page to the pagecache.

Definitely less ugly doing it that way though...
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 06/10] Generic radix trees

2018-05-18 Thread Kent Overstreet
On Fri, May 18, 2018 at 09:02:03AM -0700, Christoph Hellwig wrote:
> Completely lacks any explanation, including why this should be
> in lib/.  Also should come in the same series with actual users
> of the infrastructure.

Also in the cover letter...

 * Generic radix trees

 This is a very simple radix tree implementation that can store types of
 arbitrary size, not just pointers/unsigned long. It could probably replace
 flex arrays.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 00/12] Btrfs: orphan and truncate fixes

2018-05-18 Thread Omar Sandoval
On Fri, May 18, 2018 at 05:14:35PM +0200, David Sterba wrote:
> On Fri, May 11, 2018 at 01:13:28PM -0700, Omar Sandoval wrote:
> > This is the fourth (and hopefully final) version of the orphan item
> > early ENOSPC and related fixes.
> > 
> > Changes since v3:
> > 
> > - Changed another stale comment in patch 1
> > - Moved BTRFS_INODE_ORPHAN_META_RESERVED flag removal to patch 10
> >   instead of patch 9
> > - Moved inode runtime flag renumbering to a separate patch (patch 11)
> > - Added some more reviewed-bys
> 
> Patchset added to misc-next, thanks. Besides fstests, I've hammered it
> with some stress testing, all fine.

Thanks, Dave!

> > Changes since v2:
> > 
> > - Add patch 5 to get rid of BTRFS_INODE_HAS_ORPHAN_ITEM
> > - Move patch 10 to patch 6
> > - Got rid of patch 5; the bug goes away in the process of removing code
> >   for patches 9 and 10
> > - Rename patch 10 batch to what it was called in v1
> > 
> > Changes since v1:
> > 
> > - Added two extra cleanups, patches 10 and 11
> > - Added a forgotten clear of the orphan bit in patch 8
> > - Reworded titles of patches 6 and 9
> > - Added people's reviewed-bys
> > 
> > Cover letter from v1:
> > 
> > At Facebook we hit an early ENOSPC issue which we tracked down to the
> > reservations for orphan items of deleted-but-still-open files. The
> > primary function of this series is to fix that bug, but I ended up
> > uncovering a pile of other issues in the process, most notably that the
> > orphan items we create for truncate are useless.
> > 
> > I've also posted an xfstest that reproduces this bug.
> > 
> > Thanks!
> > 
> > *** BLURB HERE ***
> 
> Blurb!

Oops ;)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 10/10] Dynamic fault injection

2018-05-18 Thread Kent Overstreet
On Fri, May 18, 2018 at 09:02:45AM -0700, Christoph Hellwig wrote:
> On Fri, May 18, 2018 at 03:49:18AM -0400, Kent Overstreet wrote:
> > Signed-off-by: Kent Overstreet 
> 
> Completely lacks any explanation or argument why it would be useful.

It's in the cover letter...


 * Dynamic fault injection

 I've actually had this code sitting in my tree since forever... I know we have
 an existing fault injection framework, but I think this one is quite a bit
 nicer
 to actually use.

 It works very much like the dynamic debug infrastructure - for those who aren't
 familiar, dynamic debug makes it so you can list and individually
 enable/disable
 every pr_debug() callsite in debugfs.

 So to add a fault injection site with this, you just stick a call to
 dynamic_fault("foobar") somewhere in your code - dynamic_fault() returns true
 if
 you should fail whatever it is you're testing. And then it'll show up in
 debugfs, where you can enable/disable faults by file/linenumber, module, name,
 etc.

 The patch then also adds macros that wrap all the various memory allocation
 functions and fail if dynamic_fault("memory") returns true - which means you
 can
 see in debugfs every place you're allocating memory and fail all of them or
 just
 individually (I have tests that iterate over all the faults and flip them on
 one
 by one). I also use it in bcachefs to add fault injection points for uncommon
 error paths in the filesystem startup/recovery path, and for various hard to
 test slowpaths that only happen if we race in weird ways (race_fault()).
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Any chance to get snapshot-aware defragmentation?

2018-05-18 Thread Niccolò Belli

On venerdì 18 maggio 2018 19:10:02 CEST, Austin S. Hemmelgarn wrote:
and also forces the people who have ridiculous numbers of 
snapshots to deal with the memory usage or never defrag


Whoever has at least one snapshot is never going to defrag anyway, unless 
he is willing to double the used space.


Niccolò
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Any chance to get snapshot-aware defragmentation?

2018-05-18 Thread Austin S. Hemmelgarn

On 2018-05-18 12:36, Niccolò Belli wrote:

On venerdì 18 maggio 2018 18:20:51 CEST, David Sterba wrote:

Josef started working on that in 2014 and did not finish it. The patches
can be still found in his tree. The problem is in excessive memory
consumption when there are many snapshots that need to be tracked during
the defragmentation, so there are measures to avoid OOM. There's
infrastructure ready for use (shrinkers), there are maybe some problems
but fundamentally is should work.

I'd like to get the snapshot-aware working again too, we'd need to find
a volunteer to resume the work on the patchset.


Yeah I know of Josef's work, but 4 years had passed since then without 
any news on this front.


What I would really like to know is why nobody resumed his work: is it 
because it's impossible to implement snapshot-aware degram without 
excessive ram usage or is it simply because nobody is interested?
I think it's because nobody who is interested has both the time and the 
coding skills to tackle it.


Personally though, I think the biggest issue with what was done was not 
the memory consumption, but the fact that there was no switch to turn it 
on or off.  Making defrag unconditionally snapshot aware removes one of 
the easiest ways to forcibly unshare data without otherwise altering the 
files (which, as stupid as it sounds, is actually really useful for some 
storage setups), and also forces the people who have ridiculous numbers 
of snapshots to deal with the memory usage or never defrag.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: fix describe_relocation string pointer

2018-05-18 Thread David Sterba
On Thu, May 17, 2018 at 09:25:12PM +0800, Anand Jain wrote:
> Looks like the original idea was to print the hex of the flags which
> is not coded with their flag name. So use the current buf pointer bp
> instead of buf.
> 
> Signed-off-by: Anand Jain 
> ---
>  fs/btrfs/relocation.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
> index 74656d79e511..879b76fa881a 100644
> --- a/fs/btrfs/relocation.c
> +++ b/fs/btrfs/relocation.c
> @@ -4344,7 +4344,7 @@ static void describe_relocation(struct btrfs_fs_info 
> *fs_info,
>   DESCRIBE_FLAG(RAID5,"raid5");
>   DESCRIBE_FLAG(RAID6,"raid6");
>   if (flags)
> - snprintf(buf, buf - bp + sizeof(buf), "|0x%llx", flags);
> + snprintf(bp, buf - bp + sizeof(buf), "|0x%llx", flags);

Good catch.

Reviewed-by: David Sterba 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Any chance to get snapshot-aware defragmentation?

2018-05-18 Thread Niccolò Belli

On venerdì 18 maggio 2018 18:20:51 CEST, David Sterba wrote:

Josef started working on that in 2014 and did not finish it. The patches
can be still found in his tree. The problem is in excessive memory
consumption when there are many snapshots that need to be tracked during
the defragmentation, so there are measures to avoid OOM. There's
infrastructure ready for use (shrinkers), there are maybe some problems
but fundamentally is should work.

I'd like to get the snapshot-aware working again too, we'd need to find
a volunteer to resume the work on the patchset.


Yeah I know of Josef's work, but 4 years had passed since then without any 
news on this front.


What I would really like to know is why nobody resumed his work: is it 
because it's impossible to implement snapshot-aware degram without 
excessive ram usage or is it simply because nobody is interested?


Niccolò
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Any chance to get snapshot-aware defragmentation?

2018-05-18 Thread David Sterba
On Fri, May 11, 2018 at 05:22:26PM +0200, Niccolò Belli wrote:
> I'm waiting for this feature since years and initially it seemed like 
> something which would have been worked on, sooner or later.
> A long time had passed without any progress on this, so I would like to 
> know if there is any technical limitation preventing this or if it's 
> something which could possibly land in the near future.

Josef started working on that in 2014 and did not finish it. The patches
can be still found in his tree. The problem is in excessive memory
consumption when there are many snapshots that need to be tracked during
the defragmentation, so there are measures to avoid OOM. There's
infrastructure ready for use (shrinkers), there are maybe some problems
but fundamentally is should work.

I'd like to get the snapshot-aware working again too, we'd need to find
a volunteer to resume the work on the patchset.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 08/10] bcache: move closures to lib/

2018-05-18 Thread Christoph Hellwig
On Fri, May 18, 2018 at 03:49:13AM -0400, Kent Overstreet wrote:
> Prep work for bcachefs - being a fork of bcache it also uses closures

Hell no.  This code needs to go away and not actually be promoted to
lib/.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 10/10] Dynamic fault injection

2018-05-18 Thread Christoph Hellwig
On Fri, May 18, 2018 at 03:49:18AM -0400, Kent Overstreet wrote:
> Signed-off-by: Kent Overstreet 

Completely lacks any explanation or argument why it would be useful.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 2/6] Btrfs: use more straightforward extent_buffer_uptodate

2018-05-18 Thread David Sterba
On Fri, May 18, 2018 at 01:17:16PM +0800, Qu Wenruo wrote:
> 
> 
> On 2018年05月18日 11:00, Liu Bo wrote:
> > If parent_transid "0" is passed to btrfs_buffer_uptodate(),
> > btrfs_buffer_uptodate() is equivalent to extent_buffer_uptodate(), but
> > extent_buffer_uptodate() is preferred since we don't have to look into
> > verify_parent_transid().
> > 
> > Signed-off-by: Liu Bo 
> 
> Reviewed-by: Qu Wenruo 
> 
> And considering how little extra work we do in btrfs_buffer_uptodate(),
> what about make it an inline function?

The actual picture as the compiler sees the function can be quite
different, eg if there are static functions called from that one, they
could get inlined and suddenly btrfs_buffer_uptodate becomes
btrfs_buffer_uptodate + verify_parent_transid.

IOW, leave the inlining on the compiler and use 'static inline' only for
really lightweight functions.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 05/10] don't use spin_lock_irqsave() unnecessarily

2018-05-18 Thread Christoph Hellwig
On Fri, May 18, 2018 at 03:49:08AM -0400, Kent Overstreet wrote:
> Signed-off-by: Kent Overstreet 

Looks generally fine.  A little changelog with an explanation of
how we obviously never could get here with irqs disabled would
be nice, though.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 06/10] Generic radix trees

2018-05-18 Thread Christoph Hellwig
Completely lacks any explanation, including why this should be
in lib/.  Also should come in the same series with actual users
of the infrastructure.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 02/10] mm: export find_get_pages()

2018-05-18 Thread Christoph Hellwig
On Fri, May 18, 2018 at 03:49:02AM -0400, Kent Overstreet wrote:
> Signed-off-by: Kent Overstreet 
> ---
>  mm/filemap.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 31dd888785..78b99019bf 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -1845,6 +1845,7 @@ unsigned find_get_pages_range(struct address_space 
> *mapping, pgoff_t *start,
>  
>   return ret;
>  }
> +EXPORT_SYMBOL(find_get_pages_range);

EXPORT_SYMBOL_GPL and only together with the actual submission of
a user of the interface.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 01/10] mm: pagecache add lock

2018-05-18 Thread Christoph Hellwig
On Fri, May 18, 2018 at 06:13:06AM -0700, Matthew Wilcox wrote:
> > Historically, the only problematic case has been direct IO, and people
> > have been willing to say "well, if you mix buffered and direct IO you
> > get what you deserve", and that's probably not unreasonable. But now we
> > have fallocate insert range and collapse range, and those are broken in
> > ways I frankly don't want to think about if they can't ensure consistency
> > with the page cache.
> 
> ext4 manages collapse-vs-pagefault with the ext4-specific i_mmap_sem.
> You may get pushback on the grounds that this ought to be a
> filesystem-specific lock rather than one embedded in the generic inode.

Honestly I think this probably should be in the core.  But IFF we move
it to the core the existing users of per-fs locks need to be moved
over first.  E.g. XFS as the very first one, and at least ext4 and f2fs
that copied the approach, and probably more if you audit deep enough.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V3 0/3] Btrfs: btrfs_dedupe_file_range() ioctl, remove 16MiB restriction

2018-05-18 Thread David Sterba
On Fri, May 11, 2018 at 05:22:58PM +0200, David Sterba wrote:
> On Wed, May 02, 2018 at 08:15:35AM +0300, Timofey Titovets wrote:
> > At now btrfs_dedupe_file_range() restricted to 16MiB range for
> > limit locking time and memory requirement for dedup ioctl()
> > 
> > For too big input range code silently set range to 16MiB
> > 
> > Let's remove that restriction by do iterating over dedup range.
> > That's backward compatible and will not change anything for request
> > less then 16MiB.
> > 
> > Changes:
> >   v1 -> v2:
> > - Refactor btrfs_cmp_data_prepare and btrfs_extent_same
> > - Store memory of pages array between iterations
> > - Lock inodes once, not on each iteration
> > - Small inplace cleanups
> >   v2 -> v3:
> > - Split to several patches
> > 
> > Timofey Titovets (3):
> >   Btrfs: split btrfs_extent_same() for simplification
> >   Btrfs: btrfs_dedupe_file_range() ioctl, remove 16MiB restriction
> >   Btrfs: btrfs_extent_same() reuse cmp workspace
> 
> Looks good to me, thanks. I'll edit the changlogs a bit and add the
> patches to 4.18 queue.

Patchset added to misc-next, so it's on the way to 4.18, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 00/12] Btrfs: orphan and truncate fixes

2018-05-18 Thread David Sterba
On Fri, May 11, 2018 at 01:13:28PM -0700, Omar Sandoval wrote:
> This is the fourth (and hopefully final) version of the orphan item
> early ENOSPC and related fixes.
> 
> Changes since v3:
> 
> - Changed another stale comment in patch 1
> - Moved BTRFS_INODE_ORPHAN_META_RESERVED flag removal to patch 10
>   instead of patch 9
> - Moved inode runtime flag renumbering to a separate patch (patch 11)
> - Added some more reviewed-bys

Patchset added to misc-next, thanks. Besides fstests, I've hammered it
with some stress testing, all fine.

> Changes since v2:
> 
> - Add patch 5 to get rid of BTRFS_INODE_HAS_ORPHAN_ITEM
> - Move patch 10 to patch 6
> - Got rid of patch 5; the bug goes away in the process of removing code
>   for patches 9 and 10
> - Rename patch 10 batch to what it was called in v1
> 
> Changes since v1:
> 
> - Added two extra cleanups, patches 10 and 11
> - Added a forgotten clear of the orphan bit in patch 8
> - Reworded titles of patches 6 and 9
> - Added people's reviewed-bys
> 
> Cover letter from v1:
> 
> At Facebook we hit an early ENOSPC issue which we tracked down to the
> reservations for orphan items of deleted-but-still-open files. The
> primary function of this series is to fix that bug, but I ended up
> uncovering a pile of other issues in the process, most notably that the
> orphan items we create for truncate are useless.
> 
> I've also posted an xfstest that reproduces this bug.
> 
> Thanks!
> 
> *** BLURB HERE ***

Blurb!
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 5/5] generic: test invalid swap file activation

2018-05-18 Thread Darrick J. Wong
On Wed, May 16, 2018 at 01:38:49PM -0700, Omar Sandoval wrote:
> From: Omar Sandoval 
> 
> Swap files cannot have holes, and they must at least two pages.
> swapon(8) and mkswap(8) have stricter restrictions, so add versions of
> those commands without any restrictions.
> 
> Signed-off-by: Omar Sandoval 
> ---
>  .gitignore|  2 ++
>  src/Makefile  |  2 +-
>  src/mkswap.c  | 83 +++
>  src/swapon.c  | 24 +
>  tests/generic/490 | 77 +++
>  tests/generic/490.out |  5 +++
>  tests/generic/group   |  1 +
>  7 files changed, 193 insertions(+), 1 deletion(-)
>  create mode 100644 src/mkswap.c
>  create mode 100644 src/swapon.c
>  create mode 100755 tests/generic/490
>  create mode 100644 tests/generic/490.out
> 
> diff --git a/.gitignore b/.gitignore
> index 53029e24..efc73a7c 100644
> --- a/.gitignore
> +++ b/.gitignore
> @@ -92,6 +92,7 @@
>  /src/lstat64
>  /src/makeextents
>  /src/metaperf
> +/src/mkswap
>  /src/mmapcat
>  /src/multi_open_unlink
>  /src/nametest
> @@ -111,6 +112,7 @@
>  /src/seek_sanity_test
>  /src/stale_handle
>  /src/stat_test
> +/src/swapon
>  /src/t_access_root
>  /src/t_dir_offset
>  /src/t_dir_offset2
> diff --git a/src/Makefile b/src/Makefile
> index c42d3bb1..01fe99ef 100644
> --- a/src/Makefile
> +++ b/src/Makefile
> @@ -26,7 +26,7 @@ LINUX_TARGETS = xfsctl bstat t_mtab getdevicesize 
> preallo_rw_pattern_reader \
>   renameat2 t_getcwd e4compact test-nextquota punch-alternating \
>   attr-list-by-handle-cursor-test listxattr dio-interleaved t_dir_type \
>   dio-invalidate-cache stat_test t_encrypted_d_revalidate \
> - attr_replace_test
> + attr_replace_test swapon mkswap
>  
>  SUBDIRS = log-writes perf
>  
> diff --git a/src/mkswap.c b/src/mkswap.c
> new file mode 100644
> index ..d0bce2bd
> --- /dev/null
> +++ b/src/mkswap.c
> @@ -0,0 +1,83 @@
> +/* mkswap(8) without any sanity checks */
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +struct swap_header {
> + charbootbits[1024];
> + uint32_tversion;
> + uint32_tlast_page;
> + uint32_tnr_badpages;
> + unsigned char   sws_uuid[16];
> + unsigned char   sws_volume[16];
> + uint32_tpadding[117];
> + uint32_tbadpages[1];
> +};
> +
> +int main(int argc, char **argv)
> +{
> + struct swap_header *hdr;
> + FILE *file;
> + struct stat st;
> + long page_size;
> + int ret;
> +
> + if (argc != 2) {
> + fprintf(stderr, "usage: %s PATH\n", argv[0]);
> + return EXIT_FAILURE;
> + }
> +
> + page_size = sysconf(_SC_PAGESIZE);
> + if (page_size == -1) {
> + perror("sysconf");
> + return EXIT_FAILURE;
> + }
> +
> + hdr = calloc(1, page_size);
> + if (!hdr) {
> + perror("calloc");
> + return EXIT_FAILURE;
> + }
> +
> + file = fopen(argv[1], "r+");
> + if (!file) {
> + perror("fopen");
> + free(hdr);
> + return EXIT_FAILURE;
> + }
> +
> + ret = fstat(fileno(file), );
> + if (ret) {
> + perror("fstat");
> + free(hdr);
> + fclose(file);
> + return EXIT_FAILURE;
> + }
> +
> + hdr->version = 1;
> + hdr->last_page = st.st_size / page_size - 1;
> + memset(>sws_uuid, 0x99, sizeof(hdr->sws_uuid));
> + memcpy((char *)hdr + page_size - 10, "SWAPSPACE2", 10);
> +
> + if (fwrite(hdr, page_size, 1, file) != 1) {
> + perror("fwrite");
> + free(hdr);
> + fclose(file);
> + return EXIT_FAILURE;
> + }
> +
> + if (fclose(file) == EOF) {
> + perror("fwrite");
> + free(hdr);
> + return EXIT_FAILURE;
> + }
> +
> + free(hdr);
> +
> + return EXIT_SUCCESS;
> +}
> diff --git a/src/swapon.c b/src/swapon.c
> new file mode 100644
> index ..0cb7108a
> --- /dev/null
> +++ b/src/swapon.c
> @@ -0,0 +1,24 @@
> +/* swapon(8) without any sanity checks; simply calls swapon(2) directly. */
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +int main(int argc, char **argv)
> +{
> + int ret;
> +
> + if (argc != 2) {
> + fprintf(stderr, "usage: %s PATH\n", argv[0]);
> + return EXIT_FAILURE;
> + }
> +
> + ret = swapon(argv[1], 0);
> + if (ret) {
> + perror("swapon");
> + return EXIT_FAILURE;
> + }
> +
> + return EXIT_SUCCESS;
> +}
> diff --git a/tests/generic/490 b/tests/generic/490
> new file mode 100755
> index ..6ba2ecb3
> --- /dev/null
> +++ b/tests/generic/490
> @@ -0,0 +1,77 @@
> +#! /bin/bash
> +# FS QA Test 490
> +#
> +# Test invalid swap files.
> +#
> 

Re: [PATCH] Btrfs: fix the corruption by reading stale btree blocks

2018-05-18 Thread Nikolay Borisov


On 18.05.2018 17:06, David Sterba wrote:
> On Fri, May 18, 2018 at 09:15:16PM +0800, Liu Bo wrote:
 Doesn't this warrant a stable tag and
 Fixes: 5bdd3536cbbe ("Btrfs: Fix block generation verification race")
>>>
>>> The patch will not apply to < 4.16 as it depends on the addition of
>>> _key check from 581c1760415c4, but yes we need it for stable.
>>
>> True, we can easily fix the conflicts by removing @first_key, can't
>> we?
> 
> Yes the backport is easy but still needs to be done manually and we
> can't and don't expect the stable maintainers to resolve the conflicts
> but rather send a adapter version ourselves. Nikolay promised to do
> that.

Once the patch gets sent to Linus I will backport it for 4.4/4.9 and 4.14

> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: fix the corruption by reading stale btree blocks

2018-05-18 Thread David Sterba
On Fri, May 18, 2018 at 09:15:16PM +0800, Liu Bo wrote:
> > > Doesn't this warrant a stable tag and
> > > Fixes: 5bdd3536cbbe ("Btrfs: Fix block generation verification race")
> > 
> > The patch will not apply to < 4.16 as it depends on the addition of
> > _key check from 581c1760415c4, but yes we need it for stable.
> 
> True, we can easily fix the conflicts by removing @first_key, can't
> we?

Yes the backport is easy but still needs to be done manually and we
can't and don't expect the stable maintainers to resolve the conflicts
but rather send a adapter version ourselves. Nikolay promised to do
that.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: fix the corruption by reading stale btree blocks

2018-05-18 Thread Liu Bo
On Wed, May 16, 2018 at 04:12:21PM +0200, David Sterba wrote:
> On Wed, May 16, 2018 at 11:00:14AM +0300, Nikolay Borisov wrote:
> > 
> > 
> > On 15.05.2018 20:37, Liu Bo wrote:
> > > If a btree block, aka. extent buffer, is not available in the extent
> > > buffer cache, it'll be read out from the disk instead, i.e.
> > > 
> > > btrfs_search_slot()
> > >   read_block_for_search()  # hold parent and its lock, go to read child
> > > btrfs_release_path()
> > > read_tree_block()  # read child
> > > 
> > > Unfortunately, the parent lock got released before reading child, so
> > > commit 5bdd3536cbbe ("Btrfs: Fix block generation verification race") had
> > > used 0 as parent transid to read the child block.  It forces
> > > read_tree_block() not to check if parent transid is different with the
> > > generation id of the child that it reads out from disk.
> > > 
> > > A simple PoC is included in btrfs/124,
> > > 
> > > 0. A two-disk raid1 btrfs,
> > > 
> > > 1. Right after mkfs.btrfs, block A is allocated to be device tree's root.
> > > 
> > > 2. Mount this filesystem and put it in use, after a while, device tree's
> > >root got COW but block A hasn't been allocated/overwritten yet.
> > > 
> > > 3. Umount it and reload the btrfs module to remove both disks from the
> > >global @fs_devices list.
> > > 
> > > 4. mount -odegraded dev1 and write some data, so now block A is allocated
> > >to be a leaf in checksum tree.  Note that only dev1 has the latest
> > >metadata of this filesystem.
> > > 
> > > 5. Umount it and mount it again normally (with both disks), since raid1
> > >can pick up one disk by the writer task's pid, if btrfs_search_slot()
> > >needs to read block A, dev2 which does NOT have the latest metadata
> > >might be read for block A, then we got a stale block A.
> > > 
> > > 6. As parent transid is not checked, block A is marked as uptodate and
> > >put into the extent buffer cache, so the future search won't bother
> > >to read disk again, which means it'll make changes on this stale
> > >one and make it dirty and flush it onto disk.
> > > 
> > > To avoid the problem, parent transid needs to be passed to
> > > read_tree_block().
> > > 
> > > In order to get a valid parent transid, we need to hold the parent's
> > > lock until finish reading child.
> > > 
> > > Signed-off-by: Liu Bo 
> > 
> > Doesn't this warrant a stable tag and
> > Fixes: 5bdd3536cbbe ("Btrfs: Fix block generation verification race")
> 
> The patch will not apply to < 4.16 as it depends on the addition of
> _key check from 581c1760415c4, but yes we need it for stable.

True, we can easily fix the conflicts by removing @first_key, can't
we?

The problem does exist before it.

thanks,
-liubo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 01/10] mm: pagecache add lock

2018-05-18 Thread Matthew Wilcox
On Fri, May 18, 2018 at 03:49:00AM -0400, Kent Overstreet wrote:
> Add a per address space lock around adding pages to the pagecache - making it
> possible for fallocate INSERT_RANGE/COLLAPSE_RANGE to work correctly, and also
> hopefully making truncate and dio a bit saner.

(moving this section here from the overall description so I can reply
to it in one place)

>  * pagecache add lock
> 
> This is the only one that touches existing code in nontrivial ways.
> The problem it's solving is that there is no existing general mechanism
> for shooting down pages in the page and keeping them removed, which is a
> real problem if you're doing anything that modifies file data and isn't
> buffered writes.
> 
> Historically, the only problematic case has been direct IO, and people
> have been willing to say "well, if you mix buffered and direct IO you
> get what you deserve", and that's probably not unreasonable. But now we
> have fallocate insert range and collapse range, and those are broken in
> ways I frankly don't want to think about if they can't ensure consistency
> with the page cache.

ext4 manages collapse-vs-pagefault with the ext4-specific i_mmap_sem.
You may get pushback on the grounds that this ought to be a
filesystem-specific lock rather than one embedded in the generic inode.

> Also, the mechanism truncate uses (i_size and sacrificing a goat) has
> historically been rather fragile, IMO it might be a good think if we
> switched it to a more general rigorous mechanism.
> 
> I need this solved for bcachefs because without this mechanism, the page
> cache inconsistencies lead to various assertions popping (primarily when
> we didn't think we need to get a disk reservation going by page cache
> state, but then do the actual write and disk space accounting says oops,
> we did need one). And having to reason about what can happen without
> a locking mechanism for this is not something I care to spend brain
> cycles on.
> 
> That said, my patch is kind of ugly, and it requires filesystem changes
> for other filesystems to take advantage of it. And unfortunately, since
> one of the code paths that needs locking is readahead, I don't see any
> realistic way of implementing the locking within just bcachefs code.
> 
> So I'm hoping someone has an idea for something cleaner (I think I recall
> Matthew Wilcox saying he had an idea for how to use xarray to solve this),
> but if not I'll polish up my pagecache add lock patch and see what I can
> do to make it less ugly, and hopefully other people find it palatable
> or at least useful.

My idea with the XArray is that we have a number of reserved entries which
we can use as blocking entries.  I was originally planning on making this
an XArray feature, but I now believe it's a page-cache-special feature.
We can always revisit that decision if it turns out to be useful to
another user.

API:

int filemap_block_range(struct address_space *mapping, loff_t start,
loff_t end);
void filemap_remove_block(struct address_space *mapping, loff_t start,
loff_t end);

 - After removing a block, the pagecache is empty between [start, end].
 - You have to treat the block as a single entity; don't unblock only
   a subrange of the range you originally blocked.
 - Lookups of a page within a blocked range return NULL.
 - Attempts to add a page to a blocked range sleep on one of the
   page_wait_table queues.
 - Attempts to block a blocked range will also sleep on one of the
   page_wait_table queues.  Is this restriction acceptable for your use
   case?  It's clearly not a problem for fallocate insert/collapse.  It
   would only be a problem for Direct I/O if people are doing subpage
   directio from within the same page.  I think that's rare enough to
   not be a problem (but please tell me if I'm wrong!)

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs: Document __btrfs_inc_extent_ref

2018-05-18 Thread Nikolay Borisov
Here is a doc-only patch which tires to deobfuscate the terra-incognita
that arguments for delayed refs are.

Signed-off-by: Nikolay Borisov 
---
Hello, 

This patch needs revieweing since I'm not entirely sure I managed to capture 
the semantics of the "parent" and "owner" arguments. Specifically, parent is 
passed "ref->parent" only if we have a shared block, however looking at the 
code where parent is passed to the add delayed tree/data refs, it seems that 
parent is set to the eb->Start only if the tree where this extent comes from is 
the data_reloc_tree, i.e the code in 
replace_path/replace_file_extents/do_relocation
__btrfs_cow_block.

For the "owner" argument in case of data extents it'set to the ino, for metadata
extents it's a bit trickier, what I think it always contains for such extents 
is the level of the parent block in the tree. 


If this function is documented correctly then it wil be fairly trivial to 
document btrfs_add_delayed(tree|data)_ref ones as well. 

 fs/btrfs/extent-tree.c | 29 +
 1 file changed, 29 insertions(+)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 2ce32f05812f..5a2f4a86dc71 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2207,6 +2207,35 @@ int btrfs_inc_extent_ref(struct btrfs_trans_handle 
*trans,
return ret;
 }
 
+/*
+ * __btrfs_inc_extent_ref - insert backreference for a given extent
+ *
+ * @trans: Handle of transaction
+ *
+ * @node:  The delayed ref node used to get the bytenr/length for
+ * extent
+ *
+ * @parent:If this is a shared extent (BTRFS_SHARED_DATA_REF_KEY/
+ * BTRFS_SHARED_BLOCK_REF_KEY) then parent *may* hold the
+ * logical bytenr of the parent block.
+ *
+ * @root_objectid: The id of the root where this modification has originated,
+ * this can be either one of the well-known metadata trees or
+ * the subvolume id which references this extent.
+ *
+ * @owner: For data extents it is the inode number of the owning file.
+ * For metadata extents this parameter holds the level in the
+ * tree of the parent block.
+ *
+ * @offset:For metadata extents this is always 0. For data extents it
+ * is the fileoffset this extent belongs to.
+ *
+ * @refs_to_addNumber of references to add
+ *
+ * @extent_op  Pointer to a structure, holding information necessary when
+ * updating a tree block's flags
+ *
+ */
 static int __btrfs_inc_extent_ref(struct btrfs_trans_handle *trans,
  struct btrfs_delayed_ref_node *node,
  u64 parent, u64 root_objectid,
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 04/10] locking: export osq_lock()/osq_unlock()

2018-05-18 Thread Kent Overstreet
On Fri, May 18, 2018 at 01:40:54PM +0200, Peter Zijlstra wrote:
> On Fri, May 18, 2018 at 07:32:05AM -0400, Kent Overstreet wrote:
> > It does strike me that the whole optimistic spin algorithm
> > (mutex_optimistic_spin() and rwsem_optimistic_spin()) are ripe for factoring
> > out. They've been growing more optimizations I see, and the optimizations 
> > mostly
> > aren't specific to either locks.
> 
> There's a few unfortunate differences between the two; but yes it would
> be good if we could reduce some of the duplication found there.
> 
> One of the distinct differences is that mutex (now) has the lock state
> and owner in a single atomic word, while rwsem still tracks the owner in
> a separate word and thus needs to account for 'inconsistent' owner
> state.
> 
> And then there's warts such as ww_mutex and rwsem_owner_is_reader and
> similar.

The rwsem code seems workable, osq_optimistic_spin() takes callback for trylock
and get_owner - I just added a OWNER_NO_SPIN value that get_owner() can return.

The mutex code though... wtf...


commit 5c7defe81ee722f5087cbeda9be6e7ee715515d7
Author: Kent Overstreet 
Date:   Fri May 18 08:35:55 2018 -0400

Factor out osq_optimistic_spin()

diff --git a/fs/bcachefs/six.c b/fs/bcachefs/six.c
index afa59a476a..dbaf52abef 100644
--- a/fs/bcachefs/six.c
+++ b/fs/bcachefs/six.c
@@ -1,5 +1,6 @@
 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -146,127 +147,26 @@ struct six_lock_waiter {
 /* This is probably up there with the more evil things I've done */
 #define waitlist_bitnr(id) ilog2union six_lock_state) { .waiters = 1 << 
(id) }).l))
 
-#ifdef CONFIG_LOCK_SPIN_ON_OWNER
-
-static inline int six_can_spin_on_owner(struct six_lock *lock)
-{
-   struct task_struct *owner;
-   int retval = 1;
-
-   if (need_resched())
-   return 0;
-
-   rcu_read_lock();
-   owner = READ_ONCE(lock->owner);
-   if (owner)
-   retval = owner->on_cpu;
-   rcu_read_unlock();
-   /*
-* if lock->owner is not set, the mutex owner may have just acquired
-* it and not set the owner yet or the mutex has been released.
-*/
-   return retval;
-}
-
-static inline bool six_spin_on_owner(struct six_lock *lock,
-struct task_struct *owner)
+static inline struct task_struct *six_osq_get_owner(struct 
optimistic_spin_queue *osq)
 {
-   bool ret = true;
-
-   rcu_read_lock();
-   while (lock->owner == owner) {
-   /*
-* Ensure we emit the owner->on_cpu, dereference _after_
-* checking lock->owner still matches owner. If that fails,
-* owner might point to freed memory. If it still matches,
-* the rcu_read_lock() ensures the memory stays valid.
-*/
-   barrier();
-
-   if (!owner->on_cpu || need_resched()) {
-   ret = false;
-   break;
-   }
-
-   cpu_relax();
-   }
-   rcu_read_unlock();
+   struct six_lock *lock = container_of(osq, struct six_lock, osq);
 
-   return ret;
+   return lock->owner;
 }
 
-static inline bool six_optimistic_spin(struct six_lock *lock, enum 
six_lock_type type)
+static inline bool six_osq_trylock_read(struct optimistic_spin_queue *osq)
 {
-   struct task_struct *task = current;
-
-   if (type == SIX_LOCK_write)
-   return false;
-
-   preempt_disable();
-   if (!six_can_spin_on_owner(lock))
-   goto fail;
-
-   if (!osq_lock(>osq))
-   goto fail;
-
-   while (1) {
-   struct task_struct *owner;
-
-   /*
-* If there's an owner, wait for it to either
-* release the lock or go to sleep.
-*/
-   owner = READ_ONCE(lock->owner);
-   if (owner && !six_spin_on_owner(lock, owner))
-   break;
+   struct six_lock *lock = container_of(osq, struct six_lock, osq);
 
-   if (do_six_trylock_type(lock, type)) {
-   osq_unlock(>osq);
-   preempt_enable();
-   return true;
-   }
-
-   /*
-* When there's no owner, we might have preempted between the
-* owner acquiring the lock and setting the owner field. If
-* we're an RT task that will live-lock because we won't let
-* the owner complete.
-*/
-   if (!owner && (need_resched() || rt_task(task)))
-   break;
-
-   /*
-* The cpu_relax() call is a compiler barrier which forces
-* everything in this loop to be re-loaded. We don't need
-* memory barriers as we'll eventually observe the right
-* values at the cost of a few 

Re: [PATCH v2 0/3] btrfs: add read mirror policy

2018-05-18 Thread Austin S. Hemmelgarn

On 2018-05-18 04:06, Anand Jain wrote:



Thanks Austin and Jeff for the suggestion.

I am not particularly a fan of mount option either mainly because
those options aren't persistent and host independent luns will
have tough time to have them synchronize manually.

Properties are better as it is persistent. And we can apply this
read_mirror_policy property on the fsid object.

But if we are talking about the properties then it can be stored
as extended attributes or ondisk key value pair, and I am doubt
if ondisk key value pair will get a nod.
I can explore the extended attribute approach but appreciate more
comments.


Hmm, thinking a bit further, might it be easier to just keep this as a 
mount option, and add something that lets you embed default mount 
options in the volume in a free-form manner?  Then, you could set this 
persistently there, and could specify any others you want too.  Doing 
that would also give very well defined behavior for exactly when changes 
would apply (the next time you mount or remount the volume), though 
handling of whether or not an option came from there or was specified on 
the command-line might be a bit complicated.



On 05/17/2018 10:46 PM, Jeff Mahoney wrote:

On 5/17/18 8:25 AM, Austin S. Hemmelgarn wrote:

On 2018-05-16 22:32, Anand Jain wrote:



On 05/17/2018 06:35 AM, David Sterba wrote:

On Wed, May 16, 2018 at 06:03:56PM +0800, Anand Jain wrote:

Not yet ready for the integration. As I need to introduce
-o no_read_mirror_policy instead of -o read_mirror_policy=-


Mount option is mostly likely not the right interface for setting such
options, as usual.


   I am ok to make it ioctl for the final. What do you think?


   But to reproduce the bug posted in
 Btrfs: fix the corruption by reading stale btree blocks
   It needs to be a mount option, as randomly the pid can
   still pick the disk specified in the mount option.


Personally, I'd vote for filesystem property (thus handled through the
standard `btrfs property` command) that can be overridden by a mount
option.  With that approach, no new tool (or change to an existing tool)
would be needed, existing volumes could be converted to use it in a
backwards compatible manner (old kernels would just ignore the
property), and you could still have the behavior you want in tests (and
in theory it could easily be adapted to be a per-subvolume setting if we
ever get per-subvolume chunk profile support).


Properties are a combination of interfaces presented through a single
command.  Although the kernel API would allow a direct-to-property
interface via the btrfs.* extended attributes, those are currently
limited to a single inode.  The label property is set via ioctl and
stored in the superblock.  The read-only subvolume property is also set
by ioctl but stored in the root flags.

As it stands, every property is explicitly defined in the tools, so any
addition would require tools changes.  This is a bigger discussion,
though.  We *could* use the xattr interface to access per-root or
fs-global properties, but we'd need to define that interface.
btrfs_listxattr could get interesting, though I suppose we could
simplify it by only allowing the per-subvolume and fs-global operations
on root inodes.

-Jeff



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH v6 3/3] btrfs: Add unprivileged version of ino_lookup ioctl

2018-05-18 Thread Gu, Jinxiang


> -Original Message-
> From: linux-btrfs-ow...@vger.kernel.org 
> [mailto:linux-btrfs-ow...@vger.kernel.org] On Behalf Of Tomohiro Misono
> Sent: Friday, May 18, 2018 10:55 AM
> To: linux-btrfs@vger.kernel.org
> Subject: [PATCH v6 3/3] btrfs: Add unprivileged version of ino_lookup ioctl
> 
> Add unprivileged version of ino_lookup ioctl BTRFS_IOC_INO_LOOKUP_USER to 
> allow normal users to call "btrfs subvolume list/show" etc.
> in combination with BTRFS_IOC_GET_SUBVOL_INFO/BTRFS_IOC_GET_SUBVOL_ROOTREF.
> 
> This can be used like BTRFS_IOC_INO_LOOKUP but the argument is different. 
> This is  because it always searches the fs/file tree
> correspoinding to the fd with which this ioctl is called and also returns the 
> name of bottom subvolume.
> 
> The main differences from original ino_lookup ioctl are:
>   1. Read + Exec permission will be checked using inode_permission()
>  during path construction. -EACCES will be returned in case
>  of failure.
>   2. Path construction will be stopped at the inode number which
>  corresponds to the fd with which this ioctl is called. If
>  constructed path does not exist under fd's inode, -EACCES
>  will be returned.
>   3. The name of bottom subvolume is also searched and filled.
> 
> Note that the maximum length of path is shorter 256 (BTRFS_VOL_NAME_MAX+1) 
> bytes than ino_lookup ioctl because of space of
> subvolume's name.
> 
> Reviewed-by: Gu Jinxiang 
> Reviewed-by: Qu Wenruo 
> Signed-off-by: Tomohiro Misono 
> ---
>  fs/btrfs/ioctl.c   | 204 
> +
>  include/uapi/linux/btrfs.h |  17 
>  2 files changed, 221 insertions(+)
> 
> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index 
> 463ddedd90da..5e8dfa816ba9 100644
> --- a/fs/btrfs/ioctl.c
> +++ b/fs/btrfs/ioctl.c
> @@ -2200,6 +2200,166 @@ static noinline int btrfs_search_path_in_tree(struct 
> btrfs_fs_info *info,
>   return ret;
>  }
> 
> +static noinline int btrfs_search_path_in_tree_user(struct inode *inode,
> + struct btrfs_ioctl_ino_lookup_user_args *args) {
> + struct btrfs_fs_info *fs_info = BTRFS_I(inode)->root->fs_info;
> + struct super_block *sb = inode->i_sb;
> + struct btrfs_key upper_limit = BTRFS_I(inode)->location;
> + u64 treeid = BTRFS_I(inode)->root->root_key.objectid;
> + u64 dirid = args->dirid;
> +
> + unsigned long item_off;
> + unsigned long item_len;
> + struct btrfs_inode_ref *iref;
> + struct btrfs_root_ref *rref;
> + struct btrfs_root *root;
> + struct btrfs_path *path;
> + struct btrfs_key key, key2;
> + struct extent_buffer *l;
> + struct inode *temp_inode;
> + char *ptr;
> + int slot;
> + int len;
> + int total_len = 0;
> + int ret;
> +
> + path = btrfs_alloc_path();
> + if (!path)
> + return -ENOMEM;
> +
> + /*
> +  * If the bottom subvolume does not exist directly under upper_limit,
> +  * construct the path in bottomup way.
> +  */
> + if (dirid != upper_limit.objectid) {
> + ptr = >path[BTRFS_INO_LOOKUP_USER_PATH_MAX - 1];
> +
> + key.objectid = treeid;
> + key.type = BTRFS_ROOT_ITEM_KEY;
> + key.offset = (u64)-1;
> + root = btrfs_read_fs_root_no_name(fs_info, );
> + if (IS_ERR(root)) {
> + ret = -ENOENT;
> + goto out;
> + }
> +
> + key.objectid = dirid;
> + key.type = BTRFS_INODE_REF_KEY;
> + key.offset = (u64)-1;
> + while (1) {
> + ret = btrfs_search_slot(NULL, root, , path, 0, 0);
> + if (ret < 0) {
> + goto out;
> + } else if (ret > 0) {
> + ret = btrfs_previous_item(root, path, dirid,
> +   BTRFS_INODE_REF_KEY);
> + if (ret < 0) {
> + goto out;
> + } else if (ret > 0) {
> + ret = -ENOENT;
> + goto out;
> + }
> + }
> +
> + l = path->nodes[0];
> + slot = path->slots[0];
> + btrfs_item_key_to_cpu(l, , slot);
> +
> + iref = btrfs_item_ptr(l, slot, struct btrfs_inode_ref);
> + len = btrfs_inode_ref_name_len(l, iref);
> + ptr -= len + 1;
> + total_len += len + 1;
> + if (ptr < args->path) {
> + ret = -ENAMETOOLONG;
> + goto out;
> + }
> +
> + *(ptr + len) = '/';
> + 

RE: [PATCH v6 2/3] btrfs: Add unprivileged ioctl which returns subvolume's ROOT_REF

2018-05-18 Thread Gu, Jinxiang


> -Original Message-
> From: linux-btrfs-ow...@vger.kernel.org 
> [mailto:linux-btrfs-ow...@vger.kernel.org] On Behalf Of Tomohiro Misono
> Sent: Friday, May 18, 2018 10:55 AM
> To: linux-btrfs@vger.kernel.org
> Subject: [PATCH v6 2/3] btrfs: Add unprivileged ioctl which returns 
> subvolume's ROOT_REF
> 
> Add unprivileged ioctl BTRFS_IOC_GET_SUBVOL_ROOTREF which returns ROOT_REF 
> information of the subvolume containing this inode
> except the subvolume name (this is because to prevent potential name leak). 
> The subvolume name will be gained by user version of
> ino_lookup ioctl (BTRFS_IOC_INO_LOOKUP_USER) which also performs permission 
> check.
> 
> The min id of root ref's subvolume to be searched is specified by @min_id in 
> struct btrfs_ioctl_get_subvol_rootref_args. After the search
> ends, @min_id is set to the last searched root ref's subvolid + 1. Also, if 
> there are more root refs than
> BTRFS_MAX_ROOTREF_BUFFER_NUM, -EOVERFLOW is returned. Therefore the caller 
> can just call this ioctl again without changing the
> argument to continue search.
> 
> Reviewed-by: Qu Wenruo 
> Signed-off-by: Tomohiro Misono 
> ---
>  v4 -> v5
> - Update error handling of btrfs_next_leaf() to cover all cases
> - Use btrfs_next_item() to reduce the call of btrfs_search_slot()
> 
>  fs/btrfs/ioctl.c   | 102 
> +
>  include/uapi/linux/btrfs.h |  16 +++
>  2 files changed, 118 insertions(+)
> 
> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index 
> 31af6e91c614..463ddedd90da 100644
> --- a/fs/btrfs/ioctl.c
> +++ b/fs/btrfs/ioctl.c
> @@ -2365,6 +2365,106 @@ static noinline int 
> btrfs_ioctl_get_subvol_info(struct file *file,
>   return ret;
>  }
> 
> +/*
> + * Return ROOT_REF information of the subvolume containing this inode
> + * except the subvolume name.
> + */
> +static noinline int btrfs_ioctl_get_subvol_rootref(struct file *file,
> +void __user *argp)
> +{
> + struct btrfs_ioctl_get_subvol_rootref_args *rootrefs;
> + struct btrfs_root_ref *rref;
> + struct btrfs_root *root;
> + struct btrfs_path *path;
> + struct btrfs_key key;
> +
> + struct extent_buffer *l;
> + int slot;
> +
> + struct inode *inode;
> + int ret;
> + u64 objectid;
> + u8 found;
> +
> + path = btrfs_alloc_path();
> + if (!path)
> + return -ENOMEM;
> +
> + rootrefs = memdup_user(argp, sizeof(*rootrefs));
> + if (!rootrefs) {
> + btrfs_free_path(path);
> + return -ENOMEM;
> + }
> +
> + inode = file_inode(file);
> + root = BTRFS_I(inode)->root->fs_info->tree_root;
> + objectid = BTRFS_I(inode)->root->root_key.objectid;
> +
> + key.objectid = objectid;
> + key.type = BTRFS_ROOT_REF_KEY;
> + key.offset = rootrefs->min_id;
> + found = 0;
> +
> + ret = btrfs_search_slot(NULL, root, , path, 0, 0);
> + if (ret < 0) {
> + goto out;
> + } else if (path->slots[0] >=
> + btrfs_header_nritems(path->nodes[0])) {
> + ret = btrfs_next_leaf(root, path);
> + if (ret < 0) {
> + goto out;
> + } else if (ret > 0) {
> + ret = -EUCLEAN;
> + goto out;
> + }
> + }
> + while (1) {
> + l = path->nodes[0];
> + slot = path->slots[0];
> +
> + btrfs_item_key_to_cpu(l, , slot);
> + if (key.objectid != objectid ||
> + key.type != BTRFS_ROOT_REF_KEY) {
> + ret = 0;
> + goto out;
> + }
> +
> + if (found == BTRFS_MAX_ROOTREF_BUFFER_NUM) {
> + ret = -EOVERFLOW;
> + goto out;
> + }
> +
> + rref = btrfs_item_ptr(l, slot, struct btrfs_root_ref);
> + rootrefs->rootref[found].subvolid = key.offset;
> + rootrefs->rootref[found].dirid =
> +   btrfs_root_ref_dirid(l, rref);
> + found++;
> +
> + ret = btrfs_next_item(root, path);
> + if (ret < 0) {
> + goto out;
> + } else if (ret > 0) {
> + ret = -EUCLEAN;
> + goto out;
> + }
> + }
> +
> +out:
> + if (!ret || ret == -EOVERFLOW) {
> + rootrefs->num_items = found;
> + /* update min_id for next search */
> + if (found)
> + rootrefs->min_id =
> + rootrefs->rootref[found - 1].subvolid + 1;
> + if (copy_to_user(argp, rootrefs, sizeof(*rootrefs)))
> + ret = -EFAULT;
> + }
> +
> + btrfs_free_path(path);
> + kfree(rootrefs);
> + return ret;
> +}
> +
>  static noinline int 

RE: [PATCH v6 1/3] btrfs: Add unprivileged ioctl which returns subvolume information

2018-05-18 Thread Gu, Jinxiang


> -Original Message-
> From: linux-btrfs-ow...@vger.kernel.org 
> [mailto:linux-btrfs-ow...@vger.kernel.org] On Behalf Of Tomohiro Misono
> Sent: Friday, May 18, 2018 10:55 AM
> To: linux-btrfs@vger.kernel.org
> Subject: [PATCH v6 1/3] btrfs: Add unprivileged ioctl which returns subvolume 
> information
> 
> Add new unprivileged ioctl BTRFS_IOC_GET_SUBVOL_INFO which returns the 
> information of subvolume containing this inode.
> (i.e. returns the information in ROOT_ITEM and ROOT_BACKREF.)
> 
> Signed-off-by: Tomohiro Misono 
> ---
>  v5 -> v6
> - Use brfs_read_fs_root_no_name() to get subvolume root and simplify
>   code
>  v4 -> v5
> - Update error handling of btrfs_next_leaf() to cover all cases
> - Return error if ROOT_BACKREF is not found (except top-level)
> 
>  fs/btrfs/ioctl.c   | 125 
> +
>  include/uapi/linux/btrfs.h |  51 ++
>  2 files changed, 176 insertions(+)
> 
> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index 
> 48e2ddff32bd..31af6e91c614 100644
> --- a/fs/btrfs/ioctl.c
> +++ b/fs/btrfs/ioctl.c
> @@ -2242,6 +2242,129 @@ static noinline int btrfs_ioctl_ino_lookup(struct 
> file *file,
>   return ret;
>  }
> 
> +/* Get the subvolume information in BTRFS_ROOT_ITEM and
> +BTRFS_ROOT_BACKREF */ static noinline int btrfs_ioctl_get_subvol_info(struct 
> file *file,
> +void __user *argp)
> +{
> + struct btrfs_ioctl_get_subvol_info_args *subvol_info;
> + struct btrfs_fs_info *fs_info;
> + struct btrfs_root *root;
> + struct btrfs_path *path;
> + struct btrfs_key key;
> +
> + struct btrfs_root_item *root_item;
> + struct btrfs_root_ref *rref;
> + struct extent_buffer *l;
> + int slot;
> +
> + unsigned long item_off;
> + unsigned long item_len;
> +
> + struct inode *inode;
> + int ret = 0;
> +
> + path = btrfs_alloc_path();
> + if (!path)
> + return -ENOMEM;
> +
> + subvol_info = kzalloc(sizeof(*subvol_info), GFP_KERNEL);
> + if (!subvol_info) {
> + btrfs_free_path(path);
> + return -ENOMEM;
> + }
> +
> + inode = file_inode(file);
> + fs_info = BTRFS_I(inode)->root->fs_info;
> +
> + /* Get root_item of inode's subvolume */
> + key.objectid = BTRFS_I(inode)->root->root_key.objectid;
> + key.type = BTRFS_ROOT_ITEM_KEY;
> + key.offset = (u64)-1;
> + root = btrfs_read_fs_root_no_name(fs_info, );
> + if (IS_ERR(root)) {
> + ret = PTR_ERR(root);
> + goto out;
> + }
> + root_item = >root_item;
> +
> + subvol_info->id = key.objectid;
> +
> + subvol_info->generation = btrfs_root_generation(root_item);
> + subvol_info->flags = btrfs_root_flags(root_item);
> +
> + memcpy(subvol_info->uuid, root_item->uuid, BTRFS_UUID_SIZE);
> + memcpy(subvol_info->parent_uuid, root_item->parent_uuid,
> + BTRFS_UUID_SIZE);
> + memcpy(subvol_info->received_uuid, root_item->received_uuid,
> + BTRFS_UUID_SIZE);
> +
> + subvol_info->ctransid = btrfs_root_ctransid(root_item);
> + subvol_info->ctime.sec = btrfs_stack_timespec_sec(_item->otime);
> + subvol_info->ctime.nsec =
> +btrfs_stack_timespec_nsec(_item->ctime);
> +
> + subvol_info->otransid = btrfs_root_otransid(root_item);
> + subvol_info->otime.sec = btrfs_stack_timespec_sec(_item->otime);
> + subvol_info->otime.nsec =
> +btrfs_stack_timespec_nsec(_item->otime);
> +
> + subvol_info->stransid = btrfs_root_stransid(root_item);
> + subvol_info->stime.sec = btrfs_stack_timespec_sec(_item->stime);
> + subvol_info->stime.nsec =
> +btrfs_stack_timespec_nsec(_item->stime);
> +
> + subvol_info->rtransid = btrfs_root_rtransid(root_item);
> + subvol_info->rtime.sec = btrfs_stack_timespec_sec(_item->rtime);
> + subvol_info->rtime.nsec =
> +btrfs_stack_timespec_nsec(_item->rtime);
> +
> + if (key.objectid != BTRFS_FS_TREE_OBJECTID) {
> + /* Search root tree for ROOT_BACKREF of this subvolume */
> + root = fs_info->tree_root;
> +
> + key.type = BTRFS_ROOT_BACKREF_KEY;
> + key.offset = 0;
> + ret = btrfs_search_slot(NULL, root, , path, 0, 0);
> + if (ret < 0) {
> + goto out;
> + } else if (path->slots[0] >=
> + btrfs_header_nritems(path->nodes[0])) {
> + ret = btrfs_next_leaf(root, path);
> + if (ret < 0) {
> + goto out;
> + } else if (ret > 0) {
> + ret = -EUCLEAN;
> + goto out;
> + }
> + }
> +
> + l = path->nodes[0];
> + slot = path->slots[0];
> +  

Re: [PATCH 04/10] locking: export osq_lock()/osq_unlock()

2018-05-18 Thread Peter Zijlstra
On Fri, May 18, 2018 at 07:32:05AM -0400, Kent Overstreet wrote:
> It does strike me that the whole optimistic spin algorithm
> (mutex_optimistic_spin() and rwsem_optimistic_spin()) are ripe for factoring
> out. They've been growing more optimizations I see, and the optimizations 
> mostly
> aren't specific to either locks.

There's a few unfortunate differences between the two; but yes it would
be good if we could reduce some of the duplication found there.

One of the distinct differences is that mutex (now) has the lock state
and owner in a single atomic word, while rwsem still tracks the owner in
a separate word and thus needs to account for 'inconsistent' owner
state.

And then there's warts such as ww_mutex and rwsem_owner_is_reader and
similar.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 03/10] locking: bring back lglocks

2018-05-18 Thread Kent Overstreet
On Fri, May 18, 2018 at 01:03:39PM +0200, Peter Zijlstra wrote:
> On Fri, May 18, 2018 at 06:13:53AM -0400, Kent Overstreet wrote:
> > On Fri, May 18, 2018 at 11:51:02AM +0200, Peter Zijlstra wrote:
> > > On Fri, May 18, 2018 at 03:49:04AM -0400, Kent Overstreet wrote:
> > > > bcachefs makes use of them - also, add a proper lg_lock_init()
> > > 
> > > Why?! lglocks are horrid things, we got rid of them for a reason. They
> > > have terrifying worst case preemption off latencies.
> > 
> > Ah. That was missing from your commit message.
> 
> Yeah, sorry, sometimes it's hard to state what is obvious to oneself :/
> 
> > > Why can't you use something like per-cpu rwsems?
> > 
> > Well,
> > 
> >  a) in my use case, lg_global_lock() pretty much isn't used in normal 
> > operation,
> > it's only called when starting mark and sweep gc (which is not needed
> > anymore and disabled by default, it'll eventually get rolled into online
> > fsck) and for device resize
> > 
> >  b) I'm using it in conjection with percpu counters, and technically yes I
> > certainly _could_ use per-cpu sleepable locks (mutexes would make more 
> > sense
> > for me than rwsems), there's a bit of a clash there and it's going to 
> > be a
> > bit ugly and messy and it's more work for me. (this_cpu_ptr() no longer
> > makes any sense in that case, so it'd mean auditing and converting all 
> > the
> > code that touches the relevant data structures).
> 
> Well, lg is a reader-writer style lock per definition, as you want
> concurrency on the local and full exclusion against the global, so I'm
> not sure how mutexes fit into this.
> 
> In any case, have a look at percpu_down_read_preempt_disable() and
> percpu_up_read_preempt_enable(); they're a bit of a hack but they should
> work for you I think.
> 
> They will sleep at down_read, but the entire actual critical section
> will be with preemption disabled -- therefore it had better be short and
> bounded, and the RT guys will thank you for not using spinlock_t under
> it (use raw_spinlock_t if you have to).
> 
> The (global) writer side will block and be preemptible like normal.
> 
> > If you're really that dead set against lglocks I might just come up with a 
> > new
> > lock with similar semantics, that doesn't break this_cpu_ptr() but sleeps 
> > if the
> > global lock is held...
> 
> See above, we already have this ;-)

Ok, I think this might work. I'll have to stare awhile and make sure I remember
everything I'm currently depending on the lglock for...
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 04/10] locking: export osq_lock()/osq_unlock()

2018-05-18 Thread Kent Overstreet
On Fri, May 18, 2018 at 01:08:08PM +0200, Peter Zijlstra wrote:
> On Fri, May 18, 2018 at 06:18:04AM -0400, Kent Overstreet wrote:
> > On Fri, May 18, 2018 at 11:52:04AM +0200, Peter Zijlstra wrote:
> > > On Fri, May 18, 2018 at 03:49:06AM -0400, Kent Overstreet wrote:
> > > 
> > > No.. and most certainly not without a _very_ good reason.
> > 
> > Ok, can I ask why?
> 
> Because it is an internal helper for lock implementations that want to
> do optimistic spinning, it isn't a lock on its own and lacks several
> things you would expect.
> 
> Using it is tricky and I don't trust random module authors to get 1+1
> right, let alone use this thing correctly (no judgement on your code,
> just in general).

Yeah, that's true. I just modelled my usage on the rwsem code.

It does strike me that the whole optimistic spin algorithm
(mutex_optimistic_spin() and rwsem_optimistic_spin()) are ripe for factoring
out. They've been growing more optimizations I see, and the optimizations mostly
aren't specific to either locks.

> > Here's what it's for:
> 
> I'll try and have a look soon :-) But does that really _have_ to live in
> a module?

No, I'd be completely fine with moving six locks out of bcachefs, just don't
know that there'd be any other users. But I suppose we do have other filesystems
that use btrees, and that's what they're for.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 04/10] locking: export osq_lock()/osq_unlock()

2018-05-18 Thread Peter Zijlstra
On Fri, May 18, 2018 at 06:18:04AM -0400, Kent Overstreet wrote:
> On Fri, May 18, 2018 at 11:52:04AM +0200, Peter Zijlstra wrote:
> > On Fri, May 18, 2018 at 03:49:06AM -0400, Kent Overstreet wrote:
> > 
> > No.. and most certainly not without a _very_ good reason.
> 
> Ok, can I ask why?

Because it is an internal helper for lock implementations that want to
do optimistic spinning, it isn't a lock on its own and lacks several
things you would expect.

Using it is tricky and I don't trust random module authors to get 1+1
right, let alone use this thing correctly (no judgement on your code,
just in general).

> Here's what it's for:

I'll try and have a look soon :-) But does that really _have_ to live in
a module?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 03/10] locking: bring back lglocks

2018-05-18 Thread Peter Zijlstra
On Fri, May 18, 2018 at 06:13:53AM -0400, Kent Overstreet wrote:
> On Fri, May 18, 2018 at 11:51:02AM +0200, Peter Zijlstra wrote:
> > On Fri, May 18, 2018 at 03:49:04AM -0400, Kent Overstreet wrote:
> > > bcachefs makes use of them - also, add a proper lg_lock_init()
> > 
> > Why?! lglocks are horrid things, we got rid of them for a reason. They
> > have terrifying worst case preemption off latencies.
> 
> Ah. That was missing from your commit message.

Yeah, sorry, sometimes it's hard to state what is obvious to oneself :/

> > Why can't you use something like per-cpu rwsems?
> 
> Well,
> 
>  a) in my use case, lg_global_lock() pretty much isn't used in normal 
> operation,
> it's only called when starting mark and sweep gc (which is not needed
> anymore and disabled by default, it'll eventually get rolled into online
> fsck) and for device resize
> 
>  b) I'm using it in conjection with percpu counters, and technically yes I
> certainly _could_ use per-cpu sleepable locks (mutexes would make more 
> sense
> for me than rwsems), there's a bit of a clash there and it's going to be a
> bit ugly and messy and it's more work for me. (this_cpu_ptr() no longer
> makes any sense in that case, so it'd mean auditing and converting all the
> code that touches the relevant data structures).

Well, lg is a reader-writer style lock per definition, as you want
concurrency on the local and full exclusion against the global, so I'm
not sure how mutexes fit into this.

In any case, have a look at percpu_down_read_preempt_disable() and
percpu_up_read_preempt_enable(); they're a bit of a hack but they should
work for you I think.

They will sleep at down_read, but the entire actual critical section
will be with preemption disabled -- therefore it had better be short and
bounded, and the RT guys will thank you for not using spinlock_t under
it (use raw_spinlock_t if you have to).

The (global) writer side will block and be preemptible like normal.

> If you're really that dead set against lglocks I might just come up with a new
> lock with similar semantics, that doesn't break this_cpu_ptr() but sleeps if 
> the
> global lock is held...

See above, we already have this ;-)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 04/10] locking: export osq_lock()/osq_unlock()

2018-05-18 Thread Kent Overstreet
On Fri, May 18, 2018 at 11:52:04AM +0200, Peter Zijlstra wrote:
> On Fri, May 18, 2018 at 03:49:06AM -0400, Kent Overstreet wrote:
> 
> No.. and most certainly not without a _very_ good reason.

Ok, can I ask why?

Here's what it's for:

commit 61782bf71eef83919af100a9747d8d86dfdf3605
Author: Kent Overstreet 
Date:   Fri May 18 06:14:56 2018 -0400

bcachefs: SIX locks (shared/intent/exclusive)

New lock for bcachefs, like read/write locks but with a third state,
intent.

Intent locks conflict with each other, but not with read locks; taking a
write lock requires first holding an intent lock.

diff --git a/fs/bcachefs/six.c b/fs/bcachefs/six.c
new file mode 100644
index 00..afa59a476a
--- /dev/null
+++ b/fs/bcachefs/six.c
@@ -0,0 +1,516 @@
+
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "six.h"
+
+#define six_acquire(l, t)  lock_acquire(l, 0, t, 0, 0, NULL, _RET_IP_)
+#define six_release(l) lock_release(l, 0, _RET_IP_)
+
+struct six_lock_vals {
+   /* Value we add to the lock in order to take the lock: */
+   u64 lock_val;
+
+   /* If the lock has this value (used as a mask), taking the lock fails: 
*/
+   u64 lock_fail;
+
+   /* Value we add to the lock in order to release the lock: */
+   u64 unlock_val;
+
+   /* Mask that indicates lock is held for this type: */
+   u64 held_mask;
+
+   /* Waitlist we wakeup when releasing the lock: */
+   enum six_lock_type  unlock_wakeup;
+};
+
+#define __SIX_LOCK_HELD_read   __SIX_VAL(read_lock, ~0)
+#define __SIX_LOCK_HELD_intent __SIX_VAL(intent_lock, ~0)
+#define __SIX_LOCK_HELD_write  __SIX_VAL(seq, 1)
+
+#define LOCK_VALS {\
+   [SIX_LOCK_read] = { \
+   .lock_val   = __SIX_VAL(read_lock, 1),  \
+   .lock_fail  = __SIX_LOCK_HELD_write,\
+   .unlock_val = -__SIX_VAL(read_lock, 1), \
+   .held_mask  = __SIX_LOCK_HELD_read, \
+   .unlock_wakeup  = SIX_LOCK_write,   \
+   },  \
+   [SIX_LOCK_intent] = {   \
+   .lock_val   = __SIX_VAL(intent_lock, 1),\
+   .lock_fail  = __SIX_LOCK_HELD_intent,   \
+   .unlock_val = -__SIX_VAL(intent_lock, 1),   \
+   .held_mask  = __SIX_LOCK_HELD_intent,   \
+   .unlock_wakeup  = SIX_LOCK_intent,  \
+   },  \
+   [SIX_LOCK_write] = {\
+   .lock_val   = __SIX_VAL(seq, 1),\
+   .lock_fail  = __SIX_LOCK_HELD_read, \
+   .unlock_val = __SIX_VAL(seq, 1),\
+   .held_mask  = __SIX_LOCK_HELD_write,\
+   .unlock_wakeup  = SIX_LOCK_read,\
+   },  \
+}
+
+static inline void six_set_owner(struct six_lock *lock, enum six_lock_type 
type,
+union six_lock_state old)
+{
+   if (type != SIX_LOCK_intent)
+   return;
+
+   if (!old.intent_lock) {
+   EBUG_ON(lock->owner);
+   lock->owner = current;
+   } else {
+   EBUG_ON(lock->owner != current);
+   }
+}
+
+static inline void six_clear_owner(struct six_lock *lock, enum six_lock_type 
type)
+{
+   if (type != SIX_LOCK_intent)
+   return;
+
+   EBUG_ON(lock->owner != current);
+
+   if (lock->state.intent_lock == 1)
+   lock->owner = NULL;
+}
+
+static __always_inline bool do_six_trylock_type(struct six_lock *lock,
+   enum six_lock_type type)
+{
+   const struct six_lock_vals l[] = LOCK_VALS;
+   union six_lock_state old;
+   u64 v = READ_ONCE(lock->state.v);
+
+   EBUG_ON(type == SIX_LOCK_write && lock->owner != current);
+
+   do {
+   old.v = v;
+
+   EBUG_ON(type == SIX_LOCK_write &&
+   ((old.v & __SIX_LOCK_HELD_write) ||
+!(old.v & __SIX_LOCK_HELD_intent)));
+
+   if (old.v & l[type].lock_fail)
+   return false;
+   } while ((v = atomic64_cmpxchg_acquire(>state.counter,
+   old.v,
+   old.v + l[type].lock_val)) != old.v);
+
+   six_set_owner(lock, type, old);
+   return true;
+}
+

Re: [PATCH 03/10] locking: bring back lglocks

2018-05-18 Thread Kent Overstreet
On Fri, May 18, 2018 at 11:51:02AM +0200, Peter Zijlstra wrote:
> On Fri, May 18, 2018 at 03:49:04AM -0400, Kent Overstreet wrote:
> > bcachefs makes use of them - also, add a proper lg_lock_init()
> 
> Why?! lglocks are horrid things, we got rid of them for a reason. They
> have terrifying worst case preemption off latencies.

Ah. That was missing from your commit message.

> Why can't you use something like per-cpu rwsems?

Well,

 a) in my use case, lg_global_lock() pretty much isn't used in normal operation,
it's only called when starting mark and sweep gc (which is not needed
anymore and disabled by default, it'll eventually get rolled into online
fsck) and for device resize

 b) I'm using it in conjection with percpu counters, and technically yes I
certainly _could_ use per-cpu sleepable locks (mutexes would make more sense
for me than rwsems), there's a bit of a clash there and it's going to be a
bit ugly and messy and it's more work for me. (this_cpu_ptr() no longer
makes any sense in that case, so it'd mean auditing and converting all the
code that touches the relevant data structures).

If you're really that dead set against lglocks I might just come up with a new
lock with similar semantics, that doesn't break this_cpu_ptr() but sleeps if the
global lock is held...
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs-progs: Add DEBUG_CFLAGS_INTERNAL for libbtrfsutil

2018-05-18 Thread Gu Jinxiang
From: Gu JinXiang 

Add DEBUG_CFLAGS_INTERNAL to LIBBTRFSUTIL_CFLAGS for libbtrfsutil's
build.

Signed-off-by: Gu JinXiang 
---
 Makefile | 1 +
 1 file changed, 1 insertion(+)

diff --git a/Makefile b/Makefile
index cbd85533..1e38a56f 100644
--- a/Makefile
+++ b/Makefile
@@ -85,6 +85,7 @@ LIBBTRFSUTIL_CFLAGS = $(SUBST_CFLAGS) \
  -fvisibility=hidden \
  -I$(TOPDIR)/libbtrfsutil \
  $(EXTRAWARN_CFLAGS) \
+ $(DEBUG_CFLAGS_INTERNAL) \
  $(EXTRA_CFLAGS)
 
 LDFLAGS = $(SUBST_LDFLAGS) \
-- 
2.17.0



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: off-by-one uncompressed invalid ram_bytes corruptions

2018-05-18 Thread Qu Wenruo


On 2018年05月18日 17:42, james harvey wrote:
> On Fri, May 18, 2018 at 1:49 AM, Qu Wenruo  wrote:
>> And btrfs check doesn't report the same problem as the default original
>> mode doesn't have such check.
>>
>> Please also post the result of "btrfs check --mode=lowmem /dev/sda1"
> 
> Are you saying "--mode=lowmem" does more checks than without it?

Sometimes it does more check.

>  "man
> btrfs check" says it's experimental and the difference is just
> original is unoptimized regarding memory consumption and can run out
> of memory, and low memory addresses this with increased IO cost from
> re-reading blocks increasing run time.  It doesn't indicate lowmem is
> a better check.

Well, due to the fact original mode and lowmem mode use completely
different way to check, you'd better consider lowmem mode as a
completely rework.
Thus sometimes it will cause different result. (Although most of the
time lowmem is causing false alerts)

Here in this particular case, lowmem does indeed do extra check.

And overall, lowmem mode provides more human readable error output.

Thanks,
Qu



signature.asc
Description: OpenPGP digital signature


Re: [PATCH 04/10] locking: export osq_lock()/osq_unlock()

2018-05-18 Thread Peter Zijlstra
On Fri, May 18, 2018 at 03:49:06AM -0400, Kent Overstreet wrote:

No.. and most certainly not without a _very_ good reason.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 03/10] locking: bring back lglocks

2018-05-18 Thread Peter Zijlstra
On Fri, May 18, 2018 at 03:49:04AM -0400, Kent Overstreet wrote:
> bcachefs makes use of them - also, add a proper lg_lock_init()

Why?! lglocks are horrid things, we got rid of them for a reason. They
have terrifying worst case preemption off latencies.

Why can't you use something like per-cpu rwsems?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: off-by-one uncompressed invalid ram_bytes corruptions

2018-05-18 Thread james harvey
On Fri, May 18, 2018 at 1:49 AM, Qu Wenruo  wrote:
> And btrfs check doesn't report the same problem as the default original
> mode doesn't have such check.
>
> Please also post the result of "btrfs check --mode=lowmem /dev/sda1"

Are you saying "--mode=lowmem" does more checks than without it?  "man
btrfs check" says it's experimental and the difference is just
original is unoptimized regarding memory consumption and can run out
of memory, and low memory addresses this with increased IO cost from
re-reading blocks increasing run time.  It doesn't indicate lowmem is
a better check.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs device replace can cause silent or noisy corruption on compressed NOCOW/NODATASUM

2018-05-18 Thread james harvey
On Thu, May 17, 2018 at 5:46 AM, james harvey  wrote:
> ...
> I of course don't know the extent of this.  I don't know all of the
> situations where NOCOW/NODATASUM extents are compressed anyway.  In my
> real world case, it was journald logs.  We know journald/systemd
> submits those for defragmentation.  I haven't verified if it submits
> the defragmentation asking for compression.  In my reproducing example
> linked above, I had to defragment the file asking for compression to
> cause the file to be compressed.  If that's the extent of the bug,
> probably lots of journald logs out there that have been through a
> replace have corruption, but hopefully no databases.  I don't think
> any databases, and not many database administrators, are going to
> submit the files for defragmentation with compression.  But, if
> compression can be triggered in more situations than this, it's
> possible there's a lot of corruption (sometimes silent) out there on
> important things like databases.
> ...

I hoped to show systemd asked for its journal files to be compressed
when it asked them to be defragmented.  That would limit the exposure
to programs that asked for compressed NOCOW, which hopefully would not
be many.

But, systemd doesn't do this.

The most it does is try turning *off* NOCOW when it's rotating a file
and won't write to it ever again, but it fails doing that without
realizing it, since the file isn't empty.

I'm confused on why its journal files get compressed while still being NOCOW.

systemd sets NOCOW for new journal files.  [1]

There's only 2 places I can see where systemd changes the NOCOW
attribute or asks for defragmentation.  It doesn't do anything with
the lowercase 'c' attribute explicitly asking for compression (ioctl
FS_COMPR_FL.)

src/journal/journal-file.c::journal_file_open_reliably()::3555 - When
opening a journal file, if its journal format is corrupted, set NOCOW
and defrag it before trying one more time.

src/journal/journal-file.c::journal_file_close()::378 - When closing a
journal file marked defrag on close: turn *off* NOCOW and defrag it.
The only time a journal file is marked defrag on close is at
journal_file_rotate()::3484, for the old file being rotated out.  When
rotating out, it's done by renaming the file at :3466, not by copying
the contents to a new file and blanking the existing one.

These places set or turn off NOCOW by ultimately calling "ioctl(fd,
FS_IOC_SETFLAGS" and appropriately handling FS_NOCOW_FL.

These places defrag by ultimately calling "ioctl(fd, BTRFS_IOC_DEFRAG, NULL)".

To request defragmentation with decompression, I believe
BTRFS_IOC_DEFRAG_RANGE has to be used instead so the range struct can
be used with a compression type.  And, systemd doesn't seem to be
calling the _RANGE variant anywhere, so it's not directly asking for
compression.

It tries to be helpful by thinking it's turning off NOCOW once it's
done forever writing to a file, not understanding that since it's a
non-empty file, you can't do that this way.

The only thing I could come up with was that ioctl might be turning
off NOCOW on a non-empty file where "chattr -C" wasn't, but I wrote a
c program that uses systemd's exact chattr_fd() that in turn uses
"ioctl(fd, FS_IOC_SETFLAGS", and it leaves the NOCOW attribute on as
expected.

Unless I'm missing something, systemd triggers NOCOW compression
without explicitly asking for it in the c program equivalent to
running on a NOCOW file "btrfs fi def -c".

If that's correct, that means the number of programs that cause
compressed NOCOW is probably larger than just systemd.



[1] Started in systemd 219, and as of 220, it's done in a way where
the user can modify tmpfiles.d/journal-nocow.conf to change this
behavior.  See systemd NEWS, and
http://systemd-devel.freedesktop.narkive.com/ii8vxOoW/patch-v3-allow-systemd-tmpfiles-to-set-file-directory-attributes
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs device replace can cause silent or noisy corruption on compressed NOCOW/NODATASUM

2018-05-18 Thread james harvey
On Thu, May 17, 2018 at 5:46 AM, james harvey  wrote:
> ...
> In short, "btrfs device replace" caused it...
> ...

This should read "btrfs replace".

Ran a 2 year old ISO, archlinux-2016.04.01-dual.iso.  Kernel 4.4.5,
btrfs-progs v4.4.1.  Same behavior as example, so not a (within 2
year) regression.

(Back on current kernel/btrfs...)

Confirmed same behavior using existing device to replace rather than
id, adding a 4th partition to the disk in my example:

# btrfs replace start -B /dev/vda1 /dev/vda4 /mnt
-or even-
# btrfs replace start -Br /dev/vda1 /dev/vda1 /mnt

Realized in my original example, I didn't run "filefrag -v" again to
make sure the physical offsets didn't change.  But, tried the example
again with it, and replace didn't change them.

There's a workaround, before a fix is available, and for anyone who
can't upgrade to a new version once the fix is available:

# btrfs device add /dev/vda4 /mnt
# btrfs device remove /dev/vda1 /mnt

This does more than just make new mirrored copies on /dev/vda4 of the
data that was on /dev/vda1.  At least in my run of this example, it
moves the mirrored copies for the file from physical offsets starting
269952 to 531904, from device ids 1&2 to 3&4.  So, it might take a bit
longer.  But, who cares, it leaves the data properly mirrored and in
compressed form.



I'll also note that even before any of the patches in the past few
days, this is sufficient to re-write the file uncompressed, respecting
its NOCOW attribute:

# cp zero /root/zero
# cp /root/zero zero
# sync
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 0/3] btrfs: add read mirror policy

2018-05-18 Thread Anand Jain



Thanks Austin and Jeff for the suggestion.

I am not particularly a fan of mount option either mainly because
those options aren't persistent and host independent luns will
have tough time to have them synchronize manually.

Properties are better as it is persistent. And we can apply this
read_mirror_policy property on the fsid object.

But if we are talking about the properties then it can be stored
as extended attributes or ondisk key value pair, and I am doubt
if ondisk key value pair will get a nod.
I can explore the extended attribute approach but appreciate more
comments.

-Anand


On 05/17/2018 10:46 PM, Jeff Mahoney wrote:

On 5/17/18 8:25 AM, Austin S. Hemmelgarn wrote:

On 2018-05-16 22:32, Anand Jain wrote:



On 05/17/2018 06:35 AM, David Sterba wrote:

On Wed, May 16, 2018 at 06:03:56PM +0800, Anand Jain wrote:

Not yet ready for the integration. As I need to introduce
-o no_read_mirror_policy instead of -o read_mirror_policy=-


Mount option is mostly likely not the right interface for setting such
options, as usual.


   I am ok to make it ioctl for the final. What do you think?


   But to reproduce the bug posted in
     Btrfs: fix the corruption by reading stale btree blocks
   It needs to be a mount option, as randomly the pid can
   still pick the disk specified in the mount option.


Personally, I'd vote for filesystem property (thus handled through the
standard `btrfs property` command) that can be overridden by a mount
option.  With that approach, no new tool (or change to an existing tool)
would be needed, existing volumes could be converted to use it in a
backwards compatible manner (old kernels would just ignore the
property), and you could still have the behavior you want in tests (and
in theory it could easily be adapted to be a per-subvolume setting if we
ever get per-subvolume chunk profile support).


Properties are a combination of interfaces presented through a single
command.  Although the kernel API would allow a direct-to-property
interface via the btrfs.* extended attributes, those are currently
limited to a single inode.  The label property is set via ioctl and
stored in the superblock.  The read-only subvolume property is also set
by ioctl but stored in the root flags.

As it stands, every property is explicitly defined in the tools, so any
addition would require tools changes.  This is a bigger discussion,
though.  We *could* use the xattr interface to access per-root or
fs-global properties, but we'd need to define that interface.
btrfs_listxattr could get interesting, though I suppose we could
simplify it by only allowing the per-subvolume and fs-global operations
on root inodes.

-Jeff


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/10] RFC: assorted bcachefs patches

2018-05-18 Thread Kent Overstreet
shit, git send-email pebkac error - disregard all the block patches in the
thread
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 00/10] RFC: assorted bcachefs patches

2018-05-18 Thread Kent Overstreet
These are all the remaining patches in my bcachefs tree that touch stuff outside
fs/bcachefs. Not all of them are suitable for inclusion as is, I wanted to get
some discussion first.

 * pagecache add lock

This is the only one that touches existing code in nontrivial ways.  The problem
it's solving is that there is no existing general mechanism for shooting down
pages in the page and keeping them removed, which is a real problem if you're
doing anything that modifies file data and isn't buffered writes.

Historically, the only problematic case has been direct IO, and people have been
willing to say "well, if you mix buffered and direct IO you get what you
deserve", and that's probably not unreasonable. But now we have fallocate insert
range and collapse range, and those are broken in ways I frankly don't want to
think about if they can't ensure consistency with the page cache.

Also, the mechanism truncate uses (i_size and sacrificing a goat) has
historically been rather fragile, IMO it might be a good think if we switched it
to a more general rigorous mechanism.

I need this solved for bcachefs because without this mechanism, the page cache
inconsistencies lead to various assertions popping (primarily when we didn't
think we need to get a disk reservation going by page cache state, but then do
the actual write and disk space accounting says oops, we did need one). And
having to reason about what can happen without a locking mechanism for this is
not something I care to spend brain cycles on.

That said, my patch is kind of ugly, and it requires filesystem changes for
other filesystems to take advantage of it. And unfortunately, since one of the
code paths that needs locking is readahead, I don't see any realistic way of
implementing the locking within just bcachefs code.

So I'm hoping someone has an idea for something cleaner (I think I recall
Matthew Wilcox saying he had an idea for how to use xarray to solve this), but
if not I'll polish up my pagecache add lock patch and see what I can do to make
it less ugly, and hopefully other people find it palatable or at least useful.

 * lglocks

They were removed by Peter Zijlstra when the last in kernel user was removed,
but I've found them useful. His commit message seems to imply he doesn't think
people should be using them, but I'm not sure why. They are a bit niche though,
I can move them to fs/bcachefs if people would prefer. 

 * Generic radix trees

This is a very simple radix tree implementation that can store types of
arbitrary size, not just pointers/unsigned long. It could probably replace
flex arrays.

 * Dynamic fault injection

I've actually had this code sitting in my tree since forever... I know we have
an existing fault injection framework, but I think this one is quite a bit nicer
to actually use.

It works very much like the dynamic debug infrastructure - for those who aren't
familiar, dynamic debug makes it so you can list and individually enable/disable
every pr_debug() callsite in debugfs.

So to add a fault injection site with this, you just stick a call to
dynamic_fault("foobar") somewhere in your code - dynamic_fault() returns true if
you should fail whatever it is you're testing. And then it'll show up in
debugfs, where you can enable/disable faults by file/linenumber, module, name,
etc.

The patch then also adds macros that wrap all the various memory allocation
functions and fail if dynamic_fault("memory") returns true - which means you can
see in debugfs every place you're allocating memory and fail all of them or just
individually (I have tests that iterate over all the faults and flip them on one
by one). I also use it in bcachefs to add fault injection points for uncommon
error paths in the filesystem startup/recovery path, and for various hard to
test slowpaths that only happen if we race in weird ways (race_fault()).

Kent Overstreet (10):
  mm: pagecache add lock
  mm: export find_get_pages()
  locking: bring back lglocks
  locking: export osq_lock()/osq_unlock()
  don't use spin_lock_irqsave() unnecessarily
  Generic radix trees
  bcache: optimize continue_at_nobarrier()
  bcache: move closures to lib/
  closures: closure_wait_event()
  Dynamic fault injection

 drivers/md/bcache/Kconfig |  10 +-
 drivers/md/bcache/Makefile|   6 +-
 drivers/md/bcache/bcache.h|   2 +-
 drivers/md/bcache/super.c |   1 -
 drivers/md/bcache/util.h  |   3 +-
 fs/inode.c|   1 +
 include/asm-generic/vmlinux.lds.h |   4 +
 .../md/bcache => include/linux}/closure.h |  50 +-
 include/linux/dynamic_fault.h | 117 +++
 include/linux/fs.h|  23 +
 include/linux/generic-radix-tree.h| 131 +++
 include/linux/lglock.h|  97 +++
 include/linux/sched.h |   4 +
 init/init_task.c  |  

[PATCH 01/10] mm: pagecache add lock

2018-05-18 Thread Kent Overstreet
Add a per address space lock around adding pages to the pagecache - making it
possible for fallocate INSERT_RANGE/COLLAPSE_RANGE to work correctly, and also
hopefully making truncate and dio a bit saner.

Signed-off-by: Kent Overstreet 
---
 fs/inode.c|  1 +
 include/linux/fs.h| 23 +++
 include/linux/sched.h |  4 ++
 init/init_task.c  |  1 +
 mm/filemap.c  | 91 ---
 5 files changed, 115 insertions(+), 5 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index ef362364d3..e7aaa39adb 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -350,6 +350,7 @@ void address_space_init_once(struct address_space *mapping)
 {
memset(mapping, 0, sizeof(*mapping));
INIT_RADIX_TREE(>page_tree, GFP_ATOMIC | __GFP_ACCOUNT);
+   pagecache_lock_init(>add_lock);
spin_lock_init(>tree_lock);
init_rwsem(>i_mmap_rwsem);
INIT_LIST_HEAD(>private_list);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index c6baf76761..18d2886a44 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -388,9 +388,32 @@ int pagecache_write_end(struct file *, struct 
address_space *mapping,
loff_t pos, unsigned len, unsigned copied,
struct page *page, void *fsdata);
 
+/*
+ * Two-state lock - can be taken for add or block - both states are shared,
+ * like read side of rwsem, but conflict with other state:
+ */
+struct pagecache_lock {
+   atomic_long_t   v;
+   wait_queue_head_t   wait;
+};
+
+static inline void pagecache_lock_init(struct pagecache_lock *lock)
+{
+   atomic_long_set(>v, 0);
+   init_waitqueue_head(>wait);
+}
+
+void pagecache_add_put(struct pagecache_lock *);
+void pagecache_add_get(struct pagecache_lock *);
+void __pagecache_block_put(struct pagecache_lock *);
+void __pagecache_block_get(struct pagecache_lock *);
+void pagecache_block_put(struct pagecache_lock *);
+void pagecache_block_get(struct pagecache_lock *);
+
 struct address_space {
struct inode*host;  /* owner: inode, block_device */
struct radix_tree_root  page_tree;  /* radix tree of all pages */
+   struct pagecache_lock   add_lock;   /* protects adding new pages */
spinlock_t  tree_lock;  /* and lock protecting it */
atomic_ti_mmap_writable;/* count VM_SHARED mappings */
struct rb_root_cached   i_mmap; /* tree of private and shared 
mappings */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index b161ef8a90..e58465f61a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -40,6 +40,7 @@ struct io_context;
 struct mempolicy;
 struct nameidata;
 struct nsproxy;
+struct pagecache_lock;
 struct perf_event_context;
 struct pid_namespace;
 struct pipe_inode_info;
@@ -865,6 +866,9 @@ struct task_struct {
unsigned intin_ubsan;
 #endif
 
+   /* currently held lock, for avoiding recursing in fault path: */
+   struct pagecache_lock *pagecache_lock;
+
/* Journalling filesystem info: */
void*journal_info;
 
diff --git a/init/init_task.c b/init/init_task.c
index 3ac6e754cf..308d46eef9 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -106,6 +106,7 @@ struct task_struct init_task
},
.blocked= {{0}},
.alloc_lock = __SPIN_LOCK_UNLOCKED(init_task.alloc_lock),
+   .pagecache_lock = NULL,
.journal_info   = NULL,
INIT_CPU_TIMERS(init_task)
.pi_lock= __RAW_SPIN_LOCK_UNLOCKED(init_task.pi_lock),
diff --git a/mm/filemap.c b/mm/filemap.c
index 693f62212a..31dd888785 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -111,6 +111,73 @@
  *   ->tasklist_lock(memory_failure, collect_procs_ao)
  */
 
+static void __pagecache_lock_put(struct pagecache_lock *lock, long i)
+{
+   BUG_ON(atomic_long_read(>v) == 0);
+
+   if (atomic_long_sub_return_release(i, >v) == 0)
+   wake_up_all(>wait);
+}
+
+static bool __pagecache_lock_tryget(struct pagecache_lock *lock, long i)
+{
+   long v = atomic_long_read(>v), old;
+
+   do {
+   old = v;
+
+   if (i > 0 ? v < 0 : v > 0)
+   return false;
+   } while ((v = atomic_long_cmpxchg_acquire(>v,
+   old, old + i)) != old);
+   return true;
+}
+
+static void __pagecache_lock_get(struct pagecache_lock *lock, long i)
+{
+   wait_event(lock->wait, __pagecache_lock_tryget(lock, i));
+}
+
+void pagecache_add_put(struct pagecache_lock *lock)
+{
+   __pagecache_lock_put(lock, 1);
+}
+EXPORT_SYMBOL(pagecache_add_put);
+
+void pagecache_add_get(struct pagecache_lock *lock)
+{
+   __pagecache_lock_get(lock, 1);
+}
+EXPORT_SYMBOL(pagecache_add_get);
+
+void __pagecache_block_put(struct pagecache_lock *lock)
+{
+   

[PATCH 02/10] mm: export find_get_pages()

2018-05-18 Thread Kent Overstreet
Signed-off-by: Kent Overstreet 
---
 mm/filemap.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/mm/filemap.c b/mm/filemap.c
index 31dd888785..78b99019bf 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1845,6 +1845,7 @@ unsigned find_get_pages_range(struct address_space 
*mapping, pgoff_t *start,
 
return ret;
 }
+EXPORT_SYMBOL(find_get_pages_range);
 
 /**
  * find_get_pages_contig - gang contiguous pagecache lookup
-- 
2.17.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 03/10] locking: bring back lglocks

2018-05-18 Thread Kent Overstreet
bcachefs makes use of them - also, add a proper lg_lock_init()

Signed-off-by: Kent Overstreet 
---
 include/linux/lglock.h  |  97 +
 kernel/locking/Makefile |   1 +
 kernel/locking/lglock.c | 105 
 3 files changed, 203 insertions(+)
 create mode 100644 include/linux/lglock.h
 create mode 100644 kernel/locking/lglock.c

diff --git a/include/linux/lglock.h b/include/linux/lglock.h
new file mode 100644
index 00..c1fbe64bd2
--- /dev/null
+++ b/include/linux/lglock.h
@@ -0,0 +1,97 @@
+/*
+ * Specialised local-global spinlock. Can only be declared as global variables
+ * to avoid overhead and keep things simple (and we don't want to start using
+ * these inside dynamically allocated structures).
+ *
+ * "local/global locks" (lglocks) can be used to:
+ *
+ * - Provide fast exclusive access to per-CPU data, with exclusive access to
+ *   another CPU's data allowed but possibly subject to contention, and to
+ *   provide very slow exclusive access to all per-CPU data.
+ * - Or to provide very fast and scalable read serialisation, and to provide
+ *   very slow exclusive serialisation of data (not necessarily per-CPU data).
+ *
+ * Brlocks are also implemented as a short-hand notation for the latter use
+ * case.
+ *
+ * Copyright 2009, 2010, Nick Piggin, Novell Inc.
+ */
+#ifndef __LINUX_LGLOCK_H
+#define __LINUX_LGLOCK_H
+
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#ifdef CONFIG_SMP
+
+struct lglock {
+   arch_spinlock_t __percpu *lock;
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+   struct lockdep_mapdep_map;
+#endif
+};
+
+#define DEFINE_LGLOCK(name)\
+   static DEFINE_PER_CPU(arch_spinlock_t, name ## _lock)   \
+   = __ARCH_SPIN_LOCK_UNLOCKED;\
+   struct lglock name = { .lock =  ## _lock }
+
+#define DEFINE_STATIC_LGLOCK(name) \
+   static DEFINE_PER_CPU(arch_spinlock_t, name ## _lock)   \
+   = __ARCH_SPIN_LOCK_UNLOCKED;\
+   static struct lglock name = { .lock =  ## _lock }
+
+static inline void lg_lock_free(struct lglock *lg)
+{
+   free_percpu(lg->lock);
+}
+
+#define lg_lock_lockdep_init(lock) \
+do {   \
+   static struct lock_class_key __key; \
+   \
+   lockdep_init_map(&(lock)->dep_map, #lock, &__key, 0);   \
+} while (0)
+
+static inline int __lg_lock_init(struct lglock *lg)
+{
+   lg->lock = alloc_percpu(*lg->lock);
+   return lg->lock ? 0 : -ENOMEM;
+}
+
+#define lg_lock_init(lock) \
+({ \
+   lg_lock_lockdep_init(lock); \
+   __lg_lock_init(lock);   \
+})
+
+void lg_local_lock(struct lglock *lg);
+void lg_local_unlock(struct lglock *lg);
+void lg_local_lock_cpu(struct lglock *lg, int cpu);
+void lg_local_unlock_cpu(struct lglock *lg, int cpu);
+
+void lg_double_lock(struct lglock *lg, int cpu1, int cpu2);
+void lg_double_unlock(struct lglock *lg, int cpu1, int cpu2);
+
+void lg_global_lock(struct lglock *lg);
+void lg_global_unlock(struct lglock *lg);
+
+#else
+/* When !CONFIG_SMP, map lglock to spinlock */
+#define lglock spinlock
+#define DEFINE_LGLOCK(name) DEFINE_SPINLOCK(name)
+#define DEFINE_STATIC_LGLOCK(name) static DEFINE_SPINLOCK(name)
+#define lg_lock_init(lg)   ({ spin_lock_init(lg); 0; })
+#define lg_lock_free(lg)   do {} while (0)
+#define lg_local_lock spin_lock
+#define lg_local_unlock spin_unlock
+#define lg_local_lock_cpu(lg, cpu) spin_lock(lg)
+#define lg_local_unlock_cpu(lg, cpu) spin_unlock(lg)
+#define lg_global_lock spin_lock
+#define lg_global_unlock spin_unlock
+#endif
+
+#endif
diff --git a/kernel/locking/Makefile b/kernel/locking/Makefile
index 392c7f23af..e5bb62823d 100644
--- a/kernel/locking/Makefile
+++ b/kernel/locking/Makefile
@@ -19,6 +19,7 @@ obj-$(CONFIG_LOCKDEP) += lockdep_proc.o
 endif
 obj-$(CONFIG_SMP) += spinlock.o
 obj-$(CONFIG_LOCK_SPIN_ON_OWNER) += osq_lock.o
+obj-$(CONFIG_SMP) += lglock.o
 obj-$(CONFIG_PROVE_LOCKING) += spinlock.o
 obj-$(CONFIG_QUEUED_SPINLOCKS) += qspinlock.o
 obj-$(CONFIG_RT_MUTEXES) += rtmutex.o
diff --git a/kernel/locking/lglock.c b/kernel/locking/lglock.c
new file mode 100644
index 00..051feaccc4
--- /dev/null
+++ b/kernel/locking/lglock.c
@@ -0,0 +1,105 @@
+/* See include/linux/lglock.h for description */
+#include 
+#include 
+#include 
+#include 
+
+/*
+ * Note there is no uninit, so lglocks cannot be defined in
+ * modules (but it's fine to use them from there)
+ * Could be added 

[PATCH 04/10] locking: export osq_lock()/osq_unlock()

2018-05-18 Thread Kent Overstreet
Signed-off-by: Kent Overstreet 
---
 kernel/locking/osq_lock.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
index 6ef600aa0f..dfa71347b0 100644
--- a/kernel/locking/osq_lock.c
+++ b/kernel/locking/osq_lock.c
@@ -202,6 +202,7 @@ bool osq_lock(struct optimistic_spin_queue *lock)
 
return false;
 }
+EXPORT_SYMBOL_GPL(osq_lock);
 
 void osq_unlock(struct optimistic_spin_queue *lock)
 {
@@ -229,3 +230,4 @@ void osq_unlock(struct optimistic_spin_queue *lock)
if (next)
WRITE_ONCE(next->locked, 1);
 }
+EXPORT_SYMBOL_GPL(osq_unlock);
-- 
2.17.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 05/10] don't use spin_lock_irqsave() unnecessarily

2018-05-18 Thread Kent Overstreet
Signed-off-by: Kent Overstreet 
---
 mm/page-writeback.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 586f31261c..17ccc294c9 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2460,20 +2460,19 @@ int __set_page_dirty_nobuffers(struct page *page)
lock_page_memcg(page);
if (!TestSetPageDirty(page)) {
struct address_space *mapping = page_mapping(page);
-   unsigned long flags;
 
if (!mapping) {
unlock_page_memcg(page);
return 1;
}
 
-   spin_lock_irqsave(>tree_lock, flags);
+   spin_lock_irq(>tree_lock);
BUG_ON(page_mapping(page) != mapping);
WARN_ON_ONCE(!PagePrivate(page) && !PageUptodate(page));
account_page_dirtied(page, mapping);
radix_tree_tag_set(>page_tree, page_index(page),
   PAGECACHE_TAG_DIRTY);
-   spin_unlock_irqrestore(>tree_lock, flags);
+   spin_unlock_irq(>tree_lock);
unlock_page_memcg(page);
 
if (mapping->host) {
-- 
2.17.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 06/10] Generic radix trees

2018-05-18 Thread Kent Overstreet
Signed-off-by: Kent Overstreet 
---
 include/linux/generic-radix-tree.h | 131 ++
 lib/Makefile   |   3 +-
 lib/generic-radix-tree.c   | 167 +
 3 files changed, 300 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/generic-radix-tree.h
 create mode 100644 lib/generic-radix-tree.c

diff --git a/include/linux/generic-radix-tree.h 
b/include/linux/generic-radix-tree.h
new file mode 100644
index 00..4ede23e55f
--- /dev/null
+++ b/include/linux/generic-radix-tree.h
@@ -0,0 +1,131 @@
+#ifndef _LINUX_GENERIC_RADIX_TREE_H
+#define _LINUX_GENERIC_RADIX_TREE_H
+
+/*
+ * Generic radix trees/sparse arrays:
+ *
+ * A generic radix tree has all nodes of size PAGE_SIZE - both leaves and
+ * interior nodes.
+ */
+
+#include 
+#include 
+#include 
+#include 
+
+struct genradix_node;
+
+struct __genradix {
+   struct genradix_node*root;
+   size_t  depth;
+};
+
+/*
+ * NOTE: currently, sizeof(_type) must be a power of two and not larger than
+ * PAGE_SIZE:
+ */
+
+#define __GENRADIX_INITIALIZER \
+   {   \
+   .tree = {   \
+   .root = NULL,   \
+   .depth = 0, \
+   }   \
+   }
+
+/*
+ * We use a 0 size array to stash the type we're storing without taking any
+ * space at runtime - then the various accessor macros can use typeof() to get
+ * to it for casts/sizeof - we also force the alignment so that storing a type
+ * with a ridiculous alignment doesn't blow up the alignment or size of the
+ * genradix.
+ */
+
+#define GENRADIX(_type)\
+struct {   \
+   struct __genradix   tree;   \
+   _type   type[0] __aligned(1);   \
+}
+
+#define DEFINE_GENRADIX(_name, _type)  \
+   GENRADIX(_type) _name = __GENRADIX_INITIALIZER
+
+#define genradix_init(_radix)  \
+do {   \
+   *(_radix) = (typeof(*_radix)) __GENRADIX_INITIALIZER;   \
+} while (0)
+
+void __genradix_free(struct __genradix *);
+
+#define genradix_free(_radix)  __genradix_free(&(_radix)->tree)
+
+static inline size_t __idx_to_offset(size_t idx, size_t obj_size)
+{
+   BUILD_BUG_ON(obj_size > PAGE_SIZE);
+
+   if (!is_power_of_2(obj_size)) {
+   size_t objs_per_page = PAGE_SIZE / obj_size;
+
+   return (idx / objs_per_page) * PAGE_SIZE +
+   (idx % objs_per_page) * obj_size;
+   } else {
+   return idx * obj_size;
+   }
+}
+
+#define __genradix_cast(_radix)(typeof((_radix)->type[0]) *)
+#define __genradix_obj_size(_radix)sizeof((_radix)->type[0])
+#define __genradix_idx_to_offset(_radix, _idx) \
+   __idx_to_offset(_idx, __genradix_obj_size(_radix))
+
+void *__genradix_ptr(struct __genradix *, size_t);
+
+/* Returns a pointer to element at @_idx */
+#define genradix_ptr(_radix, _idx) \
+   (__genradix_cast(_radix)\
+__genradix_ptr(&(_radix)->tree,\
+   __genradix_idx_to_offset(_radix, _idx)))
+
+void *__genradix_ptr_alloc(struct __genradix *, size_t, gfp_t);
+
+/* Returns a pointer to element at @_idx, allocating it if necessary */
+#define genradix_ptr_alloc(_radix, _idx, _gfp) \
+   (__genradix_cast(_radix)\
+__genradix_ptr_alloc(&(_radix)->tree,  \
+   __genradix_idx_to_offset(_radix, _idx), \
+   _gfp))
+
+struct genradix_iter {
+   size_t  offset;
+   size_t  pos;
+};
+
+#define genradix_iter_init(_radix, _idx)   \
+   ((struct genradix_iter) {   \
+   .pos= (_idx),   \
+   .offset = __genradix_idx_to_offset((_radix), (_idx)),\
+   })
+
+void *__genradix_iter_peek(struct genradix_iter *, struct __genradix *, 
size_t);
+
+#define genradix_iter_peek(_iter, _radix)  \
+   (__genradix_cast(_radix)\
+__genradix_iter_peek(_iter, &(_radix)->tree,   \
+ PAGE_SIZE / __genradix_obj_size(_radix)))
+
+static inline void __genradix_iter_advance(struct genradix_iter *iter,
+  size_t obj_size)
+{
+   iter->offset += obj_size;
+
+  

[PATCH 07/10] bcache: optimize continue_at_nobarrier()

2018-05-18 Thread Kent Overstreet
Signed-off-by: Kent Overstreet 
---
 drivers/md/bcache/closure.h | 11 ---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/drivers/md/bcache/closure.h b/drivers/md/bcache/closure.h
index 3b9dfc9962..2392a46bcd 100644
--- a/drivers/md/bcache/closure.h
+++ b/drivers/md/bcache/closure.h
@@ -244,7 +244,7 @@ static inline void closure_queue(struct closure *cl)
 != offsetof(struct work_struct, func));
if (wq) {
INIT_WORK(>work, cl->work.func);
-   BUG_ON(!queue_work(wq, >work));
+   queue_work(wq, >work);
} else
cl->fn(cl);
 }
@@ -337,8 +337,13 @@ do {   
\
  */
 #define continue_at_nobarrier(_cl, _fn, _wq)   \
 do {   \
-   set_closure_fn(_cl, _fn, _wq);  \
-   closure_queue(_cl); \
+   closure_set_ip(_cl);\
+   if (_wq) {  \
+   INIT_WORK(&(_cl)->work, (void *) _fn);  \
+   queue_work((_wq), &(_cl)->work);\
+   } else {\
+   (_fn)(_cl); \
+   }   \
 } while (0)
 
 /**
-- 
2.17.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 08/10] bcache: move closures to lib/

2018-05-18 Thread Kent Overstreet
Prep work for bcachefs - being a fork of bcache it also uses closures

Signed-off-by: Kent Overstreet 
---
 drivers/md/bcache/Kconfig  | 10 +-
 drivers/md/bcache/Makefile |  6 +++---
 drivers/md/bcache/bcache.h |  2 +-
 drivers/md/bcache/super.c  |  1 -
 drivers/md/bcache/util.h   |  3 +--
 {drivers/md/bcache => include/linux}/closure.h | 17 -
 lib/Kconfig|  3 +++
 lib/Kconfig.debug  |  9 +
 lib/Makefile   |  2 ++
 {drivers/md/bcache => lib}/closure.c   | 17 -
 10 files changed, 36 insertions(+), 34 deletions(-)
 rename {drivers/md/bcache => include/linux}/closure.h (97%)
 rename {drivers/md/bcache => lib}/closure.c (95%)

diff --git a/drivers/md/bcache/Kconfig b/drivers/md/bcache/Kconfig
index 4d200883c5..45f1094c08 100644
--- a/drivers/md/bcache/Kconfig
+++ b/drivers/md/bcache/Kconfig
@@ -1,6 +1,7 @@
 
 config BCACHE
tristate "Block device as cache"
+   select CLOSURES
---help---
Allows a block device to be used as cache for other devices; uses
a btree for indexing and the layout is optimized for SSDs.
@@ -15,12 +16,3 @@ config BCACHE_DEBUG
 
Enables extra debugging tools, allows expensive runtime checks to be
turned on.
-
-config BCACHE_CLOSURES_DEBUG
-   bool "Debug closures"
-   depends on BCACHE
-   select DEBUG_FS
-   ---help---
-   Keeps all active closures in a linked list and provides a debugfs
-   interface to list them, which makes it possible to see asynchronous
-   operations that get stuck.
diff --git a/drivers/md/bcache/Makefile b/drivers/md/bcache/Makefile
index d26b351958..2b790fb813 100644
--- a/drivers/md/bcache/Makefile
+++ b/drivers/md/bcache/Makefile
@@ -2,8 +2,8 @@
 
 obj-$(CONFIG_BCACHE)   += bcache.o
 
-bcache-y   := alloc.o bset.o btree.o closure.o debug.o extents.o\
-   io.o journal.o movinggc.o request.o stats.o super.o sysfs.o trace.o\
-   util.o writeback.o
+bcache-y   := alloc.o bset.o btree.o debug.o extents.o io.o\
+   journal.o movinggc.o request.o stats.o super.o sysfs.o trace.o util.o\
+   writeback.o
 
 CFLAGS_request.o   += -Iblock
diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
index 12e5197f18..d954dc44dd 100644
--- a/drivers/md/bcache/bcache.h
+++ b/drivers/md/bcache/bcache.h
@@ -180,6 +180,7 @@
 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -191,7 +192,6 @@
 
 #include "bset.h"
 #include "util.h"
-#include "closure.h"
 
 struct bucket {
atomic_tpin;
diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index f2273143b3..5f1ac8e0a3 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -2148,7 +2148,6 @@ static int __init bcache_init(void)
mutex_init(_register_lock);
init_waitqueue_head(_wait);
register_reboot_notifier();
-   closure_debug_init();
 
bcache_major = register_blkdev(0, "bcache");
if (bcache_major < 0) {
diff --git a/drivers/md/bcache/util.h b/drivers/md/bcache/util.h
index a6763db7f0..a75523ed0d 100644
--- a/drivers/md/bcache/util.h
+++ b/drivers/md/bcache/util.h
@@ -4,6 +4,7 @@
 #define _BCACHE_UTIL_H
 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -12,8 +13,6 @@
 #include 
 #include 
 
-#include "closure.h"
-
 #define PAGE_SECTORS   (PAGE_SIZE / 512)
 
 struct closure;
diff --git a/drivers/md/bcache/closure.h b/include/linux/closure.h
similarity index 97%
rename from drivers/md/bcache/closure.h
rename to include/linux/closure.h
index 2392a46bcd..1072bf2c13 100644
--- a/drivers/md/bcache/closure.h
+++ b/include/linux/closure.h
@@ -154,7 +154,7 @@ struct closure {
 
atomic_tremaining;
 
-#ifdef CONFIG_BCACHE_CLOSURES_DEBUG
+#ifdef CONFIG_DEBUG_CLOSURES
 #define CLOSURE_MAGIC_DEAD 0xc054dead
 #define CLOSURE_MAGIC_ALIVE0xc054a11e
 
@@ -183,15 +183,13 @@ static inline void closure_sync(struct closure *cl)
__closure_sync(cl);
 }
 
-#ifdef CONFIG_BCACHE_CLOSURES_DEBUG
+#ifdef CONFIG_DEBUG_CLOSURES
 
-void closure_debug_init(void);
 void closure_debug_create(struct closure *cl);
 void closure_debug_destroy(struct closure *cl);
 
 #else
 
-static inline void closure_debug_init(void) {}
 static inline void closure_debug_create(struct closure *cl) {}
 static inline void closure_debug_destroy(struct closure *cl) {}
 
@@ -199,21 +197,21 @@ static inline void closure_debug_destroy(struct closure 
*cl) {}
 
 static inline void closure_set_ip(struct closure *cl)
 {
-#ifdef CONFIG_BCACHE_CLOSURES_DEBUG
+#ifdef CONFIG_DEBUG_CLOSURES
cl->ip = _THIS_IP_;
 #endif
 }
 
 static inline void closure_set_ret_ip(struct closure *cl)
 {
-#ifdef 

[PATCH 09/10] closures: closure_wait_event()

2018-05-18 Thread Kent Overstreet
Signed-off-by: Kent Overstreet 
---
 include/linux/closure.h | 22 ++
 1 file changed, 22 insertions(+)

diff --git a/include/linux/closure.h b/include/linux/closure.h
index 1072bf2c13..ef22d18f7c 100644
--- a/include/linux/closure.h
+++ b/include/linux/closure.h
@@ -375,4 +375,26 @@ static inline void closure_call(struct closure *cl, 
closure_fn fn,
continue_at_nobarrier(cl, fn, wq);
 }
 
+#define __closure_wait_event(waitlist, _cond)  \
+do {   \
+   struct closure cl;  \
+   \
+   closure_init_stack();\
+   \
+   while (1) { \
+   closure_wait(waitlist, );\
+   if (_cond)  \
+   break;  \
+   closure_sync();  \
+   }   \
+   closure_wake_up(waitlist);  \
+   closure_sync();  \
+} while (0)
+
+#define closure_wait_event(waitlist, _cond)\
+do {   \
+   if (!(_cond))   \
+   __closure_wait_event(waitlist, _cond);  \
+} while (0)
+
 #endif /* _LINUX_CLOSURE_H */
-- 
2.17.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 10/10] Dynamic fault injection

2018-05-18 Thread Kent Overstreet
Signed-off-by: Kent Overstreet 
---
 include/asm-generic/vmlinux.lds.h |   4 +
 include/linux/dynamic_fault.h | 117 +
 lib/Kconfig.debug |   5 +
 lib/Makefile  |   2 +
 lib/dynamic_fault.c   | 760 ++
 5 files changed, 888 insertions(+)
 create mode 100644 include/linux/dynamic_fault.h
 create mode 100644 lib/dynamic_fault.c

diff --git a/include/asm-generic/vmlinux.lds.h 
b/include/asm-generic/vmlinux.lds.h
index 1ab0e520d6..a4c9dfcbbd 100644
--- a/include/asm-generic/vmlinux.lds.h
+++ b/include/asm-generic/vmlinux.lds.h
@@ -246,6 +246,10 @@
VMLINUX_SYMBOL(__start___verbose) = .;  \
KEEP(*(__verbose))  \
VMLINUX_SYMBOL(__stop___verbose) = .;   \
+   . = ALIGN(8);   \
+   VMLINUX_SYMBOL(__start___faults) = .;   \
+   *(__faults) \
+   VMLINUX_SYMBOL(__stop___faults) = .;\
LIKELY_PROFILE()\
BRANCH_PROFILE()\
TRACE_PRINTKS() \
diff --git a/include/linux/dynamic_fault.h b/include/linux/dynamic_fault.h
new file mode 100644
index 00..6e7bb56ae8
--- /dev/null
+++ b/include/linux/dynamic_fault.h
@@ -0,0 +1,117 @@
+#ifndef _DYNAMIC_FAULT_H
+#define _DYNAMIC_FAULT_H
+
+#include 
+#include 
+#include 
+
+enum dfault_enabled {
+   DFAULT_DISABLED,
+   DFAULT_ENABLED,
+   DFAULT_ONESHOT,
+};
+
+union dfault_state {
+   struct {
+   unsignedenabled:2;
+   unsignedcount:30;
+   };
+
+   struct {
+   unsignedv;
+   };
+};
+
+/*
+ * An instance of this structure is created in a special
+ * ELF section at every dynamic fault callsite.  At runtime,
+ * the special section is treated as an array of these.
+ */
+struct _dfault {
+   const char  *modname;
+   const char  *function;
+   const char  *filename;
+   const char  *class;
+
+   const u16   line;
+
+   unsignedfrequency;
+   union dfault_state  state;
+
+   struct static_key   enabled;
+} __aligned(8);
+
+
+#ifdef CONFIG_DYNAMIC_FAULT
+
+int dfault_add_module(struct _dfault *tab, unsigned int n, const char *mod);
+int dfault_remove_module(char *mod_name);
+bool __dynamic_fault_enabled(struct _dfault *);
+
+#define dynamic_fault(_class)  \
+({ \
+   static struct _dfault descriptor\
+   __used __aligned(8) __attribute__((section("__faults"))) = {\
+   .modname= KBUILD_MODNAME,   \
+   .function   = __func__, \
+   .filename   = __FILE__, \
+   .line   = __LINE__, \
+   .class  = _class,   \
+   };  \
+   \
+   static_key_false() &&\
+   __dynamic_fault_enabled();   \
+})
+
+#define memory_fault() dynamic_fault("memory")
+#define race_fault()   dynamic_fault("race")
+
+#define kmalloc(...)   \
+   (memory_fault() ? NULL  : kmalloc(__VA_ARGS__))
+#define kzalloc(...)   \
+   (memory_fault() ? NULL  : kzalloc(__VA_ARGS__))
+#define krealloc(...)  \
+   (memory_fault() ? NULL  : krealloc(__VA_ARGS__))
+
+#define mempool_alloc(pool, gfp_mask)  \
+   ((!gfpflags_allow_blocking(gfp_mask) && memory_fault()) \
+   ? NULL : mempool_alloc(pool, gfp_mask))
+
+#define __get_free_pages(...)  \
+   (memory_fault() ? 0 : __get_free_pages(__VA_ARGS__))
+#define alloc_pages_node(...)  \
+   (memory_fault() ? NULL  : alloc_pages_node(__VA_ARGS__))
+#define alloc_pages_nodemask(...)  \
+   (memory_fault() ? NULL  : alloc_pages_nodemask(__VA_ARGS__))
+
+#define bio_alloc_bioset(gfp_mask, ...)
\
+   ((!gfpflags_allow_blocking(gfp_mask) && memory_fault())