Re: [PATCH RESEND 0/8] btrfs-progs: sub: Relax the privileges of "subvolume list/show"

2018-12-06 Thread Omar Sandoval
On Tue, Nov 27, 2018 at 02:24:41PM +0900, Misono Tomohiro wrote:
> Hello,
> 
> This is basically the resend of 
>   "[PATCH v2 00/20] btrfs-progs: Rework of "subvolume list/show" and relax the
>   root privileges of them" [1]
> which I submitted in June. The aim of this series is to allow non-privileged 
> user
> to use basic subvolume functionality (create/list/snapshot/delete; this 
> allows "list")
> 
> They were once in devel branch with some whitespace/comment modification by 
> david.
> I rebased them to current devel branch.
> 
> github: https://github.com/t-msn/btrfs-progs/tree/rework-sub-list
> 
> Basic logic/code is the same as before. Some differences are:
>  - Use latest libbtrfsutil from Omar [2] (thus drop first part of patches).
>As a result, "sub list" cannot accept an ordinary directry to be
>specified (which is allowed in previous version)
>  - Drop patches which add new options to "sub list"
>  - Use 'nobody' as non-privileged test user just like libbtrfsutil test
>  - Update comments
> 
> Importantly, in order to make output consistent for both root and 
> non-privileged
> user, this changes the behavior of "subvolume list": 
>  - (default) Only list in subvolume under the specified path.
>Path needs to be a subvolume.
>  - (-a) filter is dropped. i.e. its output is the same as the
> default behavior of "sub list" in progs <= 4.19
> 
> Therefore, existent scripts may need to update to add -a option
> (I believe nobody uses current -a option).
> If anyone thinks this is not good, please let me know.

I think there are a few options in the case that the path isn't a
subvolume:

1. List all subvolumes in the filesystem with randomly mangled paths,
   which is what we currently do.
2. Error out, which is what this version of the series does.
3. List all subvolumes under the containing subvolume, which is what the
   previous version does.
4. List all subvolumes under the containing subvolume that are
   underneath the given path.

Option 1 won't work well for unprivileged users. Option 2 (this series)
is definitely going to break people's workflows/scripts. Option 3 is
unintuitive. In my opinion, option 4 is the nicest, but it may also
break scripts that expect all subvolumes to be printed.

There's also an option 5, which is to keep the behavior the same for
root (like what my previous patch [1] did) and implement option 4 for
unprivileged users.

I think 4 and 5 are the two main choices: do we want to preserve
backwards compatibility as carefully as possible (at the cost of
consistency), or do we want to risk it and improve the interface?

1: 
https://github.com/osandov/btrfs-progs/commit/fb61c21aeb998b12c1d02532639083d7f40c41e0


[PATCH] libbtrfsutil: fix unprivileged tests if kernel lacks support

2018-12-06 Thread Omar Sandoval
From: Omar Sandoval 

I apparently didn't test this on a pre-4.18 kernel.
test_subvolume_info_unprivileged() checks for an ENOTTY, but this
doesn't seem to work correctly with subTest().
test_subvolume_iterator_unprivileged() doesn't have a check at all. Add
an explicit check to both before doing the actual test.

Signed-off-by: Omar Sandoval 
---
Based on the devel branch.

 libbtrfsutil/python/tests/test_subvolume.py | 10 +-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/libbtrfsutil/python/tests/test_subvolume.py 
b/libbtrfsutil/python/tests/test_subvolume.py
index 99ec97bc..b06a1d3d 100644
--- a/libbtrfsutil/python/tests/test_subvolume.py
+++ b/libbtrfsutil/python/tests/test_subvolume.py
@@ -168,12 +168,13 @@ class TestSubvolume(BtrfsTestCase):
 
 with drop_privs():
 try:
-self._test_subvolume_info(subvol, snapshot)
+btrfsutil.subvolume_info(self.mountpoint)
 except OSError as e:
 if e.errno == errno.ENOTTY:
 self.skipTest('BTRFS_IOC_GET_SUBVOL_INFO is not available')
 else:
 raise
+self._test_subvolume_info(subvol, snapshot)
 
 def test_read_only(self):
 for arg in self.path_or_fd(self.mountpoint):
@@ -487,6 +488,13 @@ class TestSubvolume(BtrfsTestCase):
 try:
 os.chdir(self.mountpoint)
 with drop_privs():
+try:
+list(btrfsutil.SubvolumeIterator('.'))
+except OSError as e:
+if e.errno == errno.ENOTTY:
+self.skipTest('BTRFS_IOC_GET_SUBVOL_ROOTREF is not 
available')
+else:
+raise
 self._test_subvolume_iterator()
 finally:
 os.chdir(pwd)
-- 
2.19.2



Re: [PATCH 9/9] btrfs: drop extra enum initialization where using defaults

2018-11-27 Thread Omar Sandoval
On Tue, Nov 27, 2018 at 08:53:59PM +0100, David Sterba wrote:
> The first auto-assigned value to enum is 0, we can use that and not
> initialize all members where the auto-increment does the same. This is
> used for values that are not part of on-disk format.

Reviewed-by: Omar Sandoval 

> Signed-off-by: David Sterba 
> ---
>  fs/btrfs/btrfs_inode.h |  2 +-
>  fs/btrfs/ctree.h   | 28 ++--
>  fs/btrfs/disk-io.h | 10 +-
>  fs/btrfs/qgroup.h  |  2 +-
>  fs/btrfs/sysfs.h   |  2 +-
>  fs/btrfs/transaction.h | 14 +++---
>  6 files changed, 29 insertions(+), 29 deletions(-)
> 
> diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
> index 4de321aee7a5..fc25607304f2 100644
> --- a/fs/btrfs/btrfs_inode.h
> +++ b/fs/btrfs/btrfs_inode.h
> @@ -20,7 +20,7 @@
>   * new data the application may have written before commit.
>   */
>  enum {
> - BTRFS_INODE_ORDERED_DATA_CLOSE = 0,
> + BTRFS_INODE_ORDERED_DATA_CLOSE,
>   BTRFS_INODE_DUMMY,
>   BTRFS_INODE_IN_DEFRAG,
>   BTRFS_INODE_HAS_ASYNC_EXTENT,
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 4bb0ac3050ff..f1d1c6ba3aa1 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -334,7 +334,7 @@ struct btrfs_node {
>   * The slots array records the index of the item or block pointer
>   * used while walking the tree.
>   */
> -enum { READA_NONE = 0, READA_BACK, READA_FORWARD };
> +enum { READA_NONE, READA_BACK, READA_FORWARD };
>  struct btrfs_path {
>   struct extent_buffer *nodes[BTRFS_MAX_LEVEL];
>   int slots[BTRFS_MAX_LEVEL];
> @@ -532,18 +532,18 @@ struct btrfs_free_cluster {
>  };
>  
>  enum btrfs_caching_type {
> - BTRFS_CACHE_NO  = 0,
> - BTRFS_CACHE_STARTED = 1,
> - BTRFS_CACHE_FAST= 2,
> - BTRFS_CACHE_FINISHED= 3,
> - BTRFS_CACHE_ERROR   = 4,
> + BTRFS_CACHE_NO,
> + BTRFS_CACHE_STARTED,
> + BTRFS_CACHE_FAST,
> + BTRFS_CACHE_FINISHED,
> + BTRFS_CACHE_ERROR,
>  };
>  
>  enum btrfs_disk_cache_state {
> - BTRFS_DC_WRITTEN= 0,
> - BTRFS_DC_ERROR  = 1,
> - BTRFS_DC_CLEAR  = 2,
> - BTRFS_DC_SETUP  = 3,
> + BTRFS_DC_WRITTEN,
> + BTRFS_DC_ERROR,
> + BTRFS_DC_CLEAR,
> + BTRFS_DC_SETUP,
>  };
>  
>  struct btrfs_caching_control {
> @@ -2621,10 +2621,10 @@ static inline gfp_t btrfs_alloc_write_mask(struct 
> address_space *mapping)
>  /* extent-tree.c */
>  
>  enum btrfs_inline_ref_type {
> - BTRFS_REF_TYPE_INVALID = 0,
> - BTRFS_REF_TYPE_BLOCK =   1,
> - BTRFS_REF_TYPE_DATA =2,
> - BTRFS_REF_TYPE_ANY = 3,
> + BTRFS_REF_TYPE_INVALID,
> + BTRFS_REF_TYPE_BLOCK,
> + BTRFS_REF_TYPE_DATA,
> + BTRFS_REF_TYPE_ANY,
>  };
>  
>  int btrfs_get_extent_inline_ref_type(const struct extent_buffer *eb,
> diff --git a/fs/btrfs/disk-io.h b/fs/btrfs/disk-io.h
> index 4cccba22640f..987a64bc0c66 100644
> --- a/fs/btrfs/disk-io.h
> +++ b/fs/btrfs/disk-io.h
> @@ -21,11 +21,11 @@
>  #define BTRFS_BDEV_BLOCKSIZE (4096)
>  
>  enum btrfs_wq_endio_type {
> - BTRFS_WQ_ENDIO_DATA = 0,
> - BTRFS_WQ_ENDIO_METADATA = 1,
> - BTRFS_WQ_ENDIO_FREE_SPACE = 2,
> - BTRFS_WQ_ENDIO_RAID56 = 3,
> - BTRFS_WQ_ENDIO_DIO_REPAIR = 4,
> + BTRFS_WQ_ENDIO_DATA,
> + BTRFS_WQ_ENDIO_METADATA,
> + BTRFS_WQ_ENDIO_FREE_SPACE,
> + BTRFS_WQ_ENDIO_RAID56,
> + BTRFS_WQ_ENDIO_DIO_REPAIR,
>  };
>  
>  static inline u64 btrfs_sb_offset(int mirror)
> diff --git a/fs/btrfs/qgroup.h b/fs/btrfs/qgroup.h
> index d8f78f5ab854..e4e6ee44073a 100644
> --- a/fs/btrfs/qgroup.h
> +++ b/fs/btrfs/qgroup.h
> @@ -70,7 +70,7 @@ struct btrfs_qgroup_extent_record {
>   *   be converted into META_PERTRANS.
>   */
>  enum btrfs_qgroup_rsv_type {
> - BTRFS_QGROUP_RSV_DATA = 0,
> + BTRFS_QGROUP_RSV_DATA,
>   BTRFS_QGROUP_RSV_META_PERTRANS,
>   BTRFS_QGROUP_RSV_META_PREALLOC,
>   BTRFS_QGROUP_RSV_LAST,
> diff --git a/fs/btrfs/sysfs.h b/fs/btrfs/sysfs.h
> index c6ee600aff89..40716b357c1d 100644
> --- a/fs/btrfs/sysfs.h
> +++ b/fs/btrfs/sysfs.h
> @@ -9,7 +9,7 @@
>  extern u64 btrfs_debugfs_test;
>  
>  enum btrfs_feature_set {
> - FEAT_COMPAT = 0,
> + FEAT_COMPAT,
>   FEAT_COMPAT_RO,
>   FEAT_INCOMPAT,
>   FEAT_MAX
> diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h
> index 703d5116a2fc..f1ba78949d1b 100644
> --- a/fs/btrfs/transaction.h
> +++ b/fs/btrfs/transaction.h
> @@ -12,13 +12,13 @@
>  #include "ctree.h"
>  
>  enum b

Re: [PATCH 8/9] btrfs: switch BTRFS_ORDERED_* to enums

2018-11-27 Thread Omar Sandoval
On Tue, Nov 27, 2018 at 08:53:57PM +0100, David Sterba wrote:
> We can use simple enum for values that are not part of on-disk format:
> ordered extent flags.

Reviewed-by: Omar Sandoval 

> Signed-off-by: David Sterba 
> ---
>  fs/btrfs/ordered-data.h | 45 +++--
>  1 file changed, 25 insertions(+), 20 deletions(-)
> 
> diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h
> index b10e6765d88f..fb9a161f0215 100644
> --- a/fs/btrfs/ordered-data.h
> +++ b/fs/btrfs/ordered-data.h
> @@ -37,26 +37,31 @@ struct btrfs_ordered_sum {
>   * rbtree, just before waking any waiters.  It is used to indicate the
>   * IO is done and any metadata is inserted into the tree.
>   */
> -#define BTRFS_ORDERED_IO_DONE 0 /* set when all the pages are written */
> -
> -#define BTRFS_ORDERED_COMPLETE 1 /* set when removed from the tree */
> -
> -#define BTRFS_ORDERED_NOCOW 2 /* set when we want to write in place */
> -
> -#define BTRFS_ORDERED_COMPRESSED 3 /* writing a zlib compressed extent */
> -
> -#define BTRFS_ORDERED_PREALLOC 4 /* set when writing to preallocated extent 
> */
> -
> -#define BTRFS_ORDERED_DIRECT 5 /* set when we're doing DIO with this extent 
> */
> -
> -#define BTRFS_ORDERED_IOERR 6 /* We had an io error when writing this out */
> -
> -#define BTRFS_ORDERED_UPDATED_ISIZE 7 /* indicates whether this ordered 
> extent
> -* has done its due diligence in updating
> -* the isize. */
> -#define BTRFS_ORDERED_TRUNCATED 8 /* Set when we have to truncate an extent 
> */
> -
> -#define BTRFS_ORDERED_REGULAR 10 /* Regular IO for COW */
> +enum {
> + /* set when all the pages are written */
> + BTRFS_ORDERED_IO_DONE,
> + /* set when removed from the tree */
> + BTRFS_ORDERED_COMPLETE,
> + /* set when we want to write in place */
> + BTRFS_ORDERED_NOCOW,
> + /* writing a zlib compressed extent */
> + BTRFS_ORDERED_COMPRESSED,
> + /* set when writing to preallocated extent */
> + BTRFS_ORDERED_PREALLOC,
> + /* set when we're doing DIO with this extent */
> + BTRFS_ORDERED_DIRECT,
> + /* We had an io error when writing this out */
> + BTRFS_ORDERED_IOERR,
> + /*
> +  * indicates whether this ordered extent has done its due diligence in
> +  * updating the isize
> +  */
> + BTRFS_ORDERED_UPDATED_ISIZE,
> + /* Set when we have to truncate an extent */
> + BTRFS_ORDERED_TRUNCATED,
> + /* Regular IO for COW */
> + BTRFS_ORDERED_REGULAR,
> +};
>  
>  struct btrfs_ordered_extent {
>   /* logical offset in the file */
> -- 
> 2.19.1
> 


Re: [PATCH 7/9] btrfs: switch BTRFS_*_LOCK to enums

2018-11-27 Thread Omar Sandoval
On Tue, Nov 27, 2018 at 08:53:55PM +0100, David Sterba wrote:
> We can use simple enum for values that are not part of on-disk format:
> tree lock types.
> 
> Signed-off-by: David Sterba 
> ---
>  fs/btrfs/locking.h | 10 ++
>  1 file changed, 6 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/btrfs/locking.h b/fs/btrfs/locking.h
> index 29135def468e..684d0ef4faa4 100644
> --- a/fs/btrfs/locking.h
> +++ b/fs/btrfs/locking.h
> @@ -6,10 +6,12 @@
>  #ifndef BTRFS_LOCKING_H
>  #define BTRFS_LOCKING_H
>  
> -#define BTRFS_WRITE_LOCK 1
> -#define BTRFS_READ_LOCK 2
> -#define BTRFS_WRITE_LOCK_BLOCKING 3
> -#define BTRFS_READ_LOCK_BLOCKING 4
> +enum {
> + BTRFS_WRITE_LOCK,

See btrfs_set_path_blocking() and btrfs_release_path(); 0 means no lock,
so this needs to be BTRFS_WRITE_LOCK = 1. I imagine that lockdep would
catch this.

> + BTRFS_READ_LOCK,
> + BTRFS_WRITE_LOCK_BLOCKING,
> + BTRFS_READ_LOCK_BLOCKING,
> +};
>  
>  void btrfs_tree_lock(struct extent_buffer *eb);
>  void btrfs_tree_unlock(struct extent_buffer *eb);
> -- 
> 2.19.1
> 


Re: [PATCH 6/9] btrfs: switch EXTENT_FLAG_* to enums

2018-11-27 Thread Omar Sandoval
On Tue, Nov 27, 2018 at 08:53:52PM +0100, David Sterba wrote:
> We can use simple enum for values that are not part of on-disk format:
> extent map flags.

Reviewed-by: Omar Sandoval 

> Signed-off-by: David Sterba 
> ---
>  fs/btrfs/extent_map.h | 21 ++---
>  1 file changed, 14 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/btrfs/extent_map.h b/fs/btrfs/extent_map.h
> index 31977ffd6190..ef05a0121652 100644
> --- a/fs/btrfs/extent_map.h
> +++ b/fs/btrfs/extent_map.h
> @@ -11,13 +11,20 @@
>  #define EXTENT_MAP_INLINE ((u64)-2)
>  #define EXTENT_MAP_DELALLOC ((u64)-1)
>  
> -/* bits for the flags field */
> -#define EXTENT_FLAG_PINNED 0 /* this entry not yet on disk, don't free it */
> -#define EXTENT_FLAG_COMPRESSED 1
> -#define EXTENT_FLAG_PREALLOC 3 /* pre-allocated extent */
> -#define EXTENT_FLAG_LOGGING 4 /* Logging this extent */
> -#define EXTENT_FLAG_FILLING 5 /* Filling in a preallocated extent */
> -#define EXTENT_FLAG_FS_MAPPING 6 /* filesystem extent mapping type */
> +/* bits for the extent_map::flags field */
> +enum {
> + /* this entry not yet on disk, don't free it */
> + EXTENT_FLAG_PINNED,
> + EXTENT_FLAG_COMPRESSED,
> + /* pre-allocated extent */
> + EXTENT_FLAG_PREALLOC,
> + /* Logging this extent */
> + EXTENT_FLAG_LOGGING,
> + /* Filling in a preallocated extent */
> + EXTENT_FLAG_FILLING,
> + /* filesystem extent mapping type */
> + EXTENT_FLAG_FS_MAPPING,
> +};
>  
>  struct extent_map {
>   struct rb_node rb_node;
> -- 
> 2.19.1
> 


Re: [PATCH 5/9] btrfs: swtich EXTENT_BUFFER_* to enums

2018-11-27 Thread Omar Sandoval
On Tue, Nov 27, 2018 at 08:53:50PM +0100, David Sterba wrote:
> We can use simple enum for values that are not part of on-disk format:
> extent buffer flags;

This one has a "swtich" typo in the subject. Otherwise,

Reviewed-by: Omar Sandoval 

> Signed-off-by: David Sterba 
> ---
>  fs/btrfs/extent_io.h | 28 
>  1 file changed, 16 insertions(+), 12 deletions(-)
> 
> diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
> index a1d3ea5a0d32..fd42492e62e5 100644
> --- a/fs/btrfs/extent_io.h
> +++ b/fs/btrfs/extent_io.h
> @@ -37,18 +37,22 @@
>  #define EXTENT_BIO_COMPRESSED 1
>  #define EXTENT_BIO_FLAG_SHIFT 16
>  
> -/* these are bit numbers for test/set bit */
> -#define EXTENT_BUFFER_UPTODATE 0
> -#define EXTENT_BUFFER_DIRTY 2
> -#define EXTENT_BUFFER_CORRUPT 3
> -#define EXTENT_BUFFER_READAHEAD 4/* this got triggered by readahead */
> -#define EXTENT_BUFFER_TREE_REF 5
> -#define EXTENT_BUFFER_STALE 6
> -#define EXTENT_BUFFER_WRITEBACK 7
> -#define EXTENT_BUFFER_READ_ERR 8/* read IO error */
> -#define EXTENT_BUFFER_UNMAPPED 9
> -#define EXTENT_BUFFER_IN_TREE 10
> -#define EXTENT_BUFFER_WRITE_ERR 11/* write IO error */
> +enum {
> + EXTENT_BUFFER_UPTODATE,
> + EXTENT_BUFFER_DIRTY,
> + EXTENT_BUFFER_CORRUPT,
> + /* this got triggered by readahead */
> + EXTENT_BUFFER_READAHEAD,
> + EXTENT_BUFFER_TREE_REF,
> + EXTENT_BUFFER_STALE,
> + EXTENT_BUFFER_WRITEBACK,
> + /* read IO error */
> + EXTENT_BUFFER_READ_ERR,
> + EXTENT_BUFFER_UNMAPPED,
> + EXTENT_BUFFER_IN_TREE,
> + /* write IO error */
> + EXTENT_BUFFER_WRITE_ERR,
> +};
>  
>  /* these are flags for __process_pages_contig */
>  #define PAGE_UNLOCK  (1 << 0)
> -- 
> 2.19.1
> 


Re: [PATCH 4/9] btrfs: switch BTRFS_ROOT_* to enums

2018-11-27 Thread Omar Sandoval
On Tue, Nov 27, 2018 at 08:53:48PM +0100, David Sterba wrote:
> We can use simple enum for values that are not part of on-disk format:
> root tree flags.

Reviewed-by: Omar Sandoval 

> Signed-off-by: David Sterba 
> ---
>  fs/btrfs/ctree.h | 33 +
>  1 file changed, 17 insertions(+), 16 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 7176b95b40e7..4bb0ac3050ff 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -1180,22 +1180,23 @@ struct btrfs_subvolume_writers {
>  /*
>   * The state of btrfs root
>   */
> -/*
> - * btrfs_record_root_in_trans is a multi-step process,
> - * and it can race with the balancing code.   But the
> - * race is very small, and only the first time the root
> - * is added to each transaction.  So IN_TRANS_SETUP
> - * is used to tell us when more checks are required
> - */
> -#define BTRFS_ROOT_IN_TRANS_SETUP0
> -#define BTRFS_ROOT_REF_COWS  1
> -#define BTRFS_ROOT_TRACK_DIRTY   2
> -#define BTRFS_ROOT_IN_RADIX  3
> -#define BTRFS_ROOT_ORPHAN_ITEM_INSERTED  4
> -#define BTRFS_ROOT_DEFRAG_RUNNING5
> -#define BTRFS_ROOT_FORCE_COW 6
> -#define BTRFS_ROOT_MULTI_LOG_TASKS   7
> -#define BTRFS_ROOT_DIRTY 8
> +enum {
> + /*
> +  * btrfs_record_root_in_trans is a multi-step process, and it can race
> +  * with the balancing code.   But the race is very small, and only the
> +  * first time the root is added to each transaction.  So IN_TRANS_SETUP
> +  * is used to tell us when more checks are required
> +  */
> + BTRFS_ROOT_IN_TRANS_SETUP,
> + BTRFS_ROOT_REF_COWS,
> + BTRFS_ROOT_TRACK_DIRTY,
> + BTRFS_ROOT_IN_RADIX,
> + BTRFS_ROOT_ORPHAN_ITEM_INSERTED,
> + BTRFS_ROOT_DEFRAG_RUNNING,
> + BTRFS_ROOT_FORCE_COW,
> + BTRFS_ROOT_MULTI_LOG_TASKS,
> + BTRFS_ROOT_DIRTY,
> +};
>  
>  /*
>   * in ram representation of the tree.  extent_root is used for all 
> allocations
> -- 
> 2.19.1
> 


Re: [PATCH 3/9] btrfs: switch BTRFS_FS_* to enums

2018-11-27 Thread Omar Sandoval
On Tue, Nov 27, 2018 at 08:53:45PM +0100, David Sterba wrote:
> We can use simple enum for values that are not part of on-disk format:
> internal filesystem states.

Hah, looks like we never had a bit 0 ;)

Reviewed-by: Omar Sandoval 

> Signed-off-by: David Sterba 
> ---
>  fs/btrfs/ctree.h | 63 
>  1 file changed, 31 insertions(+), 32 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 40c405d74a01..7176b95b40e7 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -757,38 +757,37 @@ struct btrfs_swapfile_pin {
>  
>  bool btrfs_pinned_by_swapfile(struct btrfs_fs_info *fs_info, void *ptr);
>  
> -#define BTRFS_FS_BARRIER 1
> -#define BTRFS_FS_CLOSING_START   2
> -#define BTRFS_FS_CLOSING_DONE3
> -#define BTRFS_FS_LOG_RECOVERING  4
> -#define BTRFS_FS_OPEN5
> -#define BTRFS_FS_QUOTA_ENABLED   6
> -#define BTRFS_FS_UPDATE_UUID_TREE_GEN9
> -#define BTRFS_FS_CREATING_FREE_SPACE_TREE10
> -#define BTRFS_FS_BTREE_ERR   11
> -#define BTRFS_FS_LOG1_ERR12
> -#define BTRFS_FS_LOG2_ERR13
> -#define BTRFS_FS_QUOTA_OVERRIDE  14
> -/* Used to record internally whether fs has been frozen */
> -#define BTRFS_FS_FROZEN  15
> -
> -/*
> - * Indicate that a whole-filesystem exclusive operation is running
> - * (device replace, resize, device add/delete, balance)
> - */
> -#define BTRFS_FS_EXCL_OP 16
> -
> -/*
> - * To info transaction_kthread we need an immediate commit so it doesn't
> - * need to wait for commit_interval
> - */
> -#define BTRFS_FS_NEED_ASYNC_COMMIT   17
> -
> -/*
> - * Indicate that balance has been set up from the ioctl and is in the main
> - * phase. The fs_info::balance_ctl is initialized.
> - */
> -#define BTRFS_FS_BALANCE_RUNNING 18
> +enum {
> + BTRFS_FS_BARRIER,
> + BTRFS_FS_CLOSING_START,
> + BTRFS_FS_CLOSING_DONE,
> + BTRFS_FS_LOG_RECOVERING,
> + BTRFS_FS_OPEN,
> + BTRFS_FS_QUOTA_ENABLED,
> + BTRFS_FS_UPDATE_UUID_TREE_GEN,
> + BTRFS_FS_CREATING_FREE_SPACE_TREE,
> + BTRFS_FS_BTREE_ERR,
> + BTRFS_FS_LOG1_ERR,
> + BTRFS_FS_LOG2_ERR,
> + BTRFS_FS_QUOTA_OVERRIDE,
> + /* Used to record internally whether fs has been frozen */
> + BTRFS_FS_FROZEN,
> + /*
> +  * Indicate that a whole-filesystem exclusive operation is running
> +  * (device replace, resize, device add/delete, balance)
> +  */
> + BTRFS_FS_EXCL_OP,
> + /*
> +  * To info transaction_kthread we need an immediate commit so it
> +  * doesn't need to wait for commit_interval
> +  */
> + BTRFS_FS_NEED_ASYNC_COMMIT,
> + /*
> +  * Indicate that balance has been set up from the ioctl and is in the
> +  * main phase. The fs_info::balance_ctl is initialized.
> +  */
> + BTRFS_FS_BALANCE_RUNNING,
> +};
>  
>  struct btrfs_fs_info {
>   u8 chunk_tree_uuid[BTRFS_UUID_SIZE];
> -- 
> 2.19.1
> 


Re: [PATCH 2/9] btrfs: switch BTRFS_BLOCK_RSV_* to enums

2018-11-27 Thread Omar Sandoval
On Tue, Nov 27, 2018 at 08:53:43PM +0100, David Sterba wrote:
> We can use simple enum for values that are not part of on-disk format:
> block reserve types.

Reviewed-by: Omar Sandoval 

> Signed-off-by: David Sterba 
> ---
>  fs/btrfs/ctree.h | 19 ---
>  1 file changed, 12 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index f82ec5e41b0c..40c405d74a01 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -461,13 +461,18 @@ struct btrfs_space_info {
>   struct kobject *block_group_kobjs[BTRFS_NR_RAID_TYPES];
>  };
>  
> -#define  BTRFS_BLOCK_RSV_GLOBAL  1
> -#define  BTRFS_BLOCK_RSV_DELALLOC2
> -#define  BTRFS_BLOCK_RSV_TRANS   3
> -#define  BTRFS_BLOCK_RSV_CHUNK   4
> -#define  BTRFS_BLOCK_RSV_DELOPS  5
> -#define  BTRFS_BLOCK_RSV_EMPTY   6
> -#define  BTRFS_BLOCK_RSV_TEMP7
> +/*
> + * Types of block reserves
> + */
> +enum {
> + BTRFS_BLOCK_RSV_GLOBAL,
> + BTRFS_BLOCK_RSV_DELALLOC,
> + BTRFS_BLOCK_RSV_TRANS,
> + BTRFS_BLOCK_RSV_CHUNK,
> + BTRFS_BLOCK_RSV_DELOPS,
> + BTRFS_BLOCK_RSV_EMPTY,
> + BTRFS_BLOCK_RSV_TEMP,
> +};
>  
>  struct btrfs_block_rsv {
>   u64 size;
> -- 
> 2.19.1
> 


Re: [PATCH 1/9] btrfs: switch BTRFS_FS_STATE_* to enums

2018-11-27 Thread Omar Sandoval
On Tue, Nov 27, 2018 at 08:53:41PM +0100, David Sterba wrote:
> We can use simple enum for values that are not part of on-disk format:
> global filesystem states.

Reviewed-by: Omar Sandoval 

Some typos/wording suggestions below.

> Signed-off-by: David Sterba 
> ---
>  fs/btrfs/ctree.h | 25 +++--
>  1 file changed, 19 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index a98507fa9192..f82ec5e41b0c 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -109,13 +109,26 @@ static inline unsigned long btrfs_chunk_item_size(int 
> num_stripes)
>  }
>  
>  /*
> - * File system states
> + * Runtime (in-memory) states of filesystem
>   */
> -#define BTRFS_FS_STATE_ERROR 0
> -#define BTRFS_FS_STATE_REMOUNTING1
> -#define BTRFS_FS_STATE_TRANS_ABORTED 2
> -#define BTRFS_FS_STATE_DEV_REPLACING 3
> -#define BTRFS_FS_STATE_DUMMY_FS_INFO 4
> +enum {
> + /* Global indicator of serious filesysystem errors */

filesysystem -> filesystem

> + BTRFS_FS_STATE_ERROR,
> + /*
> +  * Filesystem is being remounted, allow to skip some operations, like
> +  * defrag
> +  */
> + BTRFS_FS_STATE_REMOUNTING,
> + /* Track if the transaction abort has been reported */

Which one is "the" transaction abort? This gives me the impression that
this is a flag on the transaction, but it's actually filesystem state.
Maybe "Track if a transaction abort has been reported on this
filesystem"?

> + BTRFS_FS_STATE_TRANS_ABORTED,
> + /*
> +  * Indicate that replace source or target device state is changed and
> +  * allow to block bio operations
> +  */

Again, this makes it sound like it's device state, but it's actually
filesystem state. How about "Bio operations should be blocked on this
filesystem because a source or target device is being destroyed as part
of a device replace"?

> + BTRFS_FS_STATE_DEV_REPLACING,
> + /* The btrfs_fs_info created for self-tests */
> + BTRFS_FS_STATE_DUMMY_FS_INFO,
> +};
>  
>  #define BTRFS_BACKREF_REV_MAX256
>  #define BTRFS_BACKREF_REV_SHIFT  56
> -- 
> 2.19.1
> 


Re: [PATCH 00/10] btrfs-progs: my libbtrfsutil patch queue

2018-11-26 Thread Omar Sandoval
On Mon, Nov 26, 2018 at 05:18:12PM +0100, David Sterba wrote:
> On Tue, Nov 13, 2018 at 11:46:55PM -0800, Omar Sandoval wrote:
> > From: Omar Sandoval 
> > 
> > Hi,
> > 
> > This series contains my backlog of libbtrfsutil changes which I've been
> > collecting over the past few weeks.
> > 
> > Patches 1-4 are fixes. Patches 5-6 add functionality to the unit tests
> > which is needed for patches 7-8. Patches 7-8 add support for the
> > unprivileged ioctls added in Linux 4.18; more on those below. Patch 9
> > bumps the library version. Patch 10 adds documentation for the available
> > API along with examples.
> > 
> > Patches 7-8 are based on Misono Tomohiro's previous patch series [1],
> > with a few important changes.
> > 
> > - Both subvolume_info() and create_subvolume_iterator() now have unit
> >   tests for the unprivileged case.
> > - Both no longer explicitly check that top == 0 in the unprivileged
> >   case, since that will already fail with a clear permission error.
> > - Unprivileged iteration is much simpler: it uses openat() instead of
> >   fchdir() and is based more closely on the original tree search
> >   variant. This fixes a bug in post-order iteration in Misono's version.
> > - Unprivileged iteration does _not_ support passing in a non-subvolume
> >   path; if this behavior is desired, I'd like it to be a separate change
> >   with an explicit flag.
> 
> Series merged to devel, thanks.

Thanks!

> I've added link from the main README now
> that there's the API documentation.

Ah, great idea.

> The test-libbtrfsutil is missing from the travis CI for some reason, I
> was about to add it.  So far the testing environment does not provide
> 'umount' that knows about '-R' so the tests fail. I'll have a look if
> there's a newer base image provided, otherwise a workaround would be
> necessary.

It looks like it was added to util-linux in v2.23 back in 2013. Or maybe
the base image uses busybox? I believe that umount from busybox doesn't
have -R.

> As for the unprivileged subvolume listing ioctls, the functionality in
> the util library is self-contained and the interface is up to you to
> design properly, so this does not depend on the 'btrfs subvolume list'
> command. That one has unfortunately not bubbled high enough in my todo.

That comment is mostly for Misono, since the original version had that
functionality, probably for the subvolume list command.


Re: [PATCH v2] btrfs: add zstd compression level support

2018-11-19 Thread Omar Sandoval
On Tue, Nov 13, 2018 at 01:33:32AM +0100, David Sterba wrote:
> On Wed, Oct 31, 2018 at 11:11:08AM -0700, Nick Terrell wrote:
> > From: Jennifer Liu 
> > 
> > Adds zstd compression level support to btrfs. Zstd requires
> > different amounts of memory for each level, so the design had
> > to be modified to allow set_level() to allocate memory. We
> > preallocate one workspace of the maximum size to guarantee
> > forward progress. This feature is expected to be useful for
> > read-mostly filesystems, or when creating images.
> > 
> > Benchmarks run in qemu on Intel x86 with a single core.
> > The benchmark measures the time to copy the Silesia corpus [0] to
> > a btrfs filesystem 10 times, then read it back.
> > 
> > The two important things to note are:
> > - The decompression speed and memory remains constant.
> >   The memory required to decompress is the same as level 1.
> > - The compression speed and ratio will vary based on the source.
> > 
> > Level   Ratio   Compression Decompression   Compression Memory
> > 1   2.59153 MB/s112 MB/s0.8 MB
> > 2   2.67136 MB/s113 MB/s1.0 MB
> > 3   2.72106 MB/s115 MB/s1.3 MB
> > 4   2.7886  MB/s109 MB/s0.9 MB
> > 5   2.8369  MB/s109 MB/s1.4 MB
> > 6   2.8953  MB/s110 MB/s1.5 MB
> > 7   2.9140  MB/s112 MB/s1.4 MB
> > 8   2.9234  MB/s110 MB/s1.8 MB
> > 9   2.9327  MB/s109 MB/s1.8 MB
> > 10  2.9422  MB/s109 MB/s1.8 MB
> > 11  2.9517  MB/s114 MB/s1.8 MB
> > 12  2.9513  MB/s113 MB/s1.8 MB
> > 13  2.9510  MB/s111 MB/s2.3 MB
> > 14  2.997   MB/s110 MB/s2.6 MB
> > 15  3.036   MB/s110 MB/s2.6 MB
> > 
> > [0] http://sun.aei.polsl.pl/~sdeor/index.php?page=silesia
> > 
> > Signed-off-by: Jennifer Liu 
> > Signed-off-by: Nick Terrell 
> > Reviewed-by: Omar Sandoval 
> > ---
> > v1 -> v2:
> > - Don't reflow the unchanged line.
> > 

[snip]

> > -static struct list_head *zstd_alloc_workspace(void)
> > +static bool zstd_set_level(struct list_head *ws, unsigned int level)
> > +{
> > +   struct workspace *workspace = list_entry(ws, struct workspace, list);
> > +   ZSTD_parameters params;
> > +   int size;
> > +
> > +   if (level > BTRFS_ZSTD_MAX_LEVEL)
> > +   level = BTRFS_ZSTD_MAX_LEVEL;
> > +
> > +   if (level == 0)
> > +   level = BTRFS_ZSTD_DEFAULT_LEVEL;
> > +
> > +   params = ZSTD_getParams(level, ZSTD_BTRFS_MAX_INPUT, 0);
> > +   size = max_t(size_t,
> > +   ZSTD_CStreamWorkspaceBound(params.cParams),
> > +   ZSTD_DStreamWorkspaceBound(ZSTD_BTRFS_MAX_INPUT));
> > +   if (size > workspace->size) {
> > +   if (!zstd_reallocate_mem(workspace, size))
> 
> This can allocate memory and this can appen on the writeout path, ie.
> one of the reasons for that might be that system needs more memory.
> 
> By the table above, the size can be up to 2.6MiB, which is a lot in
> terms of kernel memory as there must be either contiguous unmapped
> memory, the virtual mappings must be created. Both are scarce resource
> or should be treated as such.
> 
> Given that there's no logic that would try to minimize the usage for
> workspaces, this can allocate many workspaces of that size.
> 
> Currently the workspace allocations have been moved to the early module
> loading phase so that they don't happen later and we don't have to
> allocate memory nor handle the failures. Your patch brings that back.

Even before this patch, we may try to allocate a workspace. See
__find_workspace():

https://github.com/kdave/btrfs-devel/blob/fd0f5617a8a2ee92dd461d01cf9c5c37363ccc8d/fs/btrfs/compression.c#L897

We already limit it to one per CPU, and only allocate when needed.
Anything greater than that has to wait. Maybe we should improve that to
also include a limit on the total amount of memory allocated? That would
be more flexible than your approach below of making the > default case
special, and I like it more than Timofey's idea of falling back to a
lower level.

> The solution I'm currently thinking about can make the levels work but
> would be limited in throughput as a trade-off for the memory
> consumption.
> 
> - preallocate one workspace for level 15 per mounted filesystem, using

Re: [PATCH v2] btrfs: add zstd compression level support

2018-11-19 Thread Omar Sandoval
On Tue, Nov 13, 2018 at 04:29:33PM +0300, Timofey Titovets wrote:
> вт, 13 нояб. 2018 г. в 04:52, Nick Terrell :
> >
> >
> >
> > > On Nov 12, 2018, at 4:33 PM, David Sterba  wrote:
> > >
> > > On Wed, Oct 31, 2018 at 11:11:08AM -0700, Nick Terrell wrote:
> > >> From: Jennifer Liu 
> > >>
> > >> Adds zstd compression level support to btrfs. Zstd requires
> > >> different amounts of memory for each level, so the design had
> > >> to be modified to allow set_level() to allocate memory. We
> > >> preallocate one workspace of the maximum size to guarantee
> > >> forward progress. This feature is expected to be useful for
> > >> read-mostly filesystems, or when creating images.
> > >>
> > >> Benchmarks run in qemu on Intel x86 with a single core.
> > >> The benchmark measures the time to copy the Silesia corpus [0] to
> > >> a btrfs filesystem 10 times, then read it back.
> > >>
> > >> The two important things to note are:
> > >> - The decompression speed and memory remains constant.
> > >>  The memory required to decompress is the same as level 1.
> > >> - The compression speed and ratio will vary based on the source.
> > >>
> > >> LevelRatio   Compression Decompression   Compression Memory
> > >> 12.59153 MB/s112 MB/s0.8 MB
> > >> 22.67136 MB/s113 MB/s1.0 MB
> > >> 32.72106 MB/s115 MB/s1.3 MB
> > >> 42.7886  MB/s109 MB/s0.9 MB
> > >> 52.8369  MB/s109 MB/s1.4 MB
> > >> 62.8953  MB/s110 MB/s1.5 MB
> > >> 72.9140  MB/s112 MB/s1.4 MB
> > >> 82.9234  MB/s110 MB/s1.8 MB
> > >> 92.9327  MB/s109 MB/s1.8 MB
> > >> 10   2.9422  MB/s109 MB/s1.8 MB
> > >> 11   2.9517  MB/s114 MB/s1.8 MB
> > >> 12   2.9513  MB/s113 MB/s1.8 MB
> > >> 13   2.9510  MB/s111 MB/s2.3 MB
> > >> 14   2.997   MB/s110 MB/s2.6 MB
> > >> 15   3.036   MB/s110 MB/s2.6 MB
> > >>
> > >> [0] 
> > >> https://urldefense.proofpoint.com/v2/url?u=http-3A__sun.aei.polsl.pl_-7Esdeor_index.php-3Fpage-3Dsilesia=DwIBAg=5VD0RTtNlTh3ycd41b3MUw=HQM5IQdWOB8WaMoii2dYTw=5LQRTUqZnx_a8dGSa5bGsd0Fm4ejQQOcH50wi7nRewY=gFUm-SA3aeQI7PBe3zmxUuxk4AEEZegB0cRsbjWUToo=
> > >>
> > >> Signed-off-by: Jennifer Liu 
> > >> Signed-off-by: Nick Terrell 
> > >> Reviewed-by: Omar Sandoval 
> > >> ---
> > >> v1 -> v2:
> > >> - Don't reflow the unchanged line.
> > >>
> > >> fs/btrfs/compression.c | 169 +
> > >> fs/btrfs/compression.h |  18 +++--
> > >> fs/btrfs/lzo.c |   5 +-
> > >> fs/btrfs/super.c   |   7 +-
> > >> fs/btrfs/zlib.c|  33 
> > >> fs/btrfs/zstd.c|  74 +-
> > >> 6 files changed, 202 insertions(+), 104 deletions(-)
> > >>
> > >> diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
> > >> index 2955a4ea2fa8..b46652cb653e 100644
> > >> --- a/fs/btrfs/compression.c
> > >> +++ b/fs/btrfs/compression.c
> > >> @@ -822,9 +822,12 @@ void __init btrfs_init_compress(void)
> > >>
> > >>  /*
> > >>   * Preallocate one workspace for each compression type so
> > >> - * we can guarantee forward progress in the worst case
> > >> + * we can guarantee forward progress in the worst case.
> > >> + * Provide the maximum compression level to guarantee large
> > >> + * enough workspace.
> > >>   */
> > >> -workspace = btrfs_compress_op[i]->alloc_workspace();
> > >> +workspace = btrfs_compress_op[i]->alloc_workspace(
> > >> +btrfs_compress_op[i]->max_level);
> >
> > We provide the max level here, so we have at least one workspace per
> > compression type that is large enough.
> >
> > >> 

Re: [PATCH 5/6] btrfs: remove unused variable tree in end_compressed_bio_write()

2018-11-14 Thread Omar Sandoval
On Wed, Nov 14, 2018 at 02:35:19PM +0100, Johannes Thumshirn wrote:
> Commit 2922040236f9 (btrfs: Remove extent_io_ops::writepage_end_io_hook)
> removed the indirection to extent_io_ops::writepage_end_io_hook but didn't
> remove the tree variable which then became unused.
> 
> Remove 'tree' as well to silence the warning when -Wunused-but-set-variable
> is used to compile btrfs.

Reviewed-by: Omar Sandoval 

> Signed-off-by: Johannes Thumshirn 
> ---
>  fs/btrfs/compression.c | 2 --
>  1 file changed, 2 deletions(-)
> 
> diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
> index bde8d0487bbb..088570c5dfb8 100644
> --- a/fs/btrfs/compression.c
> +++ b/fs/btrfs/compression.c
> @@ -229,7 +229,6 @@ static noinline void end_compressed_writeback(struct 
> inode *inode,
>   */
>  static void end_compressed_bio_write(struct bio *bio)
>  {
> - struct extent_io_tree *tree;
>   struct compressed_bio *cb = bio->bi_private;
>   struct inode *inode;
>   struct page *page;
> @@ -248,7 +247,6 @@ static void end_compressed_bio_write(struct bio *bio)
>* call back into the FS and do all the end_io operations
>*/
>   inode = cb->inode;
> - tree = _I(inode)->io_tree;
>   cb->compressed_pages[0]->mapping = cb->inode->i_mapping;
>   btrfs_writepage_endio_finish_ordered(cb->compressed_pages[0],
>   cb->start, cb->start + cb->len - 1, NULL,
> -- 
> 2.16.4
> 


Re: [PATCH 4/6] btrfs: remove unused variable tree in bio_readpage_error()

2018-11-14 Thread Omar Sandoval
On Wed, Nov 14, 2018 at 02:35:18PM +0100, Johannes Thumshirn wrote:
> Commit 2922040236f9 (btrfs: Remove extent_io_ops::writepage_end_io_hook)
> removed the indirection to extent_io_ops::writepage_end_io_hook but didn't
> remove the tree variable which then became unused.
> 
> Remove 'tree' as well to silence the warning when -Wunused-but-set-variable is
> used to compile btrfs.

The subject says bio_readpage_error() but this is
end_extent_writepage(). Otherwise,

Reviewed-by: Omar Sandoval 

> Signed-off-by: Johannes Thumshirn 
> ---
>  fs/btrfs/extent_io.c | 3 ---
>  1 file changed, 3 deletions(-)
> 
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 0f8f9c035812..17a15cc6b542 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -2403,11 +2403,8 @@ static int bio_readpage_error(struct bio *failed_bio, 
> u64 phy_offset,
>  void end_extent_writepage(struct page *page, int err, u64 start, u64 end)
>  {
>   int uptodate = (err == 0);
> - struct extent_io_tree *tree;
>   int ret = 0;
>  
> - tree = _I(page->mapping->host)->io_tree;
> -
>   btrfs_writepage_endio_finish_ordered(page, start, end, NULL, uptodate);
>  
>   if (!uptodate) {
> -- 
> 2.16.4
> 


Re: [PATCH 3/6] btrfs: remove unused function btrfs_sysfs_feature_update()

2018-11-14 Thread Omar Sandoval
On Wed, Nov 14, 2018 at 02:35:17PM +0100, Johannes Thumshirn wrote:
> btrfs_sysfs_feature_update() was introduced with commit 444e75169872 (btrfs:
> sysfs: introduce helper for syncing bits with sysfs files) to provide a helper
> which was used in 14e46e04958d (btrfs: synchronize incompat feature bits with
> sysfs files).
> 
> But commit e410e34fad91 (Revert "btrfs: synchronize incompat feature bits with
> sysfs files") reverted 14e46e04958d so btrfs_sysfs_feature_update() ended up
> as an unused function.

Reviewed-by: Omar Sandoval 

> Signed-off-by: Johannes Thumshirn 
> ---
>  fs/btrfs/sysfs.c | 33 -
>  fs/btrfs/sysfs.h |  2 --
>  2 files changed, 35 deletions(-)
> 
> diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
> index 3717c864ba23..a22a7c5f75eb 100644
> --- a/fs/btrfs/sysfs.c
> +++ b/fs/btrfs/sysfs.c
> @@ -858,39 +858,6 @@ int btrfs_sysfs_add_mounted(struct btrfs_fs_info 
> *fs_info)
>   return error;
>  }
>  
> -
> -/*
> - * Change per-fs features in /sys/fs/btrfs/UUID/features to match current
> - * values in superblock. Call after any changes to incompat/compat_ro flags
> - */
> -void btrfs_sysfs_feature_update(struct btrfs_fs_info *fs_info,
> - u64 bit, enum btrfs_feature_set set)
> -{
> - struct btrfs_fs_devices *fs_devs;
> - struct kobject *fsid_kobj;
> - u64 features;
> - int ret;
> -
> - if (!fs_info)
> - return;
> -
> - features = get_features(fs_info, set);
> - ASSERT(bit & supported_feature_masks[set]);
> -
> - fs_devs = fs_info->fs_devices;
> - fsid_kobj = _devs->fsid_kobj;
> -
> - if (!fsid_kobj->state_initialized)
> - return;
> -
> - /*
> -  * FIXME: this is too heavy to update just one value, ideally we'd like
> -  * to use sysfs_update_group but some refactoring is needed first.
> -  */
> - sysfs_remove_group(fsid_kobj, _feature_attr_group);
> - ret = sysfs_create_group(fsid_kobj, _feature_attr_group);
> -}
> -
>  static int btrfs_init_debugfs(void)
>  {
>  #ifdef CONFIG_DEBUG_FS
> diff --git a/fs/btrfs/sysfs.h b/fs/btrfs/sysfs.h
> index c6ee600aff89..93feedde8485 100644
> --- a/fs/btrfs/sysfs.h
> +++ b/fs/btrfs/sysfs.h
> @@ -88,7 +88,5 @@ int btrfs_sysfs_add_fsid(struct btrfs_fs_devices *fs_devs,
>   struct kobject *parent);
>  int btrfs_sysfs_add_device(struct btrfs_fs_devices *fs_devs);
>  void btrfs_sysfs_remove_fsid(struct btrfs_fs_devices *fs_devs);
> -void btrfs_sysfs_feature_update(struct btrfs_fs_info *fs_info,
> - u64 bit, enum btrfs_feature_set set);
>  
>  #endif
> -- 
> 2.16.4
> 


Re: [PATCH 1/6] btrfs: remove unused drop_on_err in btrfs_mkdir()

2018-11-14 Thread Omar Sandoval
On Wed, Nov 14, 2018 at 02:35:15PM +0100, Johannes Thumshirn wrote:
> Up to commit 32955c5422a8 (btrfs: switch to discard_new_inode()) the
> drop_on_err variable in btrfs_mkdir() was used to check whether the inode had
> to be dropped via iput().
> 
> After commit 32955c5422a8 (btrfs: switch to discard_new_inode())
> discard_new_inode() is called when err is set and inode is non NULL. Therefore
> drop_on_err is not used anymore and thus causes a warning when building with
> -Wunused-but-set-variable.

Reviewed-by: Omar Sandoval 

> Signed-off-by: Johannes Thumshirn 
> ---
>  fs/btrfs/inode.c | 3 ---
>  1 file changed, 3 deletions(-)
> 
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 4a2f9f7fd96e..7d17b0a654e6 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -6677,7 +6677,6 @@ static int btrfs_mkdir(struct inode *dir, struct dentry 
> *dentry, umode_t mode)
>   struct btrfs_trans_handle *trans;
>   struct btrfs_root *root = BTRFS_I(dir)->root;
>   int err = 0;
> - int drop_on_err = 0;
>   u64 objectid = 0;
>   u64 index = 0;
>  
> @@ -6703,7 +6702,6 @@ static int btrfs_mkdir(struct inode *dir, struct dentry 
> *dentry, umode_t mode)
>   goto out_fail;
>   }
>  
> - drop_on_err = 1;
>   /* these must be set before we unlock the inode */
>   inode->i_op = _dir_inode_operations;
>   inode->i_fop = _dir_file_operations;
> @@ -6724,7 +6722,6 @@ static int btrfs_mkdir(struct inode *dir, struct dentry 
> *dentry, umode_t mode)
>   goto out_fail;
>  
>   d_instantiate_new(dentry, inode);
> - drop_on_err = 0;
>  
>  out_fail:
>   btrfs_end_transaction(trans);
> -- 
> 2.16.4
> 


[PATCH 09/10] libbtrfsutil: bump version to 1.1.0

2018-11-13 Thread Omar Sandoval
From: Omar Sandoval 

With the previous few fixes and features, we should bump the minor
version.

Signed-off-by: Omar Sandoval 
---
 libbtrfsutil/btrfsutil.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/libbtrfsutil/btrfsutil.h b/libbtrfsutil/btrfsutil.h
index d88c39e5..ad4f043e 100644
--- a/libbtrfsutil/btrfsutil.h
+++ b/libbtrfsutil/btrfsutil.h
@@ -26,7 +26,7 @@
 #include 
 
 #define BTRFS_UTIL_VERSION_MAJOR 1
-#define BTRFS_UTIL_VERSION_MINOR 0
+#define BTRFS_UTIL_VERSION_MINOR 1
 #define BTRFS_UTIL_VERSION_PATCH 0
 
 #ifdef __cplusplus
-- 
2.19.1



[PATCH 10/10] libbtrfsutil: document API in README

2018-11-13 Thread Omar Sandoval
From: Omar Sandoval 

btrfsutil.h and the Python docstrings are thorough, but I've gotten a
couple of requests for a high-level overview of the available interfaces
and example usages. Add them to README.md.

Signed-off-by: Omar Sandoval 
---
 libbtrfsutil/README.md | 422 -
 1 file changed, 421 insertions(+), 1 deletion(-)

diff --git a/libbtrfsutil/README.md b/libbtrfsutil/README.md
index 0c8eba44..30ae39b6 100644
--- a/libbtrfsutil/README.md
+++ b/libbtrfsutil/README.md
@@ -6,6 +6,425 @@ the LGPL. libbtrfsutil provides interfaces for a subset of 
the operations
 offered by the `btrfs` command line utility. It also includes official Python
 bindings (Python 3 only).
 
+API Overview
+
+
+This section provides an overview of the interfaces available in libbtrfsutil
+as well as example usages. Detailed documentation for the C API can be found in
+[`btrfsutil.h`](btrfsutil.h). Detailed documentation for the Python bindings is
+available with `pydoc3 btrfsutil` or in the interpreter:
+
+```
+>>> import btrfsutil
+>>> help(btrfsutil)
+```
+
+Many functions in the C API have a variant taking a path and a variant taking a
+file descriptor. The latter has the same name as the former with an `_fd`
+suffix. The Python bindings for these functions can take a path, a file object,
+or a file descriptor.
+
+Error handling is omitted from most of these examples for brevity. Please
+handle errors in production code.
+
+### Error Handling
+
+In the C API, all functions that can return an error return an `enum
+btrfs_util_error` and set `errno`. `BTRFS_UTIL_OK` (zero) is returned on
+success. `btrfs_util_strerror()` converts an error code to a string
+description suitable for human-friendly error reporting.
+
+```c
+enum btrfs_util_err err;
+
+err = btrfs_util_sync("/");
+if (err)
+   fprintf("stderr, %s: %m\n", btrfs_util_strerror(err));
+```
+
+In the Python bindings, functions may raise a `BtrfsUtilError`, which is a
+subclass of `OSError` with an added `btrfsutilerror` error code member. Error
+codes are available as `ERROR_*` constants.
+
+```python
+try:
+btrfsutil.sync('/')
+except btrfsutil.BtrfsUtilError as e:
+print(e, file=sys.stderr)
+```
+
+### Filesystem Operations
+
+There are several operations which act on the entire filesystem.
+
+ Sync
+
+Btrfs can commit all caches for a specific filesystem to disk.
+
+`btrfs_util_sync()` forces a sync on the filesystem containing the given file
+and waits for it to complete.
+
+`btrfs_wait_sync()` waits for a previously started transaction to complete. The
+transaction is specified by ID, which may be zero to indicate the current
+transaction.
+
+`btrfs_start_sync()` asynchronously starts a sync and returns a transaction ID
+which can then be passed to `btrfs_wait_sync()`.
+
+```c
+uint64_t transid;
+btrfs_util_sync("/");
+btrfs_util_start_sync("/", );
+btrfs_util_wait_sync("/", );
+btrfs_util_wait_sync("/", 0);
+```
+
+```python
+btrfsutil.sync('/')
+transid = btrfsutil.start_sync('/')
+btrfsutil.wait_sync('/', transid)
+btrfsutil.wait_sync('/')  # equivalent to wait_sync('/', 0)
+```
+
+All of these functions have `_fd` variants.
+
+The equivalent `btrfs-progs` command is `btrfs filesystem sync`.
+
+### Subvolume Operations
+
+Functions which take a file and a subvolume ID can be used in two ways. If zero
+is given as the subvolume ID, then the given file is used as the subvolume.
+Otherwise, the given file can be any file in the filesystem, and the subvolume
+with the given ID is used.
+
+ Subvolume Information
+
+`btrfs_util_is_subvolume()` returns whether a given file is a subvolume.
+
+`btrfs_util_subvolume_id()` returns the ID of the subvolume containing the
+given file.
+
+```c
+enum btrfs_util_error err;
+err = btrfs_util_is_subvolume("/subvol");
+if (!err)
+   printf("Subvolume\n");
+else if (err == BTRFS_UTIL_ERROR_NOT_BTRFS || err == 
BTRFS_UTIL_ERROR_NOT_SUBVOLUME)
+   printf("Not subvolume\n");
+uint64_t id;
+btrfs_util_subvolume_id("/subvol", );
+```
+
+```python
+if btrfsutil.is_subvolume('/subvol'):
+print('Subvolume')
+else:
+print('Not subvolume')
+id_ = btrfsutil.subvolume_id('/subvol')
+```
+
+`btrfs_util_subvolume_path()` returns the path of the subvolume with the given
+ID relative to the filesystem root. This requires `CAP_SYS_ADMIN`. The path
+must be freed with `free()`.
+
+```c
+char *path;
+btrfs_util_subvolume_path("/", 256, );
+free(path);
+btrfs_util_subvolume_path("/subvol", 0, );
+free(path);
+```
+
+```python
+path = btrfsutil.subvolume_path('/', 256)
+path = btrfsutil.subvolume_path('/subvol')  # equivalent to 
subvolume_path('/subvol', 0)
+```
+
+`btrfs_util_subvolume_info()` returns information (including ID, parent ID,
+UUID) about a subvolume. In the C API, this is returned as a `struct
+btrfs_util

[PATCH 07/10] libbtrfsutil: relax the privileges of subvolume_info()

2018-11-13 Thread Omar Sandoval
From: Omar Sandoval 

Attempt to use the BTRFS_IOC_GET_SUBVOL_INFO ioctl (added in kernel
4.18) for subvolume_info() if not root. Also, rename
get_subvolume_info_root() -> get_subvolume_info_privileged() for
consistency with further changes.

This is based on a patch from Misono Tomohiro.

Signed-off-by: Omar Sandoval 
---
 libbtrfsutil/btrfsutil.h|  4 +-
 libbtrfsutil/errors.c   |  2 +
 libbtrfsutil/python/tests/test_subvolume.py | 42 
 libbtrfsutil/subvolume.c| 53 +++--
 4 files changed, 89 insertions(+), 12 deletions(-)

diff --git a/libbtrfsutil/btrfsutil.h b/libbtrfsutil/btrfsutil.h
index 6d655f49..c1925007 100644
--- a/libbtrfsutil/btrfsutil.h
+++ b/libbtrfsutil/btrfsutil.h
@@ -63,6 +63,7 @@ enum btrfs_util_error {
BTRFS_UTIL_ERROR_SYNC_FAILED,
BTRFS_UTIL_ERROR_START_SYNC_FAILED,
BTRFS_UTIL_ERROR_WAIT_SYNC_FAILED,
+   BTRFS_UTIL_ERROR_GET_SUBVOL_INFO_FAILED,
 };
 
 /**
@@ -266,7 +267,8 @@ struct btrfs_util_subvolume_info {
  * to check whether the subvolume exists; %BTRFS_UTIL_ERROR_SUBVOLUME_NOT_FOUND
  * will be returned if it does not.
  *
- * This requires appropriate privilege (CAP_SYS_ADMIN).
+ * This requires appropriate privilege (CAP_SYS_ADMIN) unless @id is zero and
+ * the kernel supports BTRFS_IOC_GET_SUBVOL_INFO (kernel >= 4.18).
  *
  * Return: %BTRFS_UTIL_OK on success, non-zero error code on failure.
  */
diff --git a/libbtrfsutil/errors.c b/libbtrfsutil/errors.c
index 634edc65..cf968b03 100644
--- a/libbtrfsutil/errors.c
+++ b/libbtrfsutil/errors.c
@@ -45,6 +45,8 @@ static const char * const error_messages[] = {
[BTRFS_UTIL_ERROR_SYNC_FAILED] = "Could not sync filesystem",
[BTRFS_UTIL_ERROR_START_SYNC_FAILED] = "Could not start filesystem 
sync",
[BTRFS_UTIL_ERROR_WAIT_SYNC_FAILED] = "Could not wait for filesystem 
sync",
+   [BTRFS_UTIL_ERROR_GET_SUBVOL_INFO_FAILED] =
+   "Could not get subvolume information with 
BTRFS_IOC_GET_SUBVOL_INFO",
 };
 
 PUBLIC const char *btrfs_util_strerror(enum btrfs_util_error err)
diff --git a/libbtrfsutil/python/tests/test_subvolume.py 
b/libbtrfsutil/python/tests/test_subvolume.py
index 4049b08e..55ebf34d 100644
--- a/libbtrfsutil/python/tests/test_subvolume.py
+++ b/libbtrfsutil/python/tests/test_subvolume.py
@@ -23,7 +23,12 @@ from pathlib import PurePath
 import traceback
 
 import btrfsutil
-from tests import BtrfsTestCase, HAVE_PATH_LIKE
+from tests import (
+BtrfsTestCase,
+drop_privs,
+HAVE_PATH_LIKE,
+skipUnlessHaveNobody,
+)
 
 
 class TestSubvolume(BtrfsTestCase):
@@ -87,7 +92,7 @@ class TestSubvolume(BtrfsTestCase):
 finally:
 os.chdir(pwd)
 
-def test_subvolume_info(self):
+def _test_subvolume_info(self, subvol, snapshot):
 for arg in self.path_or_fd(self.mountpoint):
 with self.subTest(type=type(arg)):
 info = btrfsutil.subvolume_info(arg)
@@ -100,7 +105,7 @@ class TestSubvolume(BtrfsTestCase):
 self.assertEqual(info.parent_uuid, bytes(16))
 self.assertEqual(info.received_uuid, bytes(16))
 self.assertNotEqual(info.generation, 0)
-self.assertEqual(info.ctransid, 0)
+self.assertGreaterEqual(info.ctransid, 0)
 self.assertEqual(info.otransid, 0)
 self.assertEqual(info.stransid, 0)
 self.assertEqual(info.rtransid, 0)
@@ -109,9 +114,6 @@ class TestSubvolume(BtrfsTestCase):
 self.assertEqual(info.stime, 0)
 self.assertEqual(info.rtime, 0)
 
-subvol = os.path.join(self.mountpoint, 'subvol')
-btrfsutil.create_subvolume(subvol)
-
 info = btrfsutil.subvolume_info(subvol)
 self.assertEqual(info.id, 256)
 self.assertEqual(info.parent_id, 5)
@@ -132,19 +134,43 @@ class TestSubvolume(BtrfsTestCase):
 self.assertEqual(info.rtime, 0)
 
 subvol_uuid = info.uuid
-snapshot = os.path.join(self.mountpoint, 'snapshot')
-btrfsutil.create_snapshot(subvol, snapshot)
 
 info = btrfsutil.subvolume_info(snapshot)
 self.assertEqual(info.parent_uuid, subvol_uuid)
 
 # TODO: test received_uuid, stransid, rtransid, stime, and rtime
 
+def test_subvolume_info(self):
+subvol = os.path.join(self.mountpoint, 'subvol')
+btrfsutil.create_subvolume(subvol)
+snapshot = os.path.join(self.mountpoint, 'snapshot')
+btrfsutil.create_snapshot(subvol, snapshot)
+
+self._test_subvolume_info(subvol, snapshot)
+
 for arg in self.path_or_fd(self.mountpoint):
 with self.subTest(type=type(arg)):
 with self.assertRaises(btrfsutil.BtrfsUtilError) as e:
 # BTRFS_EXTENT_TREE_OBJECTID
 btrfsutil.subvolume_info(arg, 2)
+ 

[PATCH 08/10] libbtrfsutil: relax the privileges of subvolume iterator

2018-11-13 Thread Omar Sandoval
From: Omar Sandoval 

We can use the new BTRFS_IOC_GET_SUBVOL_ROOTREF and
BTRFS_IOC_INO_LOOKUP_USER ioctls to allow non-root users to list
subvolumes.

This is based on a patch from Misono Tomohiro but takes a different
approach (mainly, this approach is more similar to the existing tree
search approach).

Signed-off-by: Omar Sandoval 
---
 libbtrfsutil/btrfsutil.h|  15 +-
 libbtrfsutil/errors.c   |   6 +
 libbtrfsutil/python/tests/test_subvolume.py | 180 +++---
 libbtrfsutil/subvolume.c| 354 +---
 4 files changed, 450 insertions(+), 105 deletions(-)

diff --git a/libbtrfsutil/btrfsutil.h b/libbtrfsutil/btrfsutil.h
index c1925007..d88c39e5 100644
--- a/libbtrfsutil/btrfsutil.h
+++ b/libbtrfsutil/btrfsutil.h
@@ -64,6 +64,9 @@ enum btrfs_util_error {
BTRFS_UTIL_ERROR_START_SYNC_FAILED,
BTRFS_UTIL_ERROR_WAIT_SYNC_FAILED,
BTRFS_UTIL_ERROR_GET_SUBVOL_INFO_FAILED,
+   BTRFS_UTIL_ERROR_GET_SUBVOL_ROOTREF_FAILED,
+   BTRFS_UTIL_ERROR_INO_LOOKUP_USER_FAILED,
+   BTRFS_UTIL_ERROR_FS_INFO_FAILED,
 };
 
 /**
@@ -507,6 +510,12 @@ struct btrfs_util_subvolume_iterator;
  * @flags: Bitmask of BTRFS_UTIL_SUBVOLUME_ITERATOR_* flags.
  * @ret: Returned iterator.
  *
+ * Subvolume iterators require appropriate privilege (CAP_SYS_ADMIN) unless 
@top
+ * is zero and the kernel supports BTRFS_IOC_GET_SUBVOL_ROOTREF and
+ * BTRFS_IOC_INO_LOOKUP_USER (kernel >= 4.18). In this case, subvolumes which
+ * cannot be accessed (e.g., due to permissions or other mounts) will be
+ * skipped.
+ *
  * The returned iterator must be freed with
  * btrfs_util_destroy_subvolume_iterator().
  *
@@ -555,7 +564,8 @@ int btrfs_util_subvolume_iterator_fd(const struct 
btrfs_util_subvolume_iterator
  * Must be freed with free().
  * @id_ret: Returned subvolume ID. May be %NULL.
  *
- * This requires appropriate privilege (CAP_SYS_ADMIN).
+ * This requires appropriate privilege (CAP_SYS_ADMIN) for kernels < 4.18. See
+ * btrfs_util_create_subvolume_iterator().
  *
  * Return: %BTRFS_UTIL_OK on success, %BTRFS_UTIL_ERROR_STOP_ITERATION if there
  * are no more subvolumes, non-zero error code on failure.
@@ -574,7 +584,8 @@ enum btrfs_util_error 
btrfs_util_subvolume_iterator_next(struct btrfs_util_subvo
  * This convenience function basically combines
  * btrfs_util_subvolume_iterator_next() and btrfs_util_subvolume_info().
  *
- * This requires appropriate privilege (CAP_SYS_ADMIN).
+ * This requires appropriate privilege (CAP_SYS_ADMIN) for kernels < 4.18. See
+ * btrfs_util_create_subvolume_iterator().
  *
  * Return: See btrfs_util_subvolume_iterator_next().
  */
diff --git a/libbtrfsutil/errors.c b/libbtrfsutil/errors.c
index cf968b03..d39b38d0 100644
--- a/libbtrfsutil/errors.c
+++ b/libbtrfsutil/errors.c
@@ -47,6 +47,12 @@ static const char * const error_messages[] = {
[BTRFS_UTIL_ERROR_WAIT_SYNC_FAILED] = "Could not wait for filesystem 
sync",
[BTRFS_UTIL_ERROR_GET_SUBVOL_INFO_FAILED] =
"Could not get subvolume information with 
BTRFS_IOC_GET_SUBVOL_INFO",
+   [BTRFS_UTIL_ERROR_GET_SUBVOL_ROOTREF_FAILED] =
+   "Could not get rootref information with 
BTRFS_IOC_GET_SUBVOL_ROOTREF",
+   [BTRFS_UTIL_ERROR_INO_LOOKUP_USER_FAILED] =
+   "Could not resolve subvolume path with 
BTRFS_IOC_INO_LOOKUP_USER",
+   [BTRFS_UTIL_ERROR_FS_INFO_FAILED] =
+   "Could not get filesystem information",
 };
 
 PUBLIC const char *btrfs_util_strerror(enum btrfs_util_error err)
diff --git a/libbtrfsutil/python/tests/test_subvolume.py 
b/libbtrfsutil/python/tests/test_subvolume.py
index 55ebf34d..99ec97bc 100644
--- a/libbtrfsutil/python/tests/test_subvolume.py
+++ b/libbtrfsutil/python/tests/test_subvolume.py
@@ -20,6 +20,7 @@ import errno
 import os
 import os.path
 from pathlib import PurePath
+import subprocess
 import traceback
 
 import btrfsutil
@@ -27,6 +28,8 @@ from tests import (
 BtrfsTestCase,
 drop_privs,
 HAVE_PATH_LIKE,
+NOBODY_UID,
+regain_privs,
 skipUnlessHaveNobody,
 )
 
@@ -354,69 +357,136 @@ class TestSubvolume(BtrfsTestCase):
 with self.subTest(type=type(arg)):
 self.assertEqual(btrfsutil.deleted_subvolumes(arg), [256])
 
-def test_subvolume_iterator(self):
-pwd = os.getcwd()
-try:
-os.chdir(self.mountpoint)
-btrfsutil.create_subvolume('foo')
-
-with btrfsutil.SubvolumeIterator('.', info=True) as it:
-path, subvol = next(it)
-self.assertEqual(path, 'foo')
-self.assertIsInstance(subvol, btrfsutil.SubvolumeInfo)
-self.assertEqual(subvol.id, 256)
-self.assertEqual(subvol.parent_id, 5)
-self.assertRaises(StopIteration, next, it)
-
-btrfsutil.create_subvolume('foo/bar')
-btrfs

[PATCH 06/10] libbtrfsutil: allow tests to create multiple Btrfs instances

2018-11-13 Thread Omar Sandoval
From: Omar Sandoval 

Some upcoming tests will need to create a second Btrfs filesystem, so
add support for this to the test helpers.

Signed-off-by: Omar Sandoval 
---
 libbtrfsutil/python/tests/__init__.py | 35 +--
 1 file changed, 22 insertions(+), 13 deletions(-)

diff --git a/libbtrfsutil/python/tests/__init__.py 
b/libbtrfsutil/python/tests/__init__.py
index 4bc11990..9fd6f6de 100644
--- a/libbtrfsutil/python/tests/__init__.py
+++ b/libbtrfsutil/python/tests/__init__.py
@@ -57,14 +57,18 @@ def regain_privs():
 
 @unittest.skipIf(os.geteuid() != 0, 'must be run as root')
 class BtrfsTestCase(unittest.TestCase):
-def setUp(self):
-self.mountpoint = tempfile.mkdtemp()
+def __init__(self, *args, **kwds):
+super().__init__(*args, **kwds)
+self._mountpoints = []
+
+def mount_btrfs(self):
+mountpoint = tempfile.mkdtemp()
 try:
 with tempfile.NamedTemporaryFile(delete=False) as f:
 os.truncate(f.fileno(), 1024 * 1024 * 1024)
-self.image = f.name
+image = f.name
 except Exception as e:
-os.rmdir(self.mountpoint)
+os.rmdir(mountpoint)
 raise e
 
 if os.path.exists('../../mkfs.btrfs'):
@@ -72,19 +76,24 @@ class BtrfsTestCase(unittest.TestCase):
 else:
 mkfs = 'mkfs.btrfs'
 try:
-subprocess.check_call([mkfs, '-q', self.image])
-subprocess.check_call(['mount', '-o', 'loop', '--', self.image, 
self.mountpoint])
+subprocess.check_call([mkfs, '-q', image])
+subprocess.check_call(['mount', '-o', 'loop', '--', image, 
mountpoint])
 except Exception as e:
-os.remove(self.image)
-os.rmdir(self.mountpoint)
+os.rmdir(mountpoint)
+os.remove(image)
 raise e
 
+self._mountpoints.append((mountpoint, image))
+return mountpoint, image
+
+def setUp(self):
+self.mountpoint, self.image = self.mount_btrfs()
+
 def tearDown(self):
-try:
-subprocess.check_call(['umount', self.mountpoint])
-finally:
-os.remove(self.image)
-os.rmdir(self.mountpoint)
+for mountpoint, image in self._mountpoints:
+subprocess.call(['umount', '-R', mountpoint])
+os.rmdir(mountpoint)
+os.remove(image)
 
 @staticmethod
 def path_or_fd(path, open_flags=os.O_RDONLY):
-- 
2.19.1



[PATCH 03/10] libbtrfsutil: document qgroup_inherit parameter in Python bindings

2018-11-13 Thread Omar Sandoval
From: Omar Sandoval 

This has been supported since day one, but it wasn't documented.

Signed-off-by: Omar Sandoval 
---
 libbtrfsutil/python/module.c | 13 +
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/libbtrfsutil/python/module.c b/libbtrfsutil/python/module.c
index 625cc9c6..f8260c84 100644
--- a/libbtrfsutil/python/module.c
+++ b/libbtrfsutil/python/module.c
@@ -233,15 +233,18 @@ static PyMethodDef btrfsutil_methods[] = {
 "this ID instead of the given path"},
{"create_subvolume", (PyCFunction)create_subvolume,
 METH_VARARGS | METH_KEYWORDS,
-"create_subvolume(path, async_=False)\n\n"
+"create_subvolume(path, async_=False, qgroup_inherit=None)\n\n"
 "Create a new subvolume.\n\n"
 "Arguments:\n"
 "path -- string, bytes, or path-like object\n"
 "async_ -- create the subvolume without waiting for it to commit to\n"
-"disk and return the transaction ID"},
+"disk and return the transaction ID\n"
+"qgroup_inherit -- optional QgroupInherit object of qgroups to\n"
+"inherit from"},
{"create_snapshot", (PyCFunction)create_snapshot,
 METH_VARARGS | METH_KEYWORDS,
-"create_snapshot(source, path, recursive=False, read_only=False, 
async_=False)\n\n"
+"create_snapshot(source, path, recursive=False, read_only=False,\n"
+"async_=False, qgroup_inherit=None)\n\n"
 "Create a new snapshot.\n\n"
 "Arguments:\n"
 "source -- string, bytes, path-like object, or open file descriptor\n"
@@ -249,7 +252,9 @@ static PyMethodDef btrfsutil_methods[] = {
 "recursive -- also snapshot child subvolumes\n"
 "read_only -- create a read-only snapshot\n"
 "async_ -- create the subvolume without waiting for it to commit to\n"
-"disk and return the transaction ID"},
+"disk and return the transaction ID\n"
+"qgroup_inherit -- optional QgroupInherit object of qgroups to\n"
+"inherit from"},
{"delete_subvolume", (PyCFunction)delete_subvolume,
 METH_VARARGS | METH_KEYWORDS,
 "delete_subvolume(path, recursive=False)\n\n"
-- 
2.19.1



[PATCH 05/10] libbtrfsutil: add test helpers for dropping privileges

2018-11-13 Thread Omar Sandoval
From: Omar Sandoval 

These will be used for testing some upcoming changes which allow
unprivileged operations.

Signed-off-by: Omar Sandoval 
---
 libbtrfsutil/python/tests/__init__.py | 31 ++-
 1 file changed, 30 insertions(+), 1 deletion(-)

diff --git a/libbtrfsutil/python/tests/__init__.py 
b/libbtrfsutil/python/tests/__init__.py
index 35550e0a..4bc11990 100644
--- a/libbtrfsutil/python/tests/__init__.py
+++ b/libbtrfsutil/python/tests/__init__.py
@@ -15,14 +15,44 @@
 # You should have received a copy of the GNU Lesser General Public License
 # along with libbtrfsutil.  If not, see <http://www.gnu.org/licenses/>.
 
+import contextlib
 import os
 from pathlib import PurePath
+import pwd
 import subprocess
 import tempfile
 import unittest
 
 
 HAVE_PATH_LIKE = hasattr(PurePath, '__fspath__')
+try:
+NOBODY_UID = pwd.getpwnam('nobody').pw_uid
+skipUnlessHaveNobody = lambda func: func
+except KeyError:
+NOBODY_UID = None
+skipUnlessHaveNobody = unittest.skip('must have nobody user')
+
+
+@contextlib.contextmanager
+def drop_privs():
+try:
+os.seteuid(NOBODY_UID)
+yield
+finally:
+os.seteuid(0)
+
+
+@contextlib.contextmanager
+def regain_privs():
+uid = os.geteuid()
+if uid:
+try:
+os.seteuid(0)
+yield
+finally:
+os.seteuid(uid)
+else:
+yield
 
 
 @unittest.skipIf(os.geteuid() != 0, 'must be run as root')
@@ -67,4 +97,3 @@ class BtrfsTestCase(unittest.TestCase):
 yield fd
 finally:
 os.close(fd)
-
-- 
2.19.1



[PATCH 01/10] libbtrfsutil: use top=0 as default for SubvolumeIterator()

2018-11-13 Thread Omar Sandoval
From: Omar Sandoval 

Right now, we're defaulting to top=5 (i.e, all subvolumes). The
documented default is top=0 (i.e, only beneath the given path). This is
the expected behavior. Fix it and make the test cases cover it.

Reported-by: Jonathan Lemon 
Signed-off-by: Omar Sandoval 
---
 libbtrfsutil/python/subvolume.c | 2 +-
 libbtrfsutil/python/tests/test_subvolume.py | 2 ++
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/libbtrfsutil/python/subvolume.c b/libbtrfsutil/python/subvolume.c
index 069e606b..6ecde1f6 100644
--- a/libbtrfsutil/python/subvolume.c
+++ b/libbtrfsutil/python/subvolume.c
@@ -525,7 +525,7 @@ static int SubvolumeIterator_init(SubvolumeIterator *self, 
PyObject *args,
static char *keywords[] = {"path", "top", "info", "post_order", NULL};
struct path_arg path = {.allow_fd = true};
enum btrfs_util_error err;
-   unsigned long long top = 5;
+   unsigned long long top = 0;
int info = 0;
int post_order = 0;
int flags = 0;
diff --git a/libbtrfsutil/python/tests/test_subvolume.py 
b/libbtrfsutil/python/tests/test_subvolume.py
index 93396cba..0788a564 100644
--- a/libbtrfsutil/python/tests/test_subvolume.py
+++ b/libbtrfsutil/python/tests/test_subvolume.py
@@ -353,6 +353,7 @@ class TestSubvolume(BtrfsTestCase):
 with self.subTest(type=type(arg)):
 self.assertEqual(list(btrfsutil.SubvolumeIterator(arg)), 
subvols)
 self.assertEqual(list(btrfsutil.SubvolumeIterator('.', top=0)), 
subvols)
+self.assertEqual(list(btrfsutil.SubvolumeIterator('foo', top=5)), 
subvols)
 
 self.assertEqual(list(btrfsutil.SubvolumeIterator('.', 
post_order=True)),
  [('foo/bar/baz', 258),
@@ -365,6 +366,7 @@ class TestSubvolume(BtrfsTestCase):
 ]
 
 self.assertEqual(list(btrfsutil.SubvolumeIterator('.', top=256)), 
subvols)
+self.assertEqual(list(btrfsutil.SubvolumeIterator('foo')), subvols)
 self.assertEqual(list(btrfsutil.SubvolumeIterator('foo', top=0)), 
subvols)
 
 os.rename('foo/bar/baz', 'baz')
-- 
2.19.1



[PATCH 00/10] btrfs-progs: my libbtrfsutil patch queue

2018-11-13 Thread Omar Sandoval
From: Omar Sandoval 

Hi,

This series contains my backlog of libbtrfsutil changes which I've been
collecting over the past few weeks.

Patches 1-4 are fixes. Patches 5-6 add functionality to the unit tests
which is needed for patches 7-8. Patches 7-8 add support for the
unprivileged ioctls added in Linux 4.18; more on those below. Patch 9
bumps the library version. Patch 10 adds documentation for the available
API along with examples.

Patches 7-8 are based on Misono Tomohiro's previous patch series [1],
with a few important changes.

- Both subvolume_info() and create_subvolume_iterator() now have unit
  tests for the unprivileged case.
- Both no longer explicitly check that top == 0 in the unprivileged
  case, since that will already fail with a clear permission error.
- Unprivileged iteration is much simpler: it uses openat() instead of
  fchdir() and is based more closely on the original tree search
  variant. This fixes a bug in post-order iteration in Misono's version.
- Unprivileged iteration does _not_ support passing in a non-subvolume
  path; if this behavior is desired, I'd like it to be a separate change
  with an explicit flag.

Please take a look.

Thanks!

1: https://www.spinics.net/lists/linux-btrfs/msg79285.html

Omar Sandoval (10):
  libbtrfsutil: use top=0 as default for SubvolumeIterator()
  libbtrfsutil: change async parameters to async_ in Python bindings
  libbtrfsutil: document qgroup_inherit parameter in Python bindings
  libbtrfsutil: use SubvolumeIterator as context manager in tests
  libbtrfsutil: add test helpers for dropping privileges
  libbtrfsutil: allow tests to create multiple Btrfs instances
  libbtrfsutil: relax the privileges of subvolume_info()
  libbtrfsutil: relax the privileges of subvolume iterator
  libbtrfsutil: bump version to 1.1.0
  libbtrfsutil: document API in README

 libbtrfsutil/README.md  | 422 +++-
 libbtrfsutil/btrfsutil.h|  21 +-
 libbtrfsutil/errors.c   |   8 +
 libbtrfsutil/python/module.c|  17 +-
 libbtrfsutil/python/subvolume.c |   6 +-
 libbtrfsutil/python/tests/__init__.py   |  66 ++-
 libbtrfsutil/python/tests/test_subvolume.py | 215 +++---
 libbtrfsutil/subvolume.c| 407 ---
 8 files changed, 1029 insertions(+), 133 deletions(-)

-- 
2.19.1



[PATCH 02/10] libbtrfsutil: change async parameters to async_ in Python bindings

2018-11-13 Thread Omar Sandoval
From: Omar Sandoval 

async became a keyword in Python 3.7, so, e.g., create_subvolume('foo',
async=True) is now a syntax error. Fix it with the Python convention of
adding a trailing underscore to the keyword (async -> async_). This is
what several other Python libraries did to handle this.

Signed-off-by: Omar Sandoval 
---
 libbtrfsutil/python/module.c| 8 
 libbtrfsutil/python/subvolume.c | 4 ++--
 libbtrfsutil/python/tests/test_subvolume.py | 4 ++--
 3 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/libbtrfsutil/python/module.c b/libbtrfsutil/python/module.c
index 2dbdc7be..625cc9c6 100644
--- a/libbtrfsutil/python/module.c
+++ b/libbtrfsutil/python/module.c
@@ -233,22 +233,22 @@ static PyMethodDef btrfsutil_methods[] = {
 "this ID instead of the given path"},
{"create_subvolume", (PyCFunction)create_subvolume,
 METH_VARARGS | METH_KEYWORDS,
-"create_subvolume(path, async=False)\n\n"
+"create_subvolume(path, async_=False)\n\n"
 "Create a new subvolume.\n\n"
 "Arguments:\n"
 "path -- string, bytes, or path-like object\n"
-"async -- create the subvolume without waiting for it to commit to\n"
+"async_ -- create the subvolume without waiting for it to commit to\n"
 "disk and return the transaction ID"},
{"create_snapshot", (PyCFunction)create_snapshot,
 METH_VARARGS | METH_KEYWORDS,
-"create_snapshot(source, path, recursive=False, read_only=False, 
async=False)\n\n"
+"create_snapshot(source, path, recursive=False, read_only=False, 
async_=False)\n\n"
 "Create a new snapshot.\n\n"
 "Arguments:\n"
 "source -- string, bytes, path-like object, or open file descriptor\n"
 "path -- string, bytes, or path-like object\n"
 "recursive -- also snapshot child subvolumes\n"
 "read_only -- create a read-only snapshot\n"
-"async -- create the subvolume without waiting for it to commit to\n"
+"async_ -- create the subvolume without waiting for it to commit to\n"
 "disk and return the transaction ID"},
{"delete_subvolume", (PyCFunction)delete_subvolume,
 METH_VARARGS | METH_KEYWORDS,
diff --git a/libbtrfsutil/python/subvolume.c b/libbtrfsutil/python/subvolume.c
index 6ecde1f6..0f893b91 100644
--- a/libbtrfsutil/python/subvolume.c
+++ b/libbtrfsutil/python/subvolume.c
@@ -322,7 +322,7 @@ PyObject *set_default_subvolume(PyObject *self, PyObject 
*args, PyObject *kwds)
 
 PyObject *create_subvolume(PyObject *self, PyObject *args, PyObject *kwds)
 {
-   static char *keywords[] = {"path", "async", "qgroup_inherit", NULL};
+   static char *keywords[] = {"path", "async_", "qgroup_inherit", NULL};
struct path_arg path = {.allow_fd = false};
enum btrfs_util_error err;
int async = 0;
@@ -352,7 +352,7 @@ PyObject *create_subvolume(PyObject *self, PyObject *args, 
PyObject *kwds)
 PyObject *create_snapshot(PyObject *self, PyObject *args, PyObject *kwds)
 {
static char *keywords[] = {
-   "source", "path", "recursive", "read_only", "async",
+   "source", "path", "recursive", "read_only", "async_",
"qgroup_inherit", NULL,
};
struct path_arg src = {.allow_fd = true}, dst = {.allow_fd = false};
diff --git a/libbtrfsutil/python/tests/test_subvolume.py 
b/libbtrfsutil/python/tests/test_subvolume.py
index 0788a564..f2a4cdb8 100644
--- a/libbtrfsutil/python/tests/test_subvolume.py
+++ b/libbtrfsutil/python/tests/test_subvolume.py
@@ -202,7 +202,7 @@ class TestSubvolume(BtrfsTestCase):
 btrfsutil.create_subvolume(subvol + '6//')
 self.assertTrue(btrfsutil.is_subvolume(subvol + '6'))
 
-transid = btrfsutil.create_subvolume(subvol + '7', async=True)
+transid = btrfsutil.create_subvolume(subvol + '7', async_=True)
 self.assertTrue(btrfsutil.is_subvolume(subvol + '7'))
 self.assertGreater(transid, 0)
 
@@ -265,7 +265,7 @@ class TestSubvolume(BtrfsTestCase):
 btrfsutil.create_snapshot(subvol, snapshot + '2', recursive=True)
 self.assertTrue(os.path.exists(os.path.join(snapshot + '2', 
'nested/more_nested/nested_dir')))
 
-transid = btrfsutil.create_snapshot(subvol, snapshot + '3', 
recursive=True, async=True)
+transid = btrfsutil.create_snapshot(subvol, snapshot + '3', 
recursive=True, async_=True)
 self.assertTrue(os.path.exists(os.path.join(snapshot + '3', 
'nested/more_nested/nested_dir')))
 self.assertGreater(transid, 0)
 
-- 
2.19.1



[PATCH 04/10] libbtrfsutil: use SubvolumeIterator as context manager in tests

2018-11-13 Thread Omar Sandoval
From: Omar Sandoval 

We're leaking file descriptors, which makes it impossible to clean up
the temporary mount point created by the test.

Signed-off-by: Omar Sandoval 
---
 libbtrfsutil/python/tests/test_subvolume.py | 51 -
 1 file changed, 30 insertions(+), 21 deletions(-)

diff --git a/libbtrfsutil/python/tests/test_subvolume.py 
b/libbtrfsutil/python/tests/test_subvolume.py
index f2a4cdb8..4049b08e 100644
--- a/libbtrfsutil/python/tests/test_subvolume.py
+++ b/libbtrfsutil/python/tests/test_subvolume.py
@@ -334,11 +334,13 @@ class TestSubvolume(BtrfsTestCase):
 os.chdir(self.mountpoint)
 btrfsutil.create_subvolume('foo')
 
-path, subvol = next(btrfsutil.SubvolumeIterator('.', info=True))
-self.assertEqual(path, 'foo')
-self.assertIsInstance(subvol, btrfsutil.SubvolumeInfo)
-self.assertEqual(subvol.id, 256)
-self.assertEqual(subvol.parent_id, 5)
+with btrfsutil.SubvolumeIterator('.', info=True) as it:
+path, subvol = next(it)
+self.assertEqual(path, 'foo')
+self.assertIsInstance(subvol, btrfsutil.SubvolumeInfo)
+self.assertEqual(subvol.id, 256)
+self.assertEqual(subvol.parent_id, 5)
+self.assertRaises(StopIteration, next, it)
 
 btrfsutil.create_subvolume('foo/bar')
 btrfsutil.create_subvolume('foo/bar/baz')
@@ -350,30 +352,37 @@ class TestSubvolume(BtrfsTestCase):
 ]
 
 for arg in self.path_or_fd('.'):
-with self.subTest(type=type(arg)):
-self.assertEqual(list(btrfsutil.SubvolumeIterator(arg)), 
subvols)
-self.assertEqual(list(btrfsutil.SubvolumeIterator('.', top=0)), 
subvols)
-self.assertEqual(list(btrfsutil.SubvolumeIterator('foo', top=5)), 
subvols)
-
-self.assertEqual(list(btrfsutil.SubvolumeIterator('.', 
post_order=True)),
- [('foo/bar/baz', 258),
-  ('foo/bar', 257),
-  ('foo', 256)])
+with self.subTest(type=type(arg)), 
btrfsutil.SubvolumeIterator(arg) as it:
+self.assertEqual(list(it), subvols)
+with btrfsutil.SubvolumeIterator('.', top=0) as it:
+self.assertEqual(list(it), subvols)
+with btrfsutil.SubvolumeIterator('foo', top=5) as it:
+self.assertEqual(list(it), subvols)
+
+with btrfsutil.SubvolumeIterator('.', post_order=True) as it:
+self.assertEqual(list(it),
+ [('foo/bar/baz', 258),
+  ('foo/bar', 257),
+  ('foo', 256)])
 
 subvols = [
 ('bar', 257),
 ('bar/baz', 258),
 ]
 
-self.assertEqual(list(btrfsutil.SubvolumeIterator('.', top=256)), 
subvols)
-self.assertEqual(list(btrfsutil.SubvolumeIterator('foo')), subvols)
-self.assertEqual(list(btrfsutil.SubvolumeIterator('foo', top=0)), 
subvols)
+with btrfsutil.SubvolumeIterator('.', top=256) as it:
+self.assertEqual(list(it), subvols)
+with btrfsutil.SubvolumeIterator('foo') as it:
+self.assertEqual(list(it), subvols)
+with btrfsutil.SubvolumeIterator('foo', top=0) as it:
+self.assertEqual(list(it), subvols)
 
 os.rename('foo/bar/baz', 'baz')
-self.assertEqual(sorted(btrfsutil.SubvolumeIterator('.')),
- [('baz', 258),
-  ('foo', 256),
-  ('foo/bar', 257)])
+with btrfsutil.SubvolumeIterator('.') as it:
+self.assertEqual(sorted(it),
+ [('baz', 258),
+  ('foo', 256),
+  ('foo/bar', 257)])
 
 with btrfsutil.SubvolumeIterator('.') as it:
 self.assertGreaterEqual(it.fileno(), 0)
-- 
2.19.1



Re: [PATCH v9 0/6] Btrfs: implement swap file support

2018-11-09 Thread Omar Sandoval
On Wed, Nov 07, 2018 at 04:28:10PM +0100, David Sterba wrote:
> On Wed, Nov 07, 2018 at 05:07:00PM +0200, Nikolay Borisov wrote:
> > 
> > 
> > On 7.11.18 г. 16:49 ч., David Sterba wrote:
> > > On Tue, Nov 06, 2018 at 10:54:51AM +0100, David Sterba wrote:
> > >> On Thu, Sep 27, 2018 at 11:17:32AM -0700, Omar Sandoval wrote:
> > >>> From: Omar Sandoval 
> > >>> This series implements swap file support for Btrfs.
> > >>>
> > >>> Changes from v8 [1]:
> > >>>
> > >>> - Fixed a bug in btrfs_swap_activate() which would cause us to miss some
> > >>>   file extents if they were merged into one extent map entry.
> > >>> - Fixed build for !CONFIG_SWAP.
> > >>> - Changed all error messages to KERN_WARN.
> > >>> - Unindented long error messages.
> > >>>
> > >>> I've Cc'd Jon and Al on patch 3 this time, so hopefully we can get an
> > >>> ack for that one, too.
> > >>>
> > >>> Thanks!
> > >>>
> > >>> 1: https://www.spinics.net/lists/linux-btrfs/msg82267.html
> > >>>
> > >>> Omar Sandoval (6):
> > >>>   mm: split SWP_FILE into SWP_ACTIVATED and SWP_FS
> > >>>   mm: export add_swap_extent()
> > >>>   vfs: update swap_{,de}activate documentation
> > >>>   Btrfs: prevent ioctls from interfering with a swap file
> > >>>   Btrfs: rename get_chunk_map() and make it non-static
> > >>>   Btrfs: support swap files
> > >>
> > >> fstest generic/472 reports an assertion failure. This is on the updated 
> > >> fstests
> > >> git (70c4067285b0bc076), though it should not matter:
> > >>
> > >> [16597.002190] assertion failed: IS_ALIGNED(start, fs_info->sectorsize) 
> > >> && IS_ALIGNED(end + 1, fs_info->sectorsize), file: fs/btrfs/file-item.c, 
> > >> line: 319
> > > 
> > > I have to revert the patch for now as it kills the testing machines.
> > 
> > The reason is that the isize is not aligned to a sectorsize. Ie it
> > should be:
> > 
> > +   u64 isize = ALIGN_DOWN(i_size_read(inode), fs_info->sectorsize);
> > 
> > With this fixlet generic/472 succeeds.
> 
> Thanks for the fix, I'll fold it in.

Thanks, Nikolay, I missed that. I don't think i_size_read() is
necessary, though, since the inode is locked.


Re: [PATCH v2] Btrfs: fix missing delayed iputs on unmount

2018-11-09 Thread Omar Sandoval
On Wed, Nov 07, 2018 at 05:01:19PM +0100, David Sterba wrote:
> On Wed, Oct 31, 2018 at 10:06:08AM -0700, Omar Sandoval wrote:
> > From: Omar Sandoval 
> > 
> > There's a race between close_ctree() and cleaner_kthread().
> > close_ctree() sets btrfs_fs_closing(), and the cleaner stops when it
> > sees it set, but this is racy; the cleaner might have already checked
> > the bit and could be cleaning stuff. In particular, if it deletes unused
> > block groups, it will create delayed iputs for the free space cache
> > inodes. As of "btrfs: don't run delayed_iputs in commit", we're no
> > longer running delayed iputs after a commit. Therefore, if the cleaner
> > creates more delayed iputs after delayed iputs are run in
> > btrfs_commit_super(), we will leak inodes on unmount and get a busy
> > inode crash from the VFS.
> > 
> > Fix it by parking the cleaner before we actually close anything. Then,
> > any remaining delayed iputs will always be handled in
> > btrfs_commit_super(). This also ensures that the commit in close_ctree()
> > is really the last commit, so we can get rid of the commit in
> > cleaner_kthread().
> > 
> > Fixes: 30928e9baac2 ("btrfs: don't run delayed_iputs in commit")
> > Signed-off-by: Omar Sandoval 
> 
> I'll queue this patch for rc2 as it fixes crashes I see during testing.
> My version does more changes and would be more suitable for a series,
> that could actually document the shutdown sequence and add a few
> assertions on top.

Thanks, Dave! I'll keep an eye out for the further cleanups.

> Reviewed-by: David Sterba 


[PATCH 5/7] btrfs: test swap files on multiple devices

2018-11-02 Thread Omar Sandoval
From: Omar Sandoval 

Swap files currently need to exist on exactly one device in exactly one
place.

Signed-off-by: Omar Sandoval 
---
 tests/btrfs/175 | 73 +
 tests/btrfs/175.out |  8 +
 tests/btrfs/group   |  1 +
 3 files changed, 82 insertions(+)
 create mode 100755 tests/btrfs/175
 create mode 100644 tests/btrfs/175.out

diff --git a/tests/btrfs/175 b/tests/btrfs/175
new file mode 100755
index ..64afc4f0
--- /dev/null
+++ b/tests/btrfs/175
@@ -0,0 +1,73 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2018 Facebook.  All Rights Reserved.
+#
+# FS QA Test 175
+#
+# Test swap file activation on multiple devices.
+#
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo "QA output created by $seq"
+
+here=`pwd`
+tmp=/tmp/$$
+status=1   # failure is the default!
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+_cleanup()
+{
+   cd /
+   rm -f $tmp.*
+}
+
+. ./common/rc
+. ./common/filter
+
+rm -f $seqres.full
+
+_supported_fs generic
+_supported_os Linux
+_require_scratch_dev_pool 2
+_require_scratch_swapfile
+
+cycle_swapfile() {
+   local sz=${1:-$(($(get_page_size) * 10))}
+   _format_swapfile "$SCRATCH_MNT/swap" "$sz"
+   swapon "$SCRATCH_MNT/swap" 2>&1 | _filter_scratch
+   swapoff "$SCRATCH_MNT/swap" > /dev/null 2>&1
+}
+
+echo "RAID 1"
+_scratch_pool_mkfs -d raid1 -m raid1 >> $seqres.full 2>&1
+_scratch_mount
+cycle_swapfile
+_scratch_unmount
+
+echo "DUP"
+_scratch_pool_mkfs -d dup -m dup >> $seqres.full 2>&1
+_scratch_mount
+cycle_swapfile
+_scratch_unmount
+
+echo "Single on multiple devices"
+_scratch_pool_mkfs -d single -m raid1 -b $((1024 * 1024 * 1024)) >> 
$seqres.full 2>&1
+_scratch_mount
+# Each device is only 1 GB, so 1.5 GB must be split across multiple devices.
+cycle_swapfile $((3 * 1024 * 1024 * 1024 / 2))
+_scratch_unmount
+
+echo "Single on one device"
+_scratch_mkfs >> $seqres.full 2>&1
+_scratch_mount
+# Create the swap file, then add the device. That way we know it's all on one
+# device.
+_format_swapfile "$SCRATCH_MNT/swap" $(($(get_page_size) * 10))
+scratch_dev2="$(echo "${SCRATCH_DEV_POOL}" | awk '{ print $2 }')"
+$BTRFS_UTIL_PROG device add -f "$scratch_dev2" "$SCRATCH_MNT"
+swapon "$SCRATCH_MNT/swap" 2>&1 | _filter_scratch
+swapoff "$SCRATCH_MNT/swap" > /dev/null 2>&1
+_scratch_unmount
+
+status=0
+exit
diff --git a/tests/btrfs/175.out b/tests/btrfs/175.out
new file mode 100644
index ..ce2e5992
--- /dev/null
+++ b/tests/btrfs/175.out
@@ -0,0 +1,8 @@
+QA output created by 175
+RAID 1
+swapon: SCRATCH_MNT/swap: swapon failed: Invalid argument
+DUP
+swapon: SCRATCH_MNT/swap: swapon failed: Invalid argument
+Single on multiple devices
+swapon: SCRATCH_MNT/swap: swapon failed: Invalid argument
+Single on one device
diff --git a/tests/btrfs/group b/tests/btrfs/group
index 2e10f7df..b6160b72 100644
--- a/tests/btrfs/group
+++ b/tests/btrfs/group
@@ -177,3 +177,4 @@
 172 auto quick punch
 173 auto quick swap
 174 auto quick swap
+175 auto quick swap
-- 
2.19.1



[PATCH 4/7] btrfs: test invalid operations on a swap file

2018-11-02 Thread Omar Sandoval
From: Omar Sandoval 

Btrfs forbids some operations which should not be done on a swap file.

Signed-off-by: Omar Sandoval 
---
 tests/btrfs/174 | 66 +
 tests/btrfs/174.out | 10 +++
 tests/btrfs/group   |  1 +
 3 files changed, 77 insertions(+)
 create mode 100755 tests/btrfs/174
 create mode 100644 tests/btrfs/174.out

diff --git a/tests/btrfs/174 b/tests/btrfs/174
new file mode 100755
index ..a26e6669
--- /dev/null
+++ b/tests/btrfs/174
@@ -0,0 +1,66 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2018 Facebook.  All Rights Reserved.
+#
+# FS QA Test 174
+#
+# Test restrictions on operations that can be done on an active swap file
+# specific to Btrfs.
+#
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo "QA output created by $seq"
+
+here=`pwd`
+tmp=/tmp/$$
+status=1   # failure is the default!
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+_cleanup()
+{
+   cd /
+   rm -f $tmp.*
+}
+
+. ./common/rc
+. ./common/filter
+
+rm -f $seqres.full
+
+_supported_fs generic
+_supported_os Linux
+_require_scratch_swapfile
+
+_scratch_mkfs >> $seqres.full 2>&1
+_scratch_mount
+
+$BTRFS_UTIL_PROG subvol create "$SCRATCH_MNT/swapvol" >> $seqres.full
+swapfile="$SCRATCH_MNT/swapvol/swap"
+_format_swapfile "$swapfile" $(($(get_page_size) * 10))
+swapon "$swapfile"
+
+# Turning off nowcow doesn't do anything because the file is not empty, not
+# because the file is a swap file, but make sure this works anyways.
+echo "Disable nocow"
+$CHATTR_PROG -C "$swapfile"
+lsattr -l "$swapfile" | _filter_scratch | _filter_spaces
+
+# Compression we reject outright.
+echo "Enable compression"
+$CHATTR_PROG +c "$swapfile" 2>&1 | grep -o "Text file busy"
+lsattr -l "$swapfile" | _filter_scratch | _filter_spaces
+
+echo "Snapshot"
+$BTRFS_UTIL_PROG subvol snap "$SCRATCH_MNT/swapvol" \
+   "$SCRATCH_MNT/swapsnap" 2>&1 | grep -o "Text file busy"
+
+echo "Defrag"
+# We pass the -c (compress) flag to force defrag even if the file isn't
+# fragmented.
+$BTRFS_UTIL_PROG filesystem defrag -c "$swapfile" 2>&1 | grep -o "Text file 
busy"
+
+swapoff "$swapfile"
+_scratch_unmount
+
+status=0
+exit
diff --git a/tests/btrfs/174.out b/tests/btrfs/174.out
new file mode 100644
index ..bc24f1fb
--- /dev/null
+++ b/tests/btrfs/174.out
@@ -0,0 +1,10 @@
+QA output created by 174
+Disable nocow
+SCRATCH_MNT/swapvol/swap No_COW
+Enable compression
+Text file busy
+SCRATCH_MNT/swapvol/swap No_COW
+Snapshot
+Text file busy
+Defrag
+Text file busy
diff --git a/tests/btrfs/group b/tests/btrfs/group
index 3525014f..2e10f7df 100644
--- a/tests/btrfs/group
+++ b/tests/btrfs/group
@@ -176,3 +176,4 @@
 171 auto quick qgroup
 172 auto quick punch
 173 auto quick swap
+174 auto quick swap
-- 
2.19.1



[PATCH 6/7] btrfs: test device add/remove/replace with an active swap file

2018-11-02 Thread Omar Sandoval
From: Omar Sandoval 

Make sure that we don't remove or replace a device with an active swap
file but can add, remove, and replace other devices.

Signed-off-by: Omar Sandoval 
---
 tests/btrfs/176 | 82 +
 tests/btrfs/176.out |  5 +++
 tests/btrfs/group   |  1 +
 3 files changed, 88 insertions(+)
 create mode 100755 tests/btrfs/176
 create mode 100644 tests/btrfs/176.out

diff --git a/tests/btrfs/176 b/tests/btrfs/176
new file mode 100755
index ..1e576149
--- /dev/null
+++ b/tests/btrfs/176
@@ -0,0 +1,82 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2018 Facebook.  All Rights Reserved.
+#
+# FS QA Test 176
+#
+# Test device remove/replace with an active swap file.
+#
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo "QA output created by $seq"
+
+here=`pwd`
+tmp=/tmp/$$
+status=1   # failure is the default!
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+_cleanup()
+{
+   cd /
+   rm -f $tmp.*
+}
+
+# get standard environment, filters and checks
+. ./common/rc
+. ./common/filter
+
+# remove previous $seqres.full before test
+rm -f $seqres.full
+
+# real QA test starts here
+
+# Modify as appropriate.
+_supported_fs generic
+_supported_os Linux
+_require_scratch_dev_pool 3
+_require_scratch_swapfile
+
+# We check the filesystem manually because we move devices around.
+rm -f "${RESULT_DIR}/require_scratch"
+
+scratch_dev1="$(echo "${SCRATCH_DEV_POOL}" | awk '{ print $1 }')"
+scratch_dev2="$(echo "${SCRATCH_DEV_POOL}" | awk '{ print $2 }')"
+scratch_dev3="$(echo "${SCRATCH_DEV_POOL}" | awk '{ print $3 }')"
+
+echo "Remove device"
+_scratch_mkfs >> $seqres.full 2>&1
+_scratch_mount
+_format_swapfile "$SCRATCH_MNT/swap" $(($(get_page_size) * 10))
+$BTRFS_UTIL_PROG device add -f "$scratch_dev2" "$SCRATCH_MNT"
+swapon "$SCRATCH_MNT/swap" 2>&1 | _filter_scratch
+# We know the swap file is on device 1 because we added device 2 after it was
+# already created.
+$BTRFS_UTIL_PROG device delete "$scratch_dev1" "$SCRATCH_MNT" 2>&1 | grep -o 
"Text file busy"
+# Deleting/readding device 2 should still work.
+$BTRFS_UTIL_PROG device delete "$scratch_dev2" "$SCRATCH_MNT"
+$BTRFS_UTIL_PROG device add -f "$scratch_dev2" "$SCRATCH_MNT"
+swapoff "$SCRATCH_MNT/swap" > /dev/null 2>&1
+# Deleting device 1 should work again after swapoff.
+$BTRFS_UTIL_PROG device delete "$scratch_dev1" "$SCRATCH_MNT"
+_scratch_unmount
+_check_scratch_fs "$scratch_dev2"
+
+echo "Replace device"
+_scratch_mkfs >> $seqres.full 2>&1
+_scratch_mount
+_format_swapfile "$SCRATCH_MNT/swap" $(($(get_page_size) * 10))
+$BTRFS_UTIL_PROG device add -f "$scratch_dev2" "$SCRATCH_MNT"
+swapon "$SCRATCH_MNT/swap" 2>&1 | _filter_scratch
+# Again, we know the swap file is on device 1.
+$BTRFS_UTIL_PROG replace start -fB "$scratch_dev1" "$scratch_dev3" 
"$SCRATCH_MNT" 2>&1 | grep -o "Text file busy"
+# Replacing device 2 should still work.
+$BTRFS_UTIL_PROG replace start -fB "$scratch_dev2" "$scratch_dev3" 
"$SCRATCH_MNT"
+swapoff "$SCRATCH_MNT/swap" > /dev/null 2>&1
+# Replacing device 1 should work again after swapoff.
+$BTRFS_UTIL_PROG replace start -fB "$scratch_dev1" "$scratch_dev2" 
"$SCRATCH_MNT"
+_scratch_unmount
+_check_scratch_fs "$scratch_dev2"
+
+# success, all done
+status=0
+exit
diff --git a/tests/btrfs/176.out b/tests/btrfs/176.out
new file mode 100644
index ..5c99e0fd
--- /dev/null
+++ b/tests/btrfs/176.out
@@ -0,0 +1,5 @@
+QA output created by 176
+Remove device
+Text file busy
+Replace device
+Text file busy
diff --git a/tests/btrfs/group b/tests/btrfs/group
index b6160b72..3562420b 100644
--- a/tests/btrfs/group
+++ b/tests/btrfs/group
@@ -178,3 +178,4 @@
 173 auto quick swap
 174 auto quick swap
 175 auto quick swap
+176 auto quick swap
-- 
2.19.1



[PATCH 0/7] fstests: test Btrfs swapfile support

2018-11-02 Thread Omar Sandoval
From: Omar Sandoval 

This series fixes a couple of generic swapfile tests and adds some
Btrfs-specific swapfile tests. Btrfs swapfile support is scheduled for
4.21 [1].

1: https://www.spinics.net/lists/linux-btrfs/msg83454.html

Thanks!

Omar Sandoval (7):
  generic/{472,496,497}: fix $seeqres typo
  generic/{472,496}: fix swap file creation on Btrfs
  btrfs: test swap file activation restrictions
  btrfs: test invalid operations on a swap file
  btrfs: test swap files on multiple devices
  btrfs: test device add/remove/replace with an active swap file
  btrfs: test balance and resize with an active swap file

 tests/btrfs/173 | 55 ++
 tests/btrfs/173.out |  5 +++
 tests/btrfs/174 | 66 
 tests/btrfs/174.out | 10 ++
 tests/btrfs/175 | 73 
 tests/btrfs/175.out |  8 +
 tests/btrfs/176 | 82 +
 tests/btrfs/176.out |  5 +++
 tests/btrfs/177 | 64 +++
 tests/btrfs/177.out |  6 
 tests/btrfs/group   |  5 +++
 tests/generic/472   | 16 -
 tests/generic/496   |  8 ++---
 tests/generic/497   |  2 +-
 14 files changed, 391 insertions(+), 14 deletions(-)
 create mode 100755 tests/btrfs/173
 create mode 100644 tests/btrfs/173.out
 create mode 100755 tests/btrfs/174
 create mode 100644 tests/btrfs/174.out
 create mode 100755 tests/btrfs/175
 create mode 100644 tests/btrfs/175.out
 create mode 100755 tests/btrfs/176
 create mode 100644 tests/btrfs/176.out
 create mode 100755 tests/btrfs/177
 create mode 100644 tests/btrfs/177.out

-- 
2.19.1



[PATCH 7/7] btrfs: test balance and resize with an active swap file

2018-11-02 Thread Omar Sandoval
From: Omar Sandoval 

Make sure we don't shrink the device past an active swap file, but allow
shrinking otherwise, as well as growing and balance.

Signed-off-by: Omar Sandoval 
---
 tests/btrfs/177 | 64 +
 tests/btrfs/177.out |  6 +
 tests/btrfs/group   |  1 +
 3 files changed, 71 insertions(+)
 create mode 100755 tests/btrfs/177
 create mode 100644 tests/btrfs/177.out

diff --git a/tests/btrfs/177 b/tests/btrfs/177
new file mode 100755
index ..12dad8fc
--- /dev/null
+++ b/tests/btrfs/177
@@ -0,0 +1,64 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2018 Facebook.  All Rights Reserved.
+#
+# FS QA Test 177
+#
+# Test relocation (balance and resize) with an active swap file.
+#
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo "QA output created by $seq"
+
+here=`pwd`
+tmp=/tmp/$$
+status=1   # failure is the default!
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+_cleanup()
+{
+   cd /
+   rm -f $tmp.*
+}
+
+. ./common/rc
+. ./common/filter
+. ./common/btrfs
+
+rm -f $seqres.full
+
+# Modify as appropriate.
+_supported_fs generic
+_supported_os Linux
+_require_scratch_swapfile
+
+swapfile="$SCRATCH_MNT/swap"
+
+# First, create a 1GB filesystem and fill it up.
+_scratch_mkfs_sized $((1024 * 1024 * 1024)) >> $seqres.full 2>&1
+_scratch_mount
+dd if=/dev/zero of="$SCRATCH_MNT/fill" bs=1024k >> $seqres.full 2>&1
+# Now add more space and create a swap file. We know that the first 1GB of the
+# filesystem was used, so the swap file must be in the new part of the
+# filesystem.
+$BTRFS_UTIL_PROG filesystem resize 2G "$SCRATCH_MNT" | _filter_scratch
+_format_swapfile "$swapfile" $((32 * 1024 * 1024))
+swapon "$swapfile"
+# Add even more space which we know is unused.
+$BTRFS_UTIL_PROG filesystem resize 3G "$SCRATCH_MNT" | _filter_scratch
+# Free up the first 1GB of the filesystem.
+rm -f "$SCRATCH_MNT/fill"
+# Get rid of empty block groups and also make sure that balance skips block
+# groups containing active swap files.
+_run_btrfs_balance_start "$SCRATCH_MNT"
+# Shrink away the unused space.
+$BTRFS_UTIL_PROG filesystem resize 2G "$SCRATCH_MNT" | _filter_scratch
+# Try to shrink away the area occupied by the swap file, which should fail.
+$BTRFS_UTIL_PROG filesystem resize 1G "$SCRATCH_MNT" 2>&1 | grep -o "Text file 
busy"
+swapoff "$swapfile"
+# It should work again after swapoff.
+$BTRFS_UTIL_PROG filesystem resize 1G "$SCRATCH_MNT" | _filter_scratch
+_scratch_unmount
+
+status=0
+exit
diff --git a/tests/btrfs/177.out b/tests/btrfs/177.out
new file mode 100644
index ..6ced01da
--- /dev/null
+++ b/tests/btrfs/177.out
@@ -0,0 +1,6 @@
+QA output created by 177
+Resize 'SCRATCH_MNT' of '2G'
+Resize 'SCRATCH_MNT' of '3G'
+Resize 'SCRATCH_MNT' of '2G'
+Text file busy
+Resize 'SCRATCH_MNT' of '1G'
diff --git a/tests/btrfs/group b/tests/btrfs/group
index 3562420b..0b62e58a 100644
--- a/tests/btrfs/group
+++ b/tests/btrfs/group
@@ -179,3 +179,4 @@
 174 auto quick swap
 175 auto quick swap
 176 auto quick swap
+177 auto quick swap
-- 
2.19.1



[PATCH 1/7] generic/{472,496,497}: fix $seeqres typo

2018-11-02 Thread Omar Sandoval
From: Omar Sandoval 

Signed-off-by: Omar Sandoval 
---
 tests/generic/472 | 2 +-
 tests/generic/496 | 2 +-
 tests/generic/497 | 2 +-
 3 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/tests/generic/472 b/tests/generic/472
index c74d6c70..04ed3e73 100755
--- a/tests/generic/472
+++ b/tests/generic/472
@@ -51,7 +51,7 @@ swapfile_cycle() {
$CHATTR_PROG +C $swapfile >> $seqres.full 2>&1
"$here/src/mkswap" $swapfile >> $seqres.full
"$here/src/swapon" $swapfile 2>&1 | _filter_scratch
-   swapoff $swapfile 2>> $seeqres.full
+   swapoff $swapfile 2>> $seqres.full
rm -f $swapfile
 }
 
diff --git a/tests/generic/496 b/tests/generic/496
index 1c9651ad..968b8012 100755
--- a/tests/generic/496
+++ b/tests/generic/496
@@ -53,7 +53,7 @@ swapfile_cycle() {
$CHATTR_PROG +C $swapfile >> $seqres.full 2>&1
"$here/src/mkswap" $swapfile >> $seqres.full
"$here/src/swapon" $swapfile 2>&1 | _filter_scratch
-   swapoff $swapfile 2>> $seeqres.full
+   swapoff $swapfile 2>> $seqres.full
rm -f $swapfile
 }
 
diff --git a/tests/generic/497 b/tests/generic/497
index 584af58a..3d5502ef 100755
--- a/tests/generic/497
+++ b/tests/generic/497
@@ -53,7 +53,7 @@ swapfile_cycle() {
$CHATTR_PROG +C $swapfile >> $seqres.full 2>&1
"$here/src/mkswap" $swapfile >> $seqres.full
"$here/src/swapon" $swapfile 2>&1 | _filter_scratch
-   swapoff $swapfile 2>> $seeqres.full
+   swapoff $swapfile 2>> $seqres.full
rm -f $swapfile
 }
 
-- 
2.19.1



[PATCH 2/7] generic/{472,496}: fix swap file creation on Btrfs

2018-11-02 Thread Omar Sandoval
From: Omar Sandoval 

The swap file must be set nocow before it is written to, otherwise it is
ignored and Btrfs refuses to activate it as swap.

Fixes: 25ce9740065e ("generic: test swapfile creation, activation, and 
deactivation")
Signed-off-by: Omar Sandoval 
---
 tests/generic/472 | 14 ++
 tests/generic/496 |  6 +++---
 2 files changed, 9 insertions(+), 11 deletions(-)

diff --git a/tests/generic/472 b/tests/generic/472
index 04ed3e73..aba4a007 100755
--- a/tests/generic/472
+++ b/tests/generic/472
@@ -42,13 +42,15 @@ _scratch_mount >>$seqres.full 2>&1
 
 swapfile=$SCRATCH_MNT/swap
 len=$((2 * 1048576))
-page_size=$(get_page_size)
 
 swapfile_cycle() {
local swapfile="$1"
+   local len="$2"
 
+   touch $swapfile
# Swap files must be nocow on Btrfs.
$CHATTR_PROG +C $swapfile >> $seqres.full 2>&1
+   _pwrite_byte 0x58 0 $len $swapfile >> $seqres.full
"$here/src/mkswap" $swapfile >> $seqres.full
"$here/src/swapon" $swapfile 2>&1 | _filter_scratch
swapoff $swapfile 2>> $seqres.full
@@ -57,20 +59,16 @@ swapfile_cycle() {
 
 # Create a regular swap file
 echo "regular swap" | tee -a $seqres.full
-_pwrite_byte 0x58 0 $len $swapfile >> $seqres.full
-swapfile_cycle $swapfile
+swapfile_cycle $swapfile $len
 
 # Create a swap file with a little too much junk on the end
 echo "too long swap" | tee -a $seqres.full
-_pwrite_byte 0x58 0 $((len + 3)) $swapfile >> $seqres.full
-swapfile_cycle $swapfile
+swapfile_cycle $swapfile $((len + 3))
 
 # Create a ridiculously small swap file.  Each swap file must have at least
 # two pages after the header page.
 echo "tiny swap" | tee -a $seqres.full
-tiny_len=$((page_size * 3))
-_pwrite_byte 0x58 0 $tiny_len $swapfile >> $seqres.full
-swapfile_cycle $swapfile
+swapfile_cycle $swapfile $(($(get_page_size) * 3))
 
 status=0
 exit
diff --git a/tests/generic/496 b/tests/generic/496
index 968b8012..3083eef0 100755
--- a/tests/generic/496
+++ b/tests/generic/496
@@ -49,8 +49,6 @@ page_size=$(get_page_size)
 swapfile_cycle() {
local swapfile="$1"
 
-   # Swap files must be nocow on Btrfs.
-   $CHATTR_PROG +C $swapfile >> $seqres.full 2>&1
"$here/src/mkswap" $swapfile >> $seqres.full
"$here/src/swapon" $swapfile 2>&1 | _filter_scratch
swapoff $swapfile 2>> $seqres.full
@@ -59,8 +57,10 @@ swapfile_cycle() {
 
 # Create a fallocated swap file
 echo "fallocate swap" | tee -a $seqres.full
-$XFS_IO_PROG -f -c "falloc 0 $len" $swapfile >> $seqres.full
+touch $swapfile
+# Swap files must be nocow on Btrfs.
 $CHATTR_PROG +C $swapfile >> $seqres.full 2>&1
+$XFS_IO_PROG -f -c "falloc 0 $len" $swapfile >> $seqres.full
 "$here/src/mkswap" $swapfile
 "$here/src/swapon" $swapfile >> $seqres.full 2>&1 || \
_notrun "fallocated swap not supported here"
-- 
2.19.1



[PATCH 3/7] btrfs: test swap file activation restrictions

2018-11-02 Thread Omar Sandoval
From: Omar Sandoval 

Swap files on Btrfs have some restrictions not applicable to other
filesystems.

Signed-off-by: Omar Sandoval 
---
 tests/btrfs/173 | 55 +
 tests/btrfs/173.out |  5 +
 tests/btrfs/group   |  1 +
 3 files changed, 61 insertions(+)
 create mode 100755 tests/btrfs/173
 create mode 100644 tests/btrfs/173.out

diff --git a/tests/btrfs/173 b/tests/btrfs/173
new file mode 100755
index ..665bec39
--- /dev/null
+++ b/tests/btrfs/173
@@ -0,0 +1,55 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2018 Facebook.  All Rights Reserved.
+#
+# FS QA Test 173
+#
+# Test swap file activation restrictions specific to Btrfs.
+#
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo "QA output created by $seq"
+
+here=`pwd`
+tmp=/tmp/$$
+status=1   # failure is the default!
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+_cleanup()
+{
+   cd /
+   rm -f $tmp.*
+}
+
+. ./common/rc
+. ./common/filter
+
+rm -f $seqres.full
+
+_supported_fs generic
+_supported_os Linux
+_require_scratch_swapfile
+
+_scratch_mkfs >> $seqres.full 2>&1
+_scratch_mount
+
+echo "COW file"
+# We can't use _format_swapfile because we don't want chattr +C, and we can't
+# unset it after the swap file has been created.
+rm -f "$SCRATCH_MNT/swap"
+touch "$SCRATCH_MNT/swap"
+chmod 0600 "$SCRATCH_MNT/swap"
+_pwrite_byte 0x61 0 $(($(get_page_size) * 10)) "$SCRATCH_MNT/swap" >> 
$seqres.full
+mkswap "$SCRATCH_MNT/swap" >> $seqres.full
+swapon "$SCRATCH_MNT/swap" 2>&1 | _filter_scratch
+swapoff "$SCRATCH_MNT/swap" >/dev/null 2>&1
+
+echo "Compressed file"
+rm -f "$SCRATCH_MNT/swap"
+_format_swapfile "$SCRATCH_MNT/swap" $(($(get_page_size) * 10))
+$CHATTR_PROG +c "$SCRATCH_MNT/swap"
+swapon "$SCRATCH_MNT/swap" 2>&1 | _filter_scratch
+swapoff "$SCRATCH_MNT/swap" >/dev/null 2>&1
+
+status=0
+exit
diff --git a/tests/btrfs/173.out b/tests/btrfs/173.out
new file mode 100644
index ..6d7856bf
--- /dev/null
+++ b/tests/btrfs/173.out
@@ -0,0 +1,5 @@
+QA output created by 173
+COW file
+swapon: SCRATCH_MNT/swap: swapon failed: Invalid argument
+Compressed file
+swapon: SCRATCH_MNT/swap: swapon failed: Invalid argument
diff --git a/tests/btrfs/group b/tests/btrfs/group
index a490d7eb..3525014f 100644
--- a/tests/btrfs/group
+++ b/tests/btrfs/group
@@ -175,3 +175,4 @@
 170 auto quick snapshot
 171 auto quick qgroup
 172 auto quick punch
+173 auto quick swap
-- 
2.19.1



Re: [PATCH v2] Btrfs: fix missing delayed iputs on unmount

2018-11-01 Thread Omar Sandoval
On Thu, Nov 01, 2018 at 04:29:48PM +0100, David Sterba wrote:
> On Thu, Nov 01, 2018 at 08:24:25AM -0700, Omar Sandoval wrote:
> > On Thu, Nov 01, 2018 at 04:22:29PM +0100, David Sterba wrote:
> > > On Thu, Nov 01, 2018 at 04:08:32PM +0100, David Sterba wrote:
> > > > On Thu, Nov 01, 2018 at 01:31:18PM +, Chris Mason wrote:
> > > > > On 1 Nov 2018, at 6:15, David Sterba wrote:
> > > > > 
> > > > > > On Wed, Oct 31, 2018 at 10:06:08AM -0700, Omar Sandoval wrote:
> > > > > >> From: Omar Sandoval 
> > > > > >>
> > > > > >> There's a race between close_ctree() and cleaner_kthread().
> > > > > >> close_ctree() sets btrfs_fs_closing(), and the cleaner stops when 
> > > > > >> it
> > > > > >> sees it set, but this is racy; the cleaner might have already 
> > > > > >> checked
> > > > > >> the bit and could be cleaning stuff. In particular, if it deletes 
> > > > > >> unused
> > > > > >> block groups, it will create delayed iputs for the free space cache
> > > > > >> inodes. As of "btrfs: don't run delayed_iputs in commit", we're no
> > > > > >> longer running delayed iputs after a commit. Therefore, if the 
> > > > > >> cleaner
> > > > > >> creates more delayed iputs after delayed iputs are run in
> > > > > >> btrfs_commit_super(), we will leak inodes on unmount and get a busy
> > > > > >> inode crash from the VFS.
> > > > > >>
> > > > > >> Fix it by parking the cleaner
> > > > > >
> > > > > > Ouch, that's IMO wrong way to fix it. The bug is on a higher level,
> > > > > > we're missing a commit or clean up data structures. Messing with 
> > > > > > state
> > > > > > of a thread would be the last thing I'd try after proving that it's 
> > > > > > not
> > > > > > possible to fix in the logic of btrfs itself.
> > > > > >
> > > > > > The shutdown sequence in close_tree is quite tricky and we've had 
> > > > > > bugs
> > > > > > there. The interdependencies of thread and data structures and other
> > > > > > subsystems cannot have loops that could not find an ordering that 
> > > > > > will
> > > > > > not leak something.
> > > > > >
> > > > > > It's not a big problem if some step is done more than once, like
> > > > > > committing or cleaning up some other structures if we know that
> > > > > > it could create new.
> > > > > 
> > > > > The problem is the cleaner thread needs to be told to stop doing new 
> > > > > work, and we need to wait for the work it's already doing to be 
> > > > > finished.  We're getting "stop doing new work" already because the 
> > > > > cleaner thread checks to see if the FS is closing, but we don't have 
> > > > > a 
> > > > > way today to wait for him to finish what he's already doing.
> > > > > 
> > > > > kthread_park() is basically the same as adding another mutex or 
> > > > > synchronization point.  I'm not sure how we could change close_tree() 
> > > > > or 
> > > > > the final commit to pick this up more effectively?
> > > > 
> > > > The idea is:
> > > > 
> > > > cleaner close_ctree thread
> > > > 
> > > > tell cleaner to stop
> > > > wait
> > > > start work
> > > > if should_stop, then exit
> > > > cleaner is stopped
> > > > 
> > > > [does not run: finish work]
> > > > [does not run: loop]
> > > > pick up the work or make
> > > > sure there's nothing in
> > > > progress anymore
> > > > 
> > > > 
> > > > A simplified version in code:
> > > > 
> > > >   set_bit(BTRFS_FS_CLOSING_START, _info->flags);
> > > > 
> > > >   wait for defrag - could be sta

Re: [PATCH v2] Btrfs: fix missing delayed iputs on unmount

2018-11-01 Thread Omar Sandoval
On Thu, Nov 01, 2018 at 08:24:25AM -0700, Omar Sandoval wrote:
> On Thu, Nov 01, 2018 at 04:22:29PM +0100, David Sterba wrote:
> > On Thu, Nov 01, 2018 at 04:08:32PM +0100, David Sterba wrote:
> > > On Thu, Nov 01, 2018 at 01:31:18PM +, Chris Mason wrote:
> > > > On 1 Nov 2018, at 6:15, David Sterba wrote:
> > > > 
> > > > > On Wed, Oct 31, 2018 at 10:06:08AM -0700, Omar Sandoval wrote:
> > > > >> From: Omar Sandoval 
> > > > >>
> > > > >> There's a race between close_ctree() and cleaner_kthread().
> > > > >> close_ctree() sets btrfs_fs_closing(), and the cleaner stops when it
> > > > >> sees it set, but this is racy; the cleaner might have already checked
> > > > >> the bit and could be cleaning stuff. In particular, if it deletes 
> > > > >> unused
> > > > >> block groups, it will create delayed iputs for the free space cache
> > > > >> inodes. As of "btrfs: don't run delayed_iputs in commit", we're no
> > > > >> longer running delayed iputs after a commit. Therefore, if the 
> > > > >> cleaner
> > > > >> creates more delayed iputs after delayed iputs are run in
> > > > >> btrfs_commit_super(), we will leak inodes on unmount and get a busy
> > > > >> inode crash from the VFS.
> > > > >>
> > > > >> Fix it by parking the cleaner
> > > > >
> > > > > Ouch, that's IMO wrong way to fix it. The bug is on a higher level,
> > > > > we're missing a commit or clean up data structures. Messing with state
> > > > > of a thread would be the last thing I'd try after proving that it's 
> > > > > not
> > > > > possible to fix in the logic of btrfs itself.
> > > > >
> > > > > The shutdown sequence in close_tree is quite tricky and we've had bugs
> > > > > there. The interdependencies of thread and data structures and other
> > > > > subsystems cannot have loops that could not find an ordering that will
> > > > > not leak something.
> > > > >
> > > > > It's not a big problem if some step is done more than once, like
> > > > > committing or cleaning up some other structures if we know that
> > > > > it could create new.
> > > > 
> > > > The problem is the cleaner thread needs to be told to stop doing new 
> > > > work, and we need to wait for the work it's already doing to be 
> > > > finished.  We're getting "stop doing new work" already because the 
> > > > cleaner thread checks to see if the FS is closing, but we don't have a 
> > > > way today to wait for him to finish what he's already doing.
> > > > 
> > > > kthread_park() is basically the same as adding another mutex or 
> > > > synchronization point.  I'm not sure how we could change close_tree() 
> > > > or 
> > > > the final commit to pick this up more effectively?
> > > 
> > > The idea is:
> > > 
> > > cleaner close_ctree thread
> > > 
> > > tell cleaner to stop
> > >   wait
> > > start work
> > > if should_stop, then exit
> > > cleaner is stopped
> > > 
> > > [does not run: finish work]
> > > [does not run: loop]
> > > pick up the work or make
> > >   sure there's nothing in
> > >   progress anymore
> > > 
> > > 
> > > A simplified version in code:
> > > 
> > >   set_bit(BTRFS_FS_CLOSING_START, _info->flags);
> > > 
> > >   wait for defrag - could be started from cleaner but next iteration will
> > >   see the fs closed and will not continue
> > > 
> > >   kthread_stop(transaction_kthread)
> > > 
> > >   kthread_stop(cleaner_kthread)
> > > 
> > >   /* next, everything that could be left from cleaner should be finished 
> > > */
> > > 
> > >   btrfs_delete_unused_bgs();
> > >   assert there are no defrags
> > >   assert there are no delayed iputs
> > >   commit if necessary
> > > 
> > > IOW the unused block groups are removed from close_ctree too early,
> > > moving that after the threads stop AND makins sure that it does not need
> > > either of them should work.
> > > 
> > > The "AND" above is not currently implemented as btrfs_delete_unused_bgs
> > > calls plain btrfs_end_transaction that wakes up transaction ktread, so
> > > there would need to be an argument passed to tell it to do full commit.
> > 
> > Not perfect, relies on the fact that wake_up_process(thread) on a stopped
> > thread is a no-op,
> 
> How is that? kthread_stop() frees the task struct, so wake_up_process()
> would be a use-after-free.

(Indirectly, through the kthread calling do_exit() -> do_task_dead() ->
being put in the scheduler)


Re: [PATCH v2] Btrfs: fix missing delayed iputs on unmount

2018-11-01 Thread Omar Sandoval
On Thu, Nov 01, 2018 at 04:22:29PM +0100, David Sterba wrote:
> On Thu, Nov 01, 2018 at 04:08:32PM +0100, David Sterba wrote:
> > On Thu, Nov 01, 2018 at 01:31:18PM +, Chris Mason wrote:
> > > On 1 Nov 2018, at 6:15, David Sterba wrote:
> > > 
> > > > On Wed, Oct 31, 2018 at 10:06:08AM -0700, Omar Sandoval wrote:
> > > >> From: Omar Sandoval 
> > > >>
> > > >> There's a race between close_ctree() and cleaner_kthread().
> > > >> close_ctree() sets btrfs_fs_closing(), and the cleaner stops when it
> > > >> sees it set, but this is racy; the cleaner might have already checked
> > > >> the bit and could be cleaning stuff. In particular, if it deletes 
> > > >> unused
> > > >> block groups, it will create delayed iputs for the free space cache
> > > >> inodes. As of "btrfs: don't run delayed_iputs in commit", we're no
> > > >> longer running delayed iputs after a commit. Therefore, if the 
> > > >> cleaner
> > > >> creates more delayed iputs after delayed iputs are run in
> > > >> btrfs_commit_super(), we will leak inodes on unmount and get a busy
> > > >> inode crash from the VFS.
> > > >>
> > > >> Fix it by parking the cleaner
> > > >
> > > > Ouch, that's IMO wrong way to fix it. The bug is on a higher level,
> > > > we're missing a commit or clean up data structures. Messing with state
> > > > of a thread would be the last thing I'd try after proving that it's 
> > > > not
> > > > possible to fix in the logic of btrfs itself.
> > > >
> > > > The shutdown sequence in close_tree is quite tricky and we've had bugs
> > > > there. The interdependencies of thread and data structures and other
> > > > subsystems cannot have loops that could not find an ordering that will
> > > > not leak something.
> > > >
> > > > It's not a big problem if some step is done more than once, like
> > > > committing or cleaning up some other structures if we know that
> > > > it could create new.
> > > 
> > > The problem is the cleaner thread needs to be told to stop doing new 
> > > work, and we need to wait for the work it's already doing to be 
> > > finished.  We're getting "stop doing new work" already because the 
> > > cleaner thread checks to see if the FS is closing, but we don't have a 
> > > way today to wait for him to finish what he's already doing.
> > > 
> > > kthread_park() is basically the same as adding another mutex or 
> > > synchronization point.  I'm not sure how we could change close_tree() or 
> > > the final commit to pick this up more effectively?
> > 
> > The idea is:
> > 
> > cleaner close_ctree thread
> > 
> > tell cleaner to stop
> > wait
> > start work
> > if should_stop, then exit
> > cleaner is stopped
> > 
> > [does not run: finish work]
> > [does not run: loop]
> > pick up the work or make
> > sure there's nothing in
> > progress anymore
> > 
> > 
> > A simplified version in code:
> > 
> >   set_bit(BTRFS_FS_CLOSING_START, _info->flags);
> > 
> >   wait for defrag - could be started from cleaner but next iteration will
> > see the fs closed and will not continue
> > 
> >   kthread_stop(transaction_kthread)
> > 
> >   kthread_stop(cleaner_kthread)
> > 
> >   /* next, everything that could be left from cleaner should be finished */
> > 
> >   btrfs_delete_unused_bgs();
> >   assert there are no defrags
> >   assert there are no delayed iputs
> >   commit if necessary
> > 
> > IOW the unused block groups are removed from close_ctree too early,
> > moving that after the threads stop AND makins sure that it does not need
> > either of them should work.
> > 
> > The "AND" above is not currently implemented as btrfs_delete_unused_bgs
> > calls plain btrfs_end_transaction that wakes up transaction ktread, so
> > there would need to be an argument passed to tell it to do full commit.
> 
> Not perfect, relies on the fact that wake_up_process(thread) on a stopped
> thread is a no-op,

How is that? kthread_stop() frees the task struct, so wake_up_process()
would be a use-after-free.


Re: [PATCH v2] Btrfs: fix missing delayed iputs on unmount

2018-11-01 Thread Omar Sandoval
On Thu, Nov 01, 2018 at 04:08:32PM +0100, David Sterba wrote:
> On Thu, Nov 01, 2018 at 01:31:18PM +, Chris Mason wrote:
> > On 1 Nov 2018, at 6:15, David Sterba wrote:
> > 
> > > On Wed, Oct 31, 2018 at 10:06:08AM -0700, Omar Sandoval wrote:
> > >> From: Omar Sandoval 
> > >>
> > >> There's a race between close_ctree() and cleaner_kthread().
> > >> close_ctree() sets btrfs_fs_closing(), and the cleaner stops when it
> > >> sees it set, but this is racy; the cleaner might have already checked
> > >> the bit and could be cleaning stuff. In particular, if it deletes 
> > >> unused
> > >> block groups, it will create delayed iputs for the free space cache
> > >> inodes. As of "btrfs: don't run delayed_iputs in commit", we're no
> > >> longer running delayed iputs after a commit. Therefore, if the 
> > >> cleaner
> > >> creates more delayed iputs after delayed iputs are run in
> > >> btrfs_commit_super(), we will leak inodes on unmount and get a busy
> > >> inode crash from the VFS.
> > >>
> > >> Fix it by parking the cleaner
> > >
> > > Ouch, that's IMO wrong way to fix it. The bug is on a higher level,
> > > we're missing a commit or clean up data structures. Messing with state
> > > of a thread would be the last thing I'd try after proving that it's 
> > > not
> > > possible to fix in the logic of btrfs itself.
> > >
> > > The shutdown sequence in close_tree is quite tricky and we've had bugs
> > > there. The interdependencies of thread and data structures and other
> > > subsystems cannot have loops that could not find an ordering that will
> > > not leak something.
> > >
> > > It's not a big problem if some step is done more than once, like
> > > committing or cleaning up some other structures if we know that
> > > it could create new.
> > 
> > The problem is the cleaner thread needs to be told to stop doing new 
> > work, and we need to wait for the work it's already doing to be 
> > finished.  We're getting "stop doing new work" already because the 
> > cleaner thread checks to see if the FS is closing, but we don't have a 
> > way today to wait for him to finish what he's already doing.
> > 
> > kthread_park() is basically the same as adding another mutex or 
> > synchronization point.  I'm not sure how we could change close_tree() or 
> > the final commit to pick this up more effectively?
> 
> The idea is:
> 
> cleaner close_ctree thread
> 
> tell cleaner to stop
>   wait
> start work
> if should_stop, then exit
> cleaner is stopped
> 
> [does not run: finish work]
> [does not run: loop]
> pick up the work or make
>   sure there's nothing in
>   progress anymore
> 
> 
> A simplified version in code:
> 
>   set_bit(BTRFS_FS_CLOSING_START, _info->flags);
> 
>   wait for defrag - could be started from cleaner but next iteration will
>   see the fs closed and will not continue
> 
>   kthread_stop(transaction_kthread)
> 
>   kthread_stop(cleaner_kthread)
> 
>   /* next, everything that could be left from cleaner should be finished */
> 
>   btrfs_delete_unused_bgs();
>   assert there are no defrags
>   assert there are no delayed iputs
>   commit if necessary
> 
> IOW the unused block groups are removed from close_ctree too early,
> moving that after the threads stop AND makins sure that it does not need
> either of them should work.
> 
> The "AND" above is not currently implemented as btrfs_delete_unused_bgs
> calls plain btrfs_end_transaction that wakes up transaction ktread, so
> there would need to be an argument passed to tell it to do full commit.

It's too fragile to run around in the filesystem with these threads
freed. We can probably make it now, but I'm worried that we'll add
another wakeup somewhere and blow up.


[PATCH v2] Btrfs: fix missing delayed iputs on unmount

2018-10-31 Thread Omar Sandoval
From: Omar Sandoval 

There's a race between close_ctree() and cleaner_kthread().
close_ctree() sets btrfs_fs_closing(), and the cleaner stops when it
sees it set, but this is racy; the cleaner might have already checked
the bit and could be cleaning stuff. In particular, if it deletes unused
block groups, it will create delayed iputs for the free space cache
inodes. As of "btrfs: don't run delayed_iputs in commit", we're no
longer running delayed iputs after a commit. Therefore, if the cleaner
creates more delayed iputs after delayed iputs are run in
btrfs_commit_super(), we will leak inodes on unmount and get a busy
inode crash from the VFS.

Fix it by parking the cleaner before we actually close anything. Then,
any remaining delayed iputs will always be handled in
btrfs_commit_super(). This also ensures that the commit in close_ctree()
is really the last commit, so we can get rid of the commit in
cleaner_kthread().

Fixes: 30928e9baac2 ("btrfs: don't run delayed_iputs in commit")
Signed-off-by: Omar Sandoval 
---
Changes from v1:

- Add a comment explaining why it needs to be a kthread_park(), not
  kthread_stop()
- Update later comment now that the cleaner thread is definitely stopped

 fs/btrfs/disk-io.c | 51 ++
 1 file changed, 15 insertions(+), 36 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index b0ab41da91d1..40bcc45d827d 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1664,9 +1664,8 @@ static int cleaner_kthread(void *arg)
struct btrfs_root *root = arg;
struct btrfs_fs_info *fs_info = root->fs_info;
int again;
-   struct btrfs_trans_handle *trans;
 
-   do {
+   while (1) {
again = 0;
 
/* Make the cleaner go to sleep early. */
@@ -1715,42 +1714,16 @@ static int cleaner_kthread(void *arg)
 */
btrfs_delete_unused_bgs(fs_info);
 sleep:
+   if (kthread_should_park())
+   kthread_parkme();
+   if (kthread_should_stop())
+   return 0;
if (!again) {
set_current_state(TASK_INTERRUPTIBLE);
-   if (!kthread_should_stop())
-   schedule();
+   schedule();
__set_current_state(TASK_RUNNING);
}
-   } while (!kthread_should_stop());
-
-   /*
-* Transaction kthread is stopped before us and wakes us up.
-* However we might have started a new transaction and COWed some
-* tree blocks when deleting unused block groups for example. So
-* make sure we commit the transaction we started to have a clean
-* shutdown when evicting the btree inode - if it has dirty pages
-* when we do the final iput() on it, eviction will trigger a
-* writeback for it which will fail with null pointer dereferences
-* since work queues and other resources were already released and
-* destroyed by the time the iput/eviction/writeback is made.
-*/
-   trans = btrfs_attach_transaction(root);
-   if (IS_ERR(trans)) {
-   if (PTR_ERR(trans) != -ENOENT)
-   btrfs_err(fs_info,
- "cleaner transaction attach returned %ld",
- PTR_ERR(trans));
-   } else {
-   int ret;
-
-   ret = btrfs_commit_transaction(trans);
-   if (ret)
-   btrfs_err(fs_info,
- "cleaner open transaction commit returned %d",
- ret);
}
-
-   return 0;
 }
 
 static int transaction_kthread(void *arg)
@@ -3931,6 +3904,13 @@ void close_ctree(struct btrfs_fs_info *fs_info)
int ret;
 
set_bit(BTRFS_FS_CLOSING_START, _info->flags);
+   /*
+* We don't want the cleaner to start new transactions, add more delayed
+* iputs, etc. while we're closing. We can't use kthread_stop() yet
+* because that frees the task_struct, and the transaction kthread might
+* still try to wake up the cleaner.
+*/
+   kthread_park(fs_info->cleaner_kthread);
 
/* wait for the qgroup rescan worker to stop */
btrfs_qgroup_wait_for_completion(fs_info, false);
@@ -3958,9 +3938,8 @@ void close_ctree(struct btrfs_fs_info *fs_info)
 
if (!sb_rdonly(fs_info->sb)) {
/*
-* If the cleaner thread is stopped and there are
-* block groups queued for removal, the deletion will be
-* skipped when we quit the cleaner thread.
+* The cleaner kthread is stopped, so do one final pass over
+* unused block groups.
 */
btrfs_delete_unused_bgs(fs_info);
 
-- 
2.19.1



Re: [PATCH] Btrfs: fix missing delayed iputs on unmount

2018-10-31 Thread Omar Sandoval
On Wed, Oct 31, 2018 at 06:40:29PM +0200, Nikolay Borisov wrote:
> 
> 
> On 31.10.18 г. 18:35 ч., Omar Sandoval wrote:
> > On Wed, Oct 31, 2018 at 10:43:02AM +0200, Nikolay Borisov wrote:
> >>
> >>
> >> On 31.10.18 г. 2:14 ч., Omar Sandoval wrote:
> >>> From: Omar Sandoval 
> >>>
> >>> There's a race between close_ctree() and cleaner_kthread().
> >>> close_ctree() sets btrfs_fs_closing(), and the cleaner stops when it
> >>> sees it set, but this is racy; the cleaner might have already checked
> >>> the bit and could be cleaning stuff. In particular, if it deletes unused
> >>> block groups, it will create delayed iputs for the free space cache
> >>> inodes. As of "btrfs: don't run delayed_iputs in commit", we're no
> >>> longer running delayed iputs after a commit. Therefore, if the cleaner
> >>> creates more delayed iputs after delayed iputs are run in
> >>> btrfs_commit_super(), we will leak inodes on unmount and get a busy
> >>> inode crash from the VFS.
> >>>
> >>> Fix it by parking the cleaner before we actually close anything. Then,
> >>> any remaining delayed iputs will always be handled in
> >>> btrfs_commit_super(). This also ensures that the commit in close_ctree()
> >>> is really the last commit, so we can get rid of the commit in
> >>> cleaner_kthread().
> >>>
> >>> Fixes: 30928e9baac2 ("btrfs: don't run delayed_iputs in commit")
> >>> Signed-off-by: Omar Sandoval 
> >>> ---
> >>> We found this with a stress test that our containers team runs. I'm
> >>> wondering if this same race could have caused any other issues other
> >>> than this new iput thing, but I couldn't identify any.
> >>>
> >>>  fs/btrfs/disk-io.c | 40 +++-
> >>>  1 file changed, 7 insertions(+), 33 deletions(-)
> >>>
> >>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> >>> index b0ab41da91d1..7c17284ae3c2 100644
> >>> --- a/fs/btrfs/disk-io.c
> >>> +++ b/fs/btrfs/disk-io.c
> >>> @@ -1664,9 +1664,8 @@ static int cleaner_kthread(void *arg)
> >>>   struct btrfs_root *root = arg;
> >>>   struct btrfs_fs_info *fs_info = root->fs_info;
> >>>   int again;
> >>> - struct btrfs_trans_handle *trans;
> >>>  
> >>> - do {
> >>> + while (1) {
> >>>   again = 0;
> >>>  
> >>>   /* Make the cleaner go to sleep early. */
> >>> @@ -1715,42 +1714,16 @@ static int cleaner_kthread(void *arg)
> >>>*/
> >>>   btrfs_delete_unused_bgs(fs_info);
> >>>  sleep:
> >>> + if (kthread_should_park())
> >>> + kthread_parkme();
> >>> + if (kthread_should_stop())
> >>> + return 0;
> >>>   if (!again) {
> >>>   set_current_state(TASK_INTERRUPTIBLE);
> >>> - if (!kthread_should_stop())
> >>> - schedule();
> >>> + schedule();
> >>>   __set_current_state(TASK_RUNNING);
> >>>   }
> >>> - } while (!kthread_should_stop());
> >>> -
> >>> - /*
> >>> -  * Transaction kthread is stopped before us and wakes us up.
> >>> -  * However we might have started a new transaction and COWed some
> >>> -  * tree blocks when deleting unused block groups for example. So
> >>> -  * make sure we commit the transaction we started to have a clean
> >>> -  * shutdown when evicting the btree inode - if it has dirty pages
> >>> -  * when we do the final iput() on it, eviction will trigger a
> >>> -  * writeback for it which will fail with null pointer dereferences
> >>> -  * since work queues and other resources were already released and
> >>> -  * destroyed by the time the iput/eviction/writeback is made.
> >>> -  */
> >>> - trans = btrfs_attach_transaction(root);
> >>> - if (IS_ERR(trans)) {
> >>> - if (PTR_ERR(trans) != -ENOENT)
> >>> - btrfs_err(fs_info,
> >>> -   "cleaner transaction attach returned %ld",
> >>> -   PTR_ERR(trans));
> >>> - } else {
> >>> - int ret;

Re: [PATCH] Btrfs: fix missing delayed iputs on unmount

2018-10-31 Thread Omar Sandoval
On Wed, Oct 31, 2018 at 03:41:47PM +0800, Lu Fengqi wrote:
> On Tue, Oct 30, 2018 at 05:14:42PM -0700, Omar Sandoval wrote:
> >From: Omar Sandoval 
> >
> >There's a race between close_ctree() and cleaner_kthread().
> >close_ctree() sets btrfs_fs_closing(), and the cleaner stops when it
> >sees it set, but this is racy; the cleaner might have already checked
> >the bit and could be cleaning stuff. In particular, if it deletes unused
> >block groups, it will create delayed iputs for the free space cache
> >inodes. As of "btrfs: don't run delayed_iputs in commit", we're no
> >longer running delayed iputs after a commit. Therefore, if the cleaner
> >creates more delayed iputs after delayed iputs are run in
> >btrfs_commit_super(), we will leak inodes on unmount and get a busy
> 
> Since the assert added via commit e187831e1875 ("btrfs: assert on non-empty
> delayed iputs") wasn't triggered, it doesn't seem to be the cause of inode
> leak.

This was in our production build without CONFIG_BTRFS_ASSERT.


Re: [PATCH] Btrfs: fix missing delayed iputs on unmount

2018-10-31 Thread Omar Sandoval
On Wed, Oct 31, 2018 at 10:43:02AM +0200, Nikolay Borisov wrote:
> 
> 
> On 31.10.18 г. 2:14 ч., Omar Sandoval wrote:
> > From: Omar Sandoval 
> > 
> > There's a race between close_ctree() and cleaner_kthread().
> > close_ctree() sets btrfs_fs_closing(), and the cleaner stops when it
> > sees it set, but this is racy; the cleaner might have already checked
> > the bit and could be cleaning stuff. In particular, if it deletes unused
> > block groups, it will create delayed iputs for the free space cache
> > inodes. As of "btrfs: don't run delayed_iputs in commit", we're no
> > longer running delayed iputs after a commit. Therefore, if the cleaner
> > creates more delayed iputs after delayed iputs are run in
> > btrfs_commit_super(), we will leak inodes on unmount and get a busy
> > inode crash from the VFS.
> > 
> > Fix it by parking the cleaner before we actually close anything. Then,
> > any remaining delayed iputs will always be handled in
> > btrfs_commit_super(). This also ensures that the commit in close_ctree()
> > is really the last commit, so we can get rid of the commit in
> > cleaner_kthread().
> > 
> > Fixes: 30928e9baac2 ("btrfs: don't run delayed_iputs in commit")
> > Signed-off-by: Omar Sandoval 
> > ---
> > We found this with a stress test that our containers team runs. I'm
> > wondering if this same race could have caused any other issues other
> > than this new iput thing, but I couldn't identify any.
> > 
> >  fs/btrfs/disk-io.c | 40 +++-
> >  1 file changed, 7 insertions(+), 33 deletions(-)
> > 
> > diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> > index b0ab41da91d1..7c17284ae3c2 100644
> > --- a/fs/btrfs/disk-io.c
> > +++ b/fs/btrfs/disk-io.c
> > @@ -1664,9 +1664,8 @@ static int cleaner_kthread(void *arg)
> > struct btrfs_root *root = arg;
> > struct btrfs_fs_info *fs_info = root->fs_info;
> > int again;
> > -   struct btrfs_trans_handle *trans;
> >  
> > -   do {
> > +   while (1) {
> > again = 0;
> >  
> > /* Make the cleaner go to sleep early. */
> > @@ -1715,42 +1714,16 @@ static int cleaner_kthread(void *arg)
> >  */
> > btrfs_delete_unused_bgs(fs_info);
> >  sleep:
> > +   if (kthread_should_park())
> > +   kthread_parkme();
> > +   if (kthread_should_stop())
> > +   return 0;
> > if (!again) {
> > set_current_state(TASK_INTERRUPTIBLE);
> > -   if (!kthread_should_stop())
> > -   schedule();
> > +   schedule();
> > __set_current_state(TASK_RUNNING);
> > }
> > -   } while (!kthread_should_stop());
> > -
> > -   /*
> > -* Transaction kthread is stopped before us and wakes us up.
> > -* However we might have started a new transaction and COWed some
> > -* tree blocks when deleting unused block groups for example. So
> > -* make sure we commit the transaction we started to have a clean
> > -* shutdown when evicting the btree inode - if it has dirty pages
> > -* when we do the final iput() on it, eviction will trigger a
> > -* writeback for it which will fail with null pointer dereferences
> > -* since work queues and other resources were already released and
> > -* destroyed by the time the iput/eviction/writeback is made.
> > -*/
> > -   trans = btrfs_attach_transaction(root);
> > -   if (IS_ERR(trans)) {
> > -   if (PTR_ERR(trans) != -ENOENT)
> > -   btrfs_err(fs_info,
> > - "cleaner transaction attach returned %ld",
> > - PTR_ERR(trans));
> > -   } else {
> > -   int ret;
> > -
> > -   ret = btrfs_commit_transaction(trans);
> > -   if (ret)
> > -   btrfs_err(fs_info,
> > - "cleaner open transaction commit returned %d",
> > - ret);
> > }
> > -
> > -   return 0;
> >  }
> >  
> >  static int transaction_kthread(void *arg)
> > @@ -3931,6 +3904,7 @@ void close_ctree(struct btrfs_fs_info *fs_info)
> > int ret;
> >  
> > set_bit(BTRFS_FS_CLOSING_START, _info->flags);
> > +   kthread_park(fs_info->cleaner_kthread);
> 
> Can't you directly call kthread_stop here? When you park the thread it
> will sleep and then when you call kthread_stop that function will unpark
> the thread and the cleaner kthread will see KTHREAD_SHOULD_STOP bit and
> just return 0. So the from the moment the thread is parked until it's
> stopped it doesn't have a chance to do useful work.

kthread_stop() frees the task_struct, but we might still try to wake up
the cleaner kthread from somewhere (e.g., from the transaction kthread).
So we really need to keep the cleaner alive but not doing work.


[PATCH] Btrfs: fix missing delayed iputs on unmount

2018-10-30 Thread Omar Sandoval
From: Omar Sandoval 

There's a race between close_ctree() and cleaner_kthread().
close_ctree() sets btrfs_fs_closing(), and the cleaner stops when it
sees it set, but this is racy; the cleaner might have already checked
the bit and could be cleaning stuff. In particular, if it deletes unused
block groups, it will create delayed iputs for the free space cache
inodes. As of "btrfs: don't run delayed_iputs in commit", we're no
longer running delayed iputs after a commit. Therefore, if the cleaner
creates more delayed iputs after delayed iputs are run in
btrfs_commit_super(), we will leak inodes on unmount and get a busy
inode crash from the VFS.

Fix it by parking the cleaner before we actually close anything. Then,
any remaining delayed iputs will always be handled in
btrfs_commit_super(). This also ensures that the commit in close_ctree()
is really the last commit, so we can get rid of the commit in
cleaner_kthread().

Fixes: 30928e9baac2 ("btrfs: don't run delayed_iputs in commit")
Signed-off-by: Omar Sandoval 
---
We found this with a stress test that our containers team runs. I'm
wondering if this same race could have caused any other issues other
than this new iput thing, but I couldn't identify any.

 fs/btrfs/disk-io.c | 40 +++-
 1 file changed, 7 insertions(+), 33 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index b0ab41da91d1..7c17284ae3c2 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1664,9 +1664,8 @@ static int cleaner_kthread(void *arg)
struct btrfs_root *root = arg;
struct btrfs_fs_info *fs_info = root->fs_info;
int again;
-   struct btrfs_trans_handle *trans;
 
-   do {
+   while (1) {
again = 0;
 
/* Make the cleaner go to sleep early. */
@@ -1715,42 +1714,16 @@ static int cleaner_kthread(void *arg)
 */
btrfs_delete_unused_bgs(fs_info);
 sleep:
+   if (kthread_should_park())
+   kthread_parkme();
+   if (kthread_should_stop())
+   return 0;
if (!again) {
set_current_state(TASK_INTERRUPTIBLE);
-   if (!kthread_should_stop())
-   schedule();
+   schedule();
__set_current_state(TASK_RUNNING);
}
-   } while (!kthread_should_stop());
-
-   /*
-* Transaction kthread is stopped before us and wakes us up.
-* However we might have started a new transaction and COWed some
-* tree blocks when deleting unused block groups for example. So
-* make sure we commit the transaction we started to have a clean
-* shutdown when evicting the btree inode - if it has dirty pages
-* when we do the final iput() on it, eviction will trigger a
-* writeback for it which will fail with null pointer dereferences
-* since work queues and other resources were already released and
-* destroyed by the time the iput/eviction/writeback is made.
-*/
-   trans = btrfs_attach_transaction(root);
-   if (IS_ERR(trans)) {
-   if (PTR_ERR(trans) != -ENOENT)
-   btrfs_err(fs_info,
- "cleaner transaction attach returned %ld",
- PTR_ERR(trans));
-   } else {
-   int ret;
-
-   ret = btrfs_commit_transaction(trans);
-   if (ret)
-   btrfs_err(fs_info,
- "cleaner open transaction commit returned %d",
- ret);
}
-
-   return 0;
 }
 
 static int transaction_kthread(void *arg)
@@ -3931,6 +3904,7 @@ void close_ctree(struct btrfs_fs_info *fs_info)
int ret;
 
set_bit(BTRFS_FS_CLOSING_START, _info->flags);
+   kthread_park(fs_info->cleaner_kthread);
 
/* wait for the qgroup rescan worker to stop */
btrfs_qgroup_wait_for_completion(fs_info, false);
-- 
2.19.1



Re: [PATCH] btrfs: add zstd compression level support

2018-10-30 Thread Omar Sandoval
On Tue, Oct 30, 2018 at 12:06:21PM -0700, Nick Terrell wrote:
> From: Jennifer Liu 
> 
> Adds zstd compression level support to btrfs. Zstd requires
> different amounts of memory for each level, so the design had
> to be modified to allow set_level() to allocate memory. We
> preallocate one workspace of the maximum size to guarantee
> forward progress. This feature is expected to be useful for
> read-mostly filesystems, or when creating images.
> 
> Benchmarks run in qemu on Intel x86 with a single core.
> The benchmark measures the time to copy the Silesia corpus [0] to
> a btrfs filesystem 10 times, then read it back.
> 
> The two important things to note are:
> - The decompression speed and memory remains constant.
>   The memory required to decompress is the same as level 1.
> - The compression speed and ratio will vary based on the source.
> 
> Level Ratio   Compression Decompression   Compression Memory
> 1 2.59153 MB/s112 MB/s0.8 MB
> 2 2.67136 MB/s113 MB/s1.0 MB
> 3 2.72106 MB/s115 MB/s1.3 MB
> 4 2.7886  MB/s109 MB/s0.9 MB
> 5 2.8369  MB/s109 MB/s1.4 MB
> 6 2.8953  MB/s110 MB/s1.5 MB
> 7 2.9140  MB/s112 MB/s1.4 MB
> 8 2.9234  MB/s110 MB/s1.8 MB
> 9 2.9327  MB/s109 MB/s1.8 MB
> 102.9422  MB/s109 MB/s1.8 MB
> 112.9517  MB/s114 MB/s1.8 MB
> 122.9513  MB/s113 MB/s1.8 MB
> 132.9510  MB/s111 MB/s2.3 MB
> 142.997   MB/s110 MB/s2.6 MB
> 153.036   MB/s    110 MB/s    2.6 MB
> 
> [0] http://sun.aei.polsl.pl/~sdeor/index.php?page=silesia

Reviewed-by: Omar Sandoval 

> Signed-off-by: Jennifer Liu 
> Signed-off-by: Nick Terrell 
> ---
>  fs/btrfs/compression.c | 172 +
>  fs/btrfs/compression.h |  18 +++--
>  fs/btrfs/lzo.c |   5 +-
>  fs/btrfs/super.c   |   7 +-
>  fs/btrfs/zlib.c|  33 
>  fs/btrfs/zstd.c|  74 ++
>  6 files changed, 204 insertions(+), 105 deletions(-)
> 
> diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
> index 2955a4ea2fa8..bd8e69381dc9 100644
> --- a/fs/btrfs/compression.c
> +++ b/fs/btrfs/compression.c
> @@ -822,11 +822,15 @@ void __init btrfs_init_compress(void)
>  
>   /*
>* Preallocate one workspace for each compression type so
> -  * we can guarantee forward progress in the worst case
> +  * we can guarantee forward progress in the worst case.
> +  * Provide the maximum compression level to guarantee large
> +  * enough workspace.
>*/
> - workspace = btrfs_compress_op[i]->alloc_workspace();
> + workspace = btrfs_compress_op[i]->alloc_workspace(
> + btrfs_compress_op[i]->max_level);
>   if (IS_ERR(workspace)) {
> - pr_warn("BTRFS: cannot preallocate compression 
> workspace, will try later\n");
> + pr_warn("BTRFS: cannot preallocate compression "
> + "workspace, will try later\n");

Nit: since you didn't change this line, don't rewrap it.


Re: [PATCH v9 0/6] Btrfs: implement swap file support

2018-10-22 Thread Omar Sandoval
On Fri, Oct 19, 2018 at 05:43:18PM +0200, David Sterba wrote:
> On Thu, Sep 27, 2018 at 11:17:32AM -0700, Omar Sandoval wrote:
> > From: Omar Sandoval 
> > This series implements swap file support for Btrfs.
> > 
> > Changes from v8 [1]:
> > 
> > - Fixed a bug in btrfs_swap_activate() which would cause us to miss some
> >   file extents if they were merged into one extent map entry.
> > - Fixed build for !CONFIG_SWAP.
> > - Changed all error messages to KERN_WARN.
> > - Unindented long error messages.
> > 
> > I've Cc'd Jon and Al on patch 3 this time, so hopefully we can get an
> > ack for that one, too.
> > 
> > Thanks!
> > 
> > 1: https://www.spinics.net/lists/linux-btrfs/msg82267.html
> > 
> > Omar Sandoval (6):
> >   mm: split SWP_FILE into SWP_ACTIVATED and SWP_FS
> >   mm: export add_swap_extent()
> >   vfs: update swap_{,de}activate documentation
> >   Btrfs: prevent ioctls from interfering with a swap file
> >   Btrfs: rename get_chunk_map() and make it non-static
> >   Btrfs: support swap files
> 
> Patches 1 and 2 now going through Andrew's tree, the btrfs part will be
> delayed and not merged to 4.20. This is a bit unfortuante, I was busy
> with the non-feature patches and other things, sorry.

That's perfectly fine with me, thanks, Dave!


Re: [PATCH 10/10] btrfs-progs: check: Fix wrong error message in case of corrupted bitmap

2018-10-04 Thread Omar Sandoval
On Mon, Oct 01, 2018 at 05:46:21PM +0300, Nikolay Borisov wrote:
> Similarly to the fix in e444c7bfa65f ("btrfs-progs: check: Fix wrong
> error message in case of corrupted extent") this commits addresses the
> same problem but for corrupted bitmap objects.

Oops.

Reviewed-by: Omar Sandoval 

> Signed-off-by: Nikolay Borisov 
> ---
>  free-space-tree.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/free-space-tree.c b/free-space-tree.c
> index 3b7e8a3fe4f5..690da44a739d 100644
> --- a/free-space-tree.c
> +++ b/free-space-tree.c
> @@ -1302,7 +1302,7 @@ static int load_free_space_bitmaps(struct btrfs_fs_info 
> *fs_info,
>   if (key.objectid + key.offset > end) {
>   fprintf(stderr,
>   "free space bitmap ends at %llu, beyond end of block group %llu-%llu\n",
> - key.objectid, start, end);
> + key.objectid + key.offset, start, end);
>   (*errors)++;
>   break;
>   }
> -- 
> 2.7.4
> 


Re: [PATCH 08/10] btrfs-progs: check: Add support for freespace tree fixing

2018-10-04 Thread Omar Sandoval
On Mon, Oct 01, 2018 at 05:46:19PM +0300, Nikolay Borisov wrote:
> Now that all the prerequisite code for proper support of free space
> tree repair is in, it's time to wire it in. This is achieved by first
> hooking the freespace tree to the __free_extent/alloc_reserved_tree_block
> functions. And then introducing a wrapper function to contains the
> existing check_space_cache and the newly introduced repair code.
> Finally, it's important to note that FST repair code first clears the
> existing FST in case of any problem found and rebuilds it from scratch.

Reviewed-by: Omar Sandoval 

A couple of really trivial nitpicks below that you should feel free to
ignore ;)

> Signed-off-by: Nikolay Borisov 
> ---
>  check/main.c | 47 ++-
>  1 file changed, 30 insertions(+), 17 deletions(-)
> 
> diff --git a/check/main.c b/check/main.c
> index b361cd7e26a0..4daf85aad82c 100644
> --- a/check/main.c
> +++ b/check/main.c
> @@ -5392,14 +5392,6 @@ static int check_space_cache(struct btrfs_root *root)
>   int ret;
>   int error = 0;
>  
> - if (btrfs_super_cache_generation(root->fs_info->super_copy) != -1ULL &&
> - btrfs_super_generation(root->fs_info->super_copy) !=
> - btrfs_super_cache_generation(root->fs_info->super_copy)) {
> - printf("cache and super generation don't match, space cache "
> -"will be invalidated\n");
> - return 0;
> - }
> -
>   while (1) {
>   ctx.item_count++;
>   cache = btrfs_lookup_first_block_group(root->fs_info, start);
> @@ -9417,7 +9409,6 @@ static int do_clear_free_space_cache(struct 
> btrfs_fs_info *fs_info,
>   ret = 1;
>   goto close_out;
>   }
> - printf("Clearing free space cache\n");

Just out of curiosity, why did you delete this message? The one in the
v2 case is still there.

>   ret = clear_free_space_cache(fs_info);
>   if (ret) {
>   error("failed to clear free space cache");
> @@ -9444,6 +9435,35 @@ static int do_clear_free_space_cache(struct 
> btrfs_fs_info *fs_info,
>   return ret;
>  }
>  
> +static int validate_free_space_cache(struct btrfs_root *root)

At first glance, I wouldn't know what the difference is between
check_space_cache() and validate_free_space_cache(); they sound like the
same thing. Maybe rename this to check_and_repair_space_cache() or just
fold the rebuild into check_space_cache(), to be more in line with the
other check steps in fsck?


Re: [PATCH 07/10] btrfs-progs: Add freespace tree as compat_ro supported feature

2018-10-04 Thread Omar Sandoval
On Mon, Oct 01, 2018 at 05:46:18PM +0300, Nikolay Borisov wrote:
> The RO_FREE_SPACE_TREE(_VALID) flags are required in order to be able
> to open an FST filesystem in repair mode. Add them to
> BTRFS_FEATURE_COMPAT_RO_SUPP.
> 
> Signed-off-by: Nikolay Borisov 
> ---
>  ctree.h | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/ctree.h b/ctree.h
> index a6d6c3decd87..3c396e7d293d 100644
> --- a/ctree.h
> +++ b/ctree.h
> @@ -497,7 +497,9 @@ struct btrfs_super_block {
>   * added here until read-write support for the free space tree is 
> implemented in
>   * btrfs-progs.
>   */

This comment should go away.

> -#define BTRFS_FEATURE_COMPAT_RO_SUPP 0ULL
> +#define BTRFS_FEATURE_COMPAT_RO_SUPP \
> + (BTRFS_FEATURE_COMPAT_RO_FREE_SPACE_TREE |  \
> +  BTRFS_FEATURE_COMPAT_RO_FREE_SPACE_TREE_VALID)
>  
>  #define BTRFS_FEATURE_INCOMPAT_SUPP  \
>   (BTRFS_FEATURE_INCOMPAT_MIXED_BACKREF | \

To repeat my question from before, did you test whether we can properly
change the filesystem with, e.g., btrfstune or btrfs fi label? Given
that some critical code was missing in the free space tree code, I'd be
surprised if it worked correctly.


Re: [PATCH 05/10] btrfs-progs: Pull free space tree related code from kernel

2018-10-04 Thread Omar Sandoval
On Mon, Oct 01, 2018 at 05:46:16PM +0300, Nikolay Borisov wrote:
> To help implement free space tree checker in user space some kernel
> function are necessary, namely iterating/deleting/adding freespace
> items, some internal search functions. Functions to populate a block
> group based on the extent tree. The code is largely copy/paste from
> the kernel with locking eliminated (i.e free_space_lock). It supports
> reading/writing of both bitmap and extent based FST trees.

For some reason, a lot of this added code uses spaces instead of tabs,
so I had to fix that in order to compare it to the kernel code (some of
the functions were reordered, too).

The only functional difference I noticed was that this is missing the
code to insert the block group header in the free space tree:

if (block_group->needs_free_space) {
ret = __add_block_group_free_space(trans, block_group, path);
if (ret)
return ret;
}

Was that intentionally omitted? Without it, the free space tree is
pretty broken :(

> Signed-off-by: Nikolay Borisov 
> ---
>  ctree.c   |   77 
>  ctree.h   |   15 +
>  free-space-tree.c | 1253 
> -
>  free-space-tree.h |   13 +-
>  kerncompat.h  |6 +
>  5 files changed, 1357 insertions(+), 7 deletions(-)
> 
> diff --git a/ctree.c b/ctree.c
> index d8a6883aa85f..aa1568620205 100644
> --- a/ctree.c
> +++ b/ctree.c
> @@ -1226,6 +1226,83 @@ int btrfs_search_slot(struct btrfs_trans_handle 
> *trans, struct btrfs_root
>  }
>  
>  /*
> + * helper to use instead of search slot if no exact match is needed but
> + * instead the next or previous item should be returned.
> + * When find_higher is true, the next higher item is returned, the next lower
> + * otherwise.
> + * When return_any and find_higher are both true, and no higher item is 
> found,
> + * return the next lower instead.
> + * When return_any is true and find_higher is false, and no lower item is 
> found,
> + * return the next higher instead.
> + * It returns 0 if any item is found, 1 if none is found (tree empty), and
> + * < 0 on error
> + */
> +int btrfs_search_slot_for_read(struct btrfs_root *root,
> +   const struct btrfs_key *key,
> +   struct btrfs_path *p, int find_higher,
> +   int return_any)
> +{
> +int ret;
> +struct extent_buffer *leaf;
> +
> +again:
> +ret = btrfs_search_slot(NULL, root, key, p, 0, 0);
> +if (ret <= 0)
> +return ret;
> +/*
> + * a return value of 1 means the path is at the position where the
> + * item should be inserted. Normally this is the next bigger item,
> + * but in case the previous item is the last in a leaf, path points
> + * to the first free slot in the previous leaf, i.e. at an invalid
> + * item.
> + */
> +leaf = p->nodes[0];
> +
> +if (find_higher) {
> +if (p->slots[0] >= btrfs_header_nritems(leaf)) {
> +ret = btrfs_next_leaf(root, p);
> +if (ret <= 0)
> +return ret;
> +if (!return_any)
> +return 1;
> +/*
> + * no higher item found, return the next
> + * lower instead
> + */
> +return_any = 0;
> +find_higher = 0;
> +btrfs_release_path(p);
> +goto again;
> +}
> +} else {
> +if (p->slots[0] == 0) {
> +ret = btrfs_prev_leaf(root, p);
> +if (ret < 0)
> +return ret;
> +if (!ret) {
> +leaf = p->nodes[0];
> +if (p->slots[0] == 
> btrfs_header_nritems(leaf))
> +p->slots[0]--;
> +return 0;
> +}
> +if (!return_any)
> +return 1;
> +/*
> + * no lower item found, return the next
> + * higher instead
> + */
> +return_any = 0;
> +find_higher = 1;
> +btrfs_release_path(p);
> +goto again;
> +} else {
> +--p->slots[0];
> +}
> +}
> +return 0;
> +}
> +
> +/*
>   * adjust the pointers going up the tree, starting at level
>   * making sure the right key of each node is points to 'key'.
>   * This is used after shifting pointers 

Re: [PATCH 04/10] btrfs-progs: Implement find_*_bit_le operations

2018-10-04 Thread Omar Sandoval
On Mon, Oct 01, 2018 at 05:46:15PM +0300, Nikolay Borisov wrote:
> This commit introduces explicit little endian bit operations. The only
> difference with the existing bitops implementation is that bswap(32|64)
> is called when the _le versions are invoked on a big-endian machine.
> This is in preparation for adding free space tree conversion support.

I had to check, but it looks like these are also pulled from the kernel
source, so

Reviewed-by: Omar Sandoval 

> Signed-off-by: Nikolay Borisov 
> ---
>  kernel-lib/bitops.h | 82 
> +
>  1 file changed, 82 insertions(+)


Re: [PATCH 03/10] btrfs-progs: Replace homegrown bitops related functions with kernel counterparts

2018-10-02 Thread Omar Sandoval
On Mon, Oct 01, 2018 at 05:46:14PM +0300, Nikolay Borisov wrote:
> Replace existing find_*_bit functions with kernel equivalent. This
> reduces duplication, simplifies the code (we really have one worker
> function _find_next_bit) and is quite likely faster. No functional
> changes.

Reviewed-by: Omar Sandoval 

> Signed-off-by: Nikolay Borisov 
> ---
>  kernel-lib/bitops.h | 142 
> +---
>  1 file changed, 46 insertions(+), 96 deletions(-)
> 
> diff --git a/kernel-lib/bitops.h b/kernel-lib/bitops.h
> index 5b35f9fc5213..78256adf55be 100644
> --- a/kernel-lib/bitops.h
> +++ b/kernel-lib/bitops.h
> @@ -2,6 +2,7 @@
>  #define _PERF_LINUX_BITOPS_H_
>  
>  #include 
> +#include "internal.h"
>  
>  #ifndef DIV_ROUND_UP
>  #define DIV_ROUND_UP(n, d) (((n) + (d) - 1) / (d))
> @@ -109,116 +110,65 @@ static __always_inline unsigned long __ffs(unsigned 
> long word)
>  
>  #define ffz(x) __ffs(~(x))
>  
> +#define BITMAP_FIRST_WORD_MASK(start) (~0UL << ((start) & (BITS_PER_LONG - 
> 1))) 
> +#define BITMAP_LAST_WORD_MASK(nbits) (~0UL >> (-(nbits) & (BITS_PER_LONG - 
> 1)))
> +
>  /*
> - * Find the first set bit in a memory region.
> + * This is a common helper function for find_next_bit, find_next_zero_bit, 
> and
> + * find_next_and_bit. The differences are:
> + *  - The "invert" argument, which is XORed with each fetched word before
> + *searching it for one bits.
> + *  - The optional "addr2", which is anded with "addr1" if present.
>   */
> -static inline unsigned long
> -find_first_bit(const unsigned long *addr, unsigned long size)
> +static inline unsigned long _find_next_bit(const unsigned long *addr1,
> + const unsigned long *addr2, unsigned long nbits,
> + unsigned long start, unsigned long invert)
>  {
> - const unsigned long *p = addr;
> - unsigned long result = 0;
>   unsigned long tmp;
>  
> - while (size & ~(BITS_PER_LONG-1)) {
> - if ((tmp = *(p++)))
> - goto found;
> - result += BITS_PER_LONG;
> - size -= BITS_PER_LONG;
> + if (start >= nbits)
> + return nbits;
> +
> + tmp = addr1[start / BITS_PER_LONG];
> + if (addr2)
> + tmp &= addr2[start / BITS_PER_LONG];
> + tmp ^= invert;
> +
> + /* Handle 1st word. */
> + tmp &= BITMAP_FIRST_WORD_MASK(start);
> + start = round_down(start, BITS_PER_LONG);
> +
> + while (!tmp) {
> + start += BITS_PER_LONG;
> + if (start >= nbits)
> + return nbits;
> +
> + tmp = addr1[start / BITS_PER_LONG];
> + if (addr2)
> + tmp &= addr2[start / BITS_PER_LONG];
> + tmp ^= invert;
>   }
> - if (!size)
> - return result;
> -
> - tmp = (*p) & (~0UL >> (BITS_PER_LONG - size));
> - if (tmp == 0UL) /* Are any bits set? */
> - return result + size;   /* Nope. */
> -found:
> - return result + __ffs(tmp);
> +
> + return min(start + __ffs(tmp), nbits);
>  }
>  
>  /*
>   * Find the next set bit in a memory region.
>   */
> -static inline unsigned long
> -find_next_bit(const unsigned long *addr, unsigned long size,
> -   unsigned long offset)
> +static inline unsigned long find_next_bit(const unsigned long *addr,
> +   unsigned long size,
> +   unsigned long offset)
>  {
> - const unsigned long *p = addr + BITOP_WORD(offset);
> - unsigned long result = offset & ~(BITS_PER_LONG-1);
> - unsigned long tmp;
> -
> - if (offset >= size)
> - return size;
> - size -= result;
> - offset %= BITS_PER_LONG;
> - if (offset) {
> - tmp = *(p++);
> - tmp &= (~0UL << offset);
> - if (size < BITS_PER_LONG)
> - goto found_first;
> - if (tmp)
> - goto found_middle;
> - size -= BITS_PER_LONG;
> - result += BITS_PER_LONG;
> - }
> - while (size & ~(BITS_PER_LONG-1)) {
> - if ((tmp = *(p++)))
> - goto found_middle;
> - result += BITS_PER_LONG;
> - size -= BITS_PER_LONG;
> - }
> - if (!size)
> - return result;
> - tmp = *p;
> -
> -found_first:
> - tmp &= (~0UL >> (BITS_PER_LONG - size));
> - if (tmp == 0UL) /* Are an

Re: [PATCH 02/10] btrfs-progs: Add extent buffer bitmap manipulation infrastructure

2018-10-02 Thread Omar Sandoval
On Mon, Oct 01, 2018 at 05:46:13PM +0300, Nikolay Borisov wrote:
> Those functions are in preparation for adding the freespace tree
> repair code since it needs to be able to deal with bitmap based fsts.
> This patch adds extent_buffer_bitmap_set and extent_buffer_bitmap_clear
> functions. Since in userspace we don't have to deal with page mappings
> their implementation is vastly simplified by simply setting each bit in
> the passed range.

Reviewed-by: Omar Sandoval 

> Signed-off-by: Nikolay Borisov 
> ---
>  extent_io.c | 56 
>  extent_io.h |  4 
>  2 files changed, 60 insertions(+)
> 
> diff --git a/extent_io.c b/extent_io.c
> index 198492699438..de47c2c59ae9 100644
> --- a/extent_io.c
> +++ b/extent_io.c
> @@ -204,6 +204,62 @@ static int clear_state_bit(struct extent_io_tree *tree,
>   return ret;
>  }
>  
> +/**
> + * extent_buffer_bitmap_set - set an area of a bitmap
> + * @eb: the extent buffer
> + * @start: offset of the bitmap item in the extent buffer
> + * @pos: bit number of the first bit
> + * @len: number of bits to set
> + */
> +void extent_buffer_bitmap_set(struct extent_buffer *eb, unsigned long start,
> +  unsigned long pos, unsigned long len)
> +{
> + u8 *p = (u8 *)eb->data + start + BIT_BYTE(pos);
> + const unsigned int size = pos + len;
> + int bits_to_set = BITS_PER_BYTE - (pos % BITS_PER_BYTE);
> + u8 mask_to_set = BITMAP_FIRST_BYTE_MASK(pos);
> +
> + while (len >= bits_to_set) {
> + *p |= mask_to_set;
> + len -= bits_to_set;
> + bits_to_set = BITS_PER_BYTE;
> + mask_to_set = ~0;
> + p++;
> + }
> + if (len) {
> + mask_to_set &= BITMAP_LAST_BYTE_MASK(size);
> + *p |= mask_to_set;
> + }
> +}
> +
> +
> +/**
> + * extent_buffer_bitmap_clear - clear an area of a bitmap
> + * @eb: the extent buffer
> + * @start: offset of the bitmap item in the extent buffer
> + * @pos: bit number of the first bit
> + * @len: number of bits to clear
> + */
> +void extent_buffer_bitmap_clear(struct extent_buffer *eb, unsigned long 
> start,
> +unsigned long pos, unsigned long len)
> +{
> + u8 *p = (u8 *)eb->data + start + BIT_BYTE(pos);
> + const unsigned int size = pos + len;
> + int bits_to_clear = BITS_PER_BYTE - (pos % BITS_PER_BYTE);
> + u8 mask_to_clear = BITMAP_FIRST_BYTE_MASK(pos);
> +
> + while (len >= bits_to_clear) {
> + *p &= ~mask_to_clear;
> + len -= bits_to_clear;
> + bits_to_clear = BITS_PER_BYTE;
> + mask_to_clear = ~0;
> + p++;
> + }
> + if (len) {
> + mask_to_clear &= BITMAP_LAST_BYTE_MASK(size);
> + *p &= ~mask_to_clear;
> + }
> +}
>  /*
>   * clear some bits on a range in the tree.
>   */
> diff --git a/extent_io.h b/extent_io.h
> index d407d93d617e..b67c6fc40e89 100644
> --- a/extent_io.h
> +++ b/extent_io.h
> @@ -175,4 +175,8 @@ int read_data_from_disk(struct btrfs_fs_info *info, void 
> *buf, u64 offset,
>   u64 bytes, int mirror);
>  int write_data_to_disk(struct btrfs_fs_info *info, void *buf, u64 offset,
>  u64 bytes, int mirror);
> +void extent_buffer_bitmap_clear(struct extent_buffer *eb, unsigned long 
> start,
> +unsigned long pos, unsigned long len);
> +void extent_buffer_bitmap_set(struct extent_buffer *eb, unsigned long start,
> +  unsigned long pos, unsigned long len);
>  #endif
> -- 
> 2.7.4
> 


Re: [PATCH 01/10] btrfs-progs: Add support for freespace tree in btrfs_read_fs_root

2018-10-02 Thread Omar Sandoval
On Mon, Oct 01, 2018 at 05:46:12PM +0300, Nikolay Borisov wrote:
> For completeness sake add code to btrfs_read_fs_root so that it can
> handle the freespace tree.

Reviewed-by: Omar Sandoval 

> Signed-off-by: Nikolay Borisov 
> ---
>  disk-io.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/disk-io.c b/disk-io.c
> index 2e6d56a36af9..14f0fd5c2f0c 100644
> --- a/disk-io.c
> +++ b/disk-io.c
> @@ -668,6 +668,9 @@ struct btrfs_root *btrfs_read_fs_root(struct 
> btrfs_fs_info *fs_info,
>   if (location->objectid == BTRFS_QUOTA_TREE_OBJECTID)
>   return fs_info->quota_enabled ? fs_info->quota_root :
>   ERR_PTR(-ENOENT);
> + if (location->objectid == BTRFS_FREE_SPACE_TREE_OBJECTID)
> +return fs_info->free_space_root ? fs_info->free_space_root :
> +  ERR_PTR(-ENOENT);
>  
>   BUG_ON(location->objectid == BTRFS_TREE_RELOC_OBJECTID ||
>  location->offset != (u64)-1);
> -- 
> 2.7.4
> 


Re: [PATCH v2 1/9] fstests: btrfs: _scratch_mkfs_sized fix min size without mixed option

2018-09-27 Thread Omar Sandoval
On Wed, Sep 26, 2018 at 09:34:27AM +0300, Nikolay Borisov wrote:
> 
> 
> On 26.09.2018 07:07, Anand Jain wrote:
> > 
> > 
> > On 09/25/2018 06:51 PM, Nikolay Borisov wrote:
> >>
> >>
> >> On 25.09.2018 07:24, Anand Jain wrote:
> >>> As of now _scratch_mkfs_sized() checks if the requested size is below 1G
> >>> and forces the --mixed option for the mkfs.btrfs. Well the correct size
> >>> considering all possible group profiles at which we need to force the
> >>> mixed option is roughly 256Mbytes. So fix that.
> >>>
> >>> Signed-off-by: Anand Jain 
> >>
> >> Have you considered the implications of this w.r.t commit
> >> d4da414a9a9d ("common/rc: raise btrfs mixed mode threshold to 1GB")
> >>
> >> Initially this threshold was 100mb then Omar changed it to 1g. Does this
> >> change affect generic/427?
> > 
> > d4da414a9a9d does not explain what was the problem that Omar wanted to
> > address, mainly what was the failure about.
> 
> I just retested on upstream 4.19.0-rc3 with Omar's patch reverted (so
> anything above 100m for fs size is created with non-mixed block groups)
> and the test succeeded. So indeed your change seems to not make a
> difference for this test.
> 
> > 
> > And no it does not affect. I have verified generic/427 with kernel 4.1
> > and 4.19-rc5 with  btrfs-progs 4.1, 4.9 and latest from kdave they all
> > run fine. Good to integrate.

I had to double check, but it only happens with -m dup. If I apply the
following patch:

diff --git a/common/rc b/common/rc
index d5bb1fe..989b846 100644
--- a/common/rc
+++ b/common/rc
@@ -969,7 +969,7 @@ _scratch_mkfs_sized()
;;
 btrfs)
local mixed_opt=
-   (( fssize <= 1024 * 1024 * 1024 )) && mixed_opt='--mixed'
+   (( fssize <= 100 * 1024 * 1024 )) && mixed_opt='--mixed'
$MKFS_BTRFS_PROG $MKFS_OPTIONS $mixed_opt -b $fssize $SCRATCH_DEV
;;
 jfs)
diff --git a/tests/generic/427 b/tests/generic/427
index e8ebffe..206cf08 100755
--- a/tests/generic/427
+++ b/tests/generic/427
@@ -65,6 +65,7 @@ fi
 # start a background aio writer, which does several extending loops
 # internally and check data integrality
 $AIO_TEST -s $fsize -b 65536 $SCRATCH_MNT/tst-aio-dio-eof-race.$seq
+btrfs fi usage $SCRATCH_MNT
 status=$?
 
 kill $open_close_pid

And run with MKFS_OPTIONS="-m dup", then we don't have enough data space
for the test:

--- /root/linux/xfstests/tests/generic/427.out  2017-11-28 16:05:46.811435644 
-0800
+++ /root/linux/xfstests/results/generic/427.out.bad2018-09-27 
13:01:00.540510385 -0700
@@ -1,2 +1,24 @@
 QA output created by 427
-Success, all done.
+pwrite: No space left on device
+Overall:
+Device size:256.00MiB
+Device allocated:   255.00MiB
+Device unallocated:   1.00MiB
+Device missing: 0.00B
+Used:   179.03MiB
+Free (estimated):   0.00B  (min: 0.00B)
+Data ratio:  1.00
+Metadata ratio:  2.00
+Global reserve:  16.00MiB  (used: 0.00B)
+
+Data,single: Size:175.00MiB, Used:175.00MiB
+   /dev/nvme0n1p2   175.00MiB
+
+Metadata,DUP: Size:32.00MiB, Used:2.00MiB
+   /dev/nvme0n1p264.00MiB
+
+System,DUP: Size:8.00MiB, Used:16.00KiB
+   /dev/nvme0n1p216.00MiB
+
+Unallocated:
+   /dev/nvme0n1p2 1.00MiB


Re: [PATCH] btrfs: list usage cleanup

2018-09-27 Thread Omar Sandoval
On Wed, Sep 26, 2018 at 04:35:45PM +0800, zhong jiang wrote:
> Trival cleanup, list_move_tail will implement the same function that
> list_del() + list_add_tail() will do. hence just replace them.
> 
> Signed-off-by: zhong jiang 
> ---
>  fs/btrfs/send.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c
> index 094cc144..d87f416 100644
> --- a/fs/btrfs/send.c
> +++ b/fs/btrfs/send.c
> @@ -2075,8 +2075,7 @@ static struct name_cache_entry 
> *name_cache_search(struct send_ctx *sctx,
>   */
>  static void name_cache_used(struct send_ctx *sctx, struct name_cache_entry 
> *nce)
>  {
> - list_del(>list);
> - list_add_tail(>list, >name_cache_list);
> + list_move_tail(>list, >name_cache_list);
>  }

At that point do we even need such a trivial helper, considering that
this is only called in one place?


[PATCH v9 6/6] Btrfs: support swap files

2018-09-27 Thread Omar Sandoval
From: Omar Sandoval 

Btrfs has not allowed swap files since commit 35054394c4b3 ("Btrfs: stop
providing a bmap operation to avoid swapfile corruptions"). However, now
that the proper restrictions are in place, Btrfs can support swap files
through the swap file a_ops, similar to iomap in commit 67482129cdab
("iomap: add a swapfile activation function").

For Btrfs, activation needs to make sure that the file can be used as a
swap file, which currently means that it must be fully allocated as
nocow with no compression on one device. It must also do the proper
tracking so that ioctls will not interfere with the swap file.
Deactivation clears this tracking.

Signed-off-by: Omar Sandoval 
---
 fs/btrfs/inode.c | 338 +++
 1 file changed, 338 insertions(+)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 3ea5339603cf..8f8b7079e1ba 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -27,6 +27,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include "ctree.h"
 #include "disk-io.h"
@@ -10488,6 +10489,341 @@ void btrfs_set_range_writeback(struct extent_io_tree 
*tree, u64 start, u64 end)
}
 }
 
+#ifdef CONFIG_SWAP
+/*
+ * Add an entry indicating a block group or device which is pinned by a
+ * swapfile. Returns 0 on success, 1 if there is already an entry for it, or a
+ * negative errno on failure.
+ */
+static int btrfs_add_swapfile_pin(struct inode *inode, void *ptr,
+ bool is_block_group)
+{
+   struct btrfs_fs_info *fs_info = BTRFS_I(inode)->root->fs_info;
+   struct btrfs_swapfile_pin *sp, *entry;
+   struct rb_node **p;
+   struct rb_node *parent = NULL;
+
+   sp = kmalloc(sizeof(*sp), GFP_NOFS);
+   if (!sp)
+   return -ENOMEM;
+   sp->ptr = ptr;
+   sp->inode = inode;
+   sp->is_block_group = is_block_group;
+
+   spin_lock(_info->swapfile_pins_lock);
+   p = _info->swapfile_pins.rb_node;
+   while (*p) {
+   parent = *p;
+   entry = rb_entry(parent, struct btrfs_swapfile_pin, node);
+   if (sp->ptr < entry->ptr ||
+   (sp->ptr == entry->ptr && sp->inode < entry->inode)) {
+   p = &(*p)->rb_left;
+   } else if (sp->ptr > entry->ptr ||
+  (sp->ptr == entry->ptr && sp->inode > entry->inode)) 
{
+   p = &(*p)->rb_right;
+   } else {
+   spin_unlock(_info->swapfile_pins_lock);
+   kfree(sp);
+   return 1;
+   }
+   }
+   rb_link_node(>node, parent, p);
+   rb_insert_color(>node, _info->swapfile_pins);
+   spin_unlock(_info->swapfile_pins_lock);
+   return 0;
+}
+
+/* Free all of the entries pinned by this swapfile. */
+static void btrfs_free_swapfile_pins(struct inode *inode)
+{
+   struct btrfs_fs_info *fs_info = BTRFS_I(inode)->root->fs_info;
+   struct btrfs_swapfile_pin *sp;
+   struct rb_node *node, *next;
+
+   spin_lock(_info->swapfile_pins_lock);
+   node = rb_first(_info->swapfile_pins);
+   while (node) {
+   next = rb_next(node);
+   sp = rb_entry(node, struct btrfs_swapfile_pin, node);
+   if (sp->inode == inode) {
+   rb_erase(>node, _info->swapfile_pins);
+   if (sp->is_block_group)
+   btrfs_put_block_group(sp->ptr);
+   kfree(sp);
+   }
+   node = next;
+   }
+   spin_unlock(_info->swapfile_pins_lock);
+}
+
+struct btrfs_swap_info {
+   u64 start;
+   u64 block_start;
+   u64 block_len;
+   u64 lowest_ppage;
+   u64 highest_ppage;
+   unsigned long nr_pages;
+   int nr_extents;
+};
+
+static int btrfs_add_swap_extent(struct swap_info_struct *sis,
+struct btrfs_swap_info *bsi)
+{
+   unsigned long nr_pages;
+   u64 first_ppage, first_ppage_reported, next_ppage;
+   int ret;
+
+   first_ppage = ALIGN(bsi->block_start, PAGE_SIZE) >> PAGE_SHIFT;
+   next_ppage = ALIGN_DOWN(bsi->block_start + bsi->block_len,
+   PAGE_SIZE) >> PAGE_SHIFT;
+
+   if (first_ppage >= next_ppage)
+   return 0;
+   nr_pages = next_ppage - first_ppage;
+
+   first_ppage_reported = first_ppage;
+   if (bsi->start == 0)
+   first_ppage_reported++;
+   if (bsi->lowest_ppage > first_ppage_reported)
+   bsi->lowest_ppage = first_ppage_reported;
+   if (bsi->highest_ppage < (next_ppage - 1))
+   bsi->highest_ppage = next_ppage - 1;
+
+   ret = add_swap_ex

[PATCH v9 5/6] Btrfs: rename get_chunk_map() and make it non-static

2018-09-27 Thread Omar Sandoval
From: Omar Sandoval 

The Btrfs swap code is going to need it, so give it a btrfs_ prefix and
make it non-static.

Reviewed-by: Nikolay Borisov 
Signed-off-by: Omar Sandoval 
---
 fs/btrfs/volumes.c | 29 ++---
 fs/btrfs/volumes.h |  2 ++
 2 files changed, 20 insertions(+), 11 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index aa37ae30bf62..20c26afdd330 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -2714,8 +2714,15 @@ static int btrfs_del_sys_chunk(struct btrfs_fs_info 
*fs_info, u64 chunk_offset)
return ret;
 }
 
-static struct extent_map *get_chunk_map(struct btrfs_fs_info *fs_info,
-   u64 logical, u64 length)
+/*
+ * btrfs_get_chunk_map() - Find the mapping containing the given logical 
extent.
+ * @logical: Logical block offset in bytes.
+ * @length: Length of extent in bytes.
+ *
+ * Return: Chunk mapping or ERR_PTR.
+ */
+struct extent_map *btrfs_get_chunk_map(struct btrfs_fs_info *fs_info,
+  u64 logical, u64 length)
 {
struct extent_map_tree *em_tree;
struct extent_map *em;
@@ -2752,7 +2759,7 @@ int btrfs_remove_chunk(struct btrfs_trans_handle *trans, 
u64 chunk_offset)
int i, ret = 0;
struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
 
-   em = get_chunk_map(fs_info, chunk_offset, 1);
+   em = btrfs_get_chunk_map(fs_info, chunk_offset, 1);
if (IS_ERR(em)) {
/*
 * This is a logic error, but we don't want to just rely on the
@@ -4902,7 +4909,7 @@ int btrfs_finish_chunk_alloc(struct btrfs_trans_handle 
*trans,
int i = 0;
int ret = 0;
 
-   em = get_chunk_map(fs_info, chunk_offset, chunk_size);
+   em = btrfs_get_chunk_map(fs_info, chunk_offset, chunk_size);
if (IS_ERR(em))
return PTR_ERR(em);
 
@@ -5044,7 +5051,7 @@ int btrfs_chunk_readonly(struct btrfs_fs_info *fs_info, 
u64 chunk_offset)
int miss_ndevs = 0;
int i;
 
-   em = get_chunk_map(fs_info, chunk_offset, 1);
+   em = btrfs_get_chunk_map(fs_info, chunk_offset, 1);
if (IS_ERR(em))
return 1;
 
@@ -5104,7 +5111,7 @@ int btrfs_num_copies(struct btrfs_fs_info *fs_info, u64 
logical, u64 len)
struct map_lookup *map;
int ret;
 
-   em = get_chunk_map(fs_info, logical, len);
+   em = btrfs_get_chunk_map(fs_info, logical, len);
if (IS_ERR(em))
/*
 * We could return errors for these cases, but that could get
@@ -5150,7 +5157,7 @@ unsigned long btrfs_full_stripe_len(struct btrfs_fs_info 
*fs_info,
struct map_lookup *map;
unsigned long len = fs_info->sectorsize;
 
-   em = get_chunk_map(fs_info, logical, len);
+   em = btrfs_get_chunk_map(fs_info, logical, len);
 
if (!WARN_ON(IS_ERR(em))) {
map = em->map_lookup;
@@ -5167,7 +5174,7 @@ int btrfs_is_parity_mirror(struct btrfs_fs_info *fs_info, 
u64 logical, u64 len)
struct map_lookup *map;
int ret = 0;
 
-   em = get_chunk_map(fs_info, logical, len);
+   em = btrfs_get_chunk_map(fs_info, logical, len);
 
if(!WARN_ON(IS_ERR(em))) {
map = em->map_lookup;
@@ -5326,7 +5333,7 @@ static int __btrfs_map_block_for_discard(struct 
btrfs_fs_info *fs_info,
/* discard always return a bbio */
ASSERT(bbio_ret);
 
-   em = get_chunk_map(fs_info, logical, length);
+   em = btrfs_get_chunk_map(fs_info, logical, length);
if (IS_ERR(em))
return PTR_ERR(em);
 
@@ -5652,7 +5659,7 @@ static int __btrfs_map_block(struct btrfs_fs_info 
*fs_info,
return __btrfs_map_block_for_discard(fs_info, logical,
 *length, bbio_ret);
 
-   em = get_chunk_map(fs_info, logical, *length);
+   em = btrfs_get_chunk_map(fs_info, logical, *length);
if (IS_ERR(em))
return PTR_ERR(em);
 
@@ -5951,7 +5958,7 @@ int btrfs_rmap_block(struct btrfs_fs_info *fs_info, u64 
chunk_start,
u64 rmap_len;
int i, j, nr = 0;
 
-   em = get_chunk_map(fs_info, chunk_start, 1);
+   em = btrfs_get_chunk_map(fs_info, chunk_start, 1);
if (IS_ERR(em))
return -EIO;
 
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 23e9285d88de..f4c190c2ab84 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -465,6 +465,8 @@ unsigned long btrfs_full_stripe_len(struct btrfs_fs_info 
*fs_info,
 int btrfs_finish_chunk_alloc(struct btrfs_trans_handle *trans,
 u64 chunk_offset, u64 chunk_size);
 int btrfs_remove_chunk(struct btrfs_trans_handle *trans, u64 chunk_offset);
+struct extent_map *btrfs_get_chunk_map(struct btrfs_fs_info *fs_info,
+  u64 logical, u64 length);
 
 static inline void btrfs_dev_stat_inc(struct btrfs

[PATCH v9 4/6] Btrfs: prevent ioctls from interfering with a swap file

2018-09-27 Thread Omar Sandoval
From: Omar Sandoval 

A later patch will implement swap file support for Btrfs, but before we
do that, we need to make sure that the various Btrfs ioctls cannot
change a swap file.

When a swap file is active, we must make sure that the extents of the
file are not moved and that they don't become shared. That means that
the following are not safe:

- chattr +c (enable compression)
- reflink
- dedupe
- snapshot
- defrag

Don't allow those to happen on an active swap file.

Additionally, balance, resize, device remove, and device replace are
also unsafe if they affect an active swapfile. Add a red-black tree of
block groups and devices which contain an active swapfile. Relocation
checks each block group against this tree and skips it or errors out for
balance or resize, respectively. Device remove and device replace check
the tree for the device they will operate on.

Note that we don't have to worry about chattr -C (disable nocow), which
we ignore for non-empty files, because an active swapfile must be
non-empty and can't be truncated. We also don't have to worry about
autodefrag because it's only done on COW files. Truncate and fallocate
are already taken care of by the generic code. Device add doesn't do
relocation so it's not an issue, either.

Signed-off-by: Omar Sandoval 
---
 fs/btrfs/ctree.h   | 29 +++
 fs/btrfs/dev-replace.c |  8 +++
 fs/btrfs/disk-io.c |  4 
 fs/btrfs/ioctl.c   | 31 +---
 fs/btrfs/relocation.c  | 18 ++
 fs/btrfs/volumes.c | 53 ++
 6 files changed, 131 insertions(+), 12 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 2cddfe7806a4..08df61b8fc87 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -716,6 +716,28 @@ struct btrfs_fs_devices;
 struct btrfs_balance_control;
 struct btrfs_delayed_root;
 
+/*
+ * Block group or device which contains an active swapfile. Used for preventing
+ * unsafe operations while a swapfile is active.
+ *
+ * These are sorted on (ptr, inode) (note that a block group or device can
+ * contain more than one swapfile). We compare the pointer values because we
+ * don't actually care what the object is, we just need a quick check whether
+ * the object exists in the rbtree.
+ */
+struct btrfs_swapfile_pin {
+   struct rb_node node;
+   void *ptr;
+   struct inode *inode;
+   /*
+* If true, ptr points to a struct btrfs_block_group_cache. Otherwise,
+* ptr points to a struct btrfs_device.
+*/
+   bool is_block_group;
+};
+
+bool btrfs_pinned_by_swapfile(struct btrfs_fs_info *fs_info, void *ptr);
+
 #define BTRFS_FS_BARRIER   1
 #define BTRFS_FS_CLOSING_START 2
 #define BTRFS_FS_CLOSING_DONE  3
@@ -1121,6 +1143,10 @@ struct btrfs_fs_info {
u32 sectorsize;
u32 stripesize;
 
+   /* Block groups and devices containing active swapfiles. */
+   spinlock_t swapfile_pins_lock;
+   struct rb_root swapfile_pins;
+
 #ifdef CONFIG_BTRFS_FS_REF_VERIFY
spinlock_t ref_verify_lock;
struct rb_root block_tree;
@@ -1286,6 +1312,9 @@ struct btrfs_root {
spinlock_t qgroup_meta_rsv_lock;
u64 qgroup_meta_rsv_pertrans;
u64 qgroup_meta_rsv_prealloc;
+
+   /* Number of active swapfiles */
+   atomic_t nr_swapfiles;
 };
 
 struct btrfs_file_private {
diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
index dec01970d8c5..781006b6fca3 100644
--- a/fs/btrfs/dev-replace.c
+++ b/fs/btrfs/dev-replace.c
@@ -414,6 +414,14 @@ int btrfs_dev_replace_start(struct btrfs_fs_info *fs_info,
if (ret)
return ret;
 
+   if (btrfs_pinned_by_swapfile(fs_info, src_device)) {
+   btrfs_warn_in_rcu(fs_info,
+ "cannot replace device %s (devid %llu) due to active swapfile",
+ btrfs_dev_name(src_device),
+ src_device->devid);
+   return -ETXTBSY;
+   }
+
ret = btrfs_init_dev_replace_tgtdev(fs_info, tgtdev_name,
src_device, _device);
if (ret)
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 05dc3c17cb62..2428a73067d2 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1188,6 +1188,7 @@ static void __setup_root(struct btrfs_root *root, struct 
btrfs_fs_info *fs_info,
refcount_set(>refs, 1);
atomic_set(>will_be_snapshotted, 0);
atomic_set(>snapshot_force_cow, 0);
+   atomic_set(>nr_swapfiles, 0);
root->log_transid = 0;
root->log_transid_committed = -1;
root->last_log_commit = 0;
@@ -2782,6 +2783,9 @@ int open_ctree(struct super_block *sb,
fs_info->sectorsize = 4096;
fs_info->stripesize = 4096;
 
+   spin_lock_init(_info->swapfile_pins_lock);
+   fs_i

[PATCH v9 0/6] Btrfs: implement swap file support

2018-09-27 Thread Omar Sandoval
From: Omar Sandoval 

Hi,

This series implements swap file support for Btrfs.

Changes from v8 [1]:

- Fixed a bug in btrfs_swap_activate() which would cause us to miss some
  file extents if they were merged into one extent map entry.
- Fixed build for !CONFIG_SWAP.
- Changed all error messages to KERN_WARN.
- Unindented long error messages.

I've Cc'd Jon and Al on patch 3 this time, so hopefully we can get an
ack for that one, too.

Thanks!

1: https://www.spinics.net/lists/linux-btrfs/msg82267.html

Omar Sandoval (6):
  mm: split SWP_FILE into SWP_ACTIVATED and SWP_FS
  mm: export add_swap_extent()
  vfs: update swap_{,de}activate documentation
  Btrfs: prevent ioctls from interfering with a swap file
  Btrfs: rename get_chunk_map() and make it non-static
  Btrfs: support swap files

 Documentation/filesystems/Locking |  17 +-
 Documentation/filesystems/vfs.txt |  12 +-
 fs/btrfs/ctree.h  |  29 +++
 fs/btrfs/dev-replace.c|   8 +
 fs/btrfs/disk-io.c|   4 +
 fs/btrfs/inode.c  | 338 ++
 fs/btrfs/ioctl.c  |  31 ++-
 fs/btrfs/relocation.c |  18 +-
 fs/btrfs/volumes.c|  82 ++--
 fs/btrfs/volumes.h|   2 +
 include/linux/swap.h  |  13 +-
 mm/page_io.c  |   6 +-
 mm/swapfile.c |  14 +-
 13 files changed, 523 insertions(+), 51 deletions(-)

-- 
2.19.0



[PATCH v9 2/6] mm: export add_swap_extent()

2018-09-27 Thread Omar Sandoval
From: Omar Sandoval 

Btrfs currently does not support swap files because swap's use of bmap
does not work with copy-on-write and multiple devices. See commit
35054394c4b3 ("Btrfs: stop providing a bmap operation to avoid swapfile
corruptions"). However, the swap code has a mechanism for the filesystem
to manually add swap extents using add_swap_extent() from the
->swap_activate() aop. iomap has done this since commit 67482129cdab
("iomap: add a swapfile activation function"). Btrfs will do the same in
a later patch, so export add_swap_extent().

Acked-by: Johannes Weiner 
Signed-off-by: Omar Sandoval 
---
 mm/swapfile.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index d3f95833d12e..51cb30de17bc 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -2365,6 +2365,7 @@ add_swap_extent(struct swap_info_struct *sis, unsigned 
long start_page,
list_add_tail(_se->list, >first_swap_extent.list);
return 1;
 }
+EXPORT_SYMBOL_GPL(add_swap_extent);
 
 /*
  * A `swap extent' is a simple thing which maps a contiguous range of pages
-- 
2.19.0



[PATCH v9 1/6] mm: split SWP_FILE into SWP_ACTIVATED and SWP_FS

2018-09-27 Thread Omar Sandoval
From: Omar Sandoval 

The SWP_FILE flag serves two purposes: to make swap_{read,write}page()
go through the filesystem, and to make swapoff() call
->swap_deactivate(). For Btrfs, we want the latter but not the former,
so split this flag into two. This makes us always call
->swap_deactivate() if ->swap_activate() succeeded, not just if it
didn't add any swap extents itself.

This also resolves the issue of the very misleading name of SWP_FILE,
which is only used for swap files over NFS.

Reviewed-by: Nikolay Borisov 
Acked-by: Johannes Weiner 
Signed-off-by: Omar Sandoval 
---
 include/linux/swap.h | 13 +++--
 mm/page_io.c |  6 +++---
 mm/swapfile.c| 13 -
 3 files changed, 18 insertions(+), 14 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 8e2c11e692ba..0fda0aa743f0 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -167,13 +167,14 @@ enum {
SWP_SOLIDSTATE  = (1 << 4), /* blkdev seeks are cheap */
SWP_CONTINUED   = (1 << 5), /* swap_map has count continuation */
SWP_BLKDEV  = (1 << 6), /* its a block device */
-   SWP_FILE= (1 << 7), /* set after swap_activate success */
-   SWP_AREA_DISCARD = (1 << 8),/* single-time swap area discards */
-   SWP_PAGE_DISCARD = (1 << 9),/* freed swap page-cluster discards */
-   SWP_STABLE_WRITES = (1 << 10),  /* no overwrite PG_writeback pages */
-   SWP_SYNCHRONOUS_IO = (1 << 11), /* synchronous IO is efficient */
+   SWP_ACTIVATED   = (1 << 7), /* set after swap_activate success */
+   SWP_FS  = (1 << 8), /* swap file goes through fs */
+   SWP_AREA_DISCARD = (1 << 9),/* single-time swap area discards */
+   SWP_PAGE_DISCARD = (1 << 10),   /* freed swap page-cluster discards */
+   SWP_STABLE_WRITES = (1 << 11),  /* no overwrite PG_writeback pages */
+   SWP_SYNCHRONOUS_IO = (1 << 12), /* synchronous IO is efficient */
/* add others here before... */
-   SWP_SCANNING= (1 << 12),/* refcount in scan_swap_map */
+   SWP_SCANNING= (1 << 13),/* refcount in scan_swap_map */
 };
 
 #define SWAP_CLUSTER_MAX 32UL
diff --git a/mm/page_io.c b/mm/page_io.c
index aafd19ec1db4..e8653c368069 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -283,7 +283,7 @@ int __swap_writepage(struct page *page, struct 
writeback_control *wbc,
struct swap_info_struct *sis = page_swap_info(page);
 
VM_BUG_ON_PAGE(!PageSwapCache(page), page);
-   if (sis->flags & SWP_FILE) {
+   if (sis->flags & SWP_FS) {
struct kiocb kiocb;
struct file *swap_file = sis->swap_file;
struct address_space *mapping = swap_file->f_mapping;
@@ -365,7 +365,7 @@ int swap_readpage(struct page *page, bool synchronous)
goto out;
}
 
-   if (sis->flags & SWP_FILE) {
+   if (sis->flags & SWP_FS) {
struct file *swap_file = sis->swap_file;
struct address_space *mapping = swap_file->f_mapping;
 
@@ -423,7 +423,7 @@ int swap_set_page_dirty(struct page *page)
 {
struct swap_info_struct *sis = page_swap_info(page);
 
-   if (sis->flags & SWP_FILE) {
+   if (sis->flags & SWP_FS) {
struct address_space *mapping = sis->swap_file->f_mapping;
 
VM_BUG_ON_PAGE(!PageSwapCache(page), page);
diff --git a/mm/swapfile.c b/mm/swapfile.c
index d954b71c4f9c..d3f95833d12e 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -989,7 +989,7 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], 
int entry_size)
goto nextsi;
}
if (size == SWAPFILE_CLUSTER) {
-   if (!(si->flags & SWP_FILE))
+   if (!(si->flags & SWP_FS))
n_ret = swap_alloc_cluster(si, swp_entries);
} else
n_ret = scan_swap_map_slots(si, SWAP_HAS_CACHE,
@@ -2310,12 +2310,13 @@ static void destroy_swap_extents(struct 
swap_info_struct *sis)
kfree(se);
}
 
-   if (sis->flags & SWP_FILE) {
+   if (sis->flags & SWP_ACTIVATED) {
struct file *swap_file = sis->swap_file;
struct address_space *mapping = swap_file->f_mapping;
 
-   sis->flags &= ~SWP_FILE;
-   mapping->a_ops->swap_deactivate(swap_file);
+   sis->flags &= ~SWP_ACTIVATED;
+   if (mapping->a_ops->swap_deactivate)
+   mapping->a_ops->swap_deactivate(swap_file);
}
 }
 
@@ -2411,8 +2412,10 @@ static int setup_swap_extents(struct swap_info_struct 
*sis, sector_t *span

[PATCH v9 3/6] vfs: update swap_{,de}activate documentation

2018-09-27 Thread Omar Sandoval
From: Omar Sandoval 

The documentation for these functions is wrong in several ways:

- swap_activate() is called with the inode locked
- swap_activate() takes a swap_info_struct * and a sector_t *
- swap_activate() can also return a positive number of extents it added
  itself
- swap_deactivate() does not return anything

Cc: Jonathan Corbet 
Cc: Al Viro 
Reviewed-by: Nikolay Borisov 
Signed-off-by: Omar Sandoval 
---
Hi, Jon, Al, could I get an ack on this patch? Thanks!

 Documentation/filesystems/Locking | 17 +++--
 Documentation/filesystems/vfs.txt | 12 
 2 files changed, 15 insertions(+), 14 deletions(-)

diff --git a/Documentation/filesystems/Locking 
b/Documentation/filesystems/Locking
index efea228ccd8a..b970c8c2ee22 100644
--- a/Documentation/filesystems/Locking
+++ b/Documentation/filesystems/Locking
@@ -210,8 +210,9 @@ prototypes:
int (*launder_page)(struct page *);
int (*is_partially_uptodate)(struct page *, unsigned long, unsigned 
long);
int (*error_remove_page)(struct address_space *, struct page *);
-   int (*swap_activate)(struct file *);
-   int (*swap_deactivate)(struct file *);
+   int (*swap_activate)(struct swap_info_struct *, struct file *,
+sector_t *);
+   void (*swap_deactivate)(struct file *);
 
 locking rules:
All except set_page_dirty and freepage may block
@@ -235,8 +236,8 @@ putback_page:   yes
 launder_page:  yes
 is_partially_uptodate: yes
 error_remove_page: yes
-swap_activate: no
-swap_deactivate:   no
+swap_activate: yes
+swap_deactivate:   no
 
->write_begin(), ->write_end() and ->readpage() may be called from
 the request handler (/dev/loop).
@@ -333,14 +334,10 @@ cleaned, or an error value if not. Note that in order to 
prevent the page
 getting mapped back in and redirtied, it needs to be kept locked
 across the entire operation.
 
-   ->swap_activate will be called with a non-zero argument on
-files backing (non block device backed) swapfiles. A return value
-of zero indicates success, in which case this file can be used for
-backing swapspace. The swapspace operations will be proxied to the
-address space operations.
+   ->swap_activate is called from sys_swapon() with the inode locked.
 
->swap_deactivate() will be called in the sys_swapoff()
-path after ->swap_activate() returned success.
+path after ->swap_activate() returned success. The inode is not locked.
 
 --- file_lock_operations --
 prototypes:
diff --git a/Documentation/filesystems/vfs.txt 
b/Documentation/filesystems/vfs.txt
index a6c6a8af48a2..6e14db053eaa 100644
--- a/Documentation/filesystems/vfs.txt
+++ b/Documentation/filesystems/vfs.txt
@@ -652,8 +652,9 @@ struct address_space_operations {
unsigned long);
void (*is_dirty_writeback) (struct page *, bool *, bool *);
int (*error_remove_page) (struct mapping *mapping, struct page *page);
-   int (*swap_activate)(struct file *);
-   int (*swap_deactivate)(struct file *);
+   int (*swap_activate)(struct swap_info_struct *, struct file *,
+sector_t *);
+   void (*swap_deactivate)(struct file *);
 };
 
   writepage: called by the VM to write a dirty page to backing store.
@@ -830,8 +831,11 @@ struct address_space_operations {
 
   swap_activate: Called when swapon is used on a file to allocate
space if necessary and pin the block lookup information in
-   memory. A return value of zero indicates success,
-   in which case this file can be used to back swapspace.
+   memory. If this returns zero, the swap system will call the address
+   space operations ->readpage() and ->direct_IO(). Alternatively, this
+   may call add_swap_extent() and return the number of extents added, in
+   which case the swap system will use the provided blocks directly
+   instead of going through the filesystem.
 
   swap_deactivate: Called during swapoff on files where swap_activate
was successful.
-- 
2.19.0



[PATCH] Btrfs: get rid of btrfs_symlink_aops

2018-09-24 Thread Omar Sandoval
From: Omar Sandoval 

The only aops we define for symlinks are identical to the aops for
regular files. This has been the case since symlink support was added in
commit 2b8d99a723a3 ("Btrfs: symlinks and hard links"). As far as I can
tell, there wasn't a good reason to have separate aops then, and there
isn't now, so let's just do what most other filesystems do and reuse the
same structure.

Signed-off-by: Omar Sandoval 
---
Based on v4.19-rc5.

 fs/btrfs/inode.c | 12 ++--
 1 file changed, 2 insertions(+), 10 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 3ea5339603cf..590063b0b6dc 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -64,7 +64,6 @@ static const struct inode_operations 
btrfs_dir_ro_inode_operations;
 static const struct inode_operations btrfs_special_inode_operations;
 static const struct inode_operations btrfs_file_inode_operations;
 static const struct address_space_operations btrfs_aops;
-static const struct address_space_operations btrfs_symlink_aops;
 static const struct file_operations btrfs_dir_file_operations;
 static const struct extent_io_ops btrfs_extent_io_ops;
 
@@ -3738,7 +3737,7 @@ static int btrfs_read_locked_inode(struct inode *inode)
case S_IFLNK:
inode->i_op = _symlink_inode_operations;
inode_nohighmem(inode);
-   inode->i_mapping->a_ops = _symlink_aops;
+   inode->i_mapping->a_ops = _aops;
break;
default:
inode->i_op = _special_inode_operations;
@@ -10191,7 +10190,7 @@ static int btrfs_symlink(struct inode *dir, struct 
dentry *dentry,
 
inode->i_op = _symlink_inode_operations;
inode_nohighmem(inode);
-   inode->i_mapping->a_ops = _symlink_aops;
+   inode->i_mapping->a_ops = _aops;
inode_set_bytes(inode, name_len);
btrfs_i_size_write(BTRFS_I(inode), name_len);
err = btrfs_update_inode(trans, root, inode);
@@ -10567,13 +10566,6 @@ static const struct address_space_operations 
btrfs_aops = {
.error_remove_page = generic_error_remove_page,
 };
 
-static const struct address_space_operations btrfs_symlink_aops = {
-   .readpage   = btrfs_readpage,
-   .writepage  = btrfs_writepage,
-   .invalidatepage = btrfs_invalidatepage,
-   .releasepage= btrfs_releasepage,
-};
-
 static const struct inode_operations btrfs_file_inode_operations = {
.getattr= btrfs_getattr,
.setattr= btrfs_setattr,
-- 
2.19.0



Re: [PATCH 5/6] btrfs-progs: check: Add support for freespace tree fixing

2018-09-21 Thread Omar Sandoval
On Fri, Jun 15, 2018 at 02:06:01PM +0300, Nikolay Borisov wrote:
> Now that all the prerequisite code for proper support of free space
> tree repair is in, it's time to wire it in. This is achieved by first
> hooking the freespace tree to the __free_extent/alloc_reserved_tree_block
> functions. And then introducing a wrapper function to contains the
> existing check_space_cache and the newly introduced repair code.
> Finally, it's important to note that FST repair code first clears the
> existing FST in case of any problem found and rebuilds it from scratch.
> 
> Signed-off-by: Nikolay Borisov 
> ---
>  check/main.c  | 61 
> +--
>  extent-tree.c |  9 +
>  2 files changed, 47 insertions(+), 23 deletions(-)
> 
> diff --git a/check/main.c b/check/main.c
> index 3a5efaf615a9..44d734ff4254 100644
> --- a/check/main.c
> +++ b/check/main.c
> @@ -5321,19 +5321,6 @@ static int check_space_cache(struct btrfs_root *root)
>   int ret;
>   int error = 0;
>  
> - if (btrfs_super_cache_generation(root->fs_info->super_copy) != -1ULL &&
> - btrfs_super_generation(root->fs_info->super_copy) !=
> - btrfs_super_cache_generation(root->fs_info->super_copy)) {
> - printf("cache and super generation don't match, space cache "
> -"will be invalidated\n");
> - return 0;
> - }
> -
> - if (ctx.progress_enabled) {
> - ctx.tp = TASK_FREE_SPACE;
> - task_start(ctx.info);
> - }
> -
>   while (1) {
>   cache = btrfs_lookup_first_block_group(root->fs_info, start);
>   if (!cache)
> @@ -5383,11 +5370,11 @@ static int check_space_cache(struct btrfs_root *root)
>   }
>   }
>  
> - task_stop(ctx.info);
>  
>   return error ? -EINVAL : 0;
>  }
>  
> +

Stray newline.

>  /*
>   * Check data checksum for [@bytenr, @bytenr + @num_bytes).
>   *
> @@ -9338,7 +9325,6 @@ static int do_clear_free_space_cache(struct 
> btrfs_fs_info *fs_info,
>   ret = 1;
>   goto close_out;
>   }
> - printf("Clearing free space cache\n");
>   ret = clear_free_space_cache(fs_info);
>   if (ret) {
>   error("failed to clear free space cache");
> @@ -9365,6 +9351,41 @@ static int do_clear_free_space_cache(struct 
> btrfs_fs_info *fs_info,
>   return ret;
>  }
>  
> +static int validate_free_space_cache(struct btrfs_root *root)
> +{
> +
> + int ret;
> +
> + if (btrfs_super_cache_generation(root->fs_info->super_copy) != -1ULL &&
> + btrfs_super_generation(root->fs_info->super_copy) !=
> + btrfs_super_cache_generation(root->fs_info->super_copy)) {
> + printf("cache and super generation don't match, space cache "
> +"will be invalidated\n");
> + return 0;
> + }
> +
> + if (ctx.progress_enabled) {
> + ctx.tp = TASK_FREE_SPACE;
> + task_start(ctx.info);
> + }
> +
> + ret = check_space_cache(root);
> + if (ret && btrfs_fs_compat_ro(global_info, FREE_SPACE_TREE)
> + && repair) {
> + ret = do_clear_free_space_cache(global_info, 2);
> + if (ret)
> + goto out;
> +
> + ret = btrfs_create_free_space_tree(global_info);
> + if (ret)
> + error("couldn't repair freespace tree");
> + }
> +
> +out:
> + task_stop(ctx.info);
> + return ret ? -EINVAL : 0;
> +}
> +
>  const char * const cmd_check_usage[] = {
>   "btrfs check [options] ",
>   "Check structural integrity of a filesystem (unmounted).",
> @@ -9768,15 +9789,9 @@ int cmd_check(int argc, char **argv)
>   else
>   fprintf(stderr, "checking free space cache\n");
>   }
> - ret = check_space_cache(root);
> +
> + ret = validate_free_space_cache(root);
>   err |= !!ret;
> - if (ret) {
> - if (btrfs_fs_compat_ro(info, FREE_SPACE_TREE))
> - error("errors found in free space tree");
> - else
> - error("errors found in free space cache");
> - goto out;
> - }
>  
>   /*
>* We used to have to have these hole extents in between our real

This approach seems reasonable.

> diff --git a/extent-tree.c b/extent-tree.c
> index b9d51b388c9a..40117f81352e 100644
> --- a/extent-tree.c
> +++ b/extent-tree.c
> @@ -29,6 +29,7 @@
>  #include "crc32c.h"
>  #include "volumes.h"
>  #include "free-space-cache.h"
> +#include "free-space-tree.h"
>  #include "utils.h"
>  
>  #define PENDING_EXTENT_INSERT 0
> @@ -2292,6 +2293,11 @@ static int __free_extent(struct btrfs_trans_handle 
> *trans,
>   BUG_ON(ret);
>   }
>  
> + ret = add_to_free_space_tree(trans, bytenr, num_bytes);
> + if (ret) {
> + 

Re: [PATCH 4/6] btrfs-progs: Add freespace tree as compat_ro supported feature

2018-09-21 Thread Omar Sandoval
On Fri, Jun 15, 2018 at 02:06:00PM +0300, Nikolay Borisov wrote:
> The RO_FREE_SPACE_TREE(_VALID) flags are required in order to be able
> to open an FST filesystem in repair mode. Add them to
> BTRFS_FEATURE_COMPAT_RO_SUPP.
> 
> Signed-off-by: Nikolay Borisov 
> ---
>  ctree.h | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/ctree.h b/ctree.h
> index ade883fecbd6..ef05e8122982 100644
> --- a/ctree.h
> +++ b/ctree.h
> @@ -497,7 +497,9 @@ struct btrfs_super_block {
>   * added here until read-write support for the free space tree is 
> implemented in
>   * btrfs-progs.
>   */
> -#define BTRFS_FEATURE_COMPAT_RO_SUPP 0ULL
> +#define BTRFS_FEATURE_COMPAT_RO_SUPP \
> + (BTRFS_FEATURE_COMPAT_RO_FREE_SPACE_TREE |  \
> +  BTRFS_FEATURE_COMPAT_RO_FREE_SPACE_TREE_VALID)


Have you tested whether btrfs-progs commands that modify the filesystem
(e.g., btrfstune or btrfs fi label) work with this series? Because that
is a requirement for claiming that we support this bit (at which point
we can delete the comment above). Also, this needs to happen _after_ we
hook up the free space tree with the extent tree.

See here for some historical context:
https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg57738.html


Re: [PATCH 3/6] btrfs-progs: Pull free space tree related code from kernel

2018-09-21 Thread Omar Sandoval
On Fri, Jun 15, 2018 at 02:05:59PM +0300, Nikolay Borisov wrote:
> To help implement free space tree checker in user space some kernel
> function are necessary, namely iterating/deleting/adding freespace
> items, some internal search functions. Functions to populate a block
> group based on the extent tree. The code is largely copy/paste from
> the kernel with locking eliminated (i.e free_space_lock). It supports
> reading/writing of both bitmap and extent based FST trees.
> 
> Signed-off-by: Nikolay Borisov 

Why doesn't this include the bitmap <-> extent conversions? If we end up
rebuilding the free space tree, we're never going to use the bitmap
format, which sucks if the free space is fragmented.


Re: [PATCH 1/6] btrfs-progs: Add support for freespace tree in btrfs_read_fs_root

2018-09-21 Thread Omar Sandoval
On Fri, Jun 15, 2018 at 02:05:57PM +0300, Nikolay Borisov wrote:
> For completeness sake add code to btrfs_read_fs_root so that it can
> handle the freespace tree.

Reviewed-by: Omar Sandoval 

> Signed-off-by: Nikolay Borisov 
> ---
>  disk-io.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/disk-io.c b/disk-io.c
> index 8da6e3ce5fc8..9ad826b83b3e 100644
> --- a/disk-io.c
> +++ b/disk-io.c
> @@ -664,6 +664,9 @@ struct btrfs_root *btrfs_read_fs_root(struct 
> btrfs_fs_info *fs_info,
>   if (location->objectid == BTRFS_QUOTA_TREE_OBJECTID)
>   return fs_info->quota_enabled ? fs_info->quota_root :
>   ERR_PTR(-ENOENT);
> + if (location->objectid == BTRFS_FREE_SPACE_TREE_OBJECTID)
> +return fs_info->free_space_root ? fs_info->free_space_root :
> +  ERR_PTR(-ENOENT);
>  
>   BUG_ON(location->objectid == BTRFS_TREE_RELOC_OBJECTID ||
>  location->offset != (u64)-1);
> -- 
> 2.7.4
> 


Re: [PATCH v8 0/6] Btrfs: implement swap file support

2018-09-21 Thread Omar Sandoval
On Fri, Sep 21, 2018 at 05:17:35PM +0200, David Sterba wrote:
> On Thu, Sep 20, 2018 at 10:41:24AM -0700, Omar Sandoval wrote:
> > On Thu, Sep 20, 2018 at 07:22:55PM +0200, David Sterba wrote:
> > > On Wed, Sep 19, 2018 at 10:02:11PM -0700, Omar Sandoval wrote:
> > > > From: Omar Sandoval 
> > > > Changes from v7 [1]:
> > > > 
> > > > - Expanded a few commit messages
> > > > - Added Johannes' acked-by on patches 1 and 2
> > > > - Rebased on v4.19-rc4
> > > 
> > > I've sent my comments, it's mostly about the usability or small
> > > enhancements. As you've got acks from MM people, I hope it would be ok
> > > if I add this series to for-next so we can give it some testing.
> > 
> > That'd be great. Feel free to grab it from my git tree
> > (https://github.com/osandov/linux/tree/btrfs-swap) if you want the
> > version with your comments addressed.
> 
> Updates looks good, branch added to the for-next snapshot and will be in
> upcoming for-next.

I got a kbuild error when building with CONFIG_SWAP=n, just pushed the
fix below on patch 6:

diff --git b/fs/btrfs/inode.c a/fs/btrfs/inode.c
index ffe266e612e3..6de98bb30c27 100644
--- b/fs/btrfs/inode.c
+++ a/fs/btrfs/inode.c
@@ -10489,6 +10489,7 @@ void btrfs_set_range_writeback(struct extent_io_tree 
*tree, u64 start, u64 end)
}
 }
 
+#ifdef CONFIG_SWAP
 /*
  * Add an entry indicating a block group or device which is pinned by a
  * swapfile. Returns 0 on success, 1 if there is already an entry for it, or a
@@ -10812,6 +10813,17 @@ static int btrfs_swap_activate(struct swap_info_struct 
*sis, struct file *file,
sis->highest_bit = bsi.nr_pages - 1;
return bsi.nr_extents;
 }
+#else
+static void btrfs_swap_deactivate(struct file *file)
+{
+}
+
+static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
+  sector_t *span)
+{
+   return -EOPNOTSUPP;
+}
+#endif
 
 static const struct inode_operations btrfs_dir_inode_operations = {
.getattr= btrfs_getattr,

> > > The MM patches would better go separately to 4.20 via the mm tree.  I
> > > did only build tests so 4.20 target is still feasible but given that
> > > it's rc4 it's a bit too close. There are some limitations posed by the
> > > swapfiles so I'd like to have a chance to do some actual tests myself
> > > and check the usability status.
> > 
> > 4.20 would be nice, but I could live with 4.21. I'll just be backporting
> > it to our internal kernel here anyways ;) Let me know how the tests go
> > and which way you want to go.
> 
> Backporting to your kernel is fine, your users will complain to you, but
> once it's in the mainline the complaints will go my way :)
> 
> As for the merge of the non-btrfs patches, I checked again and there are
> the VFS/documentation patches that haven't been CCed to the relvant
> people.  For that reason I'm not much comfortable to take them through
> my tree for the final merge. The MM part looks fine from that
> perspective.

There aren't any VFS changes, just the trivial documentation fixes.
fsdevel was Cc'd for the first four versions, but it's hard enough to
get Al to look at actual changes, let alone a documentation fix.


Re: [PATCH v8 0/6] Btrfs: implement swap file support

2018-09-20 Thread Omar Sandoval
On Thu, Sep 20, 2018 at 07:22:55PM +0200, David Sterba wrote:
> On Wed, Sep 19, 2018 at 10:02:11PM -0700, Omar Sandoval wrote:
> > From: Omar Sandoval 
> > Changes from v7 [1]:
> > 
> > - Expanded a few commit messages
> > - Added Johannes' acked-by on patches 1 and 2
> > - Rebased on v4.19-rc4
> 
> I've sent my comments, it's mostly about the usability or small
> enhancements. As you've got acks from MM people, I hope it would be ok
> if I add this series to for-next so we can give it some testing.

That'd be great. Feel free to grab it from my git tree
(https://github.com/osandov/linux/tree/btrfs-swap) if you want the
version with your comments addressed.

> The MM patches would better go separately to 4.20 via the mm tree.  I
> did only build tests so 4.20 target is still feasible but given that
> it's rc4 it's a bit too close. There are some limitations posed by the
> swapfiles so I'd like to have a chance to do some actual tests myself
> and check the usability status.

4.20 would be nice, but I could live with 4.21. I'll just be backporting
it to our internal kernel here anyways ;) Let me know how the tests go
and which way you want to go.

Thanks! It's nice to finally have the end in sight for this series, it's
almost 4 years old, although it's changed quite a bit since
https://lkml.org/lkml/2014/11/17/141.


Re: [PATCH v8 6/6] Btrfs: support swap files

2018-09-20 Thread Omar Sandoval
On Thu, Sep 20, 2018 at 07:15:41PM +0200, David Sterba wrote:
> On Wed, Sep 19, 2018 at 10:02:17PM -0700, Omar Sandoval wrote:
> > From: Omar Sandoval 
> > 
> > Btrfs has not allowed swap files since commit 35054394c4b3 ("Btrfs: stop
> > providing a bmap operation to avoid swapfile corruptions"). However, now
> > that the proper restrictions are in place, Btrfs can support swap files
> > through the swap file a_ops, similar to iomap in commit 67482129cdab
> > ("iomap: add a swapfile activation function").
> > 
> > For Btrfs, activation needs to make sure that the file can be used as a
> > swap file, which currently means that it must be fully allocated as
> > nocow with no compression on one device. It must also do the proper
> > tracking so that ioctls will not interfere with the swap file.
> > Deactivation clears this tracking.
> > 
> > Signed-off-by: Omar Sandoval 
> > ---
> >  fs/btrfs/inode.c | 317 +++
> >  1 file changed, 317 insertions(+)
> > 
> > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> > index 3ea5339603cf..0586285b1d9f 100644
> > --- a/fs/btrfs/inode.c
> > +++ b/fs/btrfs/inode.c

[snip]

> > +static int btrfs_swap_activate(struct swap_info_struct *sis, struct file 
> > *file,
> > +  sector_t *span)
> > +{
> > +   struct inode *inode = file_inode(file);
> > +   struct btrfs_fs_info *fs_info = BTRFS_I(inode)->root->fs_info;
> > +   struct extent_io_tree *io_tree = _I(inode)->io_tree;
> > +   struct extent_state *cached_state = NULL;
> > +   struct extent_map *em = NULL;
> > +   struct btrfs_device *device = NULL;
> > +   struct btrfs_swap_info bsi = {
> > +   .lowest_ppage = (sector_t)-1ULL,
> > +   };
> > +   int ret = 0;
> > +   u64 isize = inode->i_size;
> > +   u64 start;
> > +
> > +   /*
> > +* If the swap file was just created, make sure delalloc is done. If the
> > +* file changes again after this, the user is doing something stupid and
> > +* we don't really care.
> > +*/
> > +   ret = btrfs_wait_ordered_range(inode, 0, (u64)-1);
> > +   if (ret)
> > +   return ret;
> > +
> > +   /*
> > +* The inode is locked, so these flags won't change after we check them.
> > +*/
> > +   if (BTRFS_I(inode)->flags & BTRFS_INODE_COMPRESS) {
> > +   btrfs_info(fs_info, "swapfile must not be compressed");
> > +   return -EINVAL;
> > +   }
> > +   if (!(BTRFS_I(inode)->flags & BTRFS_INODE_NODATACOW)) {
> > +   btrfs_info(fs_info, "swapfile must not be copy-on-write");
> > +   return -EINVAL;
> > +   }
> 
> I wonder if we should also explicitly check for the checkums flag, ie.
> that NODATASUM is present. Right now it's bound to NODATACOW, but as
> with other sanity checks, it does not hurt to have it here.
> 
> > +
> > +   /*
> > +* Balance or device remove/replace/resize can move stuff around from
> > +* under us. The EXCL_OP flag makes sure they aren't running/won't run
> > +* concurrently while we are mapping the swap extents, and
> > +* fs_info->swapfile_pins prevents them from running while the swap file
> > +* is active and moving the extents. Note that this also prevents a
> > +* concurrent device add which isn't actually necessary, but it's not
> > +* really worth the trouble to allow it.
> > +*/
> > +   if (test_and_set_bit(BTRFS_FS_EXCL_OP, _info->flags))
> > +   return -EBUSY;
> 
> This could be also accompanied by a message, "why does not my swapfile
> activate?" -> "there's an exclusive operation running". I've checked if
> there are similar messages for the other exclusive ops. There are.

Sounds good. I addressed all of your comments and pushed to
https://github.com/osandov/linux/tree/btrfs-swap. The only thing I
didn't change was the btrfs_info about not being able to relocate an
active swapfile. I think it makes sense as btrfs_info since we already
log every block group we are relocating as info (see
describe_relocation()).


[PATCH v8 5/6] Btrfs: rename get_chunk_map() and make it non-static

2018-09-19 Thread Omar Sandoval
From: Omar Sandoval 

The Btrfs swap code is going to need it, so give it a btrfs_ prefix and
make it non-static.

Reviewed-by: Nikolay Borisov 
Signed-off-by: Omar Sandoval 
---
 fs/btrfs/volumes.c | 29 ++---
 fs/btrfs/volumes.h |  2 ++
 2 files changed, 20 insertions(+), 11 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index a2761395ed22..fe66b635c023 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -2714,8 +2714,15 @@ static int btrfs_del_sys_chunk(struct btrfs_fs_info 
*fs_info, u64 chunk_offset)
return ret;
 }
 
-static struct extent_map *get_chunk_map(struct btrfs_fs_info *fs_info,
-   u64 logical, u64 length)
+/**
+ * btrfs_get_chunk_map() - Find the mapping containing the given logical 
extent.
+ * @logical: Logical block offset in bytes.
+ * @length: Length of extent in bytes.
+ *
+ * Return: Chunk mapping or ERR_PTR.
+ */
+struct extent_map *btrfs_get_chunk_map(struct btrfs_fs_info *fs_info,
+  u64 logical, u64 length)
 {
struct extent_map_tree *em_tree;
struct extent_map *em;
@@ -2752,7 +2759,7 @@ int btrfs_remove_chunk(struct btrfs_trans_handle *trans, 
u64 chunk_offset)
int i, ret = 0;
struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
 
-   em = get_chunk_map(fs_info, chunk_offset, 1);
+   em = btrfs_get_chunk_map(fs_info, chunk_offset, 1);
if (IS_ERR(em)) {
/*
 * This is a logic error, but we don't want to just rely on the
@@ -4902,7 +4909,7 @@ int btrfs_finish_chunk_alloc(struct btrfs_trans_handle 
*trans,
int i = 0;
int ret = 0;
 
-   em = get_chunk_map(fs_info, chunk_offset, chunk_size);
+   em = btrfs_get_chunk_map(fs_info, chunk_offset, chunk_size);
if (IS_ERR(em))
return PTR_ERR(em);
 
@@ -5044,7 +5051,7 @@ int btrfs_chunk_readonly(struct btrfs_fs_info *fs_info, 
u64 chunk_offset)
int miss_ndevs = 0;
int i;
 
-   em = get_chunk_map(fs_info, chunk_offset, 1);
+   em = btrfs_get_chunk_map(fs_info, chunk_offset, 1);
if (IS_ERR(em))
return 1;
 
@@ -5104,7 +5111,7 @@ int btrfs_num_copies(struct btrfs_fs_info *fs_info, u64 
logical, u64 len)
struct map_lookup *map;
int ret;
 
-   em = get_chunk_map(fs_info, logical, len);
+   em = btrfs_get_chunk_map(fs_info, logical, len);
if (IS_ERR(em))
/*
 * We could return errors for these cases, but that could get
@@ -5150,7 +5157,7 @@ unsigned long btrfs_full_stripe_len(struct btrfs_fs_info 
*fs_info,
struct map_lookup *map;
unsigned long len = fs_info->sectorsize;
 
-   em = get_chunk_map(fs_info, logical, len);
+   em = btrfs_get_chunk_map(fs_info, logical, len);
 
if (!WARN_ON(IS_ERR(em))) {
map = em->map_lookup;
@@ -5167,7 +5174,7 @@ int btrfs_is_parity_mirror(struct btrfs_fs_info *fs_info, 
u64 logical, u64 len)
struct map_lookup *map;
int ret = 0;
 
-   em = get_chunk_map(fs_info, logical, len);
+   em = btrfs_get_chunk_map(fs_info, logical, len);
 
if(!WARN_ON(IS_ERR(em))) {
map = em->map_lookup;
@@ -5326,7 +5333,7 @@ static int __btrfs_map_block_for_discard(struct 
btrfs_fs_info *fs_info,
/* discard always return a bbio */
ASSERT(bbio_ret);
 
-   em = get_chunk_map(fs_info, logical, length);
+   em = btrfs_get_chunk_map(fs_info, logical, length);
if (IS_ERR(em))
return PTR_ERR(em);
 
@@ -5652,7 +5659,7 @@ static int __btrfs_map_block(struct btrfs_fs_info 
*fs_info,
return __btrfs_map_block_for_discard(fs_info, logical,
 *length, bbio_ret);
 
-   em = get_chunk_map(fs_info, logical, *length);
+   em = btrfs_get_chunk_map(fs_info, logical, *length);
if (IS_ERR(em))
return PTR_ERR(em);
 
@@ -5951,7 +5958,7 @@ int btrfs_rmap_block(struct btrfs_fs_info *fs_info, u64 
chunk_start,
u64 rmap_len;
int i, j, nr = 0;
 
-   em = get_chunk_map(fs_info, chunk_start, 1);
+   em = btrfs_get_chunk_map(fs_info, chunk_start, 1);
if (IS_ERR(em))
return -EIO;
 
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 23e9285d88de..f4c190c2ab84 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -465,6 +465,8 @@ unsigned long btrfs_full_stripe_len(struct btrfs_fs_info 
*fs_info,
 int btrfs_finish_chunk_alloc(struct btrfs_trans_handle *trans,
 u64 chunk_offset, u64 chunk_size);
 int btrfs_remove_chunk(struct btrfs_trans_handle *trans, u64 chunk_offset);
+struct extent_map *btrfs_get_chunk_map(struct btrfs_fs_info *fs_info,
+  u64 logical, u64 length);
 
 static inline void btrfs_dev_stat_inc(struct btrfs

[PATCH v8 0/6] Btrfs: implement swap file support

2018-09-19 Thread Omar Sandoval
From: Omar Sandoval 

Hi,

This series implements swap file support for Btrfs.

Changes from v7 [1]:

- Expanded a few commit messages
- Added Johannes' acked-by on patches 1 and 2
- Rebased on v4.19-rc4

No functional changes.

Thanks!

1: https://www.spinics.net/lists/linux-btrfs/msg81933.html

Omar Sandoval (6):
  mm: split SWP_FILE into SWP_ACTIVATED and SWP_FS
  mm: export add_swap_extent()
  vfs: update swap_{,de}activate documentation
  Btrfs: prevent ioctls from interfering with a swap file
  Btrfs: rename get_chunk_map() and make it non-static
  Btrfs: support swap files

 Documentation/filesystems/Locking |  17 +-
 Documentation/filesystems/vfs.txt |  12 +-
 fs/btrfs/ctree.h  |  29 +++
 fs/btrfs/dev-replace.c|   8 +
 fs/btrfs/disk-io.c|   4 +
 fs/btrfs/inode.c  | 317 ++
 fs/btrfs/ioctl.c  |  31 ++-
 fs/btrfs/relocation.c |  18 +-
 fs/btrfs/volumes.c|  82 ++--
 fs/btrfs/volumes.h|   2 +
 include/linux/swap.h  |  13 +-
 mm/page_io.c  |   6 +-
 mm/swapfile.c |  14 +-
 13 files changed, 502 insertions(+), 51 deletions(-)

-- 
2.19.0



[PATCH v8 4/6] Btrfs: prevent ioctls from interfering with a swap file

2018-09-19 Thread Omar Sandoval
From: Omar Sandoval 

A later patch will implement swap file support for Btrfs, but before we
do that, we need to make sure that the various Btrfs ioctls cannot
change a swap file.

When a swap file is active, we must make sure that the extents of the
file are not moved and that they don't become shared. That means that
the following are not safe:

- chattr +c (enable compression)
- reflink
- dedupe
- snapshot
- defrag

Don't allow those to happen on an active swap file.

Additionally, balance, resize, device remove, and device replace are
also unsafe if they affect an active swapfile. Add a red-black tree of
block groups and devices which contain an active swapfile. Relocation
checks each block group against this tree and skips it or errors out for
balance or resize, respectively. Device remove and device replace check
the tree for the device they will operate on.

Note that we don't have to worry about chattr -C (disable nocow), which
we ignore for non-empty files, because an active swapfile must be
non-empty and can't be truncated. We also don't have to worry about
autodefrag because it's only done on COW files. Truncate and fallocate
are already taken care of by the generic code. Device add doesn't do
relocation so it's not an issue, either.

Signed-off-by: Omar Sandoval 
---
 fs/btrfs/ctree.h   | 29 +++
 fs/btrfs/dev-replace.c |  8 +++
 fs/btrfs/disk-io.c |  4 
 fs/btrfs/ioctl.c   | 31 +---
 fs/btrfs/relocation.c  | 18 ++
 fs/btrfs/volumes.c | 53 ++
 6 files changed, 131 insertions(+), 12 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 2cddfe7806a4..08df61b8fc87 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -716,6 +716,28 @@ struct btrfs_fs_devices;
 struct btrfs_balance_control;
 struct btrfs_delayed_root;
 
+/*
+ * Block group or device which contains an active swapfile. Used for preventing
+ * unsafe operations while a swapfile is active.
+ *
+ * These are sorted on (ptr, inode) (note that a block group or device can
+ * contain more than one swapfile). We compare the pointer values because we
+ * don't actually care what the object is, we just need a quick check whether
+ * the object exists in the rbtree.
+ */
+struct btrfs_swapfile_pin {
+   struct rb_node node;
+   void *ptr;
+   struct inode *inode;
+   /*
+* If true, ptr points to a struct btrfs_block_group_cache. Otherwise,
+* ptr points to a struct btrfs_device.
+*/
+   bool is_block_group;
+};
+
+bool btrfs_pinned_by_swapfile(struct btrfs_fs_info *fs_info, void *ptr);
+
 #define BTRFS_FS_BARRIER   1
 #define BTRFS_FS_CLOSING_START 2
 #define BTRFS_FS_CLOSING_DONE  3
@@ -1121,6 +1143,10 @@ struct btrfs_fs_info {
u32 sectorsize;
u32 stripesize;
 
+   /* Block groups and devices containing active swapfiles. */
+   spinlock_t swapfile_pins_lock;
+   struct rb_root swapfile_pins;
+
 #ifdef CONFIG_BTRFS_FS_REF_VERIFY
spinlock_t ref_verify_lock;
struct rb_root block_tree;
@@ -1286,6 +1312,9 @@ struct btrfs_root {
spinlock_t qgroup_meta_rsv_lock;
u64 qgroup_meta_rsv_pertrans;
u64 qgroup_meta_rsv_prealloc;
+
+   /* Number of active swapfiles */
+   atomic_t nr_swapfiles;
 };
 
 struct btrfs_file_private {
diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
index dec01970d8c5..09d2cee2635b 100644
--- a/fs/btrfs/dev-replace.c
+++ b/fs/btrfs/dev-replace.c
@@ -414,6 +414,14 @@ int btrfs_dev_replace_start(struct btrfs_fs_info *fs_info,
if (ret)
return ret;
 
+   if (btrfs_pinned_by_swapfile(fs_info, src_device)) {
+   btrfs_info_in_rcu(fs_info,
+ "cannot replace device %s (devid %llu) due to 
active swapfile",
+ btrfs_dev_name(src_device),
+ src_device->devid);
+   return -ETXTBSY;
+   }
+
ret = btrfs_init_dev_replace_tgtdev(fs_info, tgtdev_name,
src_device, _device);
if (ret)
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 05dc3c17cb62..2428a73067d2 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1188,6 +1188,7 @@ static void __setup_root(struct btrfs_root *root, struct 
btrfs_fs_info *fs_info,
refcount_set(>refs, 1);
atomic_set(>will_be_snapshotted, 0);
atomic_set(>snapshot_force_cow, 0);
+   atomic_set(>nr_swapfiles, 0);
root->log_transid = 0;
root->log_transid_committed = -1;
root->last_log_commit = 0;
@@ -2782,6 +2783,9 @@ int open_ctree(struct super_block *sb,
fs_info->sectorsize = 4096;
fs_info->stripesize = 4096;
 
+   spin_lock_init(_info->swapfile_pins_lock);
+   fs_i

[PATCH v8 6/6] Btrfs: support swap files

2018-09-19 Thread Omar Sandoval
From: Omar Sandoval 

Btrfs has not allowed swap files since commit 35054394c4b3 ("Btrfs: stop
providing a bmap operation to avoid swapfile corruptions"). However, now
that the proper restrictions are in place, Btrfs can support swap files
through the swap file a_ops, similar to iomap in commit 67482129cdab
("iomap: add a swapfile activation function").

For Btrfs, activation needs to make sure that the file can be used as a
swap file, which currently means that it must be fully allocated as
nocow with no compression on one device. It must also do the proper
tracking so that ioctls will not interfere with the swap file.
Deactivation clears this tracking.

Signed-off-by: Omar Sandoval 
---
 fs/btrfs/inode.c | 317 +++
 1 file changed, 317 insertions(+)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 3ea5339603cf..0586285b1d9f 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -27,6 +27,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include "ctree.h"
 #include "disk-io.h"
@@ -10488,6 +10489,320 @@ void btrfs_set_range_writeback(struct extent_io_tree 
*tree, u64 start, u64 end)
}
 }
 
+/*
+ * Add an entry indicating a block group or device which is pinned by a
+ * swapfile. Returns 0 on success, 1 if there is already an entry for it, or a
+ * negative errno on failure.
+ */
+static int btrfs_add_swapfile_pin(struct inode *inode, void *ptr,
+ bool is_block_group)
+{
+   struct btrfs_fs_info *fs_info = BTRFS_I(inode)->root->fs_info;
+   struct btrfs_swapfile_pin *sp, *entry;
+   struct rb_node **p;
+   struct rb_node *parent = NULL;
+
+   sp = kmalloc(sizeof(*sp), GFP_NOFS);
+   if (!sp)
+   return -ENOMEM;
+   sp->ptr = ptr;
+   sp->inode = inode;
+   sp->is_block_group = is_block_group;
+
+   spin_lock(_info->swapfile_pins_lock);
+   p = _info->swapfile_pins.rb_node;
+   while (*p) {
+   parent = *p;
+   entry = rb_entry(parent, struct btrfs_swapfile_pin, node);
+   if (sp->ptr < entry->ptr ||
+   (sp->ptr == entry->ptr && sp->inode < entry->inode)) {
+   p = &(*p)->rb_left;
+   } else if (sp->ptr > entry->ptr ||
+  (sp->ptr == entry->ptr && sp->inode > entry->inode)) 
{
+   p = &(*p)->rb_right;
+   } else {
+   spin_unlock(_info->swapfile_pins_lock);
+   kfree(sp);
+   return 1;
+   }
+   }
+   rb_link_node(>node, parent, p);
+   rb_insert_color(>node, _info->swapfile_pins);
+   spin_unlock(_info->swapfile_pins_lock);
+   return 0;
+}
+
+/* Free all of the entries pinned by this swapfile. */
+static void btrfs_free_swapfile_pins(struct inode *inode)
+{
+   struct btrfs_fs_info *fs_info = BTRFS_I(inode)->root->fs_info;
+   struct btrfs_swapfile_pin *sp;
+   struct rb_node *node, *next;
+
+   spin_lock(_info->swapfile_pins_lock);
+   node = rb_first(_info->swapfile_pins);
+   while (node) {
+   next = rb_next(node);
+   sp = rb_entry(node, struct btrfs_swapfile_pin, node);
+   if (sp->inode == inode) {
+   rb_erase(>node, _info->swapfile_pins);
+   if (sp->is_block_group)
+   btrfs_put_block_group(sp->ptr);
+   kfree(sp);
+   }
+   node = next;
+   }
+   spin_unlock(_info->swapfile_pins_lock);
+}
+
+struct btrfs_swap_info {
+   u64 start;
+   u64 block_start;
+   u64 block_len;
+   u64 lowest_ppage;
+   u64 highest_ppage;
+   unsigned long nr_pages;
+   int nr_extents;
+};
+
+static int btrfs_add_swap_extent(struct swap_info_struct *sis,
+struct btrfs_swap_info *bsi)
+{
+   unsigned long nr_pages;
+   u64 first_ppage, first_ppage_reported, next_ppage;
+   int ret;
+
+   first_ppage = ALIGN(bsi->block_start, PAGE_SIZE) >> PAGE_SHIFT;
+   next_ppage = ALIGN_DOWN(bsi->block_start + bsi->block_len,
+   PAGE_SIZE) >> PAGE_SHIFT;
+
+   if (first_ppage >= next_ppage)
+   return 0;
+   nr_pages = next_ppage - first_ppage;
+
+   first_ppage_reported = first_ppage;
+   if (bsi->start == 0)
+   first_ppage_reported++;
+   if (bsi->lowest_ppage > first_ppage_reported)
+   bsi->lowest_ppage = first_ppage_reported;
+   if (bsi->highest_ppage < (next_ppage - 1))
+   bsi->highest_ppage = next_ppage - 1;
+
+   ret = add_swap_extent(sis, b

[PATCH v8 1/6] mm: split SWP_FILE into SWP_ACTIVATED and SWP_FS

2018-09-19 Thread Omar Sandoval
From: Omar Sandoval 

The SWP_FILE flag serves two purposes: to make swap_{read,write}page()
go through the filesystem, and to make swapoff() call
->swap_deactivate(). For Btrfs, we want the latter but not the former,
so split this flag into two. This makes us always call
->swap_deactivate() if ->swap_activate() succeeded, not just if it
didn't add any swap extents itself.

This also resolves the issue of the very misleading name of SWP_FILE,
which is only used for swap files over NFS.

Reviewed-by: Nikolay Borisov 
Acked-by: Johannes Weiner 
Signed-off-by: Omar Sandoval 
---
 include/linux/swap.h | 13 +++--
 mm/page_io.c |  6 +++---
 mm/swapfile.c| 13 -
 3 files changed, 18 insertions(+), 14 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 8e2c11e692ba..0fda0aa743f0 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -167,13 +167,14 @@ enum {
SWP_SOLIDSTATE  = (1 << 4), /* blkdev seeks are cheap */
SWP_CONTINUED   = (1 << 5), /* swap_map has count continuation */
SWP_BLKDEV  = (1 << 6), /* its a block device */
-   SWP_FILE= (1 << 7), /* set after swap_activate success */
-   SWP_AREA_DISCARD = (1 << 8),/* single-time swap area discards */
-   SWP_PAGE_DISCARD = (1 << 9),/* freed swap page-cluster discards */
-   SWP_STABLE_WRITES = (1 << 10),  /* no overwrite PG_writeback pages */
-   SWP_SYNCHRONOUS_IO = (1 << 11), /* synchronous IO is efficient */
+   SWP_ACTIVATED   = (1 << 7), /* set after swap_activate success */
+   SWP_FS  = (1 << 8), /* swap file goes through fs */
+   SWP_AREA_DISCARD = (1 << 9),/* single-time swap area discards */
+   SWP_PAGE_DISCARD = (1 << 10),   /* freed swap page-cluster discards */
+   SWP_STABLE_WRITES = (1 << 11),  /* no overwrite PG_writeback pages */
+   SWP_SYNCHRONOUS_IO = (1 << 12), /* synchronous IO is efficient */
/* add others here before... */
-   SWP_SCANNING= (1 << 12),/* refcount in scan_swap_map */
+   SWP_SCANNING= (1 << 13),/* refcount in scan_swap_map */
 };
 
 #define SWAP_CLUSTER_MAX 32UL
diff --git a/mm/page_io.c b/mm/page_io.c
index aafd19ec1db4..e8653c368069 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -283,7 +283,7 @@ int __swap_writepage(struct page *page, struct 
writeback_control *wbc,
struct swap_info_struct *sis = page_swap_info(page);
 
VM_BUG_ON_PAGE(!PageSwapCache(page), page);
-   if (sis->flags & SWP_FILE) {
+   if (sis->flags & SWP_FS) {
struct kiocb kiocb;
struct file *swap_file = sis->swap_file;
struct address_space *mapping = swap_file->f_mapping;
@@ -365,7 +365,7 @@ int swap_readpage(struct page *page, bool synchronous)
goto out;
}
 
-   if (sis->flags & SWP_FILE) {
+   if (sis->flags & SWP_FS) {
struct file *swap_file = sis->swap_file;
struct address_space *mapping = swap_file->f_mapping;
 
@@ -423,7 +423,7 @@ int swap_set_page_dirty(struct page *page)
 {
struct swap_info_struct *sis = page_swap_info(page);
 
-   if (sis->flags & SWP_FILE) {
+   if (sis->flags & SWP_FS) {
struct address_space *mapping = sis->swap_file->f_mapping;
 
VM_BUG_ON_PAGE(!PageSwapCache(page), page);
diff --git a/mm/swapfile.c b/mm/swapfile.c
index d954b71c4f9c..d3f95833d12e 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -989,7 +989,7 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], 
int entry_size)
goto nextsi;
}
if (size == SWAPFILE_CLUSTER) {
-   if (!(si->flags & SWP_FILE))
+   if (!(si->flags & SWP_FS))
n_ret = swap_alloc_cluster(si, swp_entries);
} else
n_ret = scan_swap_map_slots(si, SWAP_HAS_CACHE,
@@ -2310,12 +2310,13 @@ static void destroy_swap_extents(struct 
swap_info_struct *sis)
kfree(se);
}
 
-   if (sis->flags & SWP_FILE) {
+   if (sis->flags & SWP_ACTIVATED) {
struct file *swap_file = sis->swap_file;
struct address_space *mapping = swap_file->f_mapping;
 
-   sis->flags &= ~SWP_FILE;
-   mapping->a_ops->swap_deactivate(swap_file);
+   sis->flags &= ~SWP_ACTIVATED;
+   if (mapping->a_ops->swap_deactivate)
+   mapping->a_ops->swap_deactivate(swap_file);
}
 }
 
@@ -2411,8 +2412,10 @@ static int setup_swap_extents(struct swap_info_struct 
*sis, sector_t *span

[PATCH v8 2/6] mm: export add_swap_extent()

2018-09-19 Thread Omar Sandoval
From: Omar Sandoval 

Btrfs currently does not support swap files because swap's use of bmap
does not work with copy-on-write and multiple devices. See commit
35054394c4b3 ("Btrfs: stop providing a bmap operation to avoid swapfile
corruptions"). However, the swap code has a mechanism for the filesystem
to manually add swap extents using add_swap_extent() from the
->swap_activate() aop. iomap has done this since commit 67482129cdab
("iomap: add a swapfile activation function"). Btrfs will do the same in
a later patch, so export add_swap_extent().

Acked-by: Johannes Weiner 
Signed-off-by: Omar Sandoval 
---
 mm/swapfile.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index d3f95833d12e..51cb30de17bc 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -2365,6 +2365,7 @@ add_swap_extent(struct swap_info_struct *sis, unsigned 
long start_page,
list_add_tail(_se->list, >first_swap_extent.list);
return 1;
 }
+EXPORT_SYMBOL_GPL(add_swap_extent);
 
 /*
  * A `swap extent' is a simple thing which maps a contiguous range of pages
-- 
2.19.0



[PATCH v8 3/6] vfs: update swap_{,de}activate documentation

2018-09-19 Thread Omar Sandoval
From: Omar Sandoval 

The documentation for these functions is wrong in several ways:

- swap_activate() is called with the inode locked
- swap_activate() takes a swap_info_struct * and a sector_t *
- swap_activate() can also return a positive number of extents it added
  itself
- swap_deactivate() does not return anything

Reviewed-by: Nikolay Borisov 
Signed-off-by: Omar Sandoval 
---
 Documentation/filesystems/Locking | 17 +++--
 Documentation/filesystems/vfs.txt | 12 
 2 files changed, 15 insertions(+), 14 deletions(-)

diff --git a/Documentation/filesystems/Locking 
b/Documentation/filesystems/Locking
index efea228ccd8a..b970c8c2ee22 100644
--- a/Documentation/filesystems/Locking
+++ b/Documentation/filesystems/Locking
@@ -210,8 +210,9 @@ prototypes:
int (*launder_page)(struct page *);
int (*is_partially_uptodate)(struct page *, unsigned long, unsigned 
long);
int (*error_remove_page)(struct address_space *, struct page *);
-   int (*swap_activate)(struct file *);
-   int (*swap_deactivate)(struct file *);
+   int (*swap_activate)(struct swap_info_struct *, struct file *,
+sector_t *);
+   void (*swap_deactivate)(struct file *);
 
 locking rules:
All except set_page_dirty and freepage may block
@@ -235,8 +236,8 @@ putback_page:   yes
 launder_page:  yes
 is_partially_uptodate: yes
 error_remove_page: yes
-swap_activate: no
-swap_deactivate:   no
+swap_activate: yes
+swap_deactivate:   no
 
->write_begin(), ->write_end() and ->readpage() may be called from
 the request handler (/dev/loop).
@@ -333,14 +334,10 @@ cleaned, or an error value if not. Note that in order to 
prevent the page
 getting mapped back in and redirtied, it needs to be kept locked
 across the entire operation.
 
-   ->swap_activate will be called with a non-zero argument on
-files backing (non block device backed) swapfiles. A return value
-of zero indicates success, in which case this file can be used for
-backing swapspace. The swapspace operations will be proxied to the
-address space operations.
+   ->swap_activate is called from sys_swapon() with the inode locked.
 
->swap_deactivate() will be called in the sys_swapoff()
-path after ->swap_activate() returned success.
+path after ->swap_activate() returned success. The inode is not locked.
 
 --- file_lock_operations --
 prototypes:
diff --git a/Documentation/filesystems/vfs.txt 
b/Documentation/filesystems/vfs.txt
index a6c6a8af48a2..6e14db053eaa 100644
--- a/Documentation/filesystems/vfs.txt
+++ b/Documentation/filesystems/vfs.txt
@@ -652,8 +652,9 @@ struct address_space_operations {
unsigned long);
void (*is_dirty_writeback) (struct page *, bool *, bool *);
int (*error_remove_page) (struct mapping *mapping, struct page *page);
-   int (*swap_activate)(struct file *);
-   int (*swap_deactivate)(struct file *);
+   int (*swap_activate)(struct swap_info_struct *, struct file *,
+sector_t *);
+   void (*swap_deactivate)(struct file *);
 };
 
   writepage: called by the VM to write a dirty page to backing store.
@@ -830,8 +831,11 @@ struct address_space_operations {
 
   swap_activate: Called when swapon is used on a file to allocate
space if necessary and pin the block lookup information in
-   memory. A return value of zero indicates success,
-   in which case this file can be used to back swapspace.
+   memory. If this returns zero, the swap system will call the address
+   space operations ->readpage() and ->direct_IO(). Alternatively, this
+   may call add_swap_extent() and return the number of extents added, in
+   which case the swap system will use the provided blocks directly
+   instead of going through the filesystem.
 
   swap_deactivate: Called during swapoff on files where swap_activate
was successful.
-- 
2.19.0



Re: [PATCH v7 2/6] mm: export add_swap_extent()

2018-09-19 Thread Omar Sandoval
On Wed, Sep 19, 2018 at 02:09:09PM -0400, Johannes Weiner wrote:
> On Tue, Sep 11, 2018 at 03:34:45PM -0700, Omar Sandoval wrote:
> > From: Omar Sandoval 
> > 
> > Btrfs will need this for swap file support.
> > 
> > Signed-off-by: Omar Sandoval 
> 
> That looks reasonable. After reading the last patch, it's somewhat
> understandable why you cannot simply implemnet ->bmap and use the
> generic activation code. But it would be good to explain the reason(s)
> for why you can't here briefly to justify this patch.

I'll rewrite it to:

Btrfs currently does not support swap files because swap's use of bmap
does not work with copy-on-write and multiple devices. See 35054394c4b3
("Btrfs: stop providing a bmap operation to avoid swapfile
corruptions"). However, the swap code has a mechanism for the filesystem
to manually add swap extents using add_swap_extent() from the
->swap_activate() aop. iomap has done this since 67482129cdab ("iomap:
add a swapfile activation function"). Btrfs will do the same in a later
patch, so export add_swap_extent().


Re: [PATCH v7 1/6] mm: split SWP_FILE into SWP_ACTIVATED and SWP_FS

2018-09-19 Thread Omar Sandoval
On Wed, Sep 19, 2018 at 02:02:32PM -0400, Johannes Weiner wrote:
> On Tue, Sep 11, 2018 at 03:34:44PM -0700, Omar Sandoval wrote:
> > @@ -2411,8 +2412,10 @@ static int setup_swap_extents(struct 
> > swap_info_struct *sis, sector_t *span)
> >  
> > if (mapping->a_ops->swap_activate) {
> > ret = mapping->a_ops->swap_activate(sis, swap_file, span);
> > +   if (ret >= 0)
> > +   sis->flags |= SWP_ACTIVATED;
> > if (!ret) {
> > -   sis->flags |= SWP_FILE;
> > +   sis->flags |= SWP_FS;
> > ret = add_swap_extent(sis, 0, sis->max, 0);
> 
> Won't this single, linear extent be in conflict with the discontiguous
> extents you set up in your swap_activate callback in the last patch?

That's only in the case that ->swap_activate() returned 0, which only
nfs_swap_activate() will do. btrfs_swap_activate() and
iomap_swapfile_activate() both return the number of extents they set up.


Re: [PATCH v7 0/6] Btrfs: implement swap file support

2018-09-19 Thread Omar Sandoval
On Tue, Sep 11, 2018 at 03:34:43PM -0700, Omar Sandoval wrote:
> From: Omar Sandoval 
> 
> Hi,
> 
> This series implements swap file support for Btrfs.
> 
> Changes from v6 [1]:
> 
> - Moved btrfs_get_chunk_map() comment to function body
> - Added more comments about pinned block group/device rbtree
> - Fixed bug in patch 4 which broke resize
> 
> Based on v4.19-rc3.
> 
> Thanks!
> 
> 1: https://www.spinics.net/lists/linux-btrfs/msg81732.html
> 
> Omar Sandoval (6):
>   mm: split SWP_FILE into SWP_ACTIVATED and SWP_FS
>   mm: export add_swap_extent()
>   vfs: update swap_{,de}activate documentation
>   Btrfs: prevent ioctls from interfering with a swap file
>   Btrfs: rename get_chunk_map() and make it non-static
>   Btrfs: support swap files
> 
>  Documentation/filesystems/Locking |  17 +-
>  Documentation/filesystems/vfs.txt |  12 +-
>  fs/btrfs/ctree.h  |  29 +++
>  fs/btrfs/dev-replace.c|   8 +
>  fs/btrfs/disk-io.c|   4 +
>  fs/btrfs/inode.c  | 317 ++
>  fs/btrfs/ioctl.c  |  31 ++-
>  fs/btrfs/relocation.c |  18 +-
>  fs/btrfs/volumes.c|  82 ++--
>  fs/btrfs/volumes.h|   2 +
>  include/linux/swap.h  |  13 +-
>  mm/page_io.c  |   6 +-
>  mm/swapfile.c |  14 +-
>  13 files changed, 502 insertions(+), 51 deletions(-)

Ping, any other comments on this version?


Re: [PATCH 32/36] btrfs: clear delayed_refs_rsv for dirty bg cleanup

2018-09-18 Thread Omar Sandoval
On Tue, Sep 11, 2018 at 01:58:03PM -0400, Josef Bacik wrote:
> We keep track of dirty bg's as a reservation in the delayed_refs_rsv, so
> when we abort and we cleanup those dirty bgs we need to drop their
> reservation so we don't have accounting issues and lots of scary
> messages on umount.

Shouldn't this just be part of patch 6?

> Signed-off-by: Josef Bacik 
> ---
>  fs/btrfs/disk-io.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index caaca8154a1a..54fbdc944a3f 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -4412,6 +4412,7 @@ void btrfs_cleanup_dirty_bgs(struct btrfs_transaction 
> *cur_trans,
>  
>   spin_unlock(_trans->dirty_bgs_lock);
>   btrfs_put_block_group(cache);
> + btrfs_delayed_refs_rsv_release(fs_info, 1);
>   spin_lock(_trans->dirty_bgs_lock);
>   }
>   spin_unlock(_trans->dirty_bgs_lock);
> -- 
> 2.14.3
> 


Re: [PATCH 33/36] btrfs: only free reserved extent if we didn't insert it

2018-09-18 Thread Omar Sandoval
On Tue, Sep 11, 2018 at 01:58:04PM -0400, Josef Bacik wrote:
> When we insert the file extent once the ordered extent completes we free
> the reserved extent reservation as it'll have been migrated to the
> bytes_used counter.  However if we error out after this step we'll still
> clear the reserved extent reservation, resulting in a negative
> accounting of the reserved bytes for the block group and space info.
> Fix this by only doing the free if we didn't successfully insert a file
> extent for this extent.

Reviewed-by: Omar Sandoval 

> Signed-off-by: Josef Bacik 
> ---
>  fs/btrfs/inode.c | 10 +-
>  1 file changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 60bcad901857..fd6ade4680b5 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -2992,6 +2992,7 @@ static int btrfs_finish_ordered_io(struct 
> btrfs_ordered_extent *ordered_extent)
>   bool truncated = false;
>   bool range_locked = false;
>   bool clear_new_delalloc_bytes = false;
> + bool clear_reserved_extent = true;
>  
>   if (!test_bit(BTRFS_ORDERED_NOCOW, _extent->flags) &&
>   !test_bit(BTRFS_ORDERED_PREALLOC, _extent->flags) &&
> @@ -3095,10 +3096,12 @@ static int btrfs_finish_ordered_io(struct 
> btrfs_ordered_extent *ordered_extent)
>   logical_len, logical_len,
>   compress_type, 0, 0,
>   BTRFS_FILE_EXTENT_REG);
> - if (!ret)
> + if (!ret) {
> + clear_reserved_extent = false;
>   btrfs_release_delalloc_bytes(fs_info,
>ordered_extent->start,
>ordered_extent->disk_len);
> + }
>   }
>   unpin_extent_cache(_I(inode)->extent_tree,
>  ordered_extent->file_offset, ordered_extent->len,
> @@ -3159,8 +3162,13 @@ static int btrfs_finish_ordered_io(struct 
> btrfs_ordered_extent *ordered_extent)
>* wrong we need to return the space for this ordered extent
>* back to the allocator.  We only free the extent in the
>* truncated case if we didn't write out the extent at all.
> +  *
> +  * If we made it past insert_reserved_file_extent before we
> +  * errored out then we don't need to do this as the accounting
> +  * has already been done.
>*/
>   if ((ret || !logical_len) &&
> + clear_reserved_extent &&
>   !test_bit(BTRFS_ORDERED_NOCOW, _extent->flags) &&
>   !test_bit(BTRFS_ORDERED_PREALLOC, _extent->flags))
>   btrfs_free_reserved_extent(fs_info,
> -- 
> 2.14.3
> 


Re: [PATCH 17/36] btrfs: loop in inode_rsv_refill

2018-09-18 Thread Omar Sandoval
On Tue, Sep 11, 2018 at 01:57:48PM -0400, Josef Bacik wrote:
> With severe fragmentation we can end up with our inode rsv size being
> huge during writeout, which would cause us to need to make very large
> metadata reservations.  However we may not actually need that much once
> writeout is complete.  So instead try to make our reservation, and if we
> couldn't make it re-calculate our new reservation size and try again.
> If our reservation size doesn't change between tries then we know we are
> actually out of space and can error out.
> 
> Signed-off-by: Josef Bacik 
> ---
>  fs/btrfs/extent-tree.c | 19 +--
>  1 file changed, 17 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 57567d013447..e43834380ce6 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -5790,10 +5790,11 @@ static int btrfs_inode_rsv_refill(struct btrfs_inode 
> *inode,
>  {
>   struct btrfs_root *root = inode->root;
>   struct btrfs_block_rsv *block_rsv = >block_rsv;
> - u64 num_bytes = 0;
> + u64 num_bytes = 0, last = 0;
>   u64 qgroup_num_bytes = 0;
>   int ret = -ENOSPC;
>  
> +again:
>   spin_lock(_rsv->lock);
>   if (block_rsv->reserved < block_rsv->size)
>   num_bytes = block_rsv->size - block_rsv->reserved;
> @@ -5818,8 +5819,22 @@ static int btrfs_inode_rsv_refill(struct btrfs_inode 
> *inode,
>   spin_lock(_rsv->lock);
>   block_rsv->qgroup_rsv_reserved += qgroup_num_bytes;
>   spin_unlock(_rsv->lock);
> - } else
> + } else {
>   btrfs_qgroup_free_meta_prealloc(root, qgroup_num_bytes);
> +
> + /*
> +  * If we are fragmented we can end up with a lot of outstanding
> +  * extents which will make our size be much larger than our
> +  * reserved amount.  If we happen to try to do a reservation
> +  * here that may result in us trying to do a pretty hefty
> +  * reservation, which we may not need once delalloc flushing
> +  * happens.  If this is the case try and do the reserve again.
> +  */
> + if (flush == BTRFS_RESERVE_FLUSH_ALL && last != num_bytes) {

Is there any point in retrying the reservation if num_bytes didn't
change? As this is written, we will:

1. Calculate num_bytes
2. Try reservation, say it fails
3. Recalculate num_bytes, say it doesn't change
4. Retry the reservation anyways, and it fails again

Maybe we should check if it changed before we retry the reservation? So
then we'd have

1. Calculate num_bytes
2. Try reservation, fails
3. Recalculate num_bytes, it doesn't change, bail out

Also, is it possible that num_bytes > last because of other operations
happening at the same time, and should we still retry in that case?

> + last = num_bytes;
> + goto again;
> + }
> + }
>   return ret;
>  }
>  
> -- 
> 2.14.3
> 


Re: [PATCH 16/36] btrfs: run delayed iputs before committing

2018-09-18 Thread Omar Sandoval
On Tue, Sep 11, 2018 at 01:57:47PM -0400, Josef Bacik wrote:
> Delayed iputs means we can have final iputs of deleted inodes in the
> queue, which could potentially generate a lot of pinned space that could
> be free'd.  So before we decide to commit the transaction for ENOPSC
> reasons, run the delayed iputs so that any potential space is free'd up.
> If there is and we freed enough we can then commit the transaction and
> potentially be able to make our reservation.

Reviewed-by: Omar Sandoval 

> Signed-off-by: Josef Bacik 
> ---
>  fs/btrfs/extent-tree.c | 9 +
>  1 file changed, 9 insertions(+)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 76941fc5af79..57567d013447 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -4823,6 +4823,15 @@ static int may_commit_transaction(struct btrfs_fs_info 
> *fs_info,
>   if (!bytes)
>   return 0;
>  
> + /*
> +  * If we have pending delayed iputs then we could free up a bunch of
> +  * pinned space, so make sure we run the iputs before we do our pinned
> +  * bytes check below.
> +  */
> + mutex_lock(_info->cleaner_delayed_iput_mutex);
> + btrfs_run_delayed_iputs(fs_info);
> + mutex_unlock(_info->cleaner_delayed_iput_mutex);
> +
>   trans = btrfs_join_transaction(fs_info->extent_root);
>   if (IS_ERR(trans))
>   return -ENOSPC;
> -- 
> 2.14.3
> 


Re: [PATCH 14/36] btrfs: reset max_extent_size properly

2018-09-18 Thread Omar Sandoval
On Tue, Sep 11, 2018 at 01:57:45PM -0400, Josef Bacik wrote:
> If we use up our block group before allocating a new one we'll easily
> get a max_extent_size that's set really really low, which will result in
> a lot of fragmentation.  We need to make sure we're resetting the
> max_extent_size when we add a new chunk or add new space.
> 
> Signed-off-by: Josef Bacik 
> ---
>  fs/btrfs/extent-tree.c | 7 +++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 13441a293c73..44d59bee6e5e 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -4573,6 +4573,7 @@ static int do_chunk_alloc(struct btrfs_trans_handle 
> *trans, u64 flags,
>   goto out;
>   } else {
>   ret = 1;
> + space_info->max_extent_size = 0;
>   }
>  
>   space_info->force_alloc = CHUNK_ALLOC_NO_FORCE;
> @@ -8084,11 +8085,17 @@ static int __btrfs_free_reserved_extent(struct 
> btrfs_fs_info *fs_info,
>   if (pin)
>   pin_down_extent(fs_info, cache, start, len, 1);
>   else {
> + struct btrfs_space_info *space_info = cache->space_info;
> +
>   if (btrfs_test_opt(fs_info, DISCARD))
>   ret = btrfs_discard_extent(fs_info, start, len, NULL,
>   BTRFS_CLEAR_OP_DISCARD);
>   btrfs_add_free_space(cache, start, len);
>   btrfs_free_reserved_bytes(cache, len, delalloc);
> +
> + spin_lock(_info->lock);
> + space_info->max_extent_size = 0;
> + spin_unlock(_info->lock);
>   trace_btrfs_reserved_extent_free(fs_info, start, len);
>   }

Do we need to do the same for btrfs_free_tree_block()? If so, maybe it
can go in btrfs_free_reserved_bytes() instead?


Re: [PATCH 1/4 v2] btrfs: tests: add separate stub for find_lock_delalloc_range

2018-09-17 Thread Omar Sandoval
On Fri, Sep 14, 2018 at 06:38:44PM +0200, David Sterba wrote:
> The helper find_lock_delalloc_range is now conditionally built static,
> dpending on whether the self-tests are enabled or not. There's a macro
> that is supposed to hide the export, used only once. To discourage
> further use, drop it an add a public wrapper for the helper needed by
> tests.

Reviewed-by: Omar Sandoval 

> Signed-off-by: David Sterba 
> ---
> 
> v2:
> - add noinline_for_stack back
>  fs/btrfs/ctree.h |  6 --
>  fs/btrfs/extent_io.c | 13 -
>  fs/btrfs/extent_io.h |  2 +-
>  fs/btrfs/tests/extent-io-tests.c | 10 +-
>  4 files changed, 18 insertions(+), 13 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 2cddfe7806a4..45b7029d0f23 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -41,12 +41,6 @@ extern struct kmem_cache *btrfs_path_cachep;
>  extern struct kmem_cache *btrfs_free_space_cachep;
>  struct btrfs_ordered_sum;
>  
> -#ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
> -#define STATIC noinline
> -#else
> -#define STATIC static noinline
> -#endif
> -
>  #define BTRFS_MAGIC 0x4D5F53665248425FULL /* ascii _BHRfS_M, no null */
>  
>  #define BTRFS_MAX_MIRRORS 3
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 4dd6faab02bb..93108b18b231 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -1568,7 +1568,7 @@ static noinline int lock_delalloc_pages(struct inode 
> *inode,
>   *
>   * 1 is returned if we find something, 0 if nothing was in the tree
>   */
> -STATIC u64 find_lock_delalloc_range(struct inode *inode,
> +static noinline_for_stack u64 find_lock_delalloc_range(struct inode *inode,
>   struct extent_io_tree *tree,
>   struct page *locked_page, u64 *start,
>   u64 *end, u64 max_bytes)
> @@ -1648,6 +1648,17 @@ STATIC u64 find_lock_delalloc_range(struct inode 
> *inode,
>   return found;
>  }
>  
> +#ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
> +u64 btrfs_find_lock_delalloc_range(struct inode *inode,
> + struct extent_io_tree *tree,
> + struct page *locked_page, u64 *start,
> + u64 *end, u64 max_bytes)
> +{
> + return find_lock_delalloc_range(inode, tree, locked_page, start, end,
> + max_bytes);
> +}
> +#endif
> +
>  static int __process_pages_contig(struct address_space *mapping,
> struct page *locked_page,
> pgoff_t start_index, pgoff_t end_index,
> diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
> index b4d03e677e1d..1a7fdcbca49b 100644
> --- a/fs/btrfs/extent_io.h
> +++ b/fs/btrfs/extent_io.h
> @@ -546,7 +546,7 @@ int free_io_failure(struct extent_io_tree *failure_tree,
>   struct extent_io_tree *io_tree,
>   struct io_failure_record *rec);
>  #ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
> -noinline u64 find_lock_delalloc_range(struct inode *inode,
> +u64 btrfs_find_lock_delalloc_range(struct inode *inode,
> struct extent_io_tree *tree,
> struct page *locked_page, u64 *start,
> u64 *end, u64 max_bytes);
> diff --git a/fs/btrfs/tests/extent-io-tests.c 
> b/fs/btrfs/tests/extent-io-tests.c
> index d9269a531a4d..9e0f4a01be14 100644
> --- a/fs/btrfs/tests/extent-io-tests.c
> +++ b/fs/btrfs/tests/extent-io-tests.c
> @@ -106,7 +106,7 @@ static int test_find_delalloc(u32 sectorsize)
>   set_extent_delalloc(, 0, sectorsize - 1, 0, NULL);
>   start = 0;
>   end = 0;
> - found = find_lock_delalloc_range(inode, , locked_page, ,
> + found = btrfs_find_lock_delalloc_range(inode, , locked_page, ,
>, max_bytes);
>   if (!found) {
>   test_err("should have found at least one delalloc");
> @@ -137,7 +137,7 @@ static int test_find_delalloc(u32 sectorsize)
>   set_extent_delalloc(, sectorsize, max_bytes - 1, 0, NULL);
>   start = test_start;
>   end = 0;
> - found = find_lock_delalloc_range(inode, , locked_page, ,
> + found = btrfs_find_lock_delalloc_range(inode, , locked_page, ,
>, max_bytes);
>   if (!found) {
>   test_err("couldn't find delalloc in our range");
> @@ -171,7 +171,7 @@ static int test_find_delalloc(u32 sectorsize)
>   }
>   start = test_start;
>   end = 0;
>

Re: [PATCH 4/4 v2] btrfs: tests: polish ifdefs around testing helper

2018-09-17 Thread Omar Sandoval
On Fri, Sep 14, 2018 at 06:42:03PM +0200, David Sterba wrote:
> Avoid the inline ifdefs and use two sections for self-tests enabled and
> disabled.
> 
> Though there could be no ifdef and unconditional test_bit of
> BTRFS_FS_STATE_DUMMY_FS_INFO, the static inline can help to optimize out
> any code that would depend on conditions using btrfs_is_testing.
> 
> As this is only for the testing code, drop unlikely().

Reviewed-by: Omar Sandoval 

> Signed-off-by: David Sterba 
> ---
> 
> v2:
> - remove unlikely
> - simplify to: return test_bit(...)
> 
>  fs/btrfs/ctree.h | 12 ++--
>  1 file changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 32d2fce4ac53..1656ada9200b 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -3708,17 +3708,17 @@ static inline int btrfs_defrag_cancelled(struct 
> btrfs_fs_info *fs_info)
>  #ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
>  void btrfs_test_inode_set_ops(struct inode *inode);
>  void btrfs_test_destroy_inode(struct inode *inode);
> -#endif
>  
>  static inline int btrfs_is_testing(struct btrfs_fs_info *fs_info)
>  {
> -#ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
> - if (unlikely(test_bit(BTRFS_FS_STATE_DUMMY_FS_INFO,
> -   _info->fs_state)))
> - return 1;
> -#endif
> + return test_bit(BTRFS_FS_STATE_DUMMY_FS_INFO, _info->fs_state);
> +}
> +#else
> +static inline int btrfs_is_testing(struct btrfs_fs_info *fs_info)
> +{
>   return 0;
>  }
> +#endif
>  
>  static inline void cond_wake_up(struct wait_queue_head *wq)
>  {
> -- 
> 2.18.0
> 


Re: [PATCH] btrfs: wait on caching when putting the bg cache

2018-09-12 Thread Omar Sandoval
On Wed, Sep 12, 2018 at 10:45:45AM -0400, Josef Bacik wrote:
> While testing my backport I noticed there was a panic if I ran
> generic/416 generic/417 generic/418 all in a row.  This just happened to
> uncover a race where we had outstanding IO after we destroy all of our
> workqueues, and then we'd go to queue the endio work on those free'd
> workqueues.  This is because we aren't waiting for the caching threads
> to be done before freeing everything up, so to fix this make sure we
> wait on any outstanding caching that's being done before we free up the
> block group, so we're sure to be done with all IO by the time we get to
> btrfs_stop_all_workers().  This fixes the panic I was seeing
> consistently in testing.

Reviewed-by: Omar Sandoval 

> Signed-off-by: Josef Bacik 
> ---
>  fs/btrfs/extent-tree.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 414492a18f1e..2eb2e37f2354 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -9889,6 +9889,7 @@ void btrfs_put_block_group_cache(struct btrfs_fs_info 
> *info)
>  
>   block_group = btrfs_lookup_first_block_group(info, last);
>   while (block_group) {
> + wait_block_group_cache_done(block_group);
>   spin_lock(_group->lock);
>   if (block_group->iref)
>   break;
> -- 
> 2.14.3
> 


Re: [PATCH 03/36] btrfs: cleanup extent_op handling

2018-09-11 Thread Omar Sandoval
On Tue, Sep 11, 2018 at 01:57:34PM -0400, Josef Bacik wrote:
> From: Josef Bacik 
> 
> The cleanup_extent_op function actually would run the extent_op if it
> needed running, which made the name sort of a misnomer.  Change it to
> run_and_cleanup_extent_op, and move the actual cleanup work to
> cleanup_extent_op so it can be used by check_ref_cleanup() in order to
> unify the extent op handling.
> 
> Signed-off-by: Josef Bacik 
> ---
>  fs/btrfs/extent-tree.c | 36 +++-
>  1 file changed, 23 insertions(+), 13 deletions(-)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index a44d55e36e11..98f36dfeccb0 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -2442,19 +2442,33 @@ static void unselect_delayed_ref_head(struct 
> btrfs_delayed_ref_root *delayed_ref
>   btrfs_delayed_ref_unlock(head);
>  }
>  
> -static int cleanup_extent_op(struct btrfs_trans_handle *trans,
> -  struct btrfs_delayed_ref_head *head)
> +static struct btrfs_delayed_extent_op *
> +cleanup_extent_op(struct btrfs_trans_handle *trans,
> +   struct btrfs_delayed_ref_head *head)
>  {
>   struct btrfs_delayed_extent_op *extent_op = head->extent_op;
> - int ret;
>  
>   if (!extent_op)
> - return 0;
> - head->extent_op = NULL;
> + return NULL;
> +
>   if (head->must_insert_reserved) {
> + head->extent_op = NULL;
>   btrfs_free_delayed_extent_op(extent_op);
> - return 0;
> + return NULL;
>   }

Now we don't set head->extent_op = NULL in this case when we call it
from check_ref_cleanup(), is that a problem?

> + return extent_op;
> +}
> +
> +static int run_and_cleanup_extent_op(struct btrfs_trans_handle *trans,
> +  struct btrfs_delayed_ref_head *head)
> +{
> + struct btrfs_delayed_extent_op *extent_op =
> + cleanup_extent_op(trans, head);
> + int ret;
> +
> + if (!extent_op)
> + return 0;
> + head->extent_op = NULL;
>   spin_unlock(>lock);
>   ret = run_delayed_extent_op(trans, head, extent_op);
>   btrfs_free_delayed_extent_op(extent_op);
> @@ -2506,7 +2520,7 @@ static int cleanup_ref_head(struct btrfs_trans_handle 
> *trans,
>  
>   delayed_refs = >transaction->delayed_refs;
>  
> - ret = cleanup_extent_op(trans, head);
> + ret = run_and_cleanup_extent_op(trans, head);
>   if (ret < 0) {
>   unselect_delayed_ref_head(delayed_refs, head);
>   btrfs_debug(fs_info, "run_delayed_extent_op returned %d", ret);
> @@ -6977,12 +6991,8 @@ static noinline int check_ref_cleanup(struct 
> btrfs_trans_handle *trans,
>   if (!RB_EMPTY_ROOT(>ref_tree))
>   goto out;
>  
> - if (head->extent_op) {
> - if (!head->must_insert_reserved)
> - goto out;
> - btrfs_free_delayed_extent_op(head->extent_op);
> - head->extent_op = NULL;
> - }
> + if (cleanup_extent_op(trans, head) != NULL)
> + goto out;
>  
>   /*
>* waiting for the lock here would deadlock.  If someone else has it
> -- 
> 2.14.3
> 


Re: [PATCH 08/36] btrfs: dump block_rsv whe dumping space info

2018-09-11 Thread Omar Sandoval
On Tue, Sep 11, 2018 at 01:57:39PM -0400, Josef Bacik wrote:
> For enospc_debug having the block rsvs is super helpful to see if we've
> done something wrong.
> 
> Signed-off-by: Josef Bacik 
> ---
>  fs/btrfs/extent-tree.c | 16 
>  1 file changed, 16 insertions(+)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index a3baa16d456f..1cf66a92829b 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -7918,6 +7918,16 @@ static noinline int find_free_extent(struct 
> btrfs_fs_info *fs_info,
>   return ret;
>  }
>  
> +static void dump_block_rsv(struct btrfs_fs_info *fs_info,
> +struct btrfs_block_rsv *rsv)
> +{
> + spin_lock(>lock);
> + btrfs_info(fs_info, "%d: size %llu reserved %llu\n",
> +rsv->type, (unsigned long long)rsv->size,
> +(unsigned long long)rsv->reserved);

How about passing a string name for each of these instead of an ID which
we have to cross-reference with the source?

Besides that,

Reviewed-by: Omar Sandoval 

> + spin_unlock(>lock);
> +}
> +
>  static void dump_space_info(struct btrfs_fs_info *fs_info,
>   struct btrfs_space_info *info, u64 bytes,
>   int dump_block_groups)
> @@ -7937,6 +7947,12 @@ static void dump_space_info(struct btrfs_fs_info 
> *fs_info,
>   info->bytes_readonly);
>   spin_unlock(>lock);
>  
> + dump_block_rsv(fs_info, _info->global_block_rsv);
> + dump_block_rsv(fs_info, _info->trans_block_rsv);
> + dump_block_rsv(fs_info, _info->chunk_block_rsv);
> + dump_block_rsv(fs_info, _info->delayed_block_rsv);
> + dump_block_rsv(fs_info, _info->delayed_refs_rsv);
> +
>   if (!dump_block_groups)
>   return;
>  
> -- 
> 2.14.3
> 


Re: [PATCH 07/36] btrfs: check if free bgs for commit

2018-09-11 Thread Omar Sandoval
On Tue, Sep 11, 2018 at 01:57:38PM -0400, Josef Bacik wrote:
> may_commit_transaction will skip committing the transaction if we don't
> have enough pinned space or if we're trying to find space for a SYSTEM
> chunk.  However if we have pending free block groups in this transaction
> we still want to commit as we may be able to allocate a chunk to make
> our reservation.  So instead of just returning ENOSPC, check if we have
> free block groups pending, and if so commit the transaction to allow us
> to use that free space.

Reviewed-by: Omar Sandoval 

> Signed-off-by: Josef Bacik 
> ---
>  fs/btrfs/extent-tree.c | 33 +++--
>  1 file changed, 19 insertions(+), 14 deletions(-)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 7168e2476944..a3baa16d456f 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -4830,10 +4830,18 @@ static int may_commit_transaction(struct 
> btrfs_fs_info *fs_info,
>   if (!bytes)
>   return 0;
>  
> - /* See if there is enough pinned space to make this reservation */
> - if (__percpu_counter_compare(_info->total_bytes_pinned,
> -bytes,
> -BTRFS_TOTAL_BYTES_PINNED_BATCH) >= 0)
> + trans = btrfs_join_transaction(fs_info->extent_root);
> + if (IS_ERR(trans))
> + return -ENOSPC;
> +
> + /*
> +  * See if there is enough pinned space to make this reservation, or if
> +  * we have bg's that are going to be freed, allowing us to possibly do a
> +  * chunk allocation the next loop through.
> +  */
> + if (test_bit(BTRFS_TRANS_HAVE_FREE_BGS, >transaction->flags) ||
> + __percpu_counter_compare(_info->total_bytes_pinned, bytes,
> +  BTRFS_TOTAL_BYTES_PINNED_BATCH) >= 0)
>   goto commit;
>  
>   /*
> @@ -4841,7 +4849,7 @@ static int may_commit_transaction(struct btrfs_fs_info 
> *fs_info,
>* this reservation.
>*/
>   if (space_info != delayed_rsv->space_info)
> - return -ENOSPC;
> + goto enospc;
>  
>   spin_lock(_rsv->lock);
>   reclaim_bytes += delayed_rsv->reserved;
> @@ -4855,17 +4863,14 @@ static int may_commit_transaction(struct 
> btrfs_fs_info *fs_info,
>   bytes -= reclaim_bytes;
>  
>   if (__percpu_counter_compare(_info->total_bytes_pinned,
> -bytes,
> -BTRFS_TOTAL_BYTES_PINNED_BATCH) < 0) {
> - return -ENOSPC;
> - }
> -
> +  bytes,
> +  BTRFS_TOTAL_BYTES_PINNED_BATCH) < 0)
> + goto enospc;
>  commit:
> - trans = btrfs_join_transaction(fs_info->extent_root);
> - if (IS_ERR(trans))
> - return -ENOSPC;
> -
>   return btrfs_commit_transaction(trans);
> +enospc:
> + btrfs_end_transaction(trans);
> + return -ENOSPC;
>  }
>  
>  /*
> -- 
> 2.14.3
> 


Re: [PATCH 05/36] btrfs: only count ref heads run in __btrfs_run_delayed_refs

2018-09-11 Thread Omar Sandoval
On Tue, Sep 11, 2018 at 01:57:36PM -0400, Josef Bacik wrote:
> We pick the number of ref's to run based on the number of ref heads, and
> only make the decision to stop once we've processed entire ref heads, so
> only count the ref heads we've run and bail once we've hit the number of
> ref heads we wanted to process.

Despite Nikolay's comment, it seems wrong to me to split this patch up
from the previous one. After the first one, you have this nonsensical
middle ground where the counter is number of heads but this counter is
number of refs.

> Signed-off-by: Josef Bacik 
> ---
>  fs/btrfs/extent-tree.c | 5 +
>  1 file changed, 1 insertion(+), 4 deletions(-)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 98f36dfeccb0..b32bd38390dd 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -2592,6 +2592,7 @@ static noinline int __btrfs_run_delayed_refs(struct 
> btrfs_trans_handle *trans,
>   spin_unlock(_refs->lock);
>   break;
>   }
> + count++;
>  
>   /* grab the lock that says we are going to process
>* all the refs for this head */
> @@ -2605,7 +2606,6 @@ static noinline int __btrfs_run_delayed_refs(struct 
> btrfs_trans_handle *trans,
>*/
>   if (ret == -EAGAIN) {
>   locked_ref = NULL;
> - count++;
>   continue;
>   }
>   }
> @@ -2633,7 +2633,6 @@ static noinline int __btrfs_run_delayed_refs(struct 
> btrfs_trans_handle *trans,
>   unselect_delayed_ref_head(delayed_refs, locked_ref);
>   locked_ref = NULL;
>   cond_resched();
> - count++;
>   continue;
>   }
>  
> @@ -2651,7 +2650,6 @@ static noinline int __btrfs_run_delayed_refs(struct 
> btrfs_trans_handle *trans,
>   return ret;
>   }
>   locked_ref = NULL;
> - count++;
>   continue;
>   }
>  
> @@ -2702,7 +2700,6 @@ static noinline int __btrfs_run_delayed_refs(struct 
> btrfs_trans_handle *trans,
>   }
>  
>   btrfs_put_delayed_ref(ref);
> - count++;
>   cond_resched();
>   }
>  
> -- 
> 2.14.3
> 


Re: [PATCH 01/36] btrfs: add btrfs_delete_ref_head helper

2018-09-11 Thread Omar Sandoval
On Tue, Sep 11, 2018 at 01:57:32PM -0400, Josef Bacik wrote:
> From: Josef Bacik 
> 
> We do this dance in cleanup_ref_head and check_ref_cleanup, unify it
> into a helper and cleanup the calling functions.

Reviewed-by: Omar Sandoval 

> Signed-off-by: Josef Bacik 
> ---
>  fs/btrfs/delayed-ref.c | 14 ++
>  fs/btrfs/delayed-ref.h |  3 ++-
>  fs/btrfs/extent-tree.c | 22 +++---
>  3 files changed, 19 insertions(+), 20 deletions(-)
> 
> diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
> index 62ff545ba1f7..3a9e4ac21794 100644
> --- a/fs/btrfs/delayed-ref.c
> +++ b/fs/btrfs/delayed-ref.c
> @@ -393,6 +393,20 @@ btrfs_select_ref_head(struct btrfs_trans_handle *trans)
>   return head;
>  }
>  
> +void btrfs_delete_ref_head(struct btrfs_delayed_ref_root *delayed_refs,
> +struct btrfs_delayed_ref_head *head)
> +{
> + lockdep_assert_held(_refs->lock);
> + lockdep_assert_held(>lock);
> +
> + rb_erase(>href_node, _refs->href_root);
> + RB_CLEAR_NODE(>href_node);
> + atomic_dec(_refs->num_entries);
> + delayed_refs->num_heads--;
> + if (head->processing == 0)
> + delayed_refs->num_heads_ready--;
> +}
> +
>  /*
>   * Helper to insert the ref_node to the tail or merge with tail.
>   *
> diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h
> index d9f2a4ebd5db..7769177b489e 100644
> --- a/fs/btrfs/delayed-ref.h
> +++ b/fs/btrfs/delayed-ref.h
> @@ -261,7 +261,8 @@ static inline void btrfs_delayed_ref_unlock(struct 
> btrfs_delayed_ref_head *head)
>  {
>   mutex_unlock(>mutex);
>  }
> -
> +void btrfs_delete_ref_head(struct btrfs_delayed_ref_root *delayed_refs,
> +struct btrfs_delayed_ref_head *head);
>  
>  struct btrfs_delayed_ref_head *
>  btrfs_select_ref_head(struct btrfs_trans_handle *trans);
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index f77226d8020a..d24a0de4a2e7 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -2492,12 +2492,9 @@ static int cleanup_ref_head(struct btrfs_trans_handle 
> *trans,
>   spin_unlock(_refs->lock);
>   return 1;
>   }
> - delayed_refs->num_heads--;
> - rb_erase(>href_node, _refs->href_root);
> - RB_CLEAR_NODE(>href_node);
> + btrfs_delete_ref_head(delayed_refs, head);
>   spin_unlock(>lock);
>   spin_unlock(_refs->lock);
> - atomic_dec(_refs->num_entries);
>  
>   trace_run_delayed_ref_head(fs_info, head, 0);
>  
> @@ -6984,22 +6981,9 @@ static noinline int check_ref_cleanup(struct 
> btrfs_trans_handle *trans,
>   if (!mutex_trylock(>mutex))
>   goto out;
>  
> - /*
> -  * at this point we have a head with no other entries.  Go
> -  * ahead and process it.
> -  */
> - rb_erase(>href_node, _refs->href_root);
> - RB_CLEAR_NODE(>href_node);
> - atomic_dec(_refs->num_entries);
> -
> - /*
> -  * we don't take a ref on the node because we're removing it from the
> -  * tree, so we just steal the ref the tree was holding.
> -  */
> - delayed_refs->num_heads--;
> - if (head->processing == 0)
> - delayed_refs->num_heads_ready--;
> + btrfs_delete_ref_head(delayed_refs, head);
>   head->processing = 0;
> +
>   spin_unlock(>lock);
>   spin_unlock(_refs->lock);
>  
> -- 
> 2.14.3
> 


[PATCH v7 1/6] mm: split SWP_FILE into SWP_ACTIVATED and SWP_FS

2018-09-11 Thread Omar Sandoval
From: Omar Sandoval 

The SWP_FILE flag serves two purposes: to make swap_{read,write}page()
go through the filesystem, and to make swapoff() call
->swap_deactivate(). For Btrfs, we want the latter but not the former,
so split this flag into two. This makes us always call
->swap_deactivate() if ->swap_activate() succeeded, not just if it
didn't add any swap extents itself.

This also resolves the issue of the very misleading name of SWP_FILE,
which is only used for swap files over NFS.

Reviewed-by: Nikolay Borisov 
Signed-off-by: Omar Sandoval 
---
 include/linux/swap.h | 13 +++--
 mm/page_io.c |  6 +++---
 mm/swapfile.c| 13 -
 3 files changed, 18 insertions(+), 14 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 8e2c11e692ba..0fda0aa743f0 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -167,13 +167,14 @@ enum {
SWP_SOLIDSTATE  = (1 << 4), /* blkdev seeks are cheap */
SWP_CONTINUED   = (1 << 5), /* swap_map has count continuation */
SWP_BLKDEV  = (1 << 6), /* its a block device */
-   SWP_FILE= (1 << 7), /* set after swap_activate success */
-   SWP_AREA_DISCARD = (1 << 8),/* single-time swap area discards */
-   SWP_PAGE_DISCARD = (1 << 9),/* freed swap page-cluster discards */
-   SWP_STABLE_WRITES = (1 << 10),  /* no overwrite PG_writeback pages */
-   SWP_SYNCHRONOUS_IO = (1 << 11), /* synchronous IO is efficient */
+   SWP_ACTIVATED   = (1 << 7), /* set after swap_activate success */
+   SWP_FS  = (1 << 8), /* swap file goes through fs */
+   SWP_AREA_DISCARD = (1 << 9),/* single-time swap area discards */
+   SWP_PAGE_DISCARD = (1 << 10),   /* freed swap page-cluster discards */
+   SWP_STABLE_WRITES = (1 << 11),  /* no overwrite PG_writeback pages */
+   SWP_SYNCHRONOUS_IO = (1 << 12), /* synchronous IO is efficient */
/* add others here before... */
-   SWP_SCANNING= (1 << 12),/* refcount in scan_swap_map */
+   SWP_SCANNING= (1 << 13),/* refcount in scan_swap_map */
 };
 
 #define SWAP_CLUSTER_MAX 32UL
diff --git a/mm/page_io.c b/mm/page_io.c
index aafd19ec1db4..e8653c368069 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -283,7 +283,7 @@ int __swap_writepage(struct page *page, struct 
writeback_control *wbc,
struct swap_info_struct *sis = page_swap_info(page);
 
VM_BUG_ON_PAGE(!PageSwapCache(page), page);
-   if (sis->flags & SWP_FILE) {
+   if (sis->flags & SWP_FS) {
struct kiocb kiocb;
struct file *swap_file = sis->swap_file;
struct address_space *mapping = swap_file->f_mapping;
@@ -365,7 +365,7 @@ int swap_readpage(struct page *page, bool synchronous)
goto out;
}
 
-   if (sis->flags & SWP_FILE) {
+   if (sis->flags & SWP_FS) {
struct file *swap_file = sis->swap_file;
struct address_space *mapping = swap_file->f_mapping;
 
@@ -423,7 +423,7 @@ int swap_set_page_dirty(struct page *page)
 {
struct swap_info_struct *sis = page_swap_info(page);
 
-   if (sis->flags & SWP_FILE) {
+   if (sis->flags & SWP_FS) {
struct address_space *mapping = sis->swap_file->f_mapping;
 
VM_BUG_ON_PAGE(!PageSwapCache(page), page);
diff --git a/mm/swapfile.c b/mm/swapfile.c
index d954b71c4f9c..d3f95833d12e 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -989,7 +989,7 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], 
int entry_size)
goto nextsi;
}
if (size == SWAPFILE_CLUSTER) {
-   if (!(si->flags & SWP_FILE))
+   if (!(si->flags & SWP_FS))
n_ret = swap_alloc_cluster(si, swp_entries);
} else
n_ret = scan_swap_map_slots(si, SWAP_HAS_CACHE,
@@ -2310,12 +2310,13 @@ static void destroy_swap_extents(struct 
swap_info_struct *sis)
kfree(se);
}
 
-   if (sis->flags & SWP_FILE) {
+   if (sis->flags & SWP_ACTIVATED) {
struct file *swap_file = sis->swap_file;
struct address_space *mapping = swap_file->f_mapping;
 
-   sis->flags &= ~SWP_FILE;
-   mapping->a_ops->swap_deactivate(swap_file);
+   sis->flags &= ~SWP_ACTIVATED;
+   if (mapping->a_ops->swap_deactivate)
+   mapping->a_ops->swap_deactivate(swap_file);
}
 }
 
@@ -2411,8 +2412,10 @@ static int setup_swap_extents(struct swap_info_struct 
*sis, sector_t *span

[PATCH v7 0/6] Btrfs: implement swap file support

2018-09-11 Thread Omar Sandoval
From: Omar Sandoval 

Hi,

This series implements swap file support for Btrfs.

Changes from v6 [1]:

- Moved btrfs_get_chunk_map() comment to function body
- Added more comments about pinned block group/device rbtree
- Fixed bug in patch 4 which broke resize

Based on v4.19-rc3.

Thanks!

1: https://www.spinics.net/lists/linux-btrfs/msg81732.html

Omar Sandoval (6):
  mm: split SWP_FILE into SWP_ACTIVATED and SWP_FS
  mm: export add_swap_extent()
  vfs: update swap_{,de}activate documentation
  Btrfs: prevent ioctls from interfering with a swap file
  Btrfs: rename get_chunk_map() and make it non-static
  Btrfs: support swap files

 Documentation/filesystems/Locking |  17 +-
 Documentation/filesystems/vfs.txt |  12 +-
 fs/btrfs/ctree.h  |  29 +++
 fs/btrfs/dev-replace.c|   8 +
 fs/btrfs/disk-io.c|   4 +
 fs/btrfs/inode.c  | 317 ++
 fs/btrfs/ioctl.c  |  31 ++-
 fs/btrfs/relocation.c |  18 +-
 fs/btrfs/volumes.c|  82 ++--
 fs/btrfs/volumes.h|   2 +
 include/linux/swap.h  |  13 +-
 mm/page_io.c  |   6 +-
 mm/swapfile.c |  14 +-
 13 files changed, 502 insertions(+), 51 deletions(-)

-- 
2.18.0



  1   2   3   4   5   6   7   8   9   >