date:20180306

[PATCH 4.4 1/2] btrfs: Don't clear SGID when inheriting ACLs

2018-03-06 Thread Nikolay Borisov

From: Jan Kara 

When new directory 'DIR1' is created in a directory 'DIR0' with SGID bit
set, DIR1 is expected to have SGID bit set (and owning group equal to
the owning group of 'DIR0'). However when 'DIR0' also has some default
ACLs that 'DIR1' inherits, setting these ACLs will result in SGID bit on
'DIR1' to get cleared if user is not member of the owning group.

Fix the problem by moving posix_acl_update_mode() out of
__btrfs_set_acl() into btrfs_set_acl(). That way the function will not be
called when inheriting ACLs which is what we want as it prevents SGID
bit clearing and the mode has been properly set by posix_acl_create()
anyway.

Fixes: 073931017b49d9458aa351605b43a7e34598caef
CC: sta...@vger.kernel.org
CC: linux-btrfs@vger.kernel.org
CC: David Sterba 
Signed-off-by: Jan Kara 
Signed-off-by: David Sterba 
Signed-off-by: Nikolay Borisov 
---
 fs/btrfs/acl.c | 13 +++--
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/acl.c b/fs/btrfs/acl.c
index fb3e64d37cb4..233bbc8789e0 100644
--- a/fs/btrfs/acl.c
+++ b/fs/btrfs/acl.c
@@ -82,12 +82,6 @@ static int __btrfs_set_acl(struct btrfs_trans_handle *trans,
switch (type) {
case ACL_TYPE_ACCESS:
name = POSIX_ACL_XATTR_ACCESS;
-   if (acl) {
-   ret = posix_acl_update_mode(inode, >i_mode, 
);
-   if (ret)
-   return ret;
-   }
-   ret = 0;
break;
case ACL_TYPE_DEFAULT:
if (!S_ISDIR(inode->i_mode))
@@ -123,6 +117,13 @@ static int __btrfs_set_acl(struct btrfs_trans_handle 
*trans,
 
 int btrfs_set_acl(struct inode *inode, struct posix_acl *acl, int type)
 {
+   int ret;
+
+   if (type == ACL_TYPE_ACCESS && acl) {
+   ret = posix_acl_update_mode(inode, >i_mode, );
+   if (ret)
+   return ret;
+   }
return __btrfs_set_acl(NULL, inode, acl, type);
 }
 
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] fstests: btrfs/004: increase the buffer size of logical-resolve to the maximum value 64K

2018-03-06 Thread Eryu Guan

On Tue, Mar 06, 2018 at 03:02:31PM +0800, Lu Fengqi wrote:
> Because of commit e76e13ce8c0b ("fsstress: implement the
> clonerange/deduperange ioctls"), dedupe makes the number of references to
> the same extent item increase so much that the default 4K buffer of
> logical-resolve is no longer sufficient.
> 
> Signed-off-by: Lu Fengqi 

This looks fine to me. But I'd like an explicit ack from btrfs
developers.

Thanks,
Eryu

> ---
>  tests/btrfs/004 | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/tests/btrfs/004 b/tests/btrfs/004
> index de583cc355d4..0d2efb91dba7 100755
> --- a/tests/btrfs/004
> +++ b/tests/btrfs/004
> @@ -103,7 +103,7 @@ _btrfs_inspect_addr()
>   expect_addr=$3
>   expect_inum=$4
>   file=$5
> - cmd="$BTRFS_UTIL_PROG inspect-internal logical-resolve -P $addr $mp"
> + cmd="$BTRFS_UTIL_PROG inspect-internal logical-resolve -s 65536 -P 
> $addr $mp"
>   echo "# $cmd" >> $seqres.full
>   out=`$cmd`
>   echo "$out" >> $seqres.full
> -- 
> 2.16.2
> 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe fstests" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 8/8] btrfs-progs: qgroups: export qgroups usage information as JSON

2018-03-06 Thread Qu Wenruo



On 2018年03月03日 02:47, je...@suse.com wrote:
> From: Jeff Mahoney 
> 
> One of the common requests I receive is for 'df' like facilities
> for subvolume usage.  Really, the request is for monitoring tools to be
> able to understand when subvolumes may be approaching quota in the same
> manner traditional file systems approach ENOSPC.
> 
> This patch allows us to export the qgroups data in a machine-readable
> format so that monitoring tools can parse it easily.
> 
> There are two modes since JSON can technically handle 64-bit numbers
> but JavaScript proper cannot.  show -j enables JSON mode using 64-bit
> integers directly.  --json-compat presents 64-bit numbers as an array
> of two 32-bit numbers (high followed by low).
> 
> Signed-off-by: Jeff Mahoney 
> ---
>  Documentation/btrfs-qgroup.asciidoc |   4 +
>  Makefile.inc.in |   4 +-
>  cmds-qgroup.c   |  36 +-
>  configure.ac|   6 +
>  qgroup.c| 211 
> 
>  qgroup.h|   3 +
>  6 files changed, 258 insertions(+), 6 deletions(-)
> 
> diff --git a/Documentation/btrfs-qgroup.asciidoc 
> b/Documentation/btrfs-qgroup.asciidoc
> index 360b3269..22a9c2a7 100644
> --- a/Documentation/btrfs-qgroup.asciidoc
> +++ b/Documentation/btrfs-qgroup.asciidoc
> @@ -105,6 +105,10 @@ list all qgroups which impact the given path(include 
> ancestral qgroups)
>  list all qgroups which impact the given path(exclude ancestral qgroups)
>  -v
>  Be more verbose.  Print pathnames of member qgroups when nested.
> +-j
> +If enabled, export qgroup usage information in JSON format.  This implies 
> --raw.
> +--json-compat
> +By default, JSON output contains full 64-bit integers, which may be 
> incompatible with some JSON parsers.  This option exports those values as an 
> array of 32-bit numbers in [high, low] format.
>  --raw
>  raw numbers in bytes, without the 'B' suffix.
>  --human-readable
> diff --git a/Makefile.inc.in b/Makefile.inc.in
> index 56271903..68bddbed 100644
> --- a/Makefile.inc.in
> +++ b/Makefile.inc.in
> @@ -18,9 +18,9 @@ BTRFSRESTORE_ZSTD = @BTRFSRESTORE_ZSTD@
>  SUBST_CFLAGS = @CFLAGS@
>  SUBST_LDFLAGS = @LDFLAGS@
>  
> -LIBS_BASE = @UUID_LIBS@ @BLKID_LIBS@ -L. -pthread
> +LIBS_BASE = @UUID_LIBS@ @BLKID_LIBS@ @JSON_LIBS@ -L. -pthread
>  LIBS_COMP = @ZLIB_LIBS@ @LZO2_LIBS@ @ZSTD_LIBS@
> -STATIC_LIBS_BASE = @UUID_LIBS_STATIC@ @BLKID_LIBS_STATIC@ -L. -pthread
> +STATIC_LIBS_BASE = @UUID_LIBS_STATIC@ @BLKID_LIBS_STATIC@ @JSON_LIBS_STATIC@ 
> -L. -pthread
>  STATIC_LIBS_COMP = @ZLIB_LIBS_STATIC@ @LZO2_LIBS_STATIC@ @ZSTD_LIBS_STATIC@
>  
>  prefix ?= @prefix@
> diff --git a/cmds-qgroup.c b/cmds-qgroup.c
> index 94cd0fd3..eee15ef1 100644
> --- a/cmds-qgroup.c
> +++ b/cmds-qgroup.c
> @@ -282,6 +282,10 @@ static const char * const cmd_qgroup_show_usage[] = {
>   "   (excluding ancestral qgroups)",
>   "-P print first-level qgroups using pathname",
>   "-v verbose, prints all nested subvolumes",
> +#ifdef HAVE_JSON
> + "-j export in JSON format",
> + "--json-compat  export in JSON compatibility mode",
> +#endif
>   HELPINFO_UNITS_LONG,
>   "--sort=qgroupid,rfer,excl,max_rfer,max_excl,pathname",
>   "   list qgroups sorted by specified items",
> @@ -302,6 +306,8 @@ static int cmd_qgroup_show(int argc, char **argv)
>   unsigned unit_mode;
>   int sync = 0;
>   bool verbose = false;
> + bool export_json = false;
> + bool compat_json = false;
>  
>   struct btrfs_qgroup_comparer_set *comparer_set;
>   struct btrfs_qgroup_filter_set *filter_set;
> @@ -314,16 +320,26 @@ static int cmd_qgroup_show(int argc, char **argv)
>   int c;
>   enum {
>   GETOPT_VAL_SORT = 256,
> - GETOPT_VAL_SYNC
> + GETOPT_VAL_SYNC,
> + GETOPT_VAL_JSCOMPAT,
>   };
>   static const struct option long_options[] = {
>   {"sort", required_argument, NULL, GETOPT_VAL_SORT},
>   {"sync", no_argument, NULL, GETOPT_VAL_SYNC},
>   {"verbose", no_argument, NULL, 'v'},
> +#ifdef HAVE_JSON
> + {"json-compat", no_argument, NULL, GETOPT_VAL_JSCOMPAT},
> +#endif
>   { NULL, 0, NULL, 0 }
>   };
> -
> - c = getopt_long(argc, argv, "pPcreFfv", long_options, NULL);
> + const char getopt_chars[] = {
> + 'p', 'P', 'c', 'r', 'e', 'F', 'f', 'v',
> +#ifdef HAVE_JSON
> + 'j',
> +#endif
> + '\0' };
> +
> + c = getopt_long(argc, argv, getopt_chars, long_options, NULL);
>   if (c < 0)
>   break;
>   switch (c) {
> @@ -353,6 +369,14 @@

Re: [PATCH 0/8] btrfs-progs: qgroups usability [corrected]

2018-03-06 Thread Qu Wenruo



On 2018年03月03日 02:46, je...@suse.com wrote:
> From: Jeff Mahoney 
> 
> Hi all -
> 
> The following series addresses some usability issues with the qgroups UI.
> 
> 1) Adds -W option so we can wait on a rescan completing without starting one.
> 2) Adds qgroup information to 'btrfs subvolume show'
> 3) Adds a -P option to show pathnames for first-level qgroups (or member
>of nested qgroups with -v)
> 4) Allows exporting the qgroup table in JSON format for use by external
>programs/scripts.
> 
> -Jeff
> 
> Jeff Mahoney (8):
>   btrfs-progs: quota: Add -W option to rescan to wait without starting
> rescan
>   btrfs-progs: qgroups: fix misleading index check
>   btrfs-progs: constify pathnames passed as arguments

For patch 1~3 looks good.

Reviewed-by: Qu Wenruo 

Thanks,
Qu

>   btrfs-progs: qgroups: add pathname to show output
>   btrfs-progs: qgroups: introduce and use info and limit structures
>   btrfs-progs: qgroups: introduce btrfs_qgroup_query
>   btrfs-progs: subvolume: add quota info to btrfs sub show
>   btrfs-progs: qgroups: export qgroups usage information as JSON
> 
>  Documentation/btrfs-qgroup.asciidoc |   8 +
>  Documentation/btrfs-quota.asciidoc  |  10 +-
>  Makefile.inc.in |   4 +-
>  chunk-recover.c |   4 +-
>  cmds-device.c   |   2 +-
>  cmds-fi-usage.c |   6 +-
>  cmds-qgroup.c   |  49 +++-
>  cmds-quota.c|  21 +-
>  cmds-rescue.c   |   4 +-
>  cmds-subvolume.c|  46 
>  configure.ac|   6 +
>  kerncompat.h|   1 +
>  qgroup.c| 526 
> ++--
>  qgroup.h|  22 +-
>  send-utils.c|   4 +-
>  utils.c |  22 +-
>  utils.h |   2 +
>  17 files changed, 621 insertions(+), 116 deletions(-)
> 



signature.asc
Description: OpenPGP digital signature

Re: [PATCH 7/8] btrfs-progs: subvolume: add quota info to btrfs sub show

2018-03-06 Thread Qu Wenruo



On 2018年03月03日 02:47, je...@suse.com wrote:
> From: Jeff Mahoney 
> 
> This patch reports on the first-level qgroup, if any, associated with
> a particular subvolume.  It displays the usage and limit, subject
> to the usual unit parameters.
> 
> Signed-off-by: Jeff Mahoney 
> ---
>  cmds-subvolume.c | 46 ++
>  1 file changed, 46 insertions(+)
> 
> diff --git a/cmds-subvolume.c b/cmds-subvolume.c
> index 8a473f7a..29d0e0e5 100644
> --- a/cmds-subvolume.c
> +++ b/cmds-subvolume.c
> @@ -972,6 +972,7 @@ static const char * const cmd_subvol_show_usage[] = {
>   "Show more information about the subvolume",
>   "-r|--rootid   rootid of the subvolume",
>   "-u|--uuid uuid of the subvolume",
> + HELPINFO_UNITS_SHORT_LONG,
>   "",
>   "If no option is specified,  will be shown, otherwise",
>   "the rootid or uuid are resolved relative to the  path.",
> @@ -993,6 +994,13 @@ static int cmd_subvol_show(int argc, char **argv)
>   int by_uuid = 0;
>   u64 rootid_arg;
>   u8 uuid_arg[BTRFS_UUID_SIZE];
> + struct btrfs_qgroup_stats stats;
> + unsigned int unit_mode;
> + const char *referenced_size;
> + const char *referenced_limit_size = "-";
> + unsigned field_width = 0;
> +
> + unit_mode = get_unit_mode_from_arg(, argv, 1);
>  
>   while (1) {
>   int c;
> @@ -1112,6 +1120,44 @@ static int cmd_subvol_show(int argc, char **argv)
>   btrfs_list_subvols_print(fd, filter_set, NULL, BTRFS_LIST_LAYOUT_RAW,
>   1, raw_prefix);
>  
> + ret = btrfs_qgroup_query(fd, get_ri.root_id, );
> + if (ret < 0) {
> + if (ret == -ENODATA)
> + printf("Quotas must be enabled for per-subvolume 
> usage\n");

This seems a little confusing.
If quota is disabled, we get ENOTTY not ENODATA.

For ENODATA we know quota is enabled but just no info for this qgroup.

Thanks,
Qu

> + else if (ret != -ENOTTY)
> + fprintf(stderr,
> + "\nERROR: BTRFS_IOC_QUOTA_QUERY failed: %s\n",
> + strerror(errno));
> + goto out;
> + }
> +
> + printf("\tQuota Usage:\t\t");
> + fflush(stdout);
> +
> + referenced_size = pretty_size_mode(stats.info.referenced, unit_mode);
> + if (stats.limit.max_referenced)
> +referenced_limit_size = pretty_size_mode(
> + stats.limit.max_referenced,
> + unit_mode);
> + field_width = max(strlen(referenced_size),
> +   strlen(referenced_limit_size));
> +
> + printf("%-*s referenced, %s exclusive\n ", field_width,
> +referenced_size,
> +pretty_size_mode(stats.info.exclusive, unit_mode));
> +
> + printf("\tQuota Limits:\t\t");
> + if (stats.limit.max_referenced || stats.limit.max_exclusive) {
> + const char *excl = "-";
> +
> + if (stats.limit.max_exclusive)
> +excl = pretty_size_mode(stats.limit.max_exclusive,
> +unit_mode);
> + printf("%-*s referenced, %s exclusive\n", field_width,
> +referenced_limit_size, excl);
> + } else
> + printf("None\n");
> +
>  out:
>   /* clean up */
>   free(get_ri.path);
> 



signature.asc
Description: OpenPGP digital signature

Re: [PATCH 6/8] btrfs-progs: qgroups: introduce btrfs_qgroup_query

2018-03-06 Thread Qu Wenruo



On 2018年03月03日 02:47, je...@suse.com wrote:
> From: Jeff Mahoney 
> 
> The only mechanism we have in the progs for searching qgroups is to load
> all of them and filter the results.  This works for qgroup show but
> to add quota information to 'btrfs subvoluem show' it's pretty wasteful.
> 
> This patch splits out setting up the search and performing the search so
> we can search for a single qgroupid more easily.
> 
> Signed-off-by: Jeff Mahoney 
> ---
>  qgroup.c | 98 
> +---
>  qgroup.h |  7 +
>  2 files changed, 77 insertions(+), 28 deletions(-)
> 
> diff --git a/qgroup.c b/qgroup.c
> index b1be3311..2d0a6947 100644
> --- a/qgroup.c
> +++ b/qgroup.c
> @@ -1146,11 +1146,11 @@ static inline void print_status_flag_warning(u64 
> flags)
>   warning("qgroup data inconsistent, rescan recommended");
>  }
>  
> -static int __qgroups_search(int fd, struct qgroup_lookup *qgroup_lookup)
> +static int __qgroups_search(int fd, struct btrfs_ioctl_search_args *args,
> + struct qgroup_lookup *qgroup_lookup)
>  {
>   int ret;
> - struct btrfs_ioctl_search_args args;
> - struct btrfs_ioctl_search_key *sk = 
> + struct btrfs_ioctl_search_key *sk = >key;
>   struct btrfs_ioctl_search_header *sh;
>   unsigned long off = 0;
>   unsigned int i;
> @@ -1161,30 +1161,12 @@ static int __qgroups_search(int fd, struct 
> qgroup_lookup *qgroup_lookup)
>   u64 qgroupid;
>   u64 qgroupid1;
>  
> - memset(, 0, sizeof(args));
> -
> - sk->tree_id = BTRFS_QUOTA_TREE_OBJECTID;
> - sk->max_type = BTRFS_QGROUP_RELATION_KEY;
> - sk->min_type = BTRFS_QGROUP_STATUS_KEY;
> - sk->max_objectid = (u64)-1;
> - sk->max_offset = (u64)-1;
> - sk->max_transid = (u64)-1;
> - sk->nr_items = 4096;
> -
>   qgroup_lookup_init(qgroup_lookup);
>  
>   while (1) {
> - ret = ioctl(fd, BTRFS_IOC_TREE_SEARCH, );
> + ret = ioctl(fd, BTRFS_IOC_TREE_SEARCH, args);
>   if (ret < 0) {
> - if (errno == ENOENT) {
> - error("can't list qgroups: quotas not enabled");
> - ret = -ENOTTY;
> - } else {
> - error("can't list qgroups: %s",
> -strerror(errno));
> - ret = -errno;
> - }
> -
> + ret = -errno;
>   break;
>   }
>  
> @@ -1198,14 +1180,14 @@ static int __qgroups_search(int fd, struct 
> qgroup_lookup *qgroup_lookup)
>* read the root_ref item it contains
>*/
>   for (i = 0; i < sk->nr_items; i++) {
> - sh = (struct btrfs_ioctl_search_header *)(args.buf +
> + sh = (struct btrfs_ioctl_search_header *)(args->buf +
> off);
>   off += sizeof(*sh);
>  
>   switch (btrfs_search_header_type(sh)) {
>   case BTRFS_QGROUP_STATUS_KEY:
>   si = (struct btrfs_qgroup_status_item *)
> -  (args.buf + off);
> +  (args->buf + off);
>   flags = btrfs_stack_qgroup_status_flags(si);
>  
>   print_status_flag_warning(flags);
> @@ -1213,7 +1195,7 @@ static int __qgroups_search(int fd, struct 
> qgroup_lookup *qgroup_lookup)
>   case BTRFS_QGROUP_INFO_KEY:
>   qgroupid = btrfs_search_header_offset(sh);
>   info = (struct btrfs_qgroup_info_item *)
> -(args.buf + off);
> +(args->buf + off);
>  
>   ret = update_qgroup_info(fd, qgroup_lookup,
>qgroupid, info);
> @@ -1221,7 +1203,7 @@ static int __qgroups_search(int fd, struct 
> qgroup_lookup *qgroup_lookup)
>   case BTRFS_QGROUP_LIMIT_KEY:
>   qgroupid = btrfs_search_header_offset(sh);
>   limit = (struct btrfs_qgroup_limit_item *)
> - (args.buf + off);
> + (args->buf + off);
>  
>   ret = update_qgroup_limit(fd, qgroup_lookup,
> qgroupid, limit);
> @@ -1267,6 +1249,66 @@ static int __qgroups_search(int fd, struct 
> qgroup_lookup *qgroup_lookup)
>   return ret;
>  }
>  
> +static int qgroups_search_all(int fd, struct qgroup_lookup *qgroup_lookup)
> +{
> + struct btrfs_ioctl_search_args args = {
> + .key = {
> +

Re: [PATCH 6/8] btrfs-progs: qgroups: introduce btrfs_qgroup_query

2018-03-06 Thread Qu Wenruo



On 2018年03月03日 02:47, je...@suse.com wrote:
> From: Jeff Mahoney 
> 
> The only mechanism we have in the progs for searching qgroups is to load
> all of them and filter the results.  This works for qgroup show but
> to add quota information to 'btrfs subvoluem show' it's pretty wasteful.
> 
> This patch splits out setting up the search and performing the search so
> we can search for a single qgroupid more easily.
> 
> Signed-off-by: Jeff Mahoney 
> ---
>  qgroup.c | 98 
> +---
>  qgroup.h |  7 +
>  2 files changed, 77 insertions(+), 28 deletions(-)
> 
> diff --git a/qgroup.c b/qgroup.c
> index b1be3311..2d0a6947 100644
> --- a/qgroup.c
> +++ b/qgroup.c
> @@ -1146,11 +1146,11 @@ static inline void print_status_flag_warning(u64 
> flags)
>   warning("qgroup data inconsistent, rescan recommended");
>  }
>  
> -static int __qgroups_search(int fd, struct qgroup_lookup *qgroup_lookup)
> +static int __qgroups_search(int fd, struct btrfs_ioctl_search_args *args,
> + struct qgroup_lookup *qgroup_lookup)
>  {
>   int ret;
> - struct btrfs_ioctl_search_args args;
> - struct btrfs_ioctl_search_key *sk = 
> + struct btrfs_ioctl_search_key *sk = >key;
>   struct btrfs_ioctl_search_header *sh;
>   unsigned long off = 0;
>   unsigned int i;
> @@ -1161,30 +1161,12 @@ static int __qgroups_search(int fd, struct 
> qgroup_lookup *qgroup_lookup)
>   u64 qgroupid;
>   u64 qgroupid1;
>  
> - memset(, 0, sizeof(args));
> -
> - sk->tree_id = BTRFS_QUOTA_TREE_OBJECTID;
> - sk->max_type = BTRFS_QGROUP_RELATION_KEY;
> - sk->min_type = BTRFS_QGROUP_STATUS_KEY;
> - sk->max_objectid = (u64)-1;
> - sk->max_offset = (u64)-1;
> - sk->max_transid = (u64)-1;
> - sk->nr_items = 4096;
> -
>   qgroup_lookup_init(qgroup_lookup);
>  
>   while (1) {
> - ret = ioctl(fd, BTRFS_IOC_TREE_SEARCH, );
> + ret = ioctl(fd, BTRFS_IOC_TREE_SEARCH, args);
>   if (ret < 0) {
> - if (errno == ENOENT) {
> - error("can't list qgroups: quotas not enabled");
> - ret = -ENOTTY;
> - } else {
> - error("can't list qgroups: %s",
> -strerror(errno));
> - ret = -errno;
> - }
> -
> + ret = -errno;
>   break;
>   }
>  
> @@ -1198,14 +1180,14 @@ static int __qgroups_search(int fd, struct 
> qgroup_lookup *qgroup_lookup)
>* read the root_ref item it contains
>*/
>   for (i = 0; i < sk->nr_items; i++) {
> - sh = (struct btrfs_ioctl_search_header *)(args.buf +
> + sh = (struct btrfs_ioctl_search_header *)(args->buf +
> off);
>   off += sizeof(*sh);
>  
>   switch (btrfs_search_header_type(sh)) {
>   case BTRFS_QGROUP_STATUS_KEY:
>   si = (struct btrfs_qgroup_status_item *)
> -  (args.buf + off);
> +  (args->buf + off);
>   flags = btrfs_stack_qgroup_status_flags(si);
>  
>   print_status_flag_warning(flags);
> @@ -1213,7 +1195,7 @@ static int __qgroups_search(int fd, struct 
> qgroup_lookup *qgroup_lookup)
>   case BTRFS_QGROUP_INFO_KEY:
>   qgroupid = btrfs_search_header_offset(sh);
>   info = (struct btrfs_qgroup_info_item *)
> -(args.buf + off);
> +(args->buf + off);
>  
>   ret = update_qgroup_info(fd, qgroup_lookup,
>qgroupid, info);
> @@ -1221,7 +1203,7 @@ static int __qgroups_search(int fd, struct 
> qgroup_lookup *qgroup_lookup)
>   case BTRFS_QGROUP_LIMIT_KEY:
>   qgroupid = btrfs_search_header_offset(sh);
>   limit = (struct btrfs_qgroup_limit_item *)
> - (args.buf + off);
> + (args->buf + off);
>  
>   ret = update_qgroup_limit(fd, qgroup_lookup,
> qgroupid, limit);
> @@ -1267,6 +1249,66 @@ static int __qgroups_search(int fd, struct 
> qgroup_lookup *qgroup_lookup)
>   return ret;
>  }
>  
> +static int qgroups_search_all(int fd, struct qgroup_lookup *qgroup_lookup)
> +{
> + struct btrfs_ioctl_search_args args = {
> + .key = {
> +

Re: [PATCH 4/8] btrfs-progs: qgroups: add pathname to show output

2018-03-06 Thread Qu Wenruo



On 2018年03月03日 02:47, je...@suse.com wrote:
> From: Jeff Mahoney 
> 
> The btrfs qgroup show command currently only exports qgroup IDs,
> forcing the user to resolve which subvolume each corresponds to.
> 
> This patch adds pathname resolution to qgroup show so that when
> the -P option is used, the last column contains the pathname of
> the root of the subvolume it describes.  In the case of nested
> qgroups, it will show the number of member qgroups or the paths
> of the members if the -v option is used.
> 
> Pathname can also be used as a sort parameter.
> 
> Signed-off-by: Jeff Mahoney 
> ---
>  Documentation/btrfs-qgroup.asciidoc |   4 +
>  cmds-qgroup.c   |  17 -
>  kerncompat.h|   1 +
>  qgroup.c| 142 
> 
>  qgroup.h|   4 +-
>  utils.c |  22 --
>  utils.h |   2 +
>  7 files changed, 166 insertions(+), 26 deletions(-)
> 
> diff --git a/Documentation/btrfs-qgroup.asciidoc 
> b/Documentation/btrfs-qgroup.asciidoc
> index 3108457c..360b3269 100644
> --- a/Documentation/btrfs-qgroup.asciidoc
> +++ b/Documentation/btrfs-qgroup.asciidoc
> @@ -97,10 +97,14 @@ print child qgroup id.
>  print limit of referenced size of qgroup.
>  -e
>  print limit of exclusive size of qgroup.
> +-P
> +print pathname to the root of the subvolume managed by qgroup.  For nested 
> qgroups, the number of members will be printed unless -v is specified.
>  -F
>  list all qgroups which impact the given path(include ancestral qgroups)
>  -f
>  list all qgroups which impact the given path(exclude ancestral qgroups)
> +-v
> +Be more verbose.  Print pathnames of member qgroups when nested.
>  --raw
>  raw numbers in bytes, without the 'B' suffix.
>  --human-readable
> diff --git a/cmds-qgroup.c b/cmds-qgroup.c
> index 48686436..94cd0fd3 100644
> --- a/cmds-qgroup.c
> +++ b/cmds-qgroup.c
> @@ -280,8 +280,10 @@ static const char * const cmd_qgroup_show_usage[] = {
>   "   (including ancestral qgroups)",
>   "-f list all qgroups which impact the given path",
>   "   (excluding ancestral qgroups)",
> + "-P print first-level qgroups using pathname",
> + "-v verbose, prints all nested subvolumes",

Did you mean the subvolume paths of all children qgroups?

>   HELPINFO_UNITS_LONG,
> - "--sort=qgroupid,rfer,excl,max_rfer,max_excl",
> + "--sort=qgroupid,rfer,excl,max_rfer,max_excl,pathname",
>   "   list qgroups sorted by specified items",
>   "   you can use '+' or '-' in front of each item.",
>   "   (+:ascending, -:descending, ascending default)",
> @@ -299,6 +301,7 @@ static int cmd_qgroup_show(int argc, char **argv)
>   int filter_flag = 0;
>   unsigned unit_mode;
>   int sync = 0;
> + bool verbose = false;
>  
>   struct btrfs_qgroup_comparer_set *comparer_set;
>   struct btrfs_qgroup_filter_set *filter_set;
> @@ -316,10 +319,11 @@ static int cmd_qgroup_show(int argc, char **argv)
>   static const struct option long_options[] = {
>   {"sort", required_argument, NULL, GETOPT_VAL_SORT},
>   {"sync", no_argument, NULL, GETOPT_VAL_SYNC},
> + {"verbose", no_argument, NULL, 'v'},
>   { NULL, 0, NULL, 0 }
>   };
>  
> - c = getopt_long(argc, argv, "pcreFf", long_options, NULL);
> + c = getopt_long(argc, argv, "pPcreFfv", long_options, NULL);
>   if (c < 0)
>   break;
>   switch (c) {
> @@ -327,6 +331,10 @@ static int cmd_qgroup_show(int argc, char **argv)
>   btrfs_qgroup_setup_print_column(
>   BTRFS_QGROUP_PARENT);
>   break;
> + case 'P':
> + btrfs_qgroup_setup_print_column(
> + BTRFS_QGROUP_PATHNAME);
> + break;
>   case 'c':
>   btrfs_qgroup_setup_print_column(
>   BTRFS_QGROUP_CHILD);
> @@ -354,6 +362,9 @@ static int cmd_qgroup_show(int argc, char **argv)
>   case GETOPT_VAL_SYNC:
>   sync = 1;
>   break;
> + case 'v':
> + verbose = true;
> + break;
>   default:
>   usage(cmd_qgroup_show_usage);
>   }
> @@ -394,7 +405,7 @@ static int cmd_qgroup_show(int argc, char **argv)
>   BTRFS_QGROUP_FILTER_PARENT,
>   qgroupid);
>   }
> - ret = btrfs_show_qgroups(fd, filter_set, comparer_set);
> + ret = btrfs_show_qgroups(fd,

Re: spurious full btrfs corruption

2018-03-06 Thread Duncan

Christoph Anton Mitterer posted on Tue, 06 Mar 2018 01:57:58 +0100 as
excerpted:

> In the meantime I had a look of the remaining files that I got from the
> btrfs-restore (haven't run it again so far, from the OLD notebook, so
> only the results from the NEW notebook here:):
> 
> The remaining ones were multi-GB qcow2 images for some qemu VMs.
> I think I had non of these files open (i.e. VMs running) while in the
> final corruption phase... but at least I'm sure that not *all* of them
> were running.
> 
> However, all the qcow2 files from the restore are more or less garbage.
> During the btrfs-restore it already complained on them, that it would
> loop too often on them and whether I want to continue or not (I choose n
> and on another full run I choose y).
> 
> Some still contain a partition table, some partitions even filesystems
> (btrfs again)... but I cannot mount them.

Just a note on format choices FWIW, nothing at all to do with your 
current problem...

As my own use-case doesn't involve VMs I'm /far/ from an expert here, but 
if I'm screwing things up I'm sure someone will correct me and I'll learn 
something too, but it does /sound/ reasonable, so assuming I'm 
remembering correctly from a discussion here...

Tip: Btrfs and qcow2 are both copy-on-write/COW (it's in the qcow2 name, 
after all), and doing multiple layers of COW is both inefficient and a 
good candidate to test for corner-case bugs that wouldn't show up in 
more normal use-cases.  Assuming bug-free it /should/ work properly, of 
course, but equally of course, bug-free isn't an entirely realistic 
assumption. =8^0

... And you're putting btrfs on qcow2 on btrfs... THREE layers of COW!

The recommendation was thus to pick what layer you wish to COW at, and 
use something that's not COW-based at the other layers.  Apparently, qemu 
has raw-format as a choice as well as qcow2, and that was recommended as 
preferred for use with btrfs (and IIRC what the recommender was using 
himself).

But of course that still leaves cow-based btrfs on both the top and the 
bottom layers.  I suppose which of those is best to remain btrfs, while 
making the other say ext4 as widest used and hopefully safest general 
purpose non-COW alternative, depends on the use-case.

Of course keeping btrfs at both levels but nocowing the image files on 
the host btrfs is a possibility as well, but nocow on btrfs has enough 
limits and caveats that I consider it a second-class "really should have 
used a different filesystem for this but didn't want to bother setting up 
a dedicated one" choice, and as such, don't consider it a viable option 
here.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/2] btrfs: Add unprivileged version of ino_lookup ioctl

2018-03-06 Thread Misono, Tomohiro

On 2018/03/06 17:31, Misono, Tomohiro wrote:
> Add unprivileged version of ino_lookup ioctl (BTRFS_IOC_INO_LOOKUP_USER)
> to allow normal users to call "btrfs subvololume list/show" etc. in
> combination with BTRFS_IOC_GET_SUBVOL_INFO.
> 
> This can be used like BTRFS_IOC_INO_LOOKUP but the argument is
> different because it also returns the name of bottom subvolume,
> which is ommited by BTRFS_IOC_GET_SUBVOL_INFO ioctl.
> Also, this must be called with fd which exists under the tree of
> args->treeid to prevent for user to search any tree.
> 
> The main differences from original ino_lookup ioctl are:
>   1. Read permission will be checked using inode_permission()
>  during path construction. -EACCES will be returned in case
>  of failure.
>   2. Path construction will be stopped at the inode number which
>  corresponds to the fd with which this ioctl is called. If
>  constructed path does not exist under fd's inode, -EACCES
>  will be returned.
>   3. The name of bottom subvolume is also searched and filled.
> 
> Note that the maximum length of path is shorter 272 bytes than
> ino_lookup ioctl because of space of subvolume's id and name.
> 
> Signed-off-by: Tomohiro Misono 
> ---
>  fs/btrfs/ioctl.c   | 218 
> +
>  include/uapi/linux/btrfs.h |  16 
>  2 files changed, 234 insertions(+)
> 
> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> index 1dba309dce31..ac23da98b7e7 100644
> --- a/fs/btrfs/ioctl.c
> +++ b/fs/btrfs/ioctl.c
> @@ -2425,6 +2425,179 @@ static noinline int btrfs_search_path_in_tree(struct 
> btrfs_fs_info *info,
>   return ret;
>  }
>  
> +static noinline int btrfs_search_path_in_tree_user(struct btrfs_fs_info 
> *info,
> + struct super_block *sb,
> + struct btrfs_key upper_limit,
> + struct btrfs_ioctl_ino_lookup_user_args *args)
> +{
> + struct btrfs_root *root;
> + struct btrfs_key key, key2;
> + char *ptr;
> + int ret = -1;
> + int slot;
> + int len;
> + int total_len = 0;
> + u64 dirid = args->dirid;
> + struct btrfs_inode_ref *iref;
> + struct btrfs_root_ref rref;
> +
> + unsigned long item_off;
> + unsigned long item_len;
> +
> + struct extent_buffer *l;
> + struct btrfs_path *path = NULL;
> +
> + struct inode *inode;
> + int *new = 0;
> +
> + path = btrfs_alloc_path();
> + if (!path)
> + return -ENOMEM;
> +
> + if (dirid == upper_limit.objectid)
> + /*
> +  * If the bottom subvolume exists directly under upper limit,
> +  * there is no need to construct the path and just get the
> +  * subvolume's name
> +  */
> + goto get_subvol_name;
> + if (dirid == BTRFS_FIRST_FREE_OBJECTID)
> + /* The subvolume does not exist under upper_limit */
> + goto access_err;
> +
> + ptr = >path[BTRFS_INO_LOOKUP_PATH_MAX - 1];
> +
> + key.objectid = args->treeid;
> + key.type = BTRFS_ROOT_ITEM_KEY;
> + key.offset = (u64)-1;
> + root = btrfs_read_fs_root_no_name(info, );
> + if (IS_ERR(root)) {
> + btrfs_err(info, "could not find root %llu", args->treeid);
> + ret = -ENOENT;
> + goto out;
> + }
> +
> + key.objectid = dirid;
> + key.type = BTRFS_INODE_REF_KEY;
> + key.offset = (u64)-1;
> +
> + while (1) {
> + ret = btrfs_search_slot(NULL, root, , path, 0, 0);
> + if (ret < 0)
> + goto out;
> + else if (ret > 0) {
> + ret = btrfs_previous_item(root, path, dirid,
> +   BTRFS_INODE_REF_KEY);
> + if (ret < 0)
> + goto out;
> + else if (ret > 0) {
> + ret = -ENOENT;
> + goto out;
> + }
> + }
> +
> + l = path->nodes[0];
> + slot = path->slots[0];
> + btrfs_item_key_to_cpu(l, , slot);
> +
> + iref = btrfs_item_ptr(l, slot, struct btrfs_inode_ref);
> + len = btrfs_inode_ref_name_len(l, iref);
> + ptr -= len + 1;
> + total_len += len + 1;
> + if (ptr < args->path) {
> + ret = -ENAMETOOLONG;
> + goto out;
> + }
> +
> + *(ptr + len) = '/';
> + read_extent_buffer(l, ptr, (unsigned long)(iref + 1), len);
> +
> + /* Check the read permission */
> + ret = btrfs_previous_item(root, path, dirid,
> +   BTRFS_INODE_ITEM_KEY);
> + if (ret < 0) {
> + goto out;
> + } else if (ret > 0) {
> + ret = -ENOENT;
>

Re: [PATCH 1/2] btrfs: Add unprivileged subvolume search ioctl

2018-03-06 Thread Misono, Tomohiro

On 2018/03/07 5:29, Goffredo Baroncelli wrote:
> On 03/06/2018 09:30 AM, Misono, Tomohiro wrote:
>> Add new unprivileged ioctl (BTRFS_IOC_GET_SUBVOL_INFO) which searches
>> and returns only subvolume related item (ROOT_ITEM/ROOT_BACKREF/ROOT_REF)
>> from root tree. The arguments of this ioctl are the same as treesearch
>> ioctl and can be used like treesearch ioctl.
> 
> Is it a pro ? The treesearch ioctl is tightly coupled to the btrfs internal 
> structure, this means that if we would change the btrfs internal structure, 
> we have to update all the clients of this api. For the treesearch it is an 
> acceptable compromise between flexibility and speed of developing. But for a 
> more specialized API, I think that it would make sense to provide a less 
> coupled api to the internal structure.

Thanks for the comments.

The reason I choose the same api is just to minimize the code change in 
btrfs-progs.
For tree search part, it works just switching the ioctl number from 
BTRFS_IOC_TREE_SEARCH
to BTRFS_IOC_GET_SUBVOL_INFO in list_subvol_search()[1].

[1] https://marc.info/?l=linux-btrfs=152032537911218=2

> 
> Below some comments
> 
> 
>>
>> Since treesearch ioctl requires root privilege, this ioctl is needed
>> to allow normal users to call "btrfs subvolume list/show" etc.
>>
>> Note that the subvolume name in ROOT_BACKREF/ROOT_REF will not be copied to
>> prevent potential name leak. The name can be obtained by calling
>> user version's ino_lookup ioctl (BTRFS_IOC_INO_LOOKUP_USER).
>>
>> Signed-off-by: Tomohiro Misono 
>> ---
>>  fs/btrfs/ioctl.c   | 254 
>> +
>>  include/uapi/linux/btrfs.h |   2 +
>>  2 files changed, 256 insertions(+)
>>
>> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
>> index 111ee282b777..1dba309dce31 100644
>> --- a/fs/btrfs/ioctl.c
>> +++ b/fs/btrfs/ioctl.c
>> @@ -1921,6 +1921,28 @@ static noinline int key_in_sk(struct btrfs_key *key,
>>  return 1;
>>  }
>>  
>> +/*
>> + * check if key is in sk and subvolume related range
>> + */
>> +static noinline int key_in_sk_and_subvol(struct btrfs_key *key,
>> +  struct btrfs_ioctl_search_key *sk)
>> +{
>> +int ret;
>> +
>> +ret = key_in_sk(key, sk);
>> +if (!ret)
>> +return 0;
>> +
>> +if ((key->objectid == BTRFS_FS_TREE_OBJECTID ||
>> +(key->objectid >= BTRFS_FIRST_FREE_OBJECTID &&
>> + key->objectid <= BTRFS_LAST_FREE_OBJECTID)) &&
>> +key->type >= BTRFS_ROOT_ITEM_KEY &&
>> +key->type <= BTRFS_ROOT_BACKREF_KEY)
> 
> Why returning all the range [BTRFS_ROOT_ITEM_KEY...BTRFS_ROOT_BACKREF_KEY]. 
> It would be sufficient to return only
> 
>   +   (key->type == BTRFS_ROOT_ITEM_KEY ||
>   +key->type == BTRFS_ROOT_BACKREF_KEY))

Sorry, it is a mistake, I mean "key->type <= BTRFS_ROOTREF_KEY".
Although btrfs-progs only uses BTRFS_ROOT_BACKREF_KEY, I notice libbtrfs 
uses BTRFS_ROOT_REF_KEY instead. So, I think it is better to return both
ROOT_BACKREF_KEY and ROOT_REF_KEY. I will fix this in v2.

> 
> 
>> +return 1;
>> +
>> +return 0;
>> +}
>> +
>>  static noinline int copy_to_sk(struct btrfs_path *path,
>> struct btrfs_key *key,
>> struct btrfs_ioctl_search_key *sk,
>> @@ -2045,6 +2067,142 @@ static noinline int copy_to_sk(struct btrfs_path 
>> *path,
>>  return ret;
>>  }
>>  
>> +/*
>> + * Basically the same as copy_to_sk() but restricts the copied item
>> + * within subvolume related one using key_in_sk_and_subvol().
>> + * Also, the name of subvolume will be omitted from
>> + * ROOT_BACKREF/ROOT_REF item.
>> + */
>> +static noinline int copy_subvol_item(struct btrfs_path *path,
>> +   struct btrfs_key *key,
>> +   struct btrfs_ioctl_search_key *sk,
>> +   size_t *buf_size,
>> +   char __user *ubuf,
>> +   unsigned long *sk_offset,
>> +   int *num_found)
>> +{
>> +u64 found_transid;
>> +struct extent_buffer *leaf;
>> +struct btrfs_ioctl_search_header sh;
>> +struct btrfs_key test;
>> +unsigned long item_off;
>> +unsigned long item_len;
>> +int nritems;
>> +int i;
>> +int slot;
>> +int ret = 0;
>> +
>> +leaf = path->nodes[0];
>> +slot = path->slots[0];
>> +nritems = btrfs_header_nritems(leaf);
>> +
>> +if (btrfs_header_generation(leaf) > sk->max_transid) {
>> +i = nritems;
>> +goto advance_key;
>> +}
>> +found_transid = btrfs_header_generation(leaf);
>> +
>> +for (i = slot; i < nritems; i++) {
>> +item_off = btrfs_item_ptr_offset(leaf, i);
>> +item_len = btrfs_item_size_nr(leaf, i);
>> +
>> +btrfs_item_key_to_cpu(leaf, key, i);
>> +if (!key_in_sk_and_subvol(key, sk))
>> +continue;

Re: ERROR: unsupported checksum algorithm 35355

2018-03-06 Thread Ken Swenson

Thank you so much for fixing that superblock! Wouldn't even know where
to begin myself.

btrfs check returned

Checking filesystem on /dev/mapper/x
UUID: 513f07e1-08be-4d94-a55c-11c6251f6c2f
checking extents
checking free space cache
checking fs roots
checking csums
checking root refs
found 420849201152 bytes used, no error found
total csum bytes: 405265360
total tree bytes: 583565312
total fs tree bytes: 56459264
total extent tree bytes: 56328192
btree space waste bytes: 74766514
file data blocks allocated: 440751226880
referenced 419705761792

So I believe everything else looks good. I've successfully mounted the
file system and can access all my data.

Thanks again,

Ken


On 03/06/2018 03:45 AM, Qu Wenruo wrote:
> Here is the fixed superblock.
>
> csum type and incompat flags get fixed.
>
> I'm not sure if they are the only problems, but I strongly recommend to
> run btrfs check before mount.
>
> Thanks,
> Qu
>
> On 2018年03月06日 10:22, Ken Swenson wrote:
>> Hi Qu,
>>
>> attached is the binary super block as requested.
>>
>> Thank you,
>>
>> Ken
>>
>> On 03/05/2018 09:07 PM, Qu Wenruo wrote:
>>> On 2018年03月06日 09:51, Ken Swenson wrote:
 Hello,
  
 Somehow it appears the csum_type on my btrfs file system got corrupted.
 I cannot mount the system in recovery or read only. btrfs check just
 returns "ERROR: unsupported checksum algorithm 35355" as well as btrfs
 recover. The only command I was able to successfully run was btrfs
 inspect-internal dump-super, which I've pasted the output at the end of
 this message.

 Requested information from the wiki:
 Linux 4.15.7-1-ARCH #1 SMP PREEMPT Wed Feb 28 19:01:57 UTC 2018 x86_64
 GNU/Linux
 btrfs-progs v4.15.1
 btrfs fi show: ERROR: unsupported checksum algorithm 35355 ERROR: cannot
 scan /dev/mapper/x: Input/output error

 dmesg:
 [ 11.232399] Btrfs loaded, crc32c=crc32c-intel
 [ 11.233229] BTRFS: device fsid 513f07e1-08be-4d94-a55c-11c6251f6c2f
 devid 1 transid  /dev/dm-2
 [ 488.372891] BTRFS error (device dm-2): unsupported checksum algorithm
 35355
 [ 488.372894] BTRFS error (device dm-2): superblock checksum mismatch
 [ 488.384902] BTRFS error (device dm-2): open_ctree failed

 Is there anything I can do to recovery from this or am I out of luck? To
 give some background the disk was working fine until I upgraded to
 Kernel 4.15.7 and rebooted.

 superblock: bytenr=65536, device=/dev/mapper/x
 -
 csum_type        35355 (INVALID)
>>> This is obviously corrupted.
>>>
>>> Btrfs only supports csum_type 0 (CRC32) yet.
>>>
>>> And the value seems to be some garbage.
>>>
 csum_size        32
>>> So is the csum size.
>>>
 csum           
 0xf0dbeddd [match]
>>> Surprised to see the csum even matched.
>>>
 bytenr            65536
 flags            0x1
             ( WRITTEN )
 magic            _BHRfS_M [match]
 fsid            513f07e1-08be-4d94-a55c-11c6251f6c2f
 label           
 generation        
 root            186466304
 sys_array_size        129
 chunk_root_generation    7450
 root_level        1
 chunk_root        21004288
 chunk_root_level    1
 log_root        0
 log_root_transid    0
 log_root_level        0
 total_bytes        5000947523584
 bytes_used        420849201152
 sectorsize        4096
 nodesize        16384
 leafsize (deprecated)        16384
 stripesize        4096
 root_dir        6
 num_devices        1
 compat_flags        0x0
 compat_ro_flags        0x0
 incompat_flags        0x176d2169
             ( MIXED_BACKREF |
               COMPRESS_LZO |
               BIG_METADATA |
               EXTENDED_IREF |
               SKINNY_METADATA |
               unknown flag: 0x176d2000 )
>>> And unknown flags also exists.
>>>
>>> And according to later output, all backup super blocks have the same
>>> corruption while csum still matches.
>>>
>>> I'm wondering if the memory is corrupted.
>>>
>>> It's possible to manually modify the superblock to a valid status.
>>> As all the corruption is obvious, but I'm not 100% sure if there is
>>> other corruption.
>>>
>>> Please provide the binary superblock dump by:
>>>
>>> dd if= bs=1 count=4K skip=64K of=super_copy
>>>
>>> Thanks,
>>> Qu
>>>
 cache_generation    
 uuid_tree_generation    
 dev_item.uuid        880c692f-5270-4c7a-908d-8b3956fb3790
 dev_item.fsid        513f07e1-08be-4d94-a55c-11c6251f6c2f [match]
 dev_item.type        0
 dev_item.total_bytes    5000947523584
 dev_item.bytes_used    457439182848
 dev_item.io_align    4096
 dev_item.io_width    4096
 dev_item.sector_size    4096
 dev_item.devid        1
 dev_item.dev_group    0

Re: [PATCH 1/2] btrfs: Add unprivileged subvolume search ioctl

2018-03-06 Thread Goffredo Baroncelli

On 03/06/2018 09:30 AM, Misono, Tomohiro wrote:
> Add new unprivileged ioctl (BTRFS_IOC_GET_SUBVOL_INFO) which searches
> and returns only subvolume related item (ROOT_ITEM/ROOT_BACKREF/ROOT_REF)
> from root tree. The arguments of this ioctl are the same as treesearch
> ioctl and can be used like treesearch ioctl.

Is it a pro ? The treesearch ioctl is tightly coupled to the btrfs internal 
structure, this means that if we would change the btrfs internal structure, we 
have to update all the clients of this api. For the treesearch it is an 
acceptable compromise between flexibility and speed of developing. But for a 
more specialized API, I think that it would make sense to provide a less 
coupled api to the internal structure.

Below some comments


> 
> Since treesearch ioctl requires root privilege, this ioctl is needed
> to allow normal users to call "btrfs subvolume list/show" etc.
> 
> Note that the subvolume name in ROOT_BACKREF/ROOT_REF will not be copied to
> prevent potential name leak. The name can be obtained by calling
> user version's ino_lookup ioctl (BTRFS_IOC_INO_LOOKUP_USER).
> 
> Signed-off-by: Tomohiro Misono 
> ---
>  fs/btrfs/ioctl.c   | 254 
> +
>  include/uapi/linux/btrfs.h |   2 +
>  2 files changed, 256 insertions(+)
> 
> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> index 111ee282b777..1dba309dce31 100644
> --- a/fs/btrfs/ioctl.c
> +++ b/fs/btrfs/ioctl.c
> @@ -1921,6 +1921,28 @@ static noinline int key_in_sk(struct btrfs_key *key,
>   return 1;
>  }
>  
> +/*
> + * check if key is in sk and subvolume related range
> + */
> +static noinline int key_in_sk_and_subvol(struct btrfs_key *key,
> +   struct btrfs_ioctl_search_key *sk)
> +{
> + int ret;
> +
> + ret = key_in_sk(key, sk);
> + if (!ret)
> + return 0;
> +
> + if ((key->objectid == BTRFS_FS_TREE_OBJECTID ||
> + (key->objectid >= BTRFS_FIRST_FREE_OBJECTID &&
> +  key->objectid <= BTRFS_LAST_FREE_OBJECTID)) &&
> + key->type >= BTRFS_ROOT_ITEM_KEY &&
> + key->type <= BTRFS_ROOT_BACKREF_KEY)

Why returning all the range [BTRFS_ROOT_ITEM_KEY...BTRFS_ROOT_BACKREF_KEY]. It 
would be sufficient to return only

  + (key->type == BTRFS_ROOT_ITEM_KEY ||
  +  key->type == BTRFS_ROOT_BACKREF_KEY))


> + return 1;
> +
> + return 0;
> +}
> +
>  static noinline int copy_to_sk(struct btrfs_path *path,
>  struct btrfs_key *key,
>  struct btrfs_ioctl_search_key *sk,
> @@ -2045,6 +2067,142 @@ static noinline int copy_to_sk(struct btrfs_path 
> *path,
>   return ret;
>  }
>  
> +/*
> + * Basically the same as copy_to_sk() but restricts the copied item
> + * within subvolume related one using key_in_sk_and_subvol().
> + * Also, the name of subvolume will be omitted from
> + * ROOT_BACKREF/ROOT_REF item.
> + */
> +static noinline int copy_subvol_item(struct btrfs_path *path,
> +struct btrfs_key *key,
> +struct btrfs_ioctl_search_key *sk,
> +size_t *buf_size,
> +char __user *ubuf,
> +unsigned long *sk_offset,
> +int *num_found)
> +{
> + u64 found_transid;
> + struct extent_buffer *leaf;
> + struct btrfs_ioctl_search_header sh;
> + struct btrfs_key test;
> + unsigned long item_off;
> + unsigned long item_len;
> + int nritems;
> + int i;
> + int slot;
> + int ret = 0;
> +
> + leaf = path->nodes[0];
> + slot = path->slots[0];
> + nritems = btrfs_header_nritems(leaf);
> +
> + if (btrfs_header_generation(leaf) > sk->max_transid) {
> + i = nritems;
> + goto advance_key;
> + }
> + found_transid = btrfs_header_generation(leaf);
> +
> + for (i = slot; i < nritems; i++) {
> + item_off = btrfs_item_ptr_offset(leaf, i);
> + item_len = btrfs_item_size_nr(leaf, i);
> +
> + btrfs_item_key_to_cpu(leaf, key, i);
> + if (!key_in_sk_and_subvol(key, sk))
> + continue;
> +
> + /* will not copy the name of subvolume */
> + if (key->type == BTRFS_ROOT_BACKREF_KEY ||
> + key->type == BTRFS_ROOT_REF_KEY)
> + item_len = sizeof(struct btrfs_root_ref);
> +
> + if (sizeof(sh) + item_len > *buf_size) {
> + if (*num_found) {
> + ret = 1;
> + goto out;
> + }
> +
> + /*
> +  * return one empty item back for v1, which does not
> +  * handle -EOVERFLOW
> +  */
It is still applicable ?
> +
> + *buf_size = sizeof(sh) + item_len;

Re: How to change/fix 'Received UUID'

2018-03-06 Thread Marc MERLIN

On Tue, Mar 06, 2018 at 08:12:15PM +0100, Hans van Kranenburg wrote:
> On 05/03/2018 20:47, Marc MERLIN wrote:
> > On Mon, Mar 05, 2018 at 10:38:16PM +0300, Andrei Borzenkov wrote:
> >>> If I absolutely know that the data is the same on both sides, how do I
> >>> either
> >>> 1) force back in a 'Received UUID' value on the destination
> >>
> >> I suppose the most simple is to write small program that does it using
> >> BTRFS_IOC_SET_RECEIVED_SUBVOL.
> > 
> > Understdood.
> > Given that I have not worked with the code at all, what is the best 
> > tool in btrfs progs, to add this to?
> > 
> > btrfstune?
> > btrfs propery set?
> > other?
> > 
> > David, is this something you'd be willing to add support for?
> > (to be honest, it'll be quicker for someone who knows the code to add than
> > for me, but if no one has the time, I'l see if I can have a shot at it)
> 
> If you want something right now that works, so you can continue doing
> your backups, python-btrfs also has the ioctl, since v9, together with
> an example of using it:
> 
> https://github.com/knorrie/python-btrfs/commit/1ace623f95300ecf581b1182780fd6432a46b24d

Well, I had never heard about it until now, thank you.

I'll see if I can make it work when I get a bit of time.

Dear btrfs-progs folks, this would be great to add to the canonical
btrfs-progs too :)

Thanks,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/   | PGP 7F55D5F27AAF9D08
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v8 00/63] XArray v8

2018-03-06 Thread Matthew Wilcox

From: Matthew Wilcox <mawil...@microsoft.com>

This patchset is, I believe, appropriate for merging for 4.17.
It contains the XArray implementation, to eventually replace the radix
tree, and converts the page cache to use it.

This conversion keeps the radix tree and XArray data structures in sync
at all times.  That allows us to convert the page cache one function at
a time and should allow for easier bisection.  Other than renaming some
elements of the structures, the data structures are fundamentally
unchanged; a radix tree walk and an XArray walk will touch the same
number of cachelines.  I have changes planned to the XArray data
structure, but those will happen in future patches.

Improvements the XArray has over the radix tree:

 - The radix tree provides operations like other trees do; 'insert' and
   'delete'.  But what most users really want is an automatically resizing
   array, and so it makes more sense to give users an API that is like
   an array -- 'load' and 'store'.  We still have an 'insert' operation
   for users that really want that semantic.
 - The XArray considers locking as part of its API.  This simplifies a lot
   of users who formerly had to manage their own locking just for the
   radix tree.  It also improves code generation as we can now tell RCU
   that we're holding a lock and it doesn't need to generate as much
   fencing code.  The other advantage is that tree nodes can be moved
   (not yet implemented).
 - GFP flags are now parameters to calls which may need to allocate
   memory.  The radix tree forced users to decide what the allocation
   flags would be at creation time.  It's much clearer to specify them
   at allocation time.
 - Memory is not preloaded; we don't tie up dozens of pages on the
   off chance that the slab allocator fails.  Instead, we drop the lock,
   allocate a new node and retry the operation.  We have to convert all
   the radix tree, IDA and IDR preload users before we can realise this
   benefit, but I have not yet found a user which cannot be converted.
 - The XArray provides a cmpxchg operation.  The radix tree forces users
   to roll their own (and at least four have).
 - Iterators take a 'max' parameter.  That simplifies many users and
   will reduce the amount of iteration done.
 - Iteration can proceed backwards.  We only have one user for this, but
   since it's called as part of the pagefault readahead algorithm, that
   seemed worth mentioning.
 - RCU-protected pointers are not exposed as part of the API.  There are
   some fun bugs where the page cache forgets to use rcu_dereference()
   in the current codebase.
 - Value entries gain an extra bit compared to radix tree exceptional
   entries.  That gives us the extra bit we need to put huge page swap
   entries in the page cache.
 - Some iterators now take a 'filter' argument instead of having
   separate iterators for tagged/untagged iterations.

The page cache is improved by this:
 - Shorter, easier to read code
 - More efficient iterations
 - Reduction in size of struct address_space
 - Fewer walks from the top of the data structure; the XArray API
   encourages staying at the leaf node and conducting operations there.

Changes since v7:
 - Added acks from Jeff Layton (thanks!)
 - Renamed the address_space ->pages to ->i_pages
 - Used GFP_ZONEMASK instead of the more obtuse shifting by 4 (Jeff Layton)
 - Realised that page_cache_range_empty() and filemap_range_has_page()
   were essentially the same function, so redid that pair of patches
 - A few checkpatch fixes
 - Added an SPDX tag to a missed file
 - Rebased on next-20180306
   - memfd moved out of shmem
   - nds32 needed its flush_dcache_mmap_lock fixed
   - mac80211_hwsim had added a use of IDA_INIT
 - Improved some documentation and commit messages
 - Split the rearrangement of struct address_space into its own patch (Kirill)
 - address_space documentation improvements had somehow worked their way
   into an unrelated patch; move them into the rearrangement patch.
 - Removed chunks of radix tree functionality that are not used any more.

Matthew Wilcox (63):
  mac80211_hwsim: Use DEFINE_IDA
  radix tree: Use GFP_ZONEMASK bits of gfp_t for flags
  arm64: Turn flush_dcache_mmap_lock into a no-op
  unicore32: Turn flush_dcache_mmap_lock into a no-op
  Export __set_page_dirty
  btrfs: Use filemap_range_has_page()
  xfs: Rename xa_ elements to ail_
  fscache: Use appropriate radix tree accessors
  xarray: Add the xa_lock to the radix_tree_root
  page cache: Use xa_lock
  xarray: Replace exceptional entries
  xarray: Change definition of sibling entries
  xarray: Add definition of struct xarray
  xarray: Define struct xa_node
  xarray: Add documentation
  xarray: Add xa_load
  xarray: Add xa_get_tag, xa_set_tag and xa_clear_tag
  xarray: Add xa_store
  xarray: Add xa_cmpxchg and xa_insert
  xarray: Add xa_for_each
  xarray: Add xa_extract
  xarray: Add xa_destroy
  xarray: Add xas_next and xas_prev
  xarray: Add xas_

[PATCH v8 03/63] arm64: Turn flush_dcache_mmap_lock into a no-op

2018-03-06 Thread Matthew Wilcox

From: Matthew Wilcox 

ARM64 doesn't walk the VMA tree in its flush_dcache_page()
implementation, so has no need to take the tree_lock.

Signed-off-by: Matthew Wilcox 
Reviewed-by: Will Deacon 
---
 arch/arm64/include/asm/cacheflush.h | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/arch/arm64/include/asm/cacheflush.h 
b/arch/arm64/include/asm/cacheflush.h
index bef9f418f089..550b0abea953 100644
--- a/arch/arm64/include/asm/cacheflush.h
+++ b/arch/arm64/include/asm/cacheflush.h
@@ -137,10 +137,8 @@ static inline void __flush_icache_all(void)
dsb(ish);
 }
 
-#define flush_dcache_mmap_lock(mapping) \
-   spin_lock_irq(&(mapping)->tree_lock)
-#define flush_dcache_mmap_unlock(mapping) \
-   spin_unlock_irq(&(mapping)->tree_lock)
+#define flush_dcache_mmap_lock(mapping)do { } while (0)
+#define flush_dcache_mmap_unlock(mapping)  do { } while (0)
 
 /*
  * We don't appear to need to do anything here.  In fact, if we did, we'd
-- 
2.16.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v8 05/63] Export __set_page_dirty

2018-03-06 Thread Matthew Wilcox

From: Matthew Wilcox 

XFS currently contains a copy-and-paste of __set_page_dirty().  Export
it from buffer.c instead.

Signed-off-by: Matthew Wilcox 
Acked-by: Jeff Layton 
---
 fs/buffer.c|  3 ++-
 fs/xfs/xfs_aops.c  | 15 ++-
 include/linux/mm.h |  1 +
 3 files changed, 5 insertions(+), 14 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 3b2b415f1fcd..a17d47b55de1 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -594,7 +594,7 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
  *
  * The caller must hold lock_page_memcg().
  */
-static void __set_page_dirty(struct page *page, struct address_space *mapping,
+void __set_page_dirty(struct page *page, struct address_space *mapping,
 int warn)
 {
unsigned long flags;
@@ -608,6 +608,7 @@ static void __set_page_dirty(struct page *page, struct 
address_space *mapping,
}
spin_unlock_irqrestore(>tree_lock, flags);
 }
+EXPORT_SYMBOL_GPL(__set_page_dirty);
 
 /*
  * Add a page to the dirty page list.
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 9c6a830da0ee..31f2c4895a46 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -1472,19 +1472,8 @@ xfs_vm_set_page_dirty(
newly_dirty = !TestSetPageDirty(page);
spin_unlock(>private_lock);
 
-   if (newly_dirty) {
-   /* sigh - __set_page_dirty() is static, so copy it here, too */
-   unsigned long flags;
-
-   spin_lock_irqsave(>tree_lock, flags);
-   if (page->mapping) {/* Race with truncate? */
-   WARN_ON_ONCE(!PageUptodate(page));
-   account_page_dirtied(page, mapping);
-   radix_tree_tag_set(>page_tree,
-   page_index(page), PAGECACHE_TAG_DIRTY);
-   }
-   spin_unlock_irqrestore(>tree_lock, flags);
-   }
+   if (newly_dirty)
+   __set_page_dirty(page, mapping, 1);
unlock_page_memcg(page);
if (newly_dirty)
__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 9f1270360983..8cf4714bfec8 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1454,6 +1454,7 @@ extern int try_to_release_page(struct page * page, gfp_t 
gfp_mask);
 extern void do_invalidatepage(struct page *page, unsigned int offset,
  unsigned int length);
 
+void __set_page_dirty(struct page *, struct address_space *, int warn);
 int __set_page_dirty_nobuffers(struct page *page);
 int __set_page_dirty_no_writeback(struct page *page);
 int redirty_page_for_writepage(struct writeback_control *wbc,
-- 
2.16.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v8 04/63] unicore32: Turn flush_dcache_mmap_lock into a no-op

2018-03-06 Thread Matthew Wilcox

From: Matthew Wilcox 

Unicore doesn't walk the VMA tree in its flush_dcache_page()
implementation, so has no need to take the tree_lock.

Signed-off-by: Matthew Wilcox 
---
 arch/unicore32/include/asm/cacheflush.h | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/arch/unicore32/include/asm/cacheflush.h 
b/arch/unicore32/include/asm/cacheflush.h
index a5e08e2d5d6d..1d9132b66039 100644
--- a/arch/unicore32/include/asm/cacheflush.h
+++ b/arch/unicore32/include/asm/cacheflush.h
@@ -170,10 +170,8 @@ extern void flush_cache_page(struct vm_area_struct *vma,
 #define ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE 1
 extern void flush_dcache_page(struct page *);
 
-#define flush_dcache_mmap_lock(mapping)\
-   spin_lock_irq(&(mapping)->tree_lock)
-#define flush_dcache_mmap_unlock(mapping)  \
-   spin_unlock_irq(&(mapping)->tree_lock)
+#define flush_dcache_mmap_lock(mapping)do { } while (0)
+#define flush_dcache_mmap_unlock(mapping)  do { } while (0)
 
 #define flush_icache_user_range(vma, page, addr, len)  \
flush_dcache_page(page)
-- 
2.16.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v8 08/63] fscache: Use appropriate radix tree accessors

2018-03-06 Thread Matthew Wilcox

From: Matthew Wilcox 

Don't open-code accesses to data structure internals.

Signed-off-by: Matthew Wilcox 
Reviewed-by: Jeff Layton 
---
 fs/fscache/cookie.c | 2 +-
 fs/fscache/object.c | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/fscache/cookie.c b/fs/fscache/cookie.c
index ff84258132bb..e9054e0c1a49 100644
--- a/fs/fscache/cookie.c
+++ b/fs/fscache/cookie.c
@@ -608,7 +608,7 @@ void __fscache_relinquish_cookie(struct fscache_cookie 
*cookie, bool retire)
/* Clear pointers back to the netfs */
cookie->netfs_data  = NULL;
cookie->def = NULL;
-   BUG_ON(cookie->stores.rnode);
+   BUG_ON(!radix_tree_empty(>stores));
 
if (cookie->parent) {
ASSERTCMP(atomic_read(>parent->usage), >, 0);
diff --git a/fs/fscache/object.c b/fs/fscache/object.c
index 7a182c87f378..aa0e71f02c33 100644
--- a/fs/fscache/object.c
+++ b/fs/fscache/object.c
@@ -956,7 +956,7 @@ static const struct fscache_state 
*_fscache_invalidate_object(struct fscache_obj
 * retire the object instead.
 */
if (!fscache_use_cookie(object)) {
-   ASSERT(object->cookie->stores.rnode == NULL);
+   ASSERT(radix_tree_empty(>cookie->stores));
set_bit(FSCACHE_OBJECT_RETIRED, >flags);
_leave(" [no cookie]");
return transit_to(KILL_OBJECT);
-- 
2.16.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v8 06/63] btrfs: Use filemap_range_has_page()

2018-03-06 Thread Matthew Wilcox

From: Matthew Wilcox 

The current implementation of btrfs_page_exists_in_range() gives the
wrong answer if the workingset code has stored a shadow entry in the
page cache.  The filemap_range_has_page() function does not have this
problem, and it's shared code, so use it instead.

Signed-off-by: Matthew Wilcox 
---
 fs/btrfs/btrfs_inode.h |  6 -
 fs/btrfs/inode.c   | 70 --
 2 files changed, 5 insertions(+), 71 deletions(-)

diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index f527e99c9f8d..078a53e01ece 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -364,6 +364,10 @@ static inline void btrfs_print_data_csum_error(struct 
btrfs_inode *inode,
logical_start, csum, csum_expected, mirror_num);
 }
 
-bool btrfs_page_exists_in_range(struct inode *inode, loff_t start, loff_t end);
+static inline bool btrfs_page_exists_in_range(struct inode *inode,
+   loff_t start, loff_t end)
+{
+   return filemap_range_has_page(inode->i_mapping, start, end);
+}
 
 #endif
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 1f5b93ecffca..3340de232944 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7476,76 +7476,6 @@ noinline int can_nocow_extent(struct inode *inode, u64 
offset, u64 *len,
return ret;
 }
 
-bool btrfs_page_exists_in_range(struct inode *inode, loff_t start, loff_t end)
-{
-   struct radix_tree_root *root = >i_mapping->page_tree;
-   bool found = false;
-   void **pagep = NULL;
-   struct page *page = NULL;
-   unsigned long start_idx;
-   unsigned long end_idx;
-
-   start_idx = start >> PAGE_SHIFT;
-
-   /*
-* end is the last byte in the last page.  end == start is legal
-*/
-   end_idx = end >> PAGE_SHIFT;
-
-   rcu_read_lock();
-
-   /* Most of the code in this while loop is lifted from
-* find_get_page.  It's been modified to begin searching from a
-* page and return just the first page found in that range.  If the
-* found idx is less than or equal to the end idx then we know that
-* a page exists.  If no pages are found or if those pages are
-* outside of the range then we're fine (yay!) */
-   while (page == NULL &&
-  radix_tree_gang_lookup_slot(root, , NULL, start_idx, 1)) {
-   page = radix_tree_deref_slot(pagep);
-   if (unlikely(!page))
-   break;
-
-   if (radix_tree_exception(page)) {
-   if (radix_tree_deref_retry(page)) {
-   page = NULL;
-   continue;
-   }
-   /*
-* Otherwise, shmem/tmpfs must be storing a swap entry
-* here as an exceptional entry: so return it without
-* attempting to raise page count.
-*/
-   page = NULL;
-   break; /* TODO: Is this relevant for this use case? */
-   }
-
-   if (!page_cache_get_speculative(page)) {
-   page = NULL;
-   continue;
-   }
-
-   /*
-* Has the page moved?
-* This is part of the lockless pagecache protocol. See
-* include/linux/pagemap.h for details.
-*/
-   if (unlikely(page != *pagep)) {
-   put_page(page);
-   page = NULL;
-   }
-   }
-
-   if (page) {
-   if (page->index <= end_idx)
-   found = true;
-   put_page(page);
-   }
-
-   rcu_read_unlock();
-   return found;
-}
-
 static int lock_extent_direct(struct inode *inode, u64 lockstart, u64 lockend,
  struct extent_state **cached_state, int writing)
 {
-- 
2.16.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v8 07/63] xfs: Rename xa_ elements to ail_

2018-03-06 Thread Matthew Wilcox

From: Matthew Wilcox 

This is a simple rename, except that xa_ail becomes ail_head.

Signed-off-by: Matthew Wilcox 
Reviewed-by: Darrick J. Wong 
---
 fs/xfs/xfs_buf_item.c|  10 ++--
 fs/xfs/xfs_dquot.c   |   4 +-
 fs/xfs/xfs_dquot_item.c  |  11 ++--
 fs/xfs/xfs_inode_item.c  |  22 +++
 fs/xfs/xfs_log.c |   6 +-
 fs/xfs/xfs_log_recover.c |  80 -
 fs/xfs/xfs_trans.c   |  18 +++---
 fs/xfs/xfs_trans_ail.c   | 152 +++
 fs/xfs/xfs_trans_buf.c   |   4 +-
 fs/xfs/xfs_trans_priv.h  |  42 ++---
 10 files changed, 175 insertions(+), 174 deletions(-)

diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
index 270ddb4d2313..82ad270e390e 100644
--- a/fs/xfs/xfs_buf_item.c
+++ b/fs/xfs/xfs_buf_item.c
@@ -460,7 +460,7 @@ xfs_buf_item_unpin(
list_del_init(>b_li_list);
bp->b_iodone = NULL;
} else {
-   spin_lock(>xa_lock);
+   spin_lock(>ail_lock);
xfs_trans_ail_delete(ailp, lip, SHUTDOWN_LOG_IO_ERROR);
xfs_buf_item_relse(bp);
ASSERT(bp->b_log_item == NULL);
@@ -1057,12 +1057,12 @@ xfs_buf_do_callbacks_fail(
lip = list_first_entry(>b_li_list, struct xfs_log_item,
li_bio_list);
ailp = lip->li_ailp;
-   spin_lock(>xa_lock);
+   spin_lock(>ail_lock);
list_for_each_entry(lip, >b_li_list, li_bio_list) {
if (lip->li_ops->iop_error)
lip->li_ops->iop_error(lip, bp);
}
-   spin_unlock(>xa_lock);
+   spin_unlock(>ail_lock);
 }
 
 static bool
@@ -1226,7 +1226,7 @@ xfs_buf_iodone(
 *
 * Either way, AIL is useless if we're forcing a shutdown.
 */
-   spin_lock(>xa_lock);
+   spin_lock(>ail_lock);
xfs_trans_ail_delete(ailp, lip, SHUTDOWN_CORRUPT_INCORE);
xfs_buf_item_free(BUF_ITEM(lip));
 }
@@ -1246,7 +1246,7 @@ xfs_buf_resubmit_failed_buffers(
/*
 * Clear XFS_LI_FAILED flag from all items before resubmit
 *
-* XFS_LI_FAILED set/clear is protected by xa_lock, caller  this
+* XFS_LI_FAILED set/clear is protected by ail_lock, caller  this
 * function already have it acquired
 */
list_for_each_entry(lip, >b_li_list, li_bio_list)
diff --git a/fs/xfs/xfs_dquot.c b/fs/xfs/xfs_dquot.c
index 43572f8a1b8e..c4041b676b7b 100644
--- a/fs/xfs/xfs_dquot.c
+++ b/fs/xfs/xfs_dquot.c
@@ -920,7 +920,7 @@ xfs_qm_dqflush_done(
 (lip->li_flags & XFS_LI_FAILED))) {
 
/* xfs_trans_ail_delete() drops the AIL lock. */
-   spin_lock(>xa_lock);
+   spin_lock(>ail_lock);
if (lip->li_lsn == qip->qli_flush_lsn) {
xfs_trans_ail_delete(ailp, lip, 
SHUTDOWN_CORRUPT_INCORE);
} else {
@@ -930,7 +930,7 @@ xfs_qm_dqflush_done(
 */
if (lip->li_flags & XFS_LI_FAILED)
xfs_clear_li_failed(lip);
-   spin_unlock(>xa_lock);
+   spin_unlock(>ail_lock);
}
}
 
diff --git a/fs/xfs/xfs_dquot_item.c b/fs/xfs/xfs_dquot_item.c
index 96eaa6933709..4b331e354da7 100644
--- a/fs/xfs/xfs_dquot_item.c
+++ b/fs/xfs/xfs_dquot_item.c
@@ -157,8 +157,9 @@ xfs_dquot_item_error(
 STATIC uint
 xfs_qm_dquot_logitem_push(
struct xfs_log_item *lip,
-   struct list_head*buffer_list) __releases(>li_ailp->xa_lock)
- __acquires(>li_ailp->xa_lock)
+   struct list_head*buffer_list)
+   __releases(>li_ailp->ail_lock)
+   __acquires(>li_ailp->ail_lock)
 {
struct xfs_dquot*dqp = DQUOT_ITEM(lip)->qli_dquot;
struct xfs_buf  *bp = lip->li_buf;
@@ -205,7 +206,7 @@ xfs_qm_dquot_logitem_push(
goto out_unlock;
}
 
-   spin_unlock(>li_ailp->xa_lock);
+   spin_unlock(>li_ailp->ail_lock);
 
error = xfs_qm_dqflush(dqp, );
if (error) {
@@ -217,7 +218,7 @@ xfs_qm_dquot_logitem_push(
xfs_buf_relse(bp);
}
 
-   spin_lock(>li_ailp->xa_lock);
+   spin_lock(>li_ailp->ail_lock);
 out_unlock:
xfs_dqunlock(dqp);
return rval;
@@ -400,7 +401,7 @@ xfs_qm_qoffend_logitem_committed(
 * Delete the qoff-start logitem from the AIL.
 * xfs_trans_ail_delete() drops the AIL lock.
 */
-   spin_lock(>xa_lock);
+   spin_lock(>ail_lock);
xfs_trans_ail_delete(ailp, >qql_item, SHUTDOWN_LOG_IO_ERROR);
 
kmem_free(qfs->qql_item.li_lv_shadow);
diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
index d5037f060d6f..7666eae8844f 100644
---

[PATCH v8 02/63] radix tree: Use GFP_ZONEMASK bits of gfp_t for flags

2018-03-06 Thread Matthew Wilcox

From: Matthew Wilcox 

None of these bits may be used for slab allocations, so we can use them
as radix tree flags as long as we mask them off before passing them
to the slab allocator.  Move the IDR flag from the high bits to the
GFP_ZONEMASK bits.

Signed-off-by: Matthew Wilcox 
Acked-by: Jeff Layton 
---
 include/linux/idr.h  | 3 ++-
 include/linux/radix-tree.h   | 7 ---
 lib/radix-tree.c | 3 ++-
 tools/testing/radix-tree/linux/gfp.h | 1 +
 4 files changed, 9 insertions(+), 5 deletions(-)

diff --git a/include/linux/idr.h b/include/linux/idr.h
index 7d6a6313f0ab..913c335054f0 100644
--- a/include/linux/idr.h
+++ b/include/linux/idr.h
@@ -29,7 +29,8 @@ struct idr {
 #define IDR_FREE   0
 
 /* Set the IDR flag and the IDR_FREE tag */
-#define IDR_RT_MARKER  ((__force gfp_t)(3 << __GFP_BITS_SHIFT))
+#define IDR_RT_MARKER  (ROOT_IS_IDR | (__force gfp_t)  \
+   (1 << (ROOT_TAG_SHIFT + IDR_FREE)))
 
 #define IDR_INIT_BASE(base) {  \
.idr_rt = RADIX_TREE_INIT(IDR_RT_MARKER),   \
diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index fc55ff31eca7..6c4e2e716dac 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -104,9 +104,10 @@ struct radix_tree_node {
unsigned long   tags[RADIX_TREE_MAX_TAGS][RADIX_TREE_TAG_LONGS];
 };
 
-/* The top bits of gfp_mask are used to store the root tags and the IDR flag */
-#define ROOT_IS_IDR((__force gfp_t)(1 << __GFP_BITS_SHIFT))
-#define ROOT_TAG_SHIFT (__GFP_BITS_SHIFT + 1)
+/* The IDR tag is stored in the low bits of the GFP flags */
+#define ROOT_IS_IDR((__force gfp_t)4)
+/* The top bits of gfp_mask are used to store the root tags */
+#define ROOT_TAG_SHIFT (__GFP_BITS_SHIFT)
 
 struct radix_tree_root {
gfp_t   gfp_mask;
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index 8e00138d593f..da9e10c827df 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -146,7 +146,7 @@ static unsigned int radix_tree_descend(const struct 
radix_tree_node *parent,
 
 static inline gfp_t root_gfp_mask(const struct radix_tree_root *root)
 {
-   return root->gfp_mask & __GFP_BITS_MASK;
+   return root->gfp_mask & (__GFP_BITS_MASK & ~GFP_ZONEMASK);
 }
 
 static inline void tag_set(struct radix_tree_node *node, unsigned int tag,
@@ -2285,6 +2285,7 @@ void __init radix_tree_init(void)
int ret;
 
BUILD_BUG_ON(RADIX_TREE_MAX_TAGS + __GFP_BITS_SHIFT > 32);
+   BUILD_BUG_ON(ROOT_IS_IDR & ~GFP_ZONEMASK);
radix_tree_node_cachep = kmem_cache_create("radix_tree_node",
sizeof(struct radix_tree_node), 0,
SLAB_PANIC | SLAB_RECLAIM_ACCOUNT,
diff --git a/tools/testing/radix-tree/linux/gfp.h 
b/tools/testing/radix-tree/linux/gfp.h
index e3201ccf54c3..32159c08a52e 100644
--- a/tools/testing/radix-tree/linux/gfp.h
+++ b/tools/testing/radix-tree/linux/gfp.h
@@ -19,6 +19,7 @@
 
 #define __GFP_RECLAIM  (__GFP_DIRECT_RECLAIM|__GFP_KSWAPD_RECLAIM)
 
+#define GFP_ZONEMASK   0x0fu
 #define GFP_ATOMIC (__GFP_HIGH|__GFP_ATOMIC|__GFP_KSWAPD_RECLAIM)
 #define GFP_KERNEL (__GFP_RECLAIM | __GFP_IO | __GFP_FS)
 #define GFP_NOWAIT (__GFP_KSWAPD_RECLAIM)
-- 
2.16.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v8 12/63] xarray: Change definition of sibling entries

2018-03-06 Thread Matthew Wilcox

From: Matthew Wilcox 

Instead of storing a pointer to the slot containing the canonical entry,
store the offset of the slot.  Produces slightly more efficient code
(~300 bytes) and simplifies the implementation.

Signed-off-by: Matthew Wilcox 
---
 include/linux/xarray.h | 93 ++
 lib/radix-tree.c   | 66 +++
 2 files changed, 112 insertions(+), 47 deletions(-)

diff --git a/include/linux/xarray.h b/include/linux/xarray.h
index f61806fd8002..283beb5aac58 100644
--- a/include/linux/xarray.h
+++ b/include/linux/xarray.h
@@ -22,6 +22,12 @@
  * x1: Value entry
  *
  * Attempting to store internal entries in the XArray is a bug.
+ *
+ * Most internal entries are pointers to the next node in the tree.
+ * The following internal entries have a special meaning:
+ *
+ * 0-62: Sibling entries
+ * 256: Retry entry
  */
 
 #define BITS_PER_XA_VALUE  (BITS_PER_LONG - 1)
@@ -63,6 +69,42 @@ static inline bool xa_is_value(const void *entry)
return (unsigned long)entry & 1;
 }
 
+/*
+ * xa_mk_internal() - Create an internal entry.
+ * @v: Value to turn into an internal entry.
+ *
+ * Context: Any context.
+ * Return: An XArray internal entry corresponding to this value.
+ */
+static inline void *xa_mk_internal(unsigned long v)
+{
+   return (void *)((v << 2) | 2);
+}
+
+/*
+ * xa_to_internal() - Extract the value from an internal entry.
+ * @entry: XArray entry.
+ *
+ * Context: Any context.
+ * Return: The value which was stored in the internal entry.
+ */
+static inline unsigned long xa_to_internal(const void *entry)
+{
+   return (unsigned long)entry >> 2;
+}
+
+/*
+ * xa_is_internal() - Is the entry an internal entry?
+ * @entry: XArray entry.
+ *
+ * Context: Any context.
+ * Return: %true if the entry is an internal entry.
+ */
+static inline bool xa_is_internal(const void *entry)
+{
+   return ((unsigned long)entry & 3) == 2;
+}
+
 #define xa_trylock(xa) spin_trylock(&(xa)->xa_lock)
 #define xa_lock(xa)spin_lock(&(xa)->xa_lock)
 #define xa_unlock(xa)  spin_unlock(&(xa)->xa_lock)
@@ -75,4 +117,55 @@ static inline bool xa_is_value(const void *entry)
 #define xa_unlock_irqrestore(xa, flags) \
spin_unlock_irqrestore(&(xa)->xa_lock, flags)
 
+/* Everything below here is the Advanced API.  Proceed with caution. */
+
+/*
+ * The xarray is constructed out of a set of 'chunks' of pointers.  Choosing
+ * the best chunk size requires some tradeoffs.  A power of two recommends
+ * itself so that we can walk the tree based purely on shifts and masks.
+ * Generally, the larger the better; as the number of slots per level of the
+ * tree increases, the less tall the tree needs to be.  But that needs to be
+ * balanced against the memory consumption of each node.  On a 64-bit system,
+ * xa_node is currently 576 bytes, and we get 7 of them per 4kB page.  If we
+ * doubled the number of slots per node, we'd get only 3 nodes per 4kB page.
+ */
+#ifndef XA_CHUNK_SHIFT
+#define XA_CHUNK_SHIFT (CONFIG_BASE_SMALL ? 4 : 6)
+#endif
+#define XA_CHUNK_SIZE  (1UL << XA_CHUNK_SHIFT)
+#define XA_CHUNK_MASK  (XA_CHUNK_SIZE - 1)
+
+/* Private */
+static inline bool xa_is_node(const void *entry)
+{
+   return xa_is_internal(entry) && (unsigned long)entry > 4096;
+}
+
+/* Private */
+static inline void *xa_mk_sibling(unsigned int offset)
+{
+   return xa_mk_internal(offset);
+}
+
+/* Private */
+static inline unsigned long xa_to_sibling(const void *entry)
+{
+   return xa_to_internal(entry);
+}
+
+/**
+ * xa_is_sibling() - Is the entry a sibling entry?
+ * @entry: Entry retrieved from the XArray
+ *
+ * Return: %true if the entry is a sibling entry.
+ */
+static inline bool xa_is_sibling(const void *entry)
+{
+   return IS_ENABLED(CONFIG_RADIX_TREE_MULTIORDER) &&
+   xa_is_internal(entry) &&
+   (entry < xa_mk_sibling(XA_CHUNK_SIZE - 1));
+}
+
+#define XA_RETRY_ENTRY xa_mk_internal(256)
+
 #endif /* _LINUX_XARRAY_H */
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index c7246de3f367..dd4669b3c30e 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -38,6 +38,7 @@
 #include 
 #include 
 #include 
+#include 
 
 
 /* Number of nodes in fully populated tree of given height */
@@ -98,24 +99,7 @@ static inline void *node_to_entry(void *ptr)
return (void *)((unsigned long)ptr | RADIX_TREE_INTERNAL_NODE);
 }
 
-#define RADIX_TREE_RETRY   node_to_entry(NULL)
-
-#ifdef CONFIG_RADIX_TREE_MULTIORDER
-/* Sibling slots point directly to another slot in the same node */
-static inline
-bool is_sibling_entry(const struct radix_tree_node *parent, void *node)
-{
-   void __rcu **ptr = node;
-   return (parent->slots <= ptr) &&
-   (ptr < parent->slots + RADIX_TREE_MAP_SIZE);
-}
-#else
-static inline
-bool is_sibling_entry(const struct

[PATCH v8 09/63] xarray: Add the xa_lock to the radix_tree_root

2018-03-06 Thread Matthew Wilcox

From: Matthew Wilcox 

This results in no change in structure size on 64-bit machines as it
fits in the padding between the gfp_t and the void *.  32-bit machines
will grow the structure from 8 to 12 bytes.  Almost all radix trees
are protected with (at least) a spinlock, so as they are converted from
radix trees to xarrays, the data structures will shrink again.

Initialising the spinlock requires a name for the benefit of lockdep,
so RADIX_TREE_INIT() now needs to know the name of the radix tree it's
initialising, and so do IDR_INIT() and IDA_INIT().

Also add the xa_lock() and xa_unlock() family of wrappers to make it
easier to use the lock.  If we could rely on -fplan9-extensions in
the compiler, we could avoid all of this syntactic sugar, but that
wasn't added until gcc 4.6.

Signed-off-by: Matthew Wilcox 
Reviewed-by: Jeff Layton 
---
 fs/f2fs/gc.c   |  2 +-
 include/linux/idr.h| 19 ++-
 include/linux/radix-tree.h |  7 +--
 include/linux/xarray.h | 24 
 kernel/pid.c   |  2 +-
 tools/include/linux/spinlock.h |  1 +
 6 files changed, 42 insertions(+), 13 deletions(-)
 create mode 100644 include/linux/xarray.h

diff --git a/fs/f2fs/gc.c b/fs/f2fs/gc.c
index aa720cc44509..7aa15134180e 100644
--- a/fs/f2fs/gc.c
+++ b/fs/f2fs/gc.c
@@ -1006,7 +1006,7 @@ int f2fs_gc(struct f2fs_sb_info *sbi, bool sync,
unsigned int init_segno = segno;
struct gc_inode_list gc_list = {
.ilist = LIST_HEAD_INIT(gc_list.ilist),
-   .iroot = RADIX_TREE_INIT(GFP_NOFS),
+   .iroot = RADIX_TREE_INIT(gc_list.iroot, GFP_NOFS),
};
 
trace_f2fs_gc_begin(sbi->sb, sync, background,
diff --git a/include/linux/idr.h b/include/linux/idr.h
index 913c335054f0..e856f4e0ab35 100644
--- a/include/linux/idr.h
+++ b/include/linux/idr.h
@@ -32,27 +32,28 @@ struct idr {
 #define IDR_RT_MARKER  (ROOT_IS_IDR | (__force gfp_t)  \
(1 << (ROOT_TAG_SHIFT + IDR_FREE)))
 
-#define IDR_INIT_BASE(base) {  \
-   .idr_rt = RADIX_TREE_INIT(IDR_RT_MARKER),   \
+#define IDR_INIT_BASE(name, base) {\
+   .idr_rt = RADIX_TREE_INIT(name, IDR_RT_MARKER), \
.idr_base = (base), \
.idr_next = 0,  \
 }
 
 /**
  * IDR_INIT() - Initialise an IDR.
+ * @name: Name of IDR.
  *
  * A freshly-initialised IDR contains no IDs.
  */
-#define IDR_INIT   IDR_INIT_BASE(0)
+#define IDR_INIT(name) IDR_INIT_BASE(name, 0)
 
 /**
- * DEFINE_IDR() - Define a statically-allocated IDR
- * @name: Name of IDR
+ * DEFINE_IDR() - Define a statically-allocated IDR.
+ * @name: Name of IDR.
  *
  * An IDR defined using this macro is ready for use with no additional
  * initialisation required.  It contains no IDs.
  */
-#define DEFINE_IDR(name)   struct idr name = IDR_INIT
+#define DEFINE_IDR(name)   struct idr name = IDR_INIT(name)
 
 /**
  * idr_get_cursor - Return the current position of the cyclic allocator
@@ -219,10 +220,10 @@ struct ida {
struct radix_tree_root  ida_rt;
 };
 
-#define IDA_INIT   {   \
-   .ida_rt = RADIX_TREE_INIT(IDR_RT_MARKER | GFP_NOWAIT),  \
+#define IDA_INIT(name) {   \
+   .ida_rt = RADIX_TREE_INIT(name, IDR_RT_MARKER | GFP_NOWAIT),\
 }
-#define DEFINE_IDA(name)   struct ida name = IDA_INIT
+#define DEFINE_IDA(name)   struct ida name = IDA_INIT(name)
 
 int ida_pre_get(struct ida *ida, gfp_t gfp_mask);
 int ida_get_new_above(struct ida *ida, int starting_id, int *p_id);
diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index 6c4e2e716dac..34149e8b5f73 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -110,20 +110,23 @@ struct radix_tree_node {
 #define ROOT_TAG_SHIFT (__GFP_BITS_SHIFT)
 
 struct radix_tree_root {
+   spinlock_t  xa_lock;
gfp_t   gfp_mask;
struct radix_tree_node  __rcu *rnode;
 };
 
-#define RADIX_TREE_INIT(mask)  {   \
+#define RADIX_TREE_INIT(name, mask){   \
+   .xa_lock = __SPIN_LOCK_UNLOCKED(name.xa_lock),  \
.gfp_mask = (mask), \
.rnode = NULL,  \
 }
 
 #define RADIX_TREE(name, mask) \
-   struct radix_tree_root name = RADIX_TREE_INIT(mask)
+   struct radix_tree_root name = RADIX_TREE_INIT(name, mask)
 
 #define INIT_RADIX_TREE(root, mask)\
 do {

[PATCH v8 11/63] xarray: Replace exceptional entries

2018-03-06 Thread Matthew Wilcox

From: Matthew Wilcox 

Introduce xarray value entries to replace the radix tree exceptional
entry code.  This is a slight change in encoding to allow the use of an
extra bit (we can now store BITS_PER_LONG - 1 bits in a value entry).
It is also a change in emphasis; exceptional entries are intimidating
and different.  As the comment explains, you can choose to store values
or pointers in the xarray and they are both first-class citizens.

Signed-off-by: Matthew Wilcox 
---
 arch/powerpc/include/asm/book3s/64/pgtable.h|   4 +-
 arch/powerpc/include/asm/nohash/64/pgtable.h|   4 +-
 drivers/gpu/drm/i915/i915_gem.c |  17 ++--
 drivers/staging/lustre/lustre/mdc/mdc_request.c |   2 +-
 fs/btrfs/compression.c  |   2 +-
 fs/dax.c| 107 
 fs/proc/task_mmu.c  |   2 +-
 include/linux/radix-tree.h  |  36 ++--
 include/linux/swapops.h |  19 ++---
 include/linux/xarray.h  |  54 
 lib/idr.c   |  61 ++
 lib/radix-tree.c|  21 ++---
 mm/filemap.c|  10 +--
 mm/khugepaged.c |   2 +-
 mm/madvise.c|   2 +-
 mm/memcontrol.c |   2 +-
 mm/mincore.c|   2 +-
 mm/readahead.c  |   2 +-
 mm/shmem.c  |  10 +--
 mm/swap.c   |   2 +-
 mm/truncate.c   |  12 +--
 mm/workingset.c |  12 ++-
 tools/testing/radix-tree/idr-test.c |   6 +-
 tools/testing/radix-tree/linux/radix-tree.h |   1 +
 tools/testing/radix-tree/multiorder.c   |  47 +--
 tools/testing/radix-tree/test.c |   2 +-
 26 files changed, 223 insertions(+), 218 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h 
b/arch/powerpc/include/asm/book3s/64/pgtable.h
index a6b9f1d74600..c3a5c809779c 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -709,9 +709,7 @@ static inline bool pte_user(pte_t pte)
BUILD_BUG_ON(_PAGE_HPTEFLAGS & (0x1f << _PAGE_BIT_SWAP_TYPE)); \
BUILD_BUG_ON(_PAGE_HPTEFLAGS & _PAGE_SWP_SOFT_DIRTY);   \
} while (0)
-/*
- * on pte we don't need handle RADIX_TREE_EXCEPTIONAL_SHIFT;
- */
+
 #define SWP_TYPE_BITS 5
 #define __swp_type(x)  (((x).val >> _PAGE_BIT_SWAP_TYPE) \
& ((1UL << SWP_TYPE_BITS) - 1))
diff --git a/arch/powerpc/include/asm/nohash/64/pgtable.h 
b/arch/powerpc/include/asm/nohash/64/pgtable.h
index 5c5f75d005ad..22f519687b74 100644
--- a/arch/powerpc/include/asm/nohash/64/pgtable.h
+++ b/arch/powerpc/include/asm/nohash/64/pgtable.h
@@ -330,9 +330,7 @@ static inline void __ptep_set_access_flags(struct mm_struct 
*mm,
 */ \
BUILD_BUG_ON(_PAGE_HPTEFLAGS & (0x1f << _PAGE_BIT_SWAP_TYPE)); \
} while (0)
-/*
- * on pte we don't need handle RADIX_TREE_EXCEPTIONAL_SHIFT;
- */
+
 #define SWP_TYPE_BITS 5
 #define __swp_type(x)  (((x).val >> _PAGE_BIT_SWAP_TYPE) \
& ((1UL << SWP_TYPE_BITS) - 1))
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index a5bd07338b46..cd3ae7c236ae 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -5771,7 +5771,8 @@ i915_gem_object_get_sg(struct drm_i915_gem_object *obj,
count = __sg_page_count(sg);
 
while (idx + count <= n) {
-   unsigned long exception, i;
+   void *entry;
+   unsigned long i;
int ret;
 
/* If we cannot allocate and insert this entry, or the
@@ -5786,12 +5787,9 @@ i915_gem_object_get_sg(struct drm_i915_gem_object *obj,
if (ret && ret != -EEXIST)
goto scan;
 
-   exception =
-   RADIX_TREE_EXCEPTIONAL_ENTRY |
-   idx << RADIX_TREE_EXCEPTIONAL_SHIFT;
+   entry = xa_mk_value(idx);
for (i = 1; i < count; i++) {
-   ret = radix_tree_insert(>radix, idx + i,
-   (void *)exception);
+   ret = radix_tree_insert(>radix, idx + i, entry);
if (ret && ret != -EEXIST)
goto scan;
}
@@ -5829,15 +5827,14 @@ i915_gem_object_get_sg(struct drm_i915_gem_object *obj,
GEM_BUG_ON(!sg);
 
/* If this index is in the middle of

[PATCH v8 10/63] page cache: Use xa_lock

2018-03-06 Thread Matthew Wilcox

From: Matthew Wilcox 

Remove the address_space ->tree_lock and use the xa_lock newly added to
the radix_tree_root.  Rename the address_space ->page_tree to ->i_pages,
since we don't really care that it's a tree.

Signed-off-by: Matthew Wilcox 
Acked-by: Jeff Layton 
---
 Documentation/cgroup-v1/memory.txt  |   2 +-
 Documentation/vm/page_migration |  14 +--
 arch/arm/include/asm/cacheflush.h   |   6 +-
 arch/nds32/include/asm/cacheflush.h |   4 +-
 arch/nios2/include/asm/cacheflush.h |   6 +-
 arch/parisc/include/asm/cacheflush.h|   6 +-
 drivers/staging/lustre/lustre/llite/glimpse.c   |   2 +-
 drivers/staging/lustre/lustre/mdc/mdc_request.c |   8 +-
 fs/afs/write.c  |   9 +-
 fs/btrfs/compression.c  |   2 +-
 fs/btrfs/extent_io.c|  16 +--
 fs/buffer.c |  13 ++-
 fs/cifs/file.c  |   9 +-
 fs/dax.c| 126 
 fs/f2fs/data.c  |   6 +-
 fs/f2fs/dir.c   |   6 +-
 fs/f2fs/inline.c|   6 +-
 fs/f2fs/node.c  |   8 +-
 fs/fs-writeback.c   |  22 ++---
 fs/inode.c  |  11 +--
 fs/nilfs2/btnode.c  |  20 ++--
 fs/nilfs2/page.c|  22 ++---
 include/linux/backing-dev.h |  14 +--
 include/linux/fs.h  |   8 +-
 include/linux/mm.h  |   2 +-
 include/linux/pagemap.h |   4 +-
 mm/filemap.c|  84 
 mm/huge_memory.c|  10 +-
 mm/khugepaged.c |  49 +
 mm/memcontrol.c |   4 +-
 mm/memfd.c  |  18 ++--
 mm/migrate.c|  32 +++---
 mm/page-writeback.c |  43 
 mm/readahead.c  |   2 +-
 mm/rmap.c   |   4 +-
 mm/shmem.c  |  42 
 mm/swap_state.c |  17 ++--
 mm/truncate.c   |  22 ++---
 mm/vmscan.c |  12 +--
 mm/workingset.c |  22 ++---
 40 files changed, 346 insertions(+), 367 deletions(-)

diff --git a/Documentation/cgroup-v1/memory.txt 
b/Documentation/cgroup-v1/memory.txt
index a4af2e124e24..3682e99234c2 100644
--- a/Documentation/cgroup-v1/memory.txt
+++ b/Documentation/cgroup-v1/memory.txt
@@ -262,7 +262,7 @@ When oom event notifier is registered, event will be 
delivered.
 2.6 Locking
 
lock_page_cgroup()/unlock_page_cgroup() should not be called under
-   mapping->tree_lock.
+   the i_pages lock.
 
Other lock order is following:
PG_locked.
diff --git a/Documentation/vm/page_migration b/Documentation/vm/page_migration
index 0478ae2ad44a..496868072e24 100644
--- a/Documentation/vm/page_migration
+++ b/Documentation/vm/page_migration
@@ -90,7 +90,7 @@ Steps:
 
 1. Lock the page to be migrated
 
-2. Insure that writeback is complete.
+2. Ensure that writeback is complete.
 
 3. Lock the new page that we want to move to. It is locked so that accesses to
this (not yet uptodate) page immediately lock while the move is in progress.
@@ -100,8 +100,8 @@ Steps:
mapcount is not zero then we do not migrate the page. All user space
processes that attempt to access the page will now wait on the page lock.
 
-5. The radix tree lock is taken. This will cause all processes trying
-   to access the page via the mapping to block on the radix tree spinlock.
+5. The i_pages lock is taken. This will cause all processes trying
+   to access the page via the mapping to block on the spinlock.
 
 6. The refcount of the page is examined and we back out if references remain
otherwise we know that we are the only one referencing this page.
@@ -114,12 +114,12 @@ Steps:
 
 9. The radix tree is changed to point to the new page.
 
-10. The reference count of the old page is dropped because the radix tree
+10. The reference count of the old page is dropped because the address space
 reference is gone. A reference to the new page is established because
-the new page is referenced to by the radix tree.
+the new page is referenced by the address space.
 
-11. The radix tree lock is dropped. With that lookups in the mapping
-become possible again. Processes will move from spinning on the tree_lock
+11. The i_pages lock is dropped. With

[PATCH v8 17/63] xarray: Add xa_get_tag, xa_set_tag and xa_clear_tag

2018-03-06 Thread Matthew Wilcox

From: Matthew Wilcox 

XArray tags are slightly more strongly typed than the radix tree tags,
but occupy the same bits.  This commit also adds the xas_ family of tag
operations, for cases where the caller is already holding the lock, and
xa_tagged() to ask whether any array member has a particular tag set.

Signed-off-by: Matthew Wilcox 
---
 include/linux/xarray.h |  41 +++
 lib/xarray.c   | 235 +
 tools/include/linux/spinlock.h |   6 ++
 3 files changed, 282 insertions(+)

diff --git a/include/linux/xarray.h b/include/linux/xarray.h
index 5845187c1ce8..1cf012256eab 100644
--- a/include/linux/xarray.h
+++ b/include/linux/xarray.h
@@ -11,6 +11,7 @@
 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -149,6 +150,20 @@ static inline int xa_err(void *entry)
return 0;
 }
 
+typedef unsigned __bitwise xa_tag_t;
+#define XA_TAG_0   ((__force xa_tag_t)0U)
+#define XA_TAG_1   ((__force xa_tag_t)1U)
+#define XA_TAG_2   ((__force xa_tag_t)2U)
+#define XA_PRESENT ((__force xa_tag_t)8U)
+#define XA_TAG_MAX XA_TAG_2
+
+/*
+ * Values for xa_flags.  The radix tree stores its GFP flags in the xa_flags,
+ * and we remain compatible with that.
+ */
+#define XA_FLAGS_TAG(tag)  ((__force gfp_t)((1U << __GFP_BITS_SHIFT) << \
+   (__force unsigned)(tag)))
+
 /**
  * struct xarray - The anchor of the XArray.
  * @xa_lock: Lock that protects the contents of the XArray.
@@ -195,6 +210,9 @@ struct xarray {
 
 void xa_init_flags(struct xarray *, gfp_t flags);
 void *xa_load(struct xarray *, unsigned long index);
+bool xa_get_tag(struct xarray *, unsigned long index, xa_tag_t);
+void xa_set_tag(struct xarray *, unsigned long index, xa_tag_t);
+void xa_clear_tag(struct xarray *, unsigned long index, xa_tag_t);
 
 /**
  * xa_init() - Initialise an empty XArray.
@@ -209,6 +227,19 @@ static inline void xa_init(struct xarray *xa)
xa_init_flags(xa, 0);
 }
 
+/**
+ * xa_tagged() - Inquire whether any entry in this array has a tag set
+ * @xa: Array
+ * @tag: Tag value
+ *
+ * Context: Any context.
+ * Return: %true if any entry has this tag set.
+ */
+static inline bool xa_tagged(const struct xarray *xa, xa_tag_t tag)
+{
+   return xa->xa_flags & XA_FLAGS_TAG(tag);
+}
+
 #define xa_trylock(xa) spin_trylock(&(xa)->xa_lock)
 #define xa_lock(xa)spin_lock(&(xa)->xa_lock)
 #define xa_unlock(xa)  spin_unlock(&(xa)->xa_lock)
@@ -221,6 +252,12 @@ static inline void xa_init(struct xarray *xa)
 #define xa_unlock_irqrestore(xa, flags) \
spin_unlock_irqrestore(&(xa)->xa_lock, flags)
 
+/*
+ * Versions of the normal API which require the caller to hold the xa_lock.
+ */
+void __xa_set_tag(struct xarray *, unsigned long index, xa_tag_t);
+void __xa_clear_tag(struct xarray *, unsigned long index, xa_tag_t);
+
 /* Everything below here is the Advanced API.  Proceed with caution. */
 
 /*
@@ -534,6 +571,10 @@ static inline bool xas_retry(struct xa_state *xas, const 
void *entry)
 
 void *xas_load(struct xa_state *);
 
+bool xas_get_tag(const struct xa_state *, xa_tag_t);
+void xas_set_tag(const struct xa_state *, xa_tag_t);
+void xas_clear_tag(const struct xa_state *, xa_tag_t);
+
 /**
  * xas_reload() - Refetch an entry from the xarray.
  * @xas: XArray operation state.
diff --git a/lib/xarray.c b/lib/xarray.c
index 195cb130d53d..ca25a7a4a4fa 100644
--- a/lib/xarray.c
+++ b/lib/xarray.c
@@ -5,6 +5,7 @@
  * Author: Matthew Wilcox 
  */
 
+#include 
 #include 
 #include 
 
@@ -24,6 +25,55 @@
  * @entry refers to something stored in a slot in the xarray
  */
 
+static inline struct xa_node *xa_parent(struct xarray *xa,
+   const struct xa_node *node)
+{
+   return rcu_dereference_check(node->parent,
+   lockdep_is_held(>xa_lock));
+}
+
+static inline struct xa_node *xa_parent_locked(struct xarray *xa,
+   const struct xa_node *node)
+{
+   return rcu_dereference_protected(node->parent,
+   lockdep_is_held(>xa_lock));
+}
+
+static inline void xa_tag_set(struct xarray *xa, xa_tag_t tag)
+{
+   if (!(xa->xa_flags & XA_FLAGS_TAG(tag)))
+   xa->xa_flags |= XA_FLAGS_TAG(tag);
+}
+
+static inline void xa_tag_clear(struct xarray *xa, xa_tag_t tag)
+{
+   if (xa->xa_flags & XA_FLAGS_TAG(tag))
+   xa->xa_flags &= ~(XA_FLAGS_TAG(tag));
+}
+
+static inline bool node_get_tag(const struct xa_node *node, unsigned int 
offset,
+   xa_tag_t tag)
+{
+   return test_bit(offset, node->tags[(__force unsigned)tag]);
+}
+
+static inline void node_set_tag(struct xa_node *node, unsigned int offset,
+

[PATCH v8 16/63] xarray: Add xa_load

2018-03-06 Thread Matthew Wilcox

From: Matthew Wilcox 

This first function in the XArray API brings with it a lot of support
infrastructure.  The advanced API is based around the xa_state which is
a more capable version of the radix_tree_iter.

As the test-suite demonstrates, it is possible to use the xarray and
radix tree APIs on the same data structure.

Signed-off-by: Matthew Wilcox 
---
 include/linux/xarray.h  | 304 
 lib/radix-tree.c|  43 
 lib/xarray.c| 191 +
 tools/testing/radix-tree/.gitignore |   1 +
 tools/testing/radix-tree/Makefile   |   7 +-
 tools/testing/radix-tree/linux/kernel.h |   1 +
 tools/testing/radix-tree/linux/radix-tree.h |   1 -
 tools/testing/radix-tree/linux/rcupdate.h   |   1 +
 tools/testing/radix-tree/linux/xarray.h |   1 +
 tools/testing/radix-tree/xarray-test.c  |  49 +
 10 files changed, 553 insertions(+), 46 deletions(-)
 create mode 100644 tools/testing/radix-tree/xarray-test.c

diff --git a/include/linux/xarray.h b/include/linux/xarray.h
index b51f354dfbf0..5845187c1ce8 100644
--- a/include/linux/xarray.h
+++ b/include/linux/xarray.h
@@ -12,6 +12,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 #include 
 #include 
 
@@ -30,6 +32,10 @@
  *
  * 0-62: Sibling entries
  * 256: Retry entry
+ *
+ * Errors are also represented as internal entries, but use the negative
+ * space (-4094 to -2).  They're never stored in the slots array; only
+ * returned by the normal API.
  */
 
 #define BITS_PER_XA_VALUE  (BITS_PER_LONG - 1)
@@ -107,6 +113,42 @@ static inline bool xa_is_internal(const void *entry)
return ((unsigned long)entry & 3) == 2;
 }
 
+/**
+ * xa_is_err() - Report whether an XArray operation returned an error
+ * @entry: Result from calling an XArray function
+ *
+ * If an XArray operation cannot complete an operation, it will return
+ * a special value indicating an error.  This function tells you
+ * whether an error occurred; xa_err() tells you which error occurred.
+ *
+ * Context: Any context.
+ * Return: %true if the entry indicates an error.
+ */
+static inline bool xa_is_err(const void *entry)
+{
+   return unlikely(xa_is_internal(entry));
+}
+
+/**
+ * xa_err() - Turn an XArray result into an errno.
+ * @entry: Result from calling an XArray function.
+ *
+ * If an XArray operation cannot complete an operation, it will return
+ * a special pointer value which encodes an errno.  This function extracts
+ * the errno from the pointer value, or returns 0 if the pointer does not
+ * represent an errno.
+ *
+ * Context: Any context.
+ * Return: A negative errno or 0.
+ */
+static inline int xa_err(void *entry)
+{
+   /* xa_to_internal() would not do sign extension. */
+   if (xa_is_err(entry))
+   return (long)entry >> 2;
+   return 0;
+}
+
 /**
  * struct xarray - The anchor of the XArray.
  * @xa_lock: Lock that protects the contents of the XArray.
@@ -152,6 +194,7 @@ struct xarray {
struct xarray name = XARRAY_INIT_FLAGS(name, flags)
 
 void xa_init_flags(struct xarray *, gfp_t flags);
+void *xa_load(struct xarray *, unsigned long index);
 
 /**
  * xa_init() - Initialise an empty XArray.
@@ -220,6 +263,62 @@ struct xa_node {
unsigned long   tags[XA_MAX_TAGS][XA_TAG_LONGS];
 };
 
+#ifdef XA_DEBUG
+void xa_dump(const struct xarray *);
+void xa_dump_node(const struct xa_node *);
+#define XA_BUG_ON(xa, x) do { \
+   if (x) \
+   xa_dump(xa); \
+   BUG_ON(x); \
+   } while (0)
+#define XA_NODE_BUG_ON(node, x) do { \
+   if ((x) && (node)) \
+   xa_dump_node(node); \
+   BUG_ON(x); \
+   } while (0)
+#else
+#define XA_BUG_ON(xa, x)   do { } while (0)
+#define XA_NODE_BUG_ON(node, x)do { } while (0)
+#endif
+
+/* Private */
+static inline void *xa_head(struct xarray *xa)
+{
+   return rcu_dereference_check(xa->xa_head,
+   lockdep_is_held(>xa_lock));
+}
+
+/* Private */
+static inline void *xa_head_locked(struct xarray *xa)
+{
+   return rcu_dereference_protected(xa->xa_head,
+   lockdep_is_held(>xa_lock));
+}
+
+/* Private */
+static inline void *xa_entry(struct xarray *xa,
+   const struct xa_node *node, unsigned int offset)
+{
+   XA_NODE_BUG_ON(node, offset >= XA_CHUNK_SIZE);
+   return rcu_dereference_check(node->slots[offset],
+   lockdep_is_held(>xa_lock));
+}
+
+/* Private */
+static inline void *xa_entry_locked(struct xarray *xa,
+   const struct xa_node *node, unsigned int offset)
+{
+   XA_NODE_BUG_ON(node, offset >= XA_CHUNK_SIZE);
+   return rcu_dereference_protected(node->slots[offset],

[PATCH v8 15/63] xarray: Add documentation

2018-03-06 Thread Matthew Wilcox

From: Matthew Wilcox 

This is documentation on how to use the XArray, not details about its
internal implementation.

Signed-off-by: Matthew Wilcox 
---
 Documentation/core-api/index.rst  |   1 +
 Documentation/core-api/xarray.rst | 361 ++
 2 files changed, 362 insertions(+)
 create mode 100644 Documentation/core-api/xarray.rst

diff --git a/Documentation/core-api/index.rst b/Documentation/core-api/index.rst
index c670a8031786..e4e15f0f608b 100644
--- a/Documentation/core-api/index.rst
+++ b/Documentation/core-api/index.rst
@@ -20,6 +20,7 @@ Core utilities
local_ops
workqueue
genericirq
+   xarray
flexible-arrays
librs
genalloc
diff --git a/Documentation/core-api/xarray.rst 
b/Documentation/core-api/xarray.rst
new file mode 100644
index ..914999c0bf3f
--- /dev/null
+++ b/Documentation/core-api/xarray.rst
@@ -0,0 +1,361 @@
+.. SPDX-License-Identifier: CC-BY-SA-4.0
+
+==
+XArray
+==
+
+:Author: Matthew Wilcox
+
+Overview
+
+
+The XArray is an abstract data type which behaves like a very large array
+of pointers.  It meets many of the same needs as a hash or a conventional
+resizable array.  Unlike a hash, it allows you to sensibly go to the
+next or previous entry in a cache-efficient manner.  In contrast to
+a resizable array, there is no need for copying data or changing MMU
+mappings in order to grow the array.  It is more memory-efficient,
+parallelisable and cache friendly than a doubly-linked list.  It takes
+advantage of RCU to perform lookups without locking.
+
+The XArray implementation is efficient when the indices used are densely
+clustered; hashing the object and using the hash as the index will not
+perform well.  The XArray is optimised for small indices, but still has
+good performance with large indices.  If your index can be larger than
+``ULONG_MAX`` then the XArray is not the data type for you.  The most
+important user of the XArray is the page cache.
+
+A freshly-initialised XArray contains a ``NULL`` pointer at every index.
+Each non-``NULL`` entry in the array has three bits associated with it
+called tags.  Each tag may be set or cleared independently of the others.
+You can iterate over entries which are tagged.
+
+Normal pointers may be stored in the XArray directly.  They must be 4-byte
+aligned, which is true for any pointer returned from :c:func:`kmalloc` and
+:c:func:`alloc_page`.  It isn't true for arbitrary user-space pointers,
+nor for function pointers.  You can store pointers to statically allocated
+objects, as long as those objects have an alignment of at least 4.
+
+You can also store integers between 0 and ``LONG_MAX`` in the XArray.
+You must first convert it into an entry using :c:func:`xa_mk_value`.
+When you retrieve an entry from the XArray, you can check whether it is
+a value entry by calling :c:func:`xa_is_value`, and convert it back to
+an integer by calling :c:func:`xa_to_value`.
+
+The XArray does not support storing :c:func:`IS_ERR` pointers as some
+conflict with value entries or internal entries.
+
+An unusual feature of the XArray is the ability to create entries which
+occupy a range of indices.  Once stored to, looking up any index in
+the range will return the same entry as looking up any other index in
+the range.  Setting a tag on one index will set it on all of them.
+Storing to any index will store to all of them.  Multi-index entries can
+be explicitly split into smaller entries, or storing ``NULL`` into any
+entry will cause the XArray to forget about the range.
+
+Normal API
+==
+
+Start by initialising an XArray, either with :c:func:`DEFINE_XARRAY`
+for statically allocated XArrays or :c:func:`xa_init` for dynamically
+allocated ones.
+
+You can then set entries using :c:func:`xa_store` and get entries
+using :c:func:`xa_load`.  xa_store will overwrite any entry with the
+new entry and return the previous entry stored at that index.  You can
+use :c:func:`xa_erase` instead of calling :c:func:`xa_store` with a
+%NULL entry.  There is no difference between an entry that has never
+been stored to and one that has most recently had ``NULL`` stored to it.
+
+You can conditionally replace an entry at an index by using
+:c:func:`xa_cmpxchg`.  Like :c:func:`cmpxchg`, it will only succeed if
+the entry at that index has the 'old' value.  It also returns the entry
+which was at that index; if it returns the same entry which was passed as
+'old', then :c:func:`xa_cmpxchg` succeeded.
+
+If you want to only store a new entry to an index if the current entry
+at that index is ``NULL``, you can use :c:func:`xa_insert` which
+returns ``-EEXIST`` if the entry is not empty.
+
+Calling :c:func:`xa_reserve` ensures that there is enough memory allocated
+to store an entry at the specified index.  This is not normally needed,
+but some users have a complicated locking scheme.
+
+You can enquire whether a tag is set on an

[PATCH v8 19/63] xarray: Add xa_cmpxchg and xa_insert

2018-03-06 Thread Matthew Wilcox

From: Matthew Wilcox 

Like cmpxchg(), xa_cmpxchg will only store to the index if the current
entry matches the old entry.  It returns the current entry, which is
usually more useful than the errno returned by radix_tree_insert().
For the users who really only want the errno, the xa_insert() wrapper
provides a more convenient calling convention.

Signed-off-by: Matthew Wilcox 
---
 include/linux/xarray.h | 60 
 lib/xarray.c   | 71 ++
 tools/testing/radix-tree/xarray-test.c | 10 +
 3 files changed, 141 insertions(+)

diff --git a/include/linux/xarray.h b/include/linux/xarray.h
index 38e290df2ff0..e95ebe2488f9 100644
--- a/include/linux/xarray.h
+++ b/include/linux/xarray.h
@@ -218,6 +218,8 @@ struct xarray {
 void xa_init_flags(struct xarray *, gfp_t flags);
 void *xa_load(struct xarray *, unsigned long index);
 void *xa_store(struct xarray *, unsigned long index, void *entry, gfp_t);
+void *xa_cmpxchg(struct xarray *, unsigned long index,
+   void *old, void *entry, gfp_t);
 bool xa_get_tag(struct xarray *, unsigned long index, xa_tag_t);
 void xa_set_tag(struct xarray *, unsigned long index, xa_tag_t);
 void xa_clear_tag(struct xarray *, unsigned long index, xa_tag_t);
@@ -277,6 +279,34 @@ static inline bool xa_tagged(const struct xarray *xa, 
xa_tag_t tag)
return xa->xa_flags & XA_FLAGS_TAG(tag);
 }
 
+/**
+ * xa_insert() - Store this entry in the XArray unless another entry is
+ * already present.
+ * @xa: XArray.
+ * @index: Index into array.
+ * @entry: New entry.
+ * @gfp: Memory allocation flags.
+ *
+ * If you would rather see the existing entry in the array, use xa_cmpxchg().
+ * This function is for users who don't care what the entry is, only that
+ * one is present.
+ *
+ * Context: Process context.  Takes and releases the xa_lock.
+ * May sleep if the @gfp flags permit.
+ * Return: 0 if the store succeeded.  -EEXIST if another entry was present.
+ *-ENOMEM if memory could not be allocated.
+ */
+static inline int xa_insert(struct xarray *xa, unsigned long index,
+   void *entry, gfp_t gfp)
+{
+   void *curr = xa_cmpxchg(xa, index, NULL, entry, gfp);
+   if (!curr)
+   return 0;
+   if (xa_is_err(curr))
+   return xa_err(curr);
+   return -EEXIST;
+}
+
 #define xa_trylock(xa) spin_trylock(&(xa)->xa_lock)
 #define xa_lock(xa)spin_lock(&(xa)->xa_lock)
 #define xa_unlock(xa)  spin_unlock(&(xa)->xa_lock)
@@ -296,9 +326,39 @@ static inline bool xa_tagged(const struct xarray *xa, 
xa_tag_t tag)
  */
 void *__xa_erase(struct xarray *, unsigned long index);
 void *__xa_store(struct xarray *, unsigned long index, void *entry, gfp_t);
+void *__xa_cmpxchg(struct xarray *, unsigned long index, void *old,
+   void *entry, gfp_t);
 void __xa_set_tag(struct xarray *, unsigned long index, xa_tag_t);
 void __xa_clear_tag(struct xarray *, unsigned long index, xa_tag_t);
 
+/**
+ * __xa_insert() - Store this entry in the XArray unless another entry is
+ * already present.
+ * @xa: XArray.
+ * @index: Index into array.
+ * @entry: New entry.
+ * @gfp: Memory allocation flags.
+ *
+ * If you would rather see the existing entry in the array, use __xa_cmpxchg().
+ * This function is for users who don't care what the entry is, only that
+ * one is present.
+ *
+ * Context: Any context.  Expects xa_lock to be held on entry.  May
+ * release and reacquire xa_lock if the @gfp flags permit.
+ * Return: 0 if the store succeeded.  -EEXIST if another entry was present.
+ *-ENOMEM if memory could not be allocated.
+ */
+static inline int __xa_insert(struct xarray *xa, unsigned long index,
+   void *entry, gfp_t gfp)
+{
+   void *curr = __xa_cmpxchg(xa, index, NULL, entry, gfp);
+   if (!curr)
+   return 0;
+   if (xa_is_err(curr))
+   return xa_err(curr);
+   return -EEXIST;
+}
+
 /* Everything below here is the Advanced API.  Proceed with caution. */
 
 /*
diff --git a/lib/xarray.c b/lib/xarray.c
index 9e50804f168c..a231699d894a 100644
--- a/lib/xarray.c
+++ b/lib/xarray.c
@@ -937,6 +937,77 @@ void *__xa_store(struct xarray *xa, unsigned long index, 
void *entry, gfp_t gfp)
 }
 EXPORT_SYMBOL(__xa_store);
 
+/**
+ * xa_cmpxchg() - Conditionally replace an entry in the XArray.
+ * @xa: XArray.
+ * @index: Index into array.
+ * @old: Old value to test against.
+ * @entry: New value to place in array.
+ * @gfp: Memory allocation flags.
+ *
+ * If the entry at @index is the same as @old, replace it with @entry.
+ * If the return value is equal to @old, then the exchange was successful.
+ *
+ * Context: Process context.  Takes and releases the xa_lock.  May sleep
+ * if the @gfp flags permit.
+ * Return: The old value at this index or

[PATCH v8 18/63] xarray: Add xa_store

2018-03-06 Thread Matthew Wilcox

From: Matthew Wilcox 

xa_store() differs from radix_tree_insert() in that it will overwrite an
existing element in the array rather than returning an error.  This is
the behaviour which most users want, and those that want more complex
behaviour generally want to use the xas family of routines anyway.

For memory allocation, xa_store() will first attempt to request memory
from the slab allocator; if memory is not immediately available, it will
drop the xa_lock and allocate memory, keeping a pointer in the xa_state.
It does not use the per-CPU cache, although those will continue to exist
until all radix tree users are converted to the xarray.

This patch also includes xa_erase() and __xa_erase() for a streamlined
way to store NULL.  Since there is no need to allocate memory in order
to store a NULL in the XArray, we do not need to trouble the user with
deciding what memory allocation flags to use.

Signed-off-by: Matthew Wilcox 
---
 include/linux/xarray.h| 109 +
 lib/radix-tree.c  |   4 +-
 lib/xarray.c  | 648 ++
 tools/include/linux/spinlock.h|   2 +
 tools/testing/radix-tree/linux/kernel.h   |   4 +
 tools/testing/radix-tree/linux/lockdep.h  |  11 +
 tools/testing/radix-tree/linux/rcupdate.h |   1 +
 tools/testing/radix-tree/test.c   |  32 ++
 tools/testing/radix-tree/test.h   |   5 +
 tools/testing/radix-tree/xarray-test.c| 113 +-
 10 files changed, 925 insertions(+), 4 deletions(-)
 create mode 100644 tools/testing/radix-tree/linux/lockdep.h

diff --git a/include/linux/xarray.h b/include/linux/xarray.h
index 1cf012256eab..38e290df2ff0 100644
--- a/include/linux/xarray.h
+++ b/include/linux/xarray.h
@@ -157,10 +157,17 @@ typedef unsigned __bitwise xa_tag_t;
 #define XA_PRESENT ((__force xa_tag_t)8U)
 #define XA_TAG_MAX XA_TAG_2
 
+enum xa_lock_type {
+   XA_LOCK_IRQ = 1,
+   XA_LOCK_BH = 2,
+};
+
 /*
  * Values for xa_flags.  The radix tree stores its GFP flags in the xa_flags,
  * and we remain compatible with that.
  */
+#define XA_FLAGS_LOCK_IRQ  ((__force gfp_t)XA_LOCK_IRQ)
+#define XA_FLAGS_LOCK_BH   ((__force gfp_t)XA_LOCK_BH)
 #define XA_FLAGS_TAG(tag)  ((__force gfp_t)((1U << __GFP_BITS_SHIFT) << \
(__force unsigned)(tag)))
 
@@ -210,6 +217,7 @@ struct xarray {
 
 void xa_init_flags(struct xarray *, gfp_t flags);
 void *xa_load(struct xarray *, unsigned long index);
+void *xa_store(struct xarray *, unsigned long index, void *entry, gfp_t);
 bool xa_get_tag(struct xarray *, unsigned long index, xa_tag_t);
 void xa_set_tag(struct xarray *, unsigned long index, xa_tag_t);
 void xa_clear_tag(struct xarray *, unsigned long index, xa_tag_t);
@@ -227,6 +235,35 @@ static inline void xa_init(struct xarray *xa)
xa_init_flags(xa, 0);
 }
 
+/**
+ * xa_erase() - Erase this entry from the XArray.
+ * @xa: XArray.
+ * @index: Index of entry.
+ *
+ * This function is the equivalent of calling xa_store() with %NULL as
+ * the third argument.  The XArray does not need to allocate memory, so
+ * the user does not need to provide GFP flags.
+ *
+ * Context: Process context.  Takes and releases the xa_lock.
+ * Return: The entry which used to be at this index.
+ */
+static inline void *xa_erase(struct xarray *xa, unsigned long index)
+{
+   return xa_store(xa, index, NULL, 0);
+}
+
+/**
+ * xa_empty() - Determine if an array has any present entries.
+ * @xa: XArray.
+ *
+ * Context: Any context.
+ * Return: %true if the array contains only NULL pointers.
+ */
+static inline bool xa_empty(const struct xarray *xa)
+{
+   return xa->xa_head == NULL;
+}
+
 /**
  * xa_tagged() - Inquire whether any entry in this array has a tag set
  * @xa: Array
@@ -254,7 +291,11 @@ static inline bool xa_tagged(const struct xarray *xa, 
xa_tag_t tag)
 
 /*
  * Versions of the normal API which require the caller to hold the xa_lock.
+ * If the GFP flags allow it, will drop the lock in order to allocate
+ * memory, then reacquire it afterwards.
  */
+void *__xa_erase(struct xarray *, unsigned long index);
+void *__xa_store(struct xarray *, unsigned long index, void *entry, gfp_t);
 void __xa_set_tag(struct xarray *, unsigned long index, xa_tag_t);
 void __xa_clear_tag(struct xarray *, unsigned long index, xa_tag_t);
 
@@ -350,6 +391,12 @@ static inline void *xa_entry_locked(struct xarray *xa,
lockdep_is_held(>xa_lock));
 }
 
+/* Private */
+static inline void *xa_mk_node(const struct xa_node *node)
+{
+   return (void *)((unsigned long)node | 2);
+}
+
 /* Private */
 static inline struct xa_node *xa_to_node(const void *entry)
 {
@@ -534,6 +581,12 @@ static inline bool xas_valid(const struct xa_state *xas)
return !xas_invalid(xas);
 }
 
+/* True if the node represents head-of-tree,

[PATCH v8 23/63] xarray: Add xas_next and xas_prev

2018-03-06 Thread Matthew Wilcox

From: Matthew Wilcox 

These two functions move the xas index by one position, and adjust the
rest of the iterator state to match it.  This is more efficient than
calling xas_set() as it keeps the iterator at the leaves of the tree
instead of walking the iterator from the root each time.

Signed-off-by: Matthew Wilcox 
---
 include/linux/xarray.h |  67 +
 lib/xarray.c   |  74 ++
 tools/testing/radix-tree/xarray-test.c | 261 -
 3 files changed, 401 insertions(+), 1 deletion(-)

diff --git a/include/linux/xarray.h b/include/linux/xarray.h
index 96773f83ae03..c8a0ddc1b3df 100644
--- a/include/linux/xarray.h
+++ b/include/linux/xarray.h
@@ -683,6 +683,12 @@ static inline bool xas_not_node(struct xa_node *node)
return ((unsigned long)node & 3) || !node;
 }
 
+/* True if the node represents RESTART or an error */
+static inline bool xas_frozen(struct xa_node *node)
+{
+   return (unsigned long)node & 2;
+}
+
 /* True if the node represents head-of-tree, RESTART or BOUNDS */
 static inline bool xas_top(struct xa_node *node)
 {
@@ -940,6 +946,67 @@ enum {
for (entry = xas_find_tag(xas, max, tag); entry; \
 entry = xas_next_tag(xas, max, tag))
 
+void *__xas_next(struct xa_state *);
+void *__xas_prev(struct xa_state *);
+
+/**
+ * xas_prev() - Move iterator to previous index.
+ * @xas: XArray operation state.
+ *
+ * If the @xas was in an error state, it will remain in an error state
+ * and this function will return %NULL.  If the @xas has never been walked,
+ * it will have the effect of calling xas_load().  Otherwise one will be
+ * subtracted from the index and the state will be walked to the correct
+ * location in the array for the next operation.
+ *
+ * If the iterator was referencing index 0, this function wraps
+ * around to %ULONG_MAX.
+ *
+ * Return: The entry at the new index.  This may be %NULL or an internal
+ * entry, although it should never be a node entry.
+ */
+static inline void *xas_prev(struct xa_state *xas)
+{
+   struct xa_node *node = xas->xa_node;
+
+   if (unlikely(xas_not_node(node) || node->shift ||
+   xas->xa_offset == 0))
+   return __xas_prev(xas);
+
+   xas->xa_index--;
+   xas->xa_offset--;
+   return xa_entry(xas->xa, node, xas->xa_offset);
+}
+
+/**
+ * xas_next() - Move state to next index.
+ * @xas: XArray operation state.
+ *
+ * If the @xas was in an error state, it will remain in an error state
+ * and this function will return %NULL.  If the @xas has never been walked,
+ * it will have the effect of calling xas_load().  Otherwise one will be
+ * added to the index and the state will be walked to the correct
+ * location in the array for the next operation.
+ *
+ * If the iterator was referencing index %ULONG_MAX, this function wraps
+ * around to 0.
+ *
+ * Return: The entry at the new index.  This may be %NULL or an internal
+ * entry, although it should never be a node entry.
+ */
+static inline void *xas_next(struct xa_state *xas)
+{
+   struct xa_node *node = xas->xa_node;
+
+   if (unlikely(xas_not_node(node) || node->shift ||
+   xas->xa_offset == XA_CHUNK_MASK))
+   return __xas_next(xas);
+
+   xas->xa_index++;
+   xas->xa_offset++;
+   return xa_entry(xas->xa, node, xas->xa_offset);
+}
+
 /* Internal functions, mostly shared between radix-tree.c, xarray.c and idr.c 
*/
 void xas_destroy(struct xa_state *);
 
diff --git a/lib/xarray.c b/lib/xarray.c
index 080ed0fc3feb..7cf195b6e740 100644
--- a/lib/xarray.c
+++ b/lib/xarray.c
@@ -838,6 +838,80 @@ void xas_pause(struct xa_state *xas)
 }
 EXPORT_SYMBOL_GPL(xas_pause);
 
+/*
+ * __xas_prev() - Find the previous entry in the XArray.
+ * @xas: XArray operation state.
+ *
+ * Helper function for xas_prev() which handles all the complex cases
+ * out of line.
+ */
+void *__xas_prev(struct xa_state *xas)
+{
+   void *entry;
+
+   if (!xas_frozen(xas->xa_node))
+   xas->xa_index--;
+   if (xas_not_node(xas->xa_node))
+   return xas_load(xas);
+
+   if (xas->xa_offset != get_offset(xas->xa_index, xas->xa_node))
+   xas->xa_offset--;
+
+   while (xas->xa_offset == 255) {
+   xas->xa_offset = xas->xa_node->offset - 1;
+   xas->xa_node = xa_parent(xas->xa, xas->xa_node);
+   if (!xas->xa_node)
+   return set_bounds(xas);
+   }
+
+   for (;;) {
+   entry = xa_entry(xas->xa, xas->xa_node, xas->xa_offset);
+   if (!xa_is_node(entry))
+   return entry;
+
+   xas->xa_node = xa_to_node(entry);
+   xas_set_offset(xas);
+   }
+}
+EXPORT_SYMBOL_GPL(__xas_prev);
+
+/*
+ * __xas_next() - Find the next entry in the XArray.
+ * @xas: XArray operation state.
+ *
+ *

[PATCH v8 24/63] xarray: Add xas_create_range

2018-03-06 Thread Matthew Wilcox

From: Matthew Wilcox 

This hopefully temporary function is useful for users who have not yet
been converted to multi-index entries.

Signed-off-by: Matthew Wilcox 
---
 include/linux/xarray.h |  2 ++
 lib/xarray.c   | 22 ++
 2 files changed, 24 insertions(+)

diff --git a/include/linux/xarray.h b/include/linux/xarray.h
index c8a0ddc1b3df..387be18d05ba 100644
--- a/include/linux/xarray.h
+++ b/include/linux/xarray.h
@@ -744,6 +744,8 @@ void xas_init_tags(const struct xa_state *);
 bool xas_nomem(struct xa_state *, gfp_t);
 void xas_pause(struct xa_state *);
 
+void xas_create_range(struct xa_state *, unsigned long max);
+
 /**
  * xas_reload() - Refetch an entry from the xarray.
  * @xas: XArray operation state.
diff --git a/lib/xarray.c b/lib/xarray.c
index 7cf195b6e740..1d94ecc2dca3 100644
--- a/lib/xarray.c
+++ b/lib/xarray.c
@@ -612,6 +612,28 @@ void *xas_create(struct xa_state *xas)
 }
 EXPORT_SYMBOL_GPL(xas_create);
 
+/**
+ * xas_create_range() - Ensure that stores to this range will succeed
+ * @xas: XArray operation state.
+ * @max: The highest index to create a slot for.
+ *
+ * Creates all of the slots in the range between the current position of
+ * @xas and @max.  This is for the benefit of users who have not yet been
+ * converted to multi-index entries.
+ *
+ * The implementation is naive.
+ */
+void xas_create_range(struct xa_state *xas, unsigned long max)
+{
+   XA_STATE(tmp, xas->xa, xas->xa_index);
+
+   do {
+   xas_create();
+   xas_set(, tmp.xa_index + XA_CHUNK_SIZE);
+   } while (tmp.xa_index < max);
+}
+EXPORT_SYMBOL_GPL(xas_create_range);
+
 static void store_siblings(struct xa_state *xas, void *entry, void *curr,
int *countp, int *valuesp)
 {
-- 
2.16.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v8 21/63] xarray: Add xa_extract

2018-03-06 Thread Matthew Wilcox

From: Matthew Wilcox 

This function combines the functionality of radix_tree_gang_lookup() and
radix_tree_gang_lookup_tagged().  It extracts entries matching the
specified filter into a normal array.

Signed-off-by: Matthew Wilcox 
---
 include/linux/xarray.h |  2 ++
 lib/xarray.c   | 80 ++
 2 files changed, 82 insertions(+)

diff --git a/include/linux/xarray.h b/include/linux/xarray.h
index cf7966bfdd3e..85dd909586f0 100644
--- a/include/linux/xarray.h
+++ b/include/linux/xarray.h
@@ -227,6 +227,8 @@ void *xa_find(struct xarray *xa, unsigned long *index,
unsigned long max, xa_tag_t) __attribute__((nonnull(2)));
 void *xa_find_after(struct xarray *xa, unsigned long *index,
unsigned long max, xa_tag_t) __attribute__((nonnull(2)));
+unsigned int xa_extract(struct xarray *, void **dst, unsigned long start,
+   unsigned long max, unsigned int n, xa_tag_t);
 
 /**
  * xa_init() - Initialise an empty XArray.
diff --git a/lib/xarray.c b/lib/xarray.c
index 267510e98a57..124bbfec66ae 100644
--- a/lib/xarray.c
+++ b/lib/xarray.c
@@ -1388,6 +1388,86 @@ void *xa_find_after(struct xarray *xa, unsigned long 
*indexp,
 }
 EXPORT_SYMBOL(xa_find_after);
 
+static unsigned int xas_extract_present(struct xa_state *xas, void **dst,
+   unsigned long max, unsigned int n)
+{
+   void *entry;
+   unsigned int i = 0;
+
+   rcu_read_lock();
+   xas_for_each(xas, entry, max) {
+   if (xas_retry(xas, entry))
+   continue;
+   dst[i++] = entry;
+   if (i == n)
+   break;
+   }
+   rcu_read_unlock();
+
+   return i;
+}
+
+static unsigned int xas_extract_tag(struct xa_state *xas, void **dst,
+   unsigned long max, unsigned int n, xa_tag_t tag)
+{
+   void *entry;
+   unsigned int i = 0;
+
+   rcu_read_lock();
+   xas_for_each_tag(xas, entry, max, tag) {
+   if (xas_retry(xas, entry))
+   continue;
+   dst[i++] = entry;
+   if (i == n)
+   break;
+   }
+   rcu_read_unlock();
+
+   return i;
+}
+
+/**
+ * xa_extract() - Copy selected entries from the XArray into a normal array.
+ * @xa: The source XArray to copy from.
+ * @dst: The buffer to copy entries into.
+ * @start: The first index in the XArray eligible to be selected.
+ * @max: The last index in the XArray eligible to be selected.
+ * @n: The maximum number of entries to copy.
+ * @filter: Selection criterion.
+ *
+ * Copies up to @n entries that match @filter from the XArray.  The
+ * copied entries will have indices between @start and @max, inclusive.
+ *
+ * The @filter may be an XArray tag value, in which case entries which are
+ * tagged with that tag will be copied.  It may also be %XA_PRESENT, in
+ * which case non-NULL entries will be copied.
+ *
+ * The entries returned may not represent a snapshot of the XArray at a
+ * moment in time.  For example, if another thread stores to index 5, then
+ * index 10, calling xa_extract() may return the old contents of index 5
+ * and the new contents of index 10.  Indices not modified while this
+ * function is running will not be skipped.
+ *
+ * If you need stronger guarantees, holding the xa_lock across calls to this
+ * function will prevent concurrent modification.
+ *
+ * Context: Any context.  Takes and releases the RCU lock.
+ * Return: The number of entries copied.
+ */
+unsigned int xa_extract(struct xarray *xa, void **dst, unsigned long start,
+   unsigned long max, unsigned int n, xa_tag_t filter)
+{
+   XA_STATE(xas, xa, start);
+
+   if (!n)
+   return 0;
+
+   if ((__force unsigned int)filter < XA_MAX_TAGS)
+   return xas_extract_tag(, dst, max, n, filter);
+   return xas_extract_present(, dst, max, n);
+}
+EXPORT_SYMBOL(xa_extract);
+
 #ifdef XA_DEBUG
 void xa_dump_node(const struct xa_node *node)
 {
-- 
2.16.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v8 20/63] xarray: Add xa_for_each

2018-03-06 Thread Matthew Wilcox

From: Matthew Wilcox 

This iterator allows the user to efficiently walk a range of the array,
executing the loop body once for each entry in that range that matches
the filter.  This commit also includes xa_find() and xa_find_above()
which are helper functions for xa_for_each() but may also be useful in
their own right.

In the xas family of functions, we also have xas_for_each(), xas_find(),
xas_next_entry(), xas_for_each_tag(), xas_find_tag(), xas_next_tag()
and xas_pause().

Signed-off-by: Matthew Wilcox 
---
 include/linux/xarray.h | 173 +
 lib/xarray.c   | 274 +
 tools/testing/radix-tree/test.c|  13 ++
 tools/testing/radix-tree/test.h|   1 +
 tools/testing/radix-tree/xarray-test.c | 122 +++
 5 files changed, 583 insertions(+)

diff --git a/include/linux/xarray.h b/include/linux/xarray.h
index e95ebe2488f9..cf7966bfdd3e 100644
--- a/include/linux/xarray.h
+++ b/include/linux/xarray.h
@@ -223,6 +223,10 @@ void *xa_cmpxchg(struct xarray *, unsigned long index,
 bool xa_get_tag(struct xarray *, unsigned long index, xa_tag_t);
 void xa_set_tag(struct xarray *, unsigned long index, xa_tag_t);
 void xa_clear_tag(struct xarray *, unsigned long index, xa_tag_t);
+void *xa_find(struct xarray *xa, unsigned long *index,
+   unsigned long max, xa_tag_t) __attribute__((nonnull(2)));
+void *xa_find_after(struct xarray *xa, unsigned long *index,
+   unsigned long max, xa_tag_t) __attribute__((nonnull(2)));
 
 /**
  * xa_init() - Initialise an empty XArray.
@@ -279,6 +283,35 @@ static inline bool xa_tagged(const struct xarray *xa, 
xa_tag_t tag)
return xa->xa_flags & XA_FLAGS_TAG(tag);
 }
 
+/**
+ * xa_for_each() - Iterate over a portion of an XArray.
+ * @xa: XArray.
+ * @entry: Entry retrieved from array.
+ * @index: Index of @entry.
+ * @max: Maximum index to retrieve from array.
+ * @filter: Selection criterion.
+ *
+ * Initialise @index to the minimum index you want to retrieve from
+ * the array.  During the iteration, @entry will have the value of the
+ * entry stored in @xa at @index.  The iteration will skip all entries in
+ * the array which do not match @filter.  You may modify @index during the
+ * iteration if you want to skip or reprocess indices.  It is safe to modify
+ * the array during the iteration.  At the end of the iteration, @entry will
+ * be set to NULL and @index will have a value less than or equal to max.
+ *
+ * xa_for_each() is O(n.log(n)) while xas_for_each() is O(n).  You have
+ * to handle your own locking with xas_for_each(), and if you have to unlock
+ * after each iteration, it will also end up being O(n.log(n)).  xa_for_each()
+ * will spin if it hits a retry entry; if you intend to see retry entries,
+ * you should use the xas_for_each() iterator instead.  The xas_for_each()
+ * iterator will expand into more inline code than xa_for_each().
+ *
+ * Context: Any context.  Takes and releases the RCU lock.
+ */
+#define xa_for_each(xa, entry, index, max, filter) \
+   for (entry = xa_find(xa, , max, filter); entry; \
+entry = xa_find_after(xa, , max, filter))
+
 /**
  * xa_insert() - Store this entry in the XArray unless another entry is
  * already present.
@@ -641,6 +674,12 @@ static inline bool xas_valid(const struct xa_state *xas)
return !xas_invalid(xas);
 }
 
+/* True if the pointer is something other than a node */
+static inline bool xas_not_node(struct xa_node *node)
+{
+   return ((unsigned long)node & 3) || !node;
+}
+
 /* True if the node represents head-of-tree, RESTART or BOUNDS */
 static inline bool xas_top(struct xa_node *node)
 {
@@ -685,13 +724,16 @@ static inline bool xas_retry(struct xa_state *xas, const 
void *entry)
 void *xas_load(struct xa_state *);
 void *xas_store(struct xa_state *, void *entry);
 void *xas_create(struct xa_state *);
+void *xas_find(struct xa_state *, unsigned long max);
 
 bool xas_get_tag(const struct xa_state *, xa_tag_t);
 void xas_set_tag(const struct xa_state *, xa_tag_t);
 void xas_clear_tag(const struct xa_state *, xa_tag_t);
+void *xas_find_tag(struct xa_state *, unsigned long max, xa_tag_t);
 void xas_init_tags(const struct xa_state *);
 
 bool xas_nomem(struct xa_state *, gfp_t);
+void xas_pause(struct xa_state *);
 
 /**
  * xas_reload() - Refetch an entry from the xarray.
@@ -764,6 +806,137 @@ static inline void xas_set_update(struct xa_state *xas, 
xa_update_node_t update)
xas->xa_update = update;
 }
 
+/* Skip over any of these entries when iterating */
+static inline bool xa_iter_skip(const void *entry)
+{
+   return unlikely(!entry ||
+   (xa_is_internal(entry) && entry < XA_RETRY_ENTRY));
+}
+
+/**
+ * xas_next_entry() - Advance iterator to next present entry.
+ * @xas: XArray operation state.
+ * @max: Highest index to return.
+ *
+ *

[PATCH v8 22/63] xarray: Add xa_destroy

2018-03-06 Thread Matthew Wilcox

From: Matthew Wilcox 

This function frees all the internal memory allocated to the xarray
and reinitialises it to be empty.

Signed-off-by: Matthew Wilcox 
---
 include/linux/xarray.h |  1 +
 lib/xarray.c   | 28 
 2 files changed, 29 insertions(+)

diff --git a/include/linux/xarray.h b/include/linux/xarray.h
index 85dd909586f0..96773f83ae03 100644
--- a/include/linux/xarray.h
+++ b/include/linux/xarray.h
@@ -229,6 +229,7 @@ void *xa_find_after(struct xarray *xa, unsigned long *index,
unsigned long max, xa_tag_t) __attribute__((nonnull(2)));
 unsigned int xa_extract(struct xarray *, void **dst, unsigned long start,
unsigned long max, unsigned int n, xa_tag_t);
+void xa_destroy(struct xarray *);
 
 /**
  * xa_init() - Initialise an empty XArray.
diff --git a/lib/xarray.c b/lib/xarray.c
index 124bbfec66ae..080ed0fc3feb 100644
--- a/lib/xarray.c
+++ b/lib/xarray.c
@@ -1468,6 +1468,34 @@ unsigned int xa_extract(struct xarray *xa, void **dst, 
unsigned long start,
 }
 EXPORT_SYMBOL(xa_extract);
 
+/**
+ * xa_destroy() - Free all internal data structures.
+ * @xa: XArray.
+ *
+ * After calling this function, the XArray is empty and has freed all memory
+ * allocated for its internal data structures.  You are responsible for
+ * freeing the objects referenced by the XArray.
+ *
+ * Context: Any context.  Takes and releases the xa_lock, interrupt-safe.
+ */
+void xa_destroy(struct xarray *xa)
+{
+   XA_STATE(xas, xa, 0);
+   unsigned long flags;
+   void *entry;
+
+   xas.xa_node = NULL;
+   xas_lock_irqsave(, flags);
+   entry = xa_head_locked(xa);
+   RCU_INIT_POINTER(xa->xa_head, NULL);
+   xas_init_tags();
+   /* lockdep checks we're still holding the lock in xas_free_nodes() */
+   if (xa_is_node(entry))
+   xas_free_nodes(, xa_to_node(entry));
+   xas_unlock_irqrestore(, flags);
+}
+EXPORT_SYMBOL(xa_destroy);
+
 #ifdef XA_DEBUG
 void xa_dump_node(const struct xa_node *node)
 {
-- 
2.16.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v8 29/63] page cache: Convert page deletion to XArray

2018-03-06 Thread Matthew Wilcox

From: Matthew Wilcox 

The code is slightly shorter and simpler.

Signed-off-by: Matthew Wilcox 
---
 mm/filemap.c | 30 ++
 1 file changed, 14 insertions(+), 16 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 0e19ea454cba..bdda1beda932 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -111,30 +111,28 @@
  *   ->tasklist_lock(memory_failure, collect_procs_ao)
  */
 
-static void page_cache_tree_delete(struct address_space *mapping,
+static void page_cache_delete(struct address_space *mapping,
   struct page *page, void *shadow)
 {
-   int i, nr;
+   XA_STATE(xas, >i_pages, page->index);
+   unsigned int i, nr;
 
-   /* hugetlb pages are represented by one entry in the radix tree */
+   mapping_set_update(, mapping);
+
+   /* hugetlb pages are represented by a single entry in the xarray */
nr = PageHuge(page) ? 1 : hpage_nr_pages(page);
 
VM_BUG_ON_PAGE(!PageLocked(page), page);
VM_BUG_ON_PAGE(PageTail(page), page);
VM_BUG_ON_PAGE(nr != 1 && shadow, page);
 
-   for (i = 0; i < nr; i++) {
-   struct radix_tree_node *node;
-   void **slot;
-
-   __radix_tree_lookup(>i_pages, page->index + i,
-   , );
-
-   VM_BUG_ON_PAGE(!node && nr != 1, page);
-
-   radix_tree_clear_tags(>i_pages, node, slot);
-   __radix_tree_replace(>i_pages, node, slot, shadow,
-   workingset_lookup_update(mapping));
+   i = nr;
+repeat:
+   xas_store(, shadow);
+   xas_init_tags();
+   if (--i) {
+   xas_next();
+   goto repeat;
}
 
page->mapping = NULL;
@@ -234,7 +232,7 @@ void __delete_from_page_cache(struct page *page, void 
*shadow)
trace_mm_filemap_delete_from_page_cache(page);
 
unaccount_page_cache_page(mapping, page);
-   page_cache_tree_delete(mapping, page, shadow);
+   page_cache_delete(mapping, page, shadow);
 }
 
 static void page_cache_free_page(struct address_space *mapping,
-- 
2.16.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v8 26/63] page cache: Rearrange address_space

2018-03-06 Thread Matthew Wilcox

From: Matthew Wilcox 

Change i_pages from a radix_tree_root to an xarray, convert the
documentation into kernel-doc format and change the order of the elements
to pack them better on 64-bit systems.

Signed-off-by: Matthew Wilcox 
---
 include/linux/fs.h | 46 +++---
 1 file changed, 31 insertions(+), 15 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 75f1f66aec35..785100c2b835 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -389,24 +389,40 @@ int pagecache_write_end(struct file *, struct 
address_space *mapping,
loff_t pos, unsigned len, unsigned copied,
struct page *page, void *fsdata);
 
+/**
+ * struct address_space - Contents of a cacheable, mappable object.
+ * @host: Owner, either the inode or the block_device.
+ * @i_pages: Cached pages.
+ * @gfp_mask: Memory allocation flags to use for allocating pages.
+ * @i_mmap_writable: Number of VM_SHARED mappings.
+ * @i_mmap: Tree of private and shared mappings.
+ * @i_mmap_rwsem: Protects @i_mmap and @i_mmap_writable.
+ * @nrpages: Number of page entries, protected by the i_pages lock.
+ * @nrexceptional: Shadow or DAX entries, protected by the i_pages lock.
+ * @writeback_index: Writeback starts here.
+ * @a_ops: Methods.
+ * @flags: Error bits and flags (AS_*).
+ * @wb_err: The most recent error which has occurred.
+ * @private_lock: For use by the owner of the address_space.
+ * @private_list: For use by the owner of the address_space.
+ * @private_data: For use by the owner of the address_space.
+ */
 struct address_space {
-   struct inode*host;  /* owner: inode, block_device */
-   struct radix_tree_root  i_pages;/* cached pages */
-   atomic_ti_mmap_writable;/* count VM_SHARED mappings */
-   struct rb_root_cached   i_mmap; /* tree of private and shared 
mappings */
-   struct rw_semaphore i_mmap_rwsem;   /* protect tree, count, list */
-   /* Protected by the i_pages lock */
-   unsigned long   nrpages;/* number of total pages */
-   /* number of shadow or DAX exceptional entries */
+   struct inode*host;
+   struct xarray   i_pages;
+   gfp_t   gfp_mask;
+   atomic_ti_mmap_writable;
+   struct rb_root_cached   i_mmap;
+   struct rw_semaphore i_mmap_rwsem;
+   unsigned long   nrpages;
unsigned long   nrexceptional;
-   pgoff_t writeback_index;/* writeback starts here */
-   const struct address_space_operations *a_ops;   /* methods */
-   unsigned long   flags;  /* error bits */
-   spinlock_t  private_lock;   /* for use by the address_space 
*/
-   gfp_t   gfp_mask;   /* implicit gfp mask for 
allocations */
-   struct list_headprivate_list;   /* for use by the address_space 
*/
-   void*private_data;  /* ditto */
+   pgoff_t writeback_index;
+   const struct address_space_operations *a_ops;
+   unsigned long   flags;
errseq_twb_err;
+   spinlock_t  private_lock;
+   struct list_headprivate_list;
+   void*private_data;
 } __attribute__((aligned(sizeof(long __randomize_layout;
/*
 * On most architectures that alignment is already the case; but
-- 
2.16.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v8 25/63] xarray: Add MAINTAINERS entry

2018-03-06 Thread Matthew Wilcox

From: Matthew Wilcox 

Add myself as XArray and IDR maintainer.

Signed-off-by: Matthew Wilcox 
---
 MAINTAINERS | 12 
 1 file changed, 12 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index 6d78237066ab..08613d97a74d 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -15293,6 +15293,18 @@ T: git 
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git x86/vdso
 S: Maintained
 F: arch/x86/entry/vdso/
 
+XARRAY
+M: Matthew Wilcox 
+M: Matthew Wilcox 
+L: linux-fsde...@vger.kernel.org
+S: Supported
+F: Documentation/core-api/xarray.rst
+F: lib/idr.c
+F: lib/xarray.c
+F: include/linux/idr.h
+F: include/linux/xarray.h
+F: tools/testing/radix-tree
+
 XC2028/3028 TUNER DRIVER
 M: Mauro Carvalho Chehab 
 M: Mauro Carvalho Chehab 
-- 
2.16.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v8 27/63] page cache: Convert hole search to XArray

2018-03-06 Thread Matthew Wilcox

From: Matthew Wilcox 

The page cache offers the ability to search for a miss in the previous or
next N locations.  Rather than teach the XArray about the page cache's
definition of a miss, use xas_prev() and xas_next() to search the page
array.  This should be more efficient as it does not have to start the
lookup from the top for each index.

Signed-off-by: Matthew Wilcox 
---
 fs/nfs/blocklayout/blocklayout.c |   2 +-
 include/linux/pagemap.h  |   4 +-
 mm/filemap.c | 110 ++-
 mm/readahead.c   |   4 +-
 4 files changed, 55 insertions(+), 65 deletions(-)

diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
index 7cb5c38c19e4..961901685007 100644
--- a/fs/nfs/blocklayout/blocklayout.c
+++ b/fs/nfs/blocklayout/blocklayout.c
@@ -895,7 +895,7 @@ static u64 pnfs_num_cont_bytes(struct inode *inode, pgoff_t 
idx)
end = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
if (end != inode->i_mapping->nrpages) {
rcu_read_lock();
-   end = page_cache_next_hole(mapping, idx + 1, ULONG_MAX);
+   end = page_cache_next_gap(mapping, idx + 1, ULONG_MAX);
rcu_read_unlock();
}
 
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index b1bd2186e6d2..2f5d2d3ebaac 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -241,9 +241,9 @@ static inline gfp_t readahead_gfp_mask(struct address_space 
*x)
 
 typedef int filler_t(void *, struct page *);
 
-pgoff_t page_cache_next_hole(struct address_space *mapping,
+pgoff_t page_cache_next_gap(struct address_space *mapping,
 pgoff_t index, unsigned long max_scan);
-pgoff_t page_cache_prev_hole(struct address_space *mapping,
+pgoff_t page_cache_prev_gap(struct address_space *mapping,
 pgoff_t index, unsigned long max_scan);
 
 #define FGP_ACCESSED   0x0001
diff --git a/mm/filemap.c b/mm/filemap.c
index f2251183a977..efe227940784 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1326,86 +1326,76 @@ int __lock_page_or_retry(struct page *page, struct 
mm_struct *mm,
 }
 
 /**
- * page_cache_next_hole - find the next hole (not-present entry)
- * @mapping: mapping
- * @index: index
- * @max_scan: maximum range to search
- *
- * Search the set [index, min(index+max_scan-1, MAX_INDEX)] for the
- * lowest indexed hole.
- *
- * Returns: the index of the hole if found, otherwise returns an index
- * outside of the set specified (in which case 'return - index >=
- * max_scan' will be true). In rare cases of index wrap-around, 0 will
- * be returned.
- *
- * page_cache_next_hole may be called under rcu_read_lock. However,
- * like radix_tree_gang_lookup, this will not atomically search a
- * snapshot of the tree at a single point in time. For example, if a
- * hole is created at index 5, then subsequently a hole is created at
- * index 10, page_cache_next_hole covering both indexes may return 10
- * if called under rcu_read_lock.
+ * page_cache_next_gap() - Find the next gap in the page cache.
+ * @mapping: Mapping.
+ * @index: Index.
+ * @max_scan: Maximum range to search.
+ *
+ * Search the range [index, min(index + max_scan - 1, ULONG_MAX)] for the
+ * gap with the lowest index.
+ *
+ * This function may be called under the rcu_read_lock.  However, this will
+ * not atomically search a snapshot of the cache at a single point in time.
+ * For example, if a gap is created at index 5, then subsequently a gap is
+ * created at index 10, page_cache_next_gap covering both indices may
+ * return 10 if called under the rcu_read_lock.
+ *
+ * Return: The index of the gap if found, otherwise an index outside the
+ * range specified (in which case 'return - index >= max_scan' will be true).
+ * In the rare case of index wrap-around, 0 will be returned.
  */
-pgoff_t page_cache_next_hole(struct address_space *mapping,
+pgoff_t page_cache_next_gap(struct address_space *mapping,
 pgoff_t index, unsigned long max_scan)
 {
-   unsigned long i;
+   XA_STATE(xas, >i_pages, index);
 
-   for (i = 0; i < max_scan; i++) {
-   struct page *page;
-
-   page = radix_tree_lookup(>i_pages, index);
-   if (!page || xa_is_value(page))
+   while (max_scan--) {
+   void *entry = xas_next();
+   if (!entry || xa_is_value(entry))
break;
-   index++;
-   if (index == 0)
+   if (xas.xa_index == 0)
break;
}
 
-   return index;
+   return xas.xa_index;
 }
-EXPORT_SYMBOL(page_cache_next_hole);
+EXPORT_SYMBOL(page_cache_next_gap);
 
 /**
- * page_cache_prev_hole - find the prev hole (not-present entry)
- * @mapping: mapping
- * @index: index
- * @max_scan: maximum range to search
- *
- * Search backwards in the

[PATCH v8 30/63] page cache: Convert page cache lookups to XArray

2018-03-06 Thread Matthew Wilcox

From: Matthew Wilcox 

Introduce page_cache_pin() to factor out the common logic between the
various lookup routines:

find_get_entry
find_get_entries
find_get_pages_range
find_get_pages_contig
find_get_pages_range_tag
find_get_entries_tag
filemap_map_pages

By using the xa_state to control the iteration, we can remove most of
the gotos and just use the normal break/continue loop control flow.

Also convert the regression1 read-side to XArray since that simulates
the functions being modified here.

Signed-off-by: Matthew Wilcox 
---
 include/linux/pagemap.h|   6 +-
 mm/filemap.c   | 380 +
 tools/testing/radix-tree/regression1.c |  68 +++---
 3 files changed, 129 insertions(+), 325 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 2f5d2d3ebaac..442977811b59 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -363,17 +363,17 @@ static inline unsigned find_get_pages(struct 
address_space *mapping,
 unsigned find_get_pages_contig(struct address_space *mapping, pgoff_t start,
   unsigned int nr_pages, struct page **pages);
 unsigned find_get_pages_range_tag(struct address_space *mapping, pgoff_t 
*index,
-   pgoff_t end, int tag, unsigned int nr_pages,
+   pgoff_t end, xa_tag_t tag, unsigned int nr_pages,
struct page **pages);
 static inline unsigned find_get_pages_tag(struct address_space *mapping,
-   pgoff_t *index, int tag, unsigned int nr_pages,
+   pgoff_t *index, xa_tag_t tag, unsigned int nr_pages,
struct page **pages)
 {
return find_get_pages_range_tag(mapping, index, (pgoff_t)-1, tag,
nr_pages, pages);
 }
 unsigned find_get_entries_tag(struct address_space *mapping, pgoff_t start,
-   int tag, unsigned int nr_entries,
+   xa_tag_t tag, unsigned int nr_entries,
struct page **entries, pgoff_t *indices);
 
 struct page *grab_cache_page_write_begin(struct address_space *mapping,
diff --git a/mm/filemap.c b/mm/filemap.c
index bdda1beda932..5a6c7c874d45 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1374,6 +1374,32 @@ pgoff_t page_cache_prev_gap(struct address_space 
*mapping,
 }
 EXPORT_SYMBOL(page_cache_prev_gap);
 
+/*
+ * page_cache_pin() - Try to pin a page in the page cache.
+ * @xas: The XArray operation state.
+ * @pagep: The page which has been previously found at this location.
+ *
+ * On success, the page has an elevated refcount, but is not locked.
+ * This implements the lockless pagecache protocol as described in
+ * include/linux/pagemap.h; see page_cache_get_speculative().
+ *
+ * Return: True if the page is still in the cache.
+ */
+static bool page_cache_pin(struct xa_state *xas, struct page *page)
+{
+   struct page *head = compound_head(page);
+   bool got = page_cache_get_speculative(head);
+
+   if (likely(got && (xas_reload(xas) == page) &&
+   (compound_head(page) == head)))
+   return true;
+
+   if (got)
+   put_page(head);
+   xas_reset(xas);
+   return false;
+}
+
 /**
  * find_get_entry - find and get a page cache entry
  * @mapping: the address_space to search
@@ -1389,51 +1415,21 @@ EXPORT_SYMBOL(page_cache_prev_gap);
  */
 struct page *find_get_entry(struct address_space *mapping, pgoff_t offset)
 {
-   void **pagep;
-   struct page *head, *page;
+   XA_STATE(xas, >i_pages, offset);
+   struct page *page;
 
rcu_read_lock();
-repeat:
-   page = NULL;
-   pagep = radix_tree_lookup_slot(>i_pages, offset);
-   if (pagep) {
-   page = radix_tree_deref_slot(pagep);
-   if (unlikely(!page))
-   goto out;
-   if (radix_tree_exception(page)) {
-   if (radix_tree_deref_retry(page))
-   goto repeat;
-   /*
-* A shadow entry of a recently evicted page,
-* or a swap entry from shmem/tmpfs.  Return
-* it without attempting to raise page count.
-*/
-   goto out;
-   }
-
-   head = compound_head(page);
-   if (!page_cache_get_speculative(head))
-   goto repeat;
-
-   /* The page was split under us? */
-   if (compound_head(page) != head) {
-   put_page(head);
-   goto repeat;
-   }
+   do {
+   page = xas_load();
+   if (xas_retry(, page))
+   continue;
+   if (!page || xa_is_value(page))
+

[PATCH v8 28/63] page cache: Add and replace pages using the XArray

2018-03-06 Thread Matthew Wilcox

From: Matthew Wilcox 

Use the XArray APIs to add and replace pages in the page cache.  This
removes two uses of the radix tree preload API and is significantly
shorter code.

Signed-off-by: Matthew Wilcox 
---
 include/linux/swap.h |   8 ++-
 mm/filemap.c | 143 ++-
 2 files changed, 67 insertions(+), 84 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 1985940af479..a0ebb5deea2d 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -300,8 +300,12 @@ void *workingset_eviction(struct address_space *mapping, 
struct page *page);
 bool workingset_refault(void *shadow);
 void workingset_activation(struct page *page);
 
-/* Do not use directly, use workingset_lookup_update */
-void workingset_update_node(struct radix_tree_node *node);
+/* Only track the nodes of mappings with shadow entries */
+void workingset_update_node(struct xa_node *node);
+#define mapping_set_update(xas, mapping) do {  \
+   if (!dax_mapping(mapping) && !shmem_mapping(mapping))   \
+   xas_set_update(xas, workingset_update_node);\
+} while (0)
 
 /* Returns workingset_update_node() if the mapping has shadow entries. */
 #define workingset_lookup_update(mapping)  \
diff --git a/mm/filemap.c b/mm/filemap.c
index efe227940784..0e19ea454cba 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -111,35 +111,6 @@
  *   ->tasklist_lock(memory_failure, collect_procs_ao)
  */
 
-static int page_cache_tree_insert(struct address_space *mapping,
- struct page *page, void **shadowp)
-{
-   struct radix_tree_node *node;
-   void **slot;
-   int error;
-
-   error = __radix_tree_create(>i_pages, page->index, 0,
-   , );
-   if (error)
-   return error;
-   if (*slot) {
-   void *p;
-
-   p = radix_tree_deref_slot_protected(slot,
-   >i_pages.xa_lock);
-   if (!xa_is_value(p))
-   return -EEXIST;
-
-   mapping->nrexceptional--;
-   if (shadowp)
-   *shadowp = p;
-   }
-   __radix_tree_replace(>i_pages, node, slot, page,
-workingset_lookup_update(mapping));
-   mapping->nrpages++;
-   return 0;
-}
-
 static void page_cache_tree_delete(struct address_space *mapping,
   struct page *page, void *shadow)
 {
@@ -775,51 +746,44 @@ EXPORT_SYMBOL(file_write_and_wait_range);
  * locked.  This function does not add the new page to the LRU, the
  * caller must do that.
  *
- * The remove + add is atomic.  The only way this function can fail is
- * memory allocation failure.
+ * The remove + add is atomic.  This function cannot fail.
  */
 int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask)
 {
-   int error;
+   struct address_space *mapping = old->mapping;
+   void (*freepage)(struct page *) = mapping->a_ops->freepage;
+   pgoff_t offset = old->index;
+   XA_STATE(xas, >i_pages, offset);
+   unsigned long flags;
 
VM_BUG_ON_PAGE(!PageLocked(old), old);
VM_BUG_ON_PAGE(!PageLocked(new), new);
VM_BUG_ON_PAGE(new->mapping, new);
 
-   error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
-   if (!error) {
-   struct address_space *mapping = old->mapping;
-   void (*freepage)(struct page *);
-   unsigned long flags;
-
-   pgoff_t offset = old->index;
-   freepage = mapping->a_ops->freepage;
-
-   get_page(new);
-   new->mapping = mapping;
-   new->index = offset;
+   get_page(new);
+   new->mapping = mapping;
+   new->index = offset;
 
-   xa_lock_irqsave(>i_pages, flags);
-   __delete_from_page_cache(old, NULL);
-   error = page_cache_tree_insert(mapping, new, NULL);
-   BUG_ON(error);
+   xas_lock_irqsave(, flags);
+   xas_store(, new);
 
-   /*
-* hugetlb pages do not participate in page cache accounting.
-*/
-   if (!PageHuge(new))
-   __inc_node_page_state(new, NR_FILE_PAGES);
-   if (PageSwapBacked(new))
-   __inc_node_page_state(new, NR_SHMEM);
-   xa_unlock_irqrestore(>i_pages, flags);
-   mem_cgroup_migrate(old, new);
-   radix_tree_preload_end();
-   if (freepage)
-   freepage(old);
-   put_page(old);
-   }
+   old->mapping = NULL;
+   /* hugetlb pages do not participate in page cache accounting. */
+   if (!PageHuge(old))
+   __dec_node_page_state(new,

[PATCH v8 32/63] page cache: Remove stray radix comment

2018-03-06 Thread Matthew Wilcox

From: Matthew Wilcox 

Signed-off-by: Matthew Wilcox 
---
 mm/filemap.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 0635e9cdbc06..86c83014c909 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2579,7 +2579,7 @@ static struct page *do_read_cache_page(struct 
address_space *mapping,
put_page(page);
if (err == -EEXIST)
goto repeat;
-   /* Presumably ENOMEM for radix tree node */
+   /* Presumably ENOMEM for xarray node */
return ERR_PTR(err);
}
 
-- 
2.16.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v8 31/63] page cache: Convert delete_batch to XArray

2018-03-06 Thread Matthew Wilcox

From: Matthew Wilcox 

Rename the function from page_cache_tree_delete_batch to just
page_cache_delete_batch.

Signed-off-by: Matthew Wilcox 
---
 mm/filemap.c | 28 +---
 1 file changed, 13 insertions(+), 15 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 5a6c7c874d45..0635e9cdbc06 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -275,7 +275,7 @@ void delete_from_page_cache(struct page *page)
 EXPORT_SYMBOL(delete_from_page_cache);
 
 /*
- * page_cache_tree_delete_batch - delete several pages from page cache
+ * page_cache_delete_batch - delete several pages from page cache
  * @mapping: the mapping to which pages belong
  * @pvec: pagevec with pages to delete
  *
@@ -288,23 +288,18 @@ EXPORT_SYMBOL(delete_from_page_cache);
  *
  * The function expects the i_pages lock to be held.
  */
-static void
-page_cache_tree_delete_batch(struct address_space *mapping,
+static void page_cache_delete_batch(struct address_space *mapping,
 struct pagevec *pvec)
 {
-   struct radix_tree_iter iter;
-   void **slot;
+   XA_STATE(xas, >i_pages, pvec->pages[0]->index);
int total_pages = 0;
int i = 0, tail_pages = 0;
struct page *page;
-   pgoff_t start;
 
-   start = pvec->pages[0]->index;
-   radix_tree_for_each_slot(slot, >i_pages, , start) {
+   mapping_set_update(, mapping);
+   xas_for_each(, page, ULONG_MAX) {
if (i >= pagevec_count(pvec) && !tail_pages)
break;
-   page = radix_tree_deref_slot_protected(slot,
-  
>i_pages.xa_lock);
if (xa_is_value(page))
continue;
if (!tail_pages) {
@@ -313,8 +308,11 @@ page_cache_tree_delete_batch(struct address_space *mapping,
 * have our pages locked so they are protected from
 * being removed.
 */
-   if (page != pvec->pages[i])
+   if (page != pvec->pages[i]) {
+   VM_BUG_ON_PAGE(page->index >
+   pvec->pages[i]->index, page);
continue;
+   }
WARN_ON_ONCE(!PageLocked(page));
if (PageTransHuge(page) && !PageHuge(page))
tail_pages = HPAGE_PMD_NR - 1;
@@ -325,11 +323,11 @@ page_cache_tree_delete_batch(struct address_space 
*mapping,
 */
i++;
} else {
+   VM_BUG_ON_PAGE(page->index + HPAGE_PMD_NR - tail_pages
+   != pvec->pages[i]->index, page);
tail_pages--;
}
-   radix_tree_clear_tags(>i_pages, iter.node, slot);
-   __radix_tree_replace(>i_pages, iter.node, slot, NULL,
-   workingset_lookup_update(mapping));
+   xas_store(, NULL);
total_pages++;
}
mapping->nrpages -= total_pages;
@@ -350,7 +348,7 @@ void delete_from_page_cache_batch(struct address_space 
*mapping,
 
unaccount_page_cache_page(mapping, pvec->pages[i]);
}
-   page_cache_tree_delete_batch(mapping, pvec);
+   page_cache_delete_batch(mapping, pvec);
xa_unlock_irqrestore(>i_pages, flags);
 
for (i = 0; i < pagevec_count(pvec); i++)
-- 
2.16.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v8 33/63] page cache: Convert filemap_range_has_page to XArray

2018-03-06 Thread Matthew Wilcox

From: Matthew Wilcox 

Instead of calling find_get_pages_range() and putting any reference,
use xas_find() to iterate over any entries in the range, skipping the
shadow/swap entries.

Signed-off-by: Matthew Wilcox 
---
 mm/filemap.c | 26 ++
 1 file changed, 18 insertions(+), 8 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 86c83014c909..9bc417913269 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -458,20 +458,30 @@ EXPORT_SYMBOL(filemap_flush);
 bool filemap_range_has_page(struct address_space *mapping,
   loff_t start_byte, loff_t end_byte)
 {
-   pgoff_t index = start_byte >> PAGE_SHIFT;
-   pgoff_t end = end_byte >> PAGE_SHIFT;
struct page *page;
+   XA_STATE(xas, >i_pages, start_byte >> PAGE_SHIFT);
+   pgoff_t max = end_byte >> PAGE_SHIFT;
 
if (end_byte < start_byte)
return false;
 
-   if (mapping->nrpages == 0)
-   return false;
+   rcu_read_lock();
+   do {
+   page = xas_find(, max);
+   if (xas_retry(, page))
+   continue;
+   /* Shadow entries don't count */
+   if (xa_is_value(page))
+   continue;
+   /*
+* We don't need to try to pin this page; we're about to
+* release the RCU lock anyway.  It is enough to know that
+* there was a page here recently.
+*/
+   } while (0);
+   rcu_read_unlock();
 
-   if (!find_get_pages_range(mapping, , end, 1, ))
-   return false;
-   put_page(page);
-   return true;
+   return page != NULL;
 }
 EXPORT_SYMBOL(filemap_range_has_page);
 
-- 
2.16.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v8 38/63] mm: Convert delete_from_swap_cache to XArray

2018-03-06 Thread Matthew Wilcox

From: Matthew Wilcox 

Both callers of __delete_from_swap_cache have the swp_entry_t already,
so pass that in to make constructing the XA_STATE easier.

Signed-off-by: Matthew Wilcox 
---
 include/linux/swap.h |  5 +++--
 mm/swap_state.c  | 24 ++--
 mm/vmscan.c  |  2 +-
 3 files changed, 14 insertions(+), 17 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index dab96af23d96..0b6a47a46c55 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -404,7 +404,7 @@ extern void show_swap_cache_info(void);
 extern int add_to_swap(struct page *page);
 extern int add_to_swap_cache(struct page *, swp_entry_t, gfp_t);
 extern int __add_to_swap_cache(struct page *page, swp_entry_t entry);
-extern void __delete_from_swap_cache(struct page *);
+extern void __delete_from_swap_cache(struct page *, swp_entry_t entry);
 extern void delete_from_swap_cache(struct page *);
 extern void free_page_and_swap_cache(struct page *);
 extern void free_pages_and_swap_cache(struct page **, int);
@@ -564,7 +564,8 @@ static inline int add_to_swap_cache(struct page *page, 
swp_entry_t entry,
return -1;
 }
 
-static inline void __delete_from_swap_cache(struct page *page)
+static inline void __delete_from_swap_cache(struct page *page,
+   swp_entry_t entry)
 {
 }
 
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 329736723d36..21218f6e438b 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -154,23 +154,22 @@ int add_to_swap_cache(struct page *page, swp_entry_t 
entry, gfp_t gfp)
  * This must be called only on pages that have
  * been verified to be in the swap cache.
  */
-void __delete_from_swap_cache(struct page *page)
+void __delete_from_swap_cache(struct page *page, swp_entry_t entry)
 {
-   struct address_space *address_space;
+   struct address_space *address_space = swap_address_space(entry);
int i, nr = hpage_nr_pages(page);
-   swp_entry_t entry;
-   pgoff_t idx;
+   pgoff_t idx = swp_offset(entry);
+   XA_STATE(xas, _space->i_pages, idx);
 
VM_BUG_ON_PAGE(!PageLocked(page), page);
VM_BUG_ON_PAGE(!PageSwapCache(page), page);
VM_BUG_ON_PAGE(PageWriteback(page), page);
 
-   entry.val = page_private(page);
-   address_space = swap_address_space(entry);
-   idx = swp_offset(entry);
for (i = 0; i < nr; i++) {
-   radix_tree_delete(_space->i_pages, idx + i);
+   void *entry = xas_store(, NULL);
+   VM_BUG_ON_PAGE(entry != page + i, entry);
set_page_private(page + i, 0);
+   xas_next();
}
ClearPageSwapCache(page);
address_space->nrpages -= nr;
@@ -246,14 +245,11 @@ int add_to_swap(struct page *page)
  */
 void delete_from_swap_cache(struct page *page)
 {
-   swp_entry_t entry;
-   struct address_space *address_space;
-
-   entry.val = page_private(page);
+   swp_entry_t entry = { .val = page_private(page) };
+   struct address_space *address_space = swap_address_space(entry);
 
-   address_space = swap_address_space(entry);
xa_lock_irq(_space->i_pages);
-   __delete_from_swap_cache(page);
+   __delete_from_swap_cache(page, entry);
xa_unlock_irq(_space->i_pages);
 
put_swap_page(page, entry);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 22bd3720b318..728d7e57cf11 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -689,7 +689,7 @@ static int __remove_mapping(struct address_space *mapping, 
struct page *page,
if (PageSwapCache(page)) {
swp_entry_t swap = { .val = page_private(page) };
mem_cgroup_swapout(page, swap);
-   __delete_from_swap_cache(page);
+   __delete_from_swap_cache(page, swap);
xa_unlock_irqrestore(>i_pages, flags);
put_swap_page(page, swap);
} else {
-- 
2.16.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v8 37/63] mm: Convert add_to_swap_cache to XArray

2018-03-06 Thread Matthew Wilcox

From: Matthew Wilcox 

Combine __add_to_swap_cache and add_to_swap_cache into one function
since there is no more need to preload.

Signed-off-by: Matthew Wilcox 
---
 mm/swap_state.c | 93 ++---
 1 file changed, 29 insertions(+), 64 deletions(-)

diff --git a/mm/swap_state.c b/mm/swap_state.c
index b2ba3b1d4727..329736723d36 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -107,14 +107,15 @@ void show_swap_cache_info(void)
 }
 
 /*
- * __add_to_swap_cache resembles add_to_page_cache_locked on swapper_space,
+ * add_to_swap_cache resembles add_to_page_cache_locked on swapper_space,
  * but sets SwapCache flag and private instead of mapping and index.
  */
-int __add_to_swap_cache(struct page *page, swp_entry_t entry)
+int add_to_swap_cache(struct page *page, swp_entry_t entry, gfp_t gfp)
 {
-   int error, i, nr = hpage_nr_pages(page);
-   struct address_space *address_space;
+   struct address_space *address_space = swap_address_space(entry);
pgoff_t idx = swp_offset(entry);
+   XA_STATE(xas, _space->i_pages, idx);
+   unsigned long i, nr = 1UL << compound_order(page);
 
VM_BUG_ON_PAGE(!PageLocked(page), page);
VM_BUG_ON_PAGE(PageSwapCache(page), page);
@@ -123,50 +124,30 @@ int __add_to_swap_cache(struct page *page, swp_entry_t 
entry)
page_ref_add(page, nr);
SetPageSwapCache(page);
 
-   address_space = swap_address_space(entry);
-   xa_lock_irq(_space->i_pages);
-   for (i = 0; i < nr; i++) {
-   set_page_private(page + i, entry.val + i);
-   error = radix_tree_insert(_space->i_pages,
- idx + i, page + i);
-   if (unlikely(error))
-   break;
-   }
-   if (likely(!error)) {
+   do {
+   xas_lock_irq();
+   xas_create_range(, idx + nr - 1);
+   if (xas_error())
+   goto unlock;
+   for (i = 0; i < nr; i++) {
+   VM_BUG_ON_PAGE(xas.xa_index != idx + i, page);
+   set_page_private(page + i, entry.val + i);
+   xas_store(, page + i);
+   xas_next();
+   }
address_space->nrpages += nr;
__mod_node_page_state(page_pgdat(page), NR_FILE_PAGES, nr);
ADD_CACHE_INFO(add_total, nr);
-   } else {
-   /*
-* Only the context which have set SWAP_HAS_CACHE flag
-* would call add_to_swap_cache().
-* So add_to_swap_cache() doesn't returns -EEXIST.
-*/
-   VM_BUG_ON(error == -EEXIST);
-   set_page_private(page + i, 0UL);
-   while (i--) {
-   radix_tree_delete(_space->i_pages, idx + i);
-   set_page_private(page + i, 0UL);
-   }
-   ClearPageSwapCache(page);
-   page_ref_sub(page, nr);
-   }
-   xa_unlock_irq(_space->i_pages);
+unlock:
+   xas_unlock_irq();
+   } while (xas_nomem(, gfp));
 
-   return error;
-}
-
-
-int add_to_swap_cache(struct page *page, swp_entry_t entry, gfp_t gfp_mask)
-{
-   int error;
+   if (!xas_error())
+   return 0;
 
-   error = radix_tree_maybe_preload_order(gfp_mask, compound_order(page));
-   if (!error) {
-   error = __add_to_swap_cache(page, entry);
-   radix_tree_preload_end();
-   }
-   return error;
+   ClearPageSwapCache(page);
+   page_ref_sub(page, nr);
+   return xas_error();
 }
 
 /*
@@ -220,7 +201,7 @@ int add_to_swap(struct page *page)
goto fail;
 
/*
-* Radix-tree node allocations from PF_MEMALLOC contexts could
+* XArray node allocations from PF_MEMALLOC contexts could
 * completely exhaust the page allocator. __GFP_NOMEMALLOC
 * stops emergency reserves from being allocated.
 *
@@ -232,7 +213,6 @@ int add_to_swap(struct page *page)
 */
err = add_to_swap_cache(page, entry,
__GFP_HIGH|__GFP_NOMEMALLOC|__GFP_NOWARN);
-   /* -ENOMEM radix-tree allocation failure */
if (err)
/*
 * add_to_swap_cache() doesn't return -EEXIST, so we can safely
@@ -421,19 +401,11 @@ struct page *__read_swap_cache_async(swp_entry_t entry, 
gfp_t gfp_mask,
break;  /* Out of memory */
}
 
-   /*
-* call radix_tree_preload() while we can wait.
-*/
-   err = radix_tree_maybe_preload(gfp_mask & GFP_KERNEL);
-   if (err)
-   break;
-
/*
 * Swap entry may have been freed since our caller observed it.
 */

[PATCH v8 34/63] mm: Convert page-writeback to XArray

2018-03-06 Thread Matthew Wilcox

From: Matthew Wilcox 

Includes moving mapping_tagged() to fs.h as a static inline, and
changing it to return bool.

Signed-off-by: Matthew Wilcox 
---
 include/linux/fs.h  | 17 +--
 mm/page-writeback.c | 63 +++--
 2 files changed, 32 insertions(+), 48 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 785100c2b835..4bd801b5adc8 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -469,15 +469,18 @@ struct block_device {
struct mutexbd_fsfreeze_mutex;
 } __randomize_layout;
 
+/* XArray tags, for tagging dirty and writeback pages in the pagecache. */
+#define PAGECACHE_TAG_DIRTYXA_TAG_0
+#define PAGECACHE_TAG_WRITEBACKXA_TAG_1
+#define PAGECACHE_TAG_TOWRITE  XA_TAG_2
+
 /*
- * Radix-tree tags, for tagging dirty and writeback pages within the pagecache
- * radix trees
+ * Returns true if any of the pages in the mapping are marked with the tag.
  */
-#define PAGECACHE_TAG_DIRTY0
-#define PAGECACHE_TAG_WRITEBACK1
-#define PAGECACHE_TAG_TOWRITE  2
-
-int mapping_tagged(struct address_space *mapping, int tag);
+static inline bool mapping_tagged(struct address_space *mapping, xa_tag_t tag)
+{
+   return xa_tagged(>i_pages, tag);
+}
 
 static inline void i_mmap_lock_write(struct address_space *mapping)
 {
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 5c1a3279e63f..195ccd0b30c8 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2098,34 +2098,25 @@ void __init page_writeback_init(void)
  * dirty pages in the file (thus it is important for this function to be quick
  * so that it can tag pages faster than a dirtying process can create them).
  */
-/*
- * We tag pages in batches of WRITEBACK_TAG_BATCH to reduce the i_pages lock
- * latency.
- */
 void tag_pages_for_writeback(struct address_space *mapping,
 pgoff_t start, pgoff_t end)
 {
-#define WRITEBACK_TAG_BATCH 4096
-   unsigned long tagged = 0;
-   struct radix_tree_iter iter;
-   void **slot;
+   XA_STATE(xas, >i_pages, start);
+   unsigned int tagged = 0;
+   void *page;
 
-   xa_lock_irq(>i_pages);
-   radix_tree_for_each_tagged(slot, >i_pages, , start,
-   PAGECACHE_TAG_DIRTY) {
-   if (iter.index > end)
-   break;
-   radix_tree_iter_tag_set(>i_pages, ,
-   PAGECACHE_TAG_TOWRITE);
-   tagged++;
-   if ((tagged % WRITEBACK_TAG_BATCH) != 0)
+   xas_lock_irq();
+   xas_for_each_tag(, page, end, PAGECACHE_TAG_DIRTY) {
+   xas_set_tag(, PAGECACHE_TAG_TOWRITE);
+   if (++tagged % XA_CHECK_SCHED)
continue;
-   slot = radix_tree_iter_resume(slot, );
-   xa_unlock_irq(>i_pages);
+
+   xas_pause();
+   xas_unlock_irq();
cond_resched();
-   xa_lock_irq(>i_pages);
+   xas_lock_irq();
}
-   xa_unlock_irq(>i_pages);
+   xas_unlock_irq();
 }
 EXPORT_SYMBOL(tag_pages_for_writeback);
 
@@ -2165,7 +2156,7 @@ int write_cache_pages(struct address_space *mapping,
pgoff_t done_index;
int cycled;
int range_whole = 0;
-   int tag;
+   xa_tag_t tag;
 
pagevec_init();
if (wbc->range_cyclic) {
@@ -2446,7 +2437,7 @@ void account_page_cleaned(struct page *page, struct 
address_space *mapping,
 
 /*
  * For address_spaces which do not use buffers.  Just tag the page as dirty in
- * its radix tree.
+ * the xarray.
  *
  * This is also used when a single buffer is being dirtied: we want to set the
  * page dirty in that case, but not all the buffers.  This is a "bottom-up"
@@ -2472,7 +2463,7 @@ int __set_page_dirty_nobuffers(struct page *page)
BUG_ON(page_mapping(page) != mapping);
WARN_ON_ONCE(!PagePrivate(page) && !PageUptodate(page));
account_page_dirtied(page, mapping);
-   radix_tree_tag_set(>i_pages, page_index(page),
+   __xa_set_tag(>i_pages, page_index(page),
   PAGECACHE_TAG_DIRTY);
xa_unlock_irqrestore(>i_pages, flags);
unlock_page_memcg(page);
@@ -2635,13 +2626,13 @@ EXPORT_SYMBOL(__cancel_dirty_page);
  * Returns true if the page was previously dirty.
  *
  * This is for preparing to put the page under writeout.  We leave the page
- * tagged as dirty in the radix tree so that a concurrent write-for-sync
+ * tagged as dirty in the xarray so that a concurrent write-for-sync
  * can discover it via a PAGECACHE_TAG_DIRTY walk.  The ->writepage
  * implementation will run either set_page_writeback() or set_page_dirty(),
- * at which stage we bring the page's dirty flag and radix-tree dirty tag
+ * at which

[PATCH v8 36/63] mm: Convert truncate to XArray

2018-03-06 Thread Matthew Wilcox

From: Matthew Wilcox 

This is essentially xa_cmpxchg() with the locking handled above us,
and it doesn't have to handle replacing a NULL entry.

Signed-off-by: Matthew Wilcox 
---
 mm/truncate.c | 15 ++-
 1 file changed, 6 insertions(+), 9 deletions(-)

diff --git a/mm/truncate.c b/mm/truncate.c
index ed778555c9f3..45d68e90b703 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -33,15 +33,12 @@
 static inline void __clear_shadow_entry(struct address_space *mapping,
pgoff_t index, void *entry)
 {
-   struct radix_tree_node *node;
-   void **slot;
+   XA_STATE(xas, >i_pages, index);
 
-   if (!__radix_tree_lookup(>i_pages, index, , ))
+   xas_set_update(, workingset_update_node);
+   if (xas_load() != entry)
return;
-   if (*slot != entry)
-   return;
-   __radix_tree_replace(>i_pages, node, slot, NULL,
-workingset_update_node);
+   xas_store(, NULL);
mapping->nrexceptional--;
 }
 
@@ -738,10 +735,10 @@ int invalidate_inode_pages2_range(struct address_space 
*mapping,
index++;
}
/*
-* For DAX we invalidate page tables after invalidating radix tree.  We
+* For DAX we invalidate page tables after invalidating page cache.  We
 * could invalidate page tables while invalidating each entry however
 * that would be expensive. And doing range unmapping before doesn't
-* work as we have no cheap way to find whether radix tree entry didn't
+* work as we have no cheap way to find whether page cache entry didn't
 * get remapped later.
 */
if (dax_mapping(mapping)) {
-- 
2.16.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v8 35/63] mm: Convert workingset to XArray

2018-03-06 Thread Matthew Wilcox

From: Matthew Wilcox 

We construct a fake XA_STATE and use it to delete the node with xa_store()
rather than adding a special function for this unique use case.

Signed-off-by: Matthew Wilcox 
---
 include/linux/swap.h |  9 -
 mm/workingset.c  | 51 ++-
 2 files changed, 22 insertions(+), 38 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index a0ebb5deea2d..dab96af23d96 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -307,15 +307,6 @@ void workingset_update_node(struct xa_node *node);
xas_set_update(xas, workingset_update_node);\
 } while (0)
 
-/* Returns workingset_update_node() if the mapping has shadow entries. */
-#define workingset_lookup_update(mapping)  \
-({ \
-   radix_tree_update_node_t __helper = workingset_update_node; \
-   if (dax_mapping(mapping) || shmem_mapping(mapping)) \
-   __helper = NULL;\
-   __helper;   \
-})
-
 /* linux/mm/page_alloc.c */
 extern unsigned long totalram_pages;
 extern unsigned long totalreserve_pages;
diff --git a/mm/workingset.c b/mm/workingset.c
index bad4e58881cd..564e97bd5934 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -148,7 +148,7 @@
  * and activations is maintained (node->inactive_age).
  *
  * On eviction, a snapshot of this counter (along with some bits to
- * identify the node) is stored in the now empty page cache radix tree
+ * identify the node) is stored in the now empty page cache
  * slot of the evicted page.  This is called a shadow entry.
  *
  * On cache misses for which there are shadow entries, an eligible
@@ -162,7 +162,7 @@
 
 /*
  * Eviction timestamps need to be able to cover the full range of
- * actionable refaults. However, bits are tight in the radix tree
+ * actionable refaults. However, bits are tight in the xarray
  * entry, and after storing the identifier for the lruvec there might
  * not be enough left to represent every single actionable refault. In
  * that case, we have to sacrifice granularity for distance, and group
@@ -338,7 +338,7 @@ void workingset_activation(struct page *page)
 
 static struct list_lru shadow_nodes;
 
-void workingset_update_node(struct radix_tree_node *node)
+void workingset_update_node(struct xa_node *node)
 {
/*
 * Track non-empty nodes that contain only shadow entries;
@@ -370,7 +370,7 @@ static unsigned long count_shadow_nodes(struct shrinker 
*shrinker,
local_irq_enable();
 
/*
-* Approximate a reasonable limit for the radix tree nodes
+* Approximate a reasonable limit for the nodes
 * containing shadow entries. We don't need to keep more
 * shadow entries than possible pages on the active list,
 * since refault distances bigger than that are dismissed.
@@ -385,11 +385,11 @@ static unsigned long count_shadow_nodes(struct shrinker 
*shrinker,
 * worst-case density of 1/8th. Below that, not all eligible
 * refaults can be detected anymore.
 *
-* On 64-bit with 7 radix_tree_nodes per page and 64 slots
+* On 64-bit with 7 xa_nodes per page and 64 slots
 * each, this will reclaim shadow entries when they consume
 * ~1.8% of available memory:
 *
-* PAGE_SIZE / radix_tree_nodes / node_entries * 8 / PAGE_SIZE
+* PAGE_SIZE / xa_nodes / node_entries * 8 / PAGE_SIZE
 */
if (sc->memcg) {
cache = mem_cgroup_node_nr_lru_pages(sc->memcg, sc->nid,
@@ -398,7 +398,7 @@ static unsigned long count_shadow_nodes(struct shrinker 
*shrinker,
cache = node_page_state(NODE_DATA(sc->nid), NR_ACTIVE_FILE) +
node_page_state(NODE_DATA(sc->nid), NR_INACTIVE_FILE);
}
-   max_nodes = cache >> (RADIX_TREE_MAP_SHIFT - 3);
+   max_nodes = cache >> (XA_CHUNK_SHIFT - 3);
 
if (nodes <= max_nodes)
return 0;
@@ -408,11 +408,11 @@ static unsigned long count_shadow_nodes(struct shrinker 
*shrinker,
 static enum lru_status shadow_lru_isolate(struct list_head *item,
  struct list_lru_one *lru,
  spinlock_t *lru_lock,
- void *arg)
+ void *arg) __must_hold(lru_lock)
 {
+   XA_STATE(xas, NULL, 0);
struct address_space *mapping;
-   struct radix_tree_node *node;
-   unsigned int i;
+   struct xa_node *node;
int ret;
 
/*
@@ -420,7 +420,7 @@ static enum lru_status shadow_lru_isolate(struct list_head 
*item,
 * the shadow node LRU under the i_pages lock and the
 *

[PATCH v8 40/63] mm: Convert page migration to XArray

2018-03-06 Thread Matthew Wilcox

From: Matthew Wilcox 

Signed-off-by: Matthew Wilcox 
---
 mm/migrate.c | 41 -
 1 file changed, 16 insertions(+), 25 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 740b71857898..9a15d27768a0 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -322,7 +322,7 @@ void __migration_entry_wait(struct mm_struct *mm, pte_t 
*ptep,
page = migration_entry_to_page(entry);
 
/*
-* Once radix-tree replacement of page migration started, page_count
+* Once page cache replacement of page migration started, page_count
 * *must* be zero. And, we don't want to call wait_on_page_locked()
 * against a page without get_page().
 * So, we use get_page_unless_zero(), here. Even failed, page fault
@@ -437,10 +437,10 @@ int migrate_page_move_mapping(struct address_space 
*mapping,
struct buffer_head *head, enum migrate_mode mode,
int extra_count)
 {
+   XA_STATE(xas, >i_pages, page_index(page));
struct zone *oldzone, *newzone;
int dirty;
int expected_count = 1 + extra_count;
-   void **pslot;
 
/*
 * Device public or private pages have an extra refcount as they are
@@ -466,21 +466,16 @@ int migrate_page_move_mapping(struct address_space 
*mapping,
oldzone = page_zone(page);
newzone = page_zone(newpage);
 
-   xa_lock_irq(>i_pages);
-
-   pslot = radix_tree_lookup_slot(>i_pages,
-   page_index(page));
+   xas_lock_irq();
 
expected_count += 1 + page_has_private(page);
-   if (page_count(page) != expected_count ||
-   radix_tree_deref_slot_protected(pslot,
-   >i_pages.xa_lock) != page) {
-   xa_unlock_irq(>i_pages);
+   if (page_count(page) != expected_count || xas_load() != page) {
+   xas_unlock_irq();
return -EAGAIN;
}
 
if (!page_ref_freeze(page, expected_count)) {
-   xa_unlock_irq(>i_pages);
+   xas_unlock_irq();
return -EAGAIN;
}
 
@@ -494,7 +489,7 @@ int migrate_page_move_mapping(struct address_space *mapping,
if (mode == MIGRATE_ASYNC && head &&
!buffer_migrate_lock_buffers(head, mode)) {
page_ref_unfreeze(page, expected_count);
-   xa_unlock_irq(>i_pages);
+   xas_unlock_irq();
return -EAGAIN;
}
 
@@ -522,7 +517,7 @@ int migrate_page_move_mapping(struct address_space *mapping,
SetPageDirty(newpage);
}
 
-   radix_tree_replace_slot(>i_pages, pslot, newpage);
+   xas_store(, newpage);
 
/*
 * Drop cache reference from old page by unfreezing
@@ -531,7 +526,7 @@ int migrate_page_move_mapping(struct address_space *mapping,
 */
page_ref_unfreeze(page, expected_count - 1);
 
-   xa_unlock(>i_pages);
+   xas_unlock();
/* Leave irq disabled to prevent preemption while updating stats */
 
/*
@@ -571,22 +566,18 @@ EXPORT_SYMBOL(migrate_page_move_mapping);
 int migrate_huge_page_move_mapping(struct address_space *mapping,
   struct page *newpage, struct page *page)
 {
+   XA_STATE(xas, >i_pages, page_index(page));
int expected_count;
-   void **pslot;
-
-   xa_lock_irq(>i_pages);
-
-   pslot = radix_tree_lookup_slot(>i_pages, page_index(page));
 
+   xas_lock_irq();
expected_count = 2 + page_has_private(page);
-   if (page_count(page) != expected_count ||
-   radix_tree_deref_slot_protected(pslot, 
>i_pages.xa_lock) != page) {
-   xa_unlock_irq(>i_pages);
+   if (page_count(page) != expected_count || xas_load() != page) {
+   xas_unlock_irq();
return -EAGAIN;
}
 
if (!page_ref_freeze(page, expected_count)) {
-   xa_unlock_irq(>i_pages);
+   xas_unlock_irq();
return -EAGAIN;
}
 
@@ -595,11 +586,11 @@ int migrate_huge_page_move_mapping(struct address_space 
*mapping,
 
get_page(newpage);
 
-   radix_tree_replace_slot(>i_pages, pslot, newpage);
+   xas_store(, newpage);
 
page_ref_unfreeze(page, expected_count - 1);
 
-   xa_unlock_irq(>i_pages);
+   xas_unlock_irq();
 
return MIGRATEPAGE_SUCCESS;
 }
-- 
2.16.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v8 13/63] xarray: Add definition of struct xarray

2018-03-06 Thread Matthew Wilcox

From: Matthew Wilcox 

This is a direct replacement for struct radix_tree_root.  Some of the
struct members have changed name; convert those, and use a #define so
that radix_tree users continue to work without change.

Signed-off-by: Matthew Wilcox 
---
 include/linux/radix-tree.h   | 33 --
 include/linux/xarray.h   | 61 ++
 lib/Makefile |  2 +-
 lib/idr.c|  4 +-
 lib/radix-tree.c | 75 
 lib/xarray.c | 44 +++
 tools/include/linux/spinlock.h   |  3 +-
 tools/testing/radix-tree/.gitignore  |  1 +
 tools/testing/radix-tree/Makefile|  8 +++-
 tools/testing/radix-tree/linux/bug.h |  1 +
 tools/testing/radix-tree/linux/kconfig.h |  1 +
 tools/testing/radix-tree/linux/xarray.h  |  2 +
 tools/testing/radix-tree/multiorder.c|  6 +--
 tools/testing/radix-tree/test.c  |  6 +--
 14 files changed, 173 insertions(+), 74 deletions(-)
 create mode 100644 lib/xarray.c
 create mode 100644 tools/testing/radix-tree/linux/kconfig.h
 create mode 100644 tools/testing/radix-tree/linux/xarray.h

diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index 87f35fe00e55..c8a33e9e9a3c 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -30,6 +30,9 @@
 #include 
 #include 
 
+/* Keep unconverted code working */
+#define radix_tree_rootxarray
+
 /*
  * The bottom two bits of the slot determine how the remaining bits in the
  * slot are interpreted:
@@ -59,10 +62,7 @@ static inline bool radix_tree_is_internal_node(void *ptr)
 
 #define RADIX_TREE_MAX_TAGS 3
 
-#ifndef RADIX_TREE_MAP_SHIFT
-#define RADIX_TREE_MAP_SHIFT   (CONFIG_BASE_SMALL ? 4 : 6)
-#endif
-
+#define RADIX_TREE_MAP_SHIFT   XA_CHUNK_SHIFT
 #define RADIX_TREE_MAP_SIZE(1UL << RADIX_TREE_MAP_SHIFT)
 #define RADIX_TREE_MAP_MASK(RADIX_TREE_MAP_SIZE-1)
 
@@ -95,36 +95,21 @@ struct radix_tree_node {
unsigned long   tags[RADIX_TREE_MAX_TAGS][RADIX_TREE_TAG_LONGS];
 };
 
-/* The IDR tag is stored in the low bits of the GFP flags */
+/* The IDR tag is stored in the low bits of xa_flags */
 #define ROOT_IS_IDR((__force gfp_t)4)
-/* The top bits of gfp_mask are used to store the root tags */
+/* The top bits of xa_flags are used to store the root tags */
 #define ROOT_TAG_SHIFT (__GFP_BITS_SHIFT)
 
-struct radix_tree_root {
-   spinlock_t  xa_lock;
-   gfp_t   gfp_mask;
-   struct radix_tree_node  __rcu *rnode;
-};
-
-#define RADIX_TREE_INIT(name, mask){   \
-   .xa_lock = __SPIN_LOCK_UNLOCKED(name.xa_lock),  \
-   .gfp_mask = (mask), \
-   .rnode = NULL,  \
-}
+#define RADIX_TREE_INIT(name, mask)XARRAY_INIT_FLAGS(name, mask)
 
 #define RADIX_TREE(name, mask) \
struct radix_tree_root name = RADIX_TREE_INIT(name, mask)
 
-#define INIT_RADIX_TREE(root, mask)\
-do {   \
-   spin_lock_init(&(root)->xa_lock);   \
-   (root)->gfp_mask = (mask);  \
-   (root)->rnode = NULL;   \
-} while (0)
+#define INIT_RADIX_TREE(root, mask) xa_init_flags(root, mask)
 
 static inline bool radix_tree_empty(const struct radix_tree_root *root)
 {
-   return root->rnode == NULL;
+   return root->xa_head == NULL;
 }
 
 /**
diff --git a/include/linux/xarray.h b/include/linux/xarray.h
index 283beb5aac58..9b05b907062b 100644
--- a/include/linux/xarray.h
+++ b/include/linux/xarray.h
@@ -10,6 +10,8 @@
  */
 
 #include 
+#include 
+#include 
 #include 
 #include 
 
@@ -105,6 +107,65 @@ static inline bool xa_is_internal(const void *entry)
return ((unsigned long)entry & 3) == 2;
 }
 
+/**
+ * struct xarray - The anchor of the XArray.
+ * @xa_lock: Lock that protects the contents of the XArray.
+ *
+ * To use the xarray, define it statically or embed it in your data structure.
+ * It is a very small data structure, so it does not usually make sense to
+ * allocate it separately and keep a pointer to it in your data structure.
+ *
+ * You may use the xa_lock to protect your own data structures as well.
+ */
+/*
+ * If all of the entries in the array are NULL, @xa_head is a NULL pointer.
+ * If the only non-NULL entry in the array is at index 0, @xa_head is that
+ * entry.  If any other entry in the array is non-NULL, @xa_head points
+ * to an @xa_node.
+ */
+struct xarray {
+   spinlock_t  xa_lock;
+/* private: The rest of the data structure is not to be used directly. */
+   gfp_t   xa_flags;
+

[PATCH v8 39/63] mm: Convert __do_page_cache_readahead to XArray

2018-03-06 Thread Matthew Wilcox

From: Matthew Wilcox 

This one is trivial.

Signed-off-by: Matthew Wilcox 
---
 mm/readahead.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/mm/readahead.c b/mm/readahead.c
index 3ff9763b0461..5f528d649d5e 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -174,9 +174,7 @@ int __do_page_cache_readahead(struct address_space 
*mapping, struct file *filp,
if (page_offset > end_index)
break;
 
-   rcu_read_lock();
-   page = radix_tree_lookup(>i_pages, page_offset);
-   rcu_read_unlock();
+   page = xa_load(>i_pages, page_offset);
if (page && !xa_is_value(page))
continue;
 
-- 
2.16.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v8 41/63] mm: Convert huge_memory to XArray

2018-03-06 Thread Matthew Wilcox

From: Matthew Wilcox 

Quite a straightforward conversion.

Signed-off-by: Matthew Wilcox 
---
 mm/huge_memory.c | 17 +++--
 1 file changed, 7 insertions(+), 10 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 89737c0e0d34..354b7f768d0f 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2442,13 +2442,13 @@ static void __split_huge_page(struct page *page, struct 
list_head *list,
ClearPageCompound(head);
/* See comment in __split_huge_page_tail() */
if (PageAnon(head)) {
-   /* Additional pin to radix tree of swap cache */
+   /* Additional pin to swap cache */
if (PageSwapCache(head))
page_ref_add(head, 2);
else
page_ref_inc(head);
} else {
-   /* Additional pin to radix tree */
+   /* Additional pin to page cache */
page_ref_add(head, 2);
xa_unlock(>mapping->i_pages);
}
@@ -2560,7 +2560,7 @@ bool can_split_huge_page(struct page *page, int 
*pextra_pins)
 {
int extra_pins;
 
-   /* Additional pins from radix tree */
+   /* Additional pins from page cache */
if (PageAnon(page))
extra_pins = PageSwapCache(page) ? HPAGE_PMD_NR : 0;
else
@@ -2656,17 +2656,14 @@ int split_huge_page_to_list(struct page *page, struct 
list_head *list)
spin_lock_irqsave(zone_lru_lock(page_zone(head)), flags);
 
if (mapping) {
-   void **pslot;
+   XA_STATE(xas, >i_pages, page_index(head));
 
-   xa_lock(>i_pages);
-   pslot = radix_tree_lookup_slot(>i_pages,
-   page_index(head));
/*
-* Check if the head page is present in radix tree.
+* Check if the head page is present in page cache.
 * We assume all tail are present too, if head is there.
 */
-   if (radix_tree_deref_slot_protected(pslot,
-   >i_pages.xa_lock) != head)
+   xa_lock(>i_pages);
+   if (xas_load() != head)
goto fail;
}
 
-- 
2.16.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v8 42/63] mm: Convert collapse_shmem to XArray

2018-03-06 Thread Matthew Wilcox

From: Matthew Wilcox 

I found another victim of the radix tree being hard to use.  Because
there was no call to radix_tree_preload(), khugepaged was allocating
radix_tree_nodes using GFP_ATOMIC.

I also converted a local_irq_save()/restore() pair to
disable()/enable().

Signed-off-by: Matthew Wilcox 
---
 mm/khugepaged.c | 158 +++-
 1 file changed, 65 insertions(+), 93 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 69545692155f..3685c8e2b3dc 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1282,17 +1282,17 @@ static void retract_page_tables(struct address_space 
*mapping, pgoff_t pgoff)
  *
  * Basic scheme is simple, details are more complex:
  *  - allocate and freeze a new huge page;
- *  - scan over radix tree replacing old pages the new one
+ *  - scan page cache replacing old pages with the new one
  *+ swap in pages if necessary;
  *+ fill in gaps;
- *+ keep old pages around in case if rollback is required;
- *  - if replacing succeed:
+ *+ keep old pages around in case rollback is required;
+ *  - if replacing succeeds:
  *+ copy data over;
  *+ free old pages;
  *+ unfreeze huge page;
  *  - if replacing failed;
  *+ put all pages back and unfreeze them;
- *+ restore gaps in the radix-tree;
+ *+ restore gaps in the page cache;
  *+ free huge page;
  */
 static void collapse_shmem(struct mm_struct *mm,
@@ -1300,12 +1300,11 @@ static void collapse_shmem(struct mm_struct *mm,
struct page **hpage, int node)
 {
gfp_t gfp;
-   struct page *page, *new_page, *tmp;
+   struct page *new_page;
struct mem_cgroup *memcg;
pgoff_t index, end = start + HPAGE_PMD_NR;
LIST_HEAD(pagelist);
-   struct radix_tree_iter iter;
-   void **slot;
+   XA_STATE(xas, >i_pages, start);
int nr_none = 0, result = SCAN_SUCCEED;
 
VM_BUG_ON(start & (HPAGE_PMD_NR - 1));
@@ -1330,48 +1329,48 @@ static void collapse_shmem(struct mm_struct *mm,
__SetPageLocked(new_page);
BUG_ON(!page_ref_freeze(new_page, 1));
 
-
/*
-* At this point the new_page is 'frozen' (page_count() is zero), locked
-* and not up-to-date. It's safe to insert it into radix tree, because
-* nobody would be able to map it or use it in other way until we
-* unfreeze it.
+* At this point the new_page is 'frozen' (page_count() is zero),
+* locked and not up-to-date. It's safe to insert it into the page
+* cache, because nobody would be able to map it or use it in other
+* way until we unfreeze it.
 */
 
-   index = start;
-   xa_lock_irq(>i_pages);
-   radix_tree_for_each_slot(slot, >i_pages, , start) {
-   int n = min(iter.index, end) - index;
-
-   /*
-* Handle holes in the radix tree: charge it from shmem and
-* insert relevant subpage of new_page into the radix-tree.
-*/
-   if (n && !shmem_charge(mapping->host, n)) {
-   result = SCAN_FAIL;
+   /* This will be less messy when we use multi-index entries */
+   do {
+   xas_lock_irq();
+   xas_create_range(, end - 1);
+   if (!xas_error())
break;
-   }
-   nr_none += n;
-   for (; index < min(iter.index, end); index++) {
-   radix_tree_insert(>i_pages, index,
-   new_page + (index % HPAGE_PMD_NR));
-   }
+   xas_unlock_irq();
+   if (!xas_nomem(, GFP_KERNEL))
+   goto out;
+   } while (1);
 
-   /* We are done. */
-   if (index >= end)
-   break;
+   for (index = start; index < end; index++) {
+   struct page *page = xas_next();
+
+   VM_BUG_ON(index != xas.xa_index);
+   if (!page) {
+   if (!shmem_charge(mapping->host, 1)) {
+   result = SCAN_FAIL;
+   break;
+   }
+   xas_store(, new_page + (index % HPAGE_PMD_NR));
+   nr_none++;
+   continue;
+   }
 
-   page = radix_tree_deref_slot_protected(slot,
-   >i_pages.xa_lock);
if (xa_is_value(page) || !PageUptodate(page)) {
-   xa_unlock_irq(>i_pages);
+   xas_unlock_irq();
/* swap in or instantiate fallocated page */
if (shmem_getpage(mapping->host, index, ,
SGP_NOHUGE)) {
result = SCAN_FAIL;
-

[PATCH v8 43/63] mm: Convert khugepaged_scan_shmem to XArray

2018-03-06 Thread Matthew Wilcox

From: Matthew Wilcox 

Slightly shorter and easier to read code.

Signed-off-by: Matthew Wilcox 
---
 mm/khugepaged.c | 17 +
 1 file changed, 5 insertions(+), 12 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 3685c8e2b3dc..39e260a0639c 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1533,8 +1533,7 @@ static void khugepaged_scan_shmem(struct mm_struct *mm,
pgoff_t start, struct page **hpage)
 {
struct page *page = NULL;
-   struct radix_tree_iter iter;
-   void **slot;
+   XA_STATE(xas, >i_pages, start);
int present, swap;
int node = NUMA_NO_NODE;
int result = SCAN_SUCCEED;
@@ -1543,17 +1542,11 @@ static void khugepaged_scan_shmem(struct mm_struct *mm,
swap = 0;
memset(khugepaged_node_load, 0, sizeof(khugepaged_node_load));
rcu_read_lock();
-   radix_tree_for_each_slot(slot, >i_pages, , start) {
-   if (iter.index >= start + HPAGE_PMD_NR)
-   break;
-
-   page = radix_tree_deref_slot(slot);
-   if (radix_tree_deref_retry(page)) {
-   slot = radix_tree_iter_retry();
+   xas_for_each(, page, start + HPAGE_PMD_NR - 1) {
+   if (xas_retry(, page))
continue;
-   }
 
-   if (radix_tree_exception(page)) {
+   if (xa_is_value(page)) {
if (++swap > khugepaged_max_ptes_swap) {
result = SCAN_EXCEED_SWAP_PTE;
break;
@@ -1592,7 +1585,7 @@ static void khugepaged_scan_shmem(struct mm_struct *mm,
present++;
 
if (need_resched()) {
-   slot = radix_tree_iter_resume(slot, );
+   xas_pause();
cond_resched_rcu();
}
}
-- 
2.16.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v8 44/63] pagevec: Use xa_tag_t

2018-03-06 Thread Matthew Wilcox

From: Matthew Wilcox 

Removes sparse warnings.

Signed-off-by: Matthew Wilcox 
---
 fs/btrfs/extent_io.c| 4 ++--
 fs/ext4/inode.c | 2 +-
 fs/f2fs/data.c  | 2 +-
 fs/gfs2/aops.c  | 2 +-
 include/linux/pagevec.h | 8 +---
 mm/swap.c   | 4 ++--
 6 files changed, 12 insertions(+), 10 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 4e22edd04457..bcc24ee5a2c9 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -3789,7 +3789,7 @@ int btree_write_cache_pages(struct address_space *mapping,
pgoff_t index;
pgoff_t end;/* Inclusive */
int scanned = 0;
-   int tag;
+   xa_tag_t tag;
 
pagevec_init();
if (wbc->range_cyclic) {
@@ -3914,7 +3914,7 @@ static int extent_write_cache_pages(struct address_space 
*mapping,
pgoff_t done_index;
int range_whole = 0;
int scanned = 0;
-   int tag;
+   xa_tag_t tag;
 
/*
 * We have to hold onto the inode so that ordered extents can do their
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 2a47c2f715bb..71cb9d7fd9c2 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2615,7 +2615,7 @@ static int mpage_prepare_extent_to_map(struct 
mpage_da_data *mpd)
long left = mpd->wbc->nr_to_write;
pgoff_t index = mpd->first_page;
pgoff_t end = mpd->last_page;
-   int tag;
+   xa_tag_t tag;
int i, err = 0;
int blkbits = mpd->inode->i_blkbits;
ext4_lblk_t lblk;
diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
index 412e9f650dac..dfbccf884d4f 100644
--- a/fs/f2fs/data.c
+++ b/fs/f2fs/data.c
@@ -1848,7 +1848,7 @@ static int f2fs_write_cache_pages(struct address_space 
*mapping,
pgoff_t last_idx = ULONG_MAX;
int cycled;
int range_whole = 0;
-   int tag;
+   xa_tag_t tag;
 
pagevec_init();
 
diff --git a/fs/gfs2/aops.c b/fs/gfs2/aops.c
index f58716567972..8376d1358379 100644
--- a/fs/gfs2/aops.c
+++ b/fs/gfs2/aops.c
@@ -371,7 +371,7 @@ static int gfs2_write_cache_jdata(struct address_space 
*mapping,
pgoff_t done_index;
int cycled;
int range_whole = 0;
-   int tag;
+   xa_tag_t tag;
 
pagevec_init();
if (wbc->range_cyclic) {
diff --git a/include/linux/pagevec.h b/include/linux/pagevec.h
index 6dc456ac6136..955bd6425903 100644
--- a/include/linux/pagevec.h
+++ b/include/linux/pagevec.h
@@ -9,6 +9,8 @@
 #ifndef _LINUX_PAGEVEC_H
 #define _LINUX_PAGEVEC_H
 
+#include 
+
 /* 15 pointers + header align the pagevec structure to a power of two */
 #define PAGEVEC_SIZE   15
 
@@ -40,12 +42,12 @@ static inline unsigned pagevec_lookup(struct pagevec *pvec,
 
 unsigned pagevec_lookup_range_tag(struct pagevec *pvec,
struct address_space *mapping, pgoff_t *index, pgoff_t end,
-   int tag);
+   xa_tag_t tag);
 unsigned pagevec_lookup_range_nr_tag(struct pagevec *pvec,
struct address_space *mapping, pgoff_t *index, pgoff_t end,
-   int tag, unsigned max_pages);
+   xa_tag_t tag, unsigned max_pages);
 static inline unsigned pagevec_lookup_tag(struct pagevec *pvec,
-   struct address_space *mapping, pgoff_t *index, int tag)
+   struct address_space *mapping, pgoff_t *index, xa_tag_t tag)
 {
return pagevec_lookup_range_tag(pvec, mapping, index, (pgoff_t)-1, tag);
 }
diff --git a/mm/swap.c b/mm/swap.c
index 65d3ecbd5958..b9072d77bb82 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -1002,7 +1002,7 @@ EXPORT_SYMBOL(pagevec_lookup_range);
 
 unsigned pagevec_lookup_range_tag(struct pagevec *pvec,
struct address_space *mapping, pgoff_t *index, pgoff_t end,
-   int tag)
+   xa_tag_t tag)
 {
pvec->nr = find_get_pages_range_tag(mapping, index, end, tag,
PAGEVEC_SIZE, pvec->pages);
@@ -1012,7 +1012,7 @@ EXPORT_SYMBOL(pagevec_lookup_range_tag);
 
 unsigned pagevec_lookup_range_nr_tag(struct pagevec *pvec,
struct address_space *mapping, pgoff_t *index, pgoff_t end,
-   int tag, unsigned max_pages)
+   xa_tag_t tag, unsigned max_pages)
 {
pvec->nr = find_get_pages_range_tag(mapping, index, end, tag,
min_t(unsigned int, max_pages, PAGEVEC_SIZE), pvec->pages);
-- 
2.16.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v8 47/63] shmem: Convert find_swap_entry to XArray

2018-03-06 Thread Matthew Wilcox

From: Matthew Wilcox 

This is a 1:1 conversion.

Signed-off-by: Matthew Wilcox 
---
 mm/shmem.c | 23 +++
 1 file changed, 11 insertions(+), 12 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 0af8a439dfad..49f42dc9e1dc 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1076,28 +1076,27 @@ static void shmem_evict_inode(struct inode *inode)
clear_inode(inode);
 }
 
-static unsigned long find_swap_entry(struct radix_tree_root *root, void *item)
+static unsigned long find_swap_entry(struct xarray *xa, void *item)
 {
-   struct radix_tree_iter iter;
-   void **slot;
-   unsigned long found = -1;
+   XA_STATE(xas, xa, 0);
unsigned int checked = 0;
+   void *entry;
 
rcu_read_lock();
-   radix_tree_for_each_slot(slot, root, , 0) {
-   if (*slot == item) {
-   found = iter.index;
+   xas_for_each(, entry, ULONG_MAX) {
+   if (xas_retry(, entry))
+   continue;
+   if (entry == item)
break;
-   }
checked++;
-   if ((checked % 4096) != 0)
+   if ((checked % XA_CHECK_SCHED) != 0)
continue;
-   slot = radix_tree_iter_resume(slot, );
+   xas_pause();
cond_resched_rcu();
}
-
rcu_read_unlock();
-   return found;
+
+   return xas_invalid() ? -1 : xas.xa_index;
 }
 
 /*
-- 
2.16.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v8 53/63] memfd: Convert shmem_wait_for_pins to XArray

2018-03-06 Thread Matthew Wilcox

From: Matthew Wilcox 

As with shmem_tag_pins(), hold the lock around the entire loop instead
of acquiring & dropping it for each entry we're going to untag.

Signed-off-by: Matthew Wilcox 
---
 mm/memfd.c | 61 +
 1 file changed, 25 insertions(+), 36 deletions(-)

diff --git a/mm/memfd.c b/mm/memfd.c
index 3b299d72df78..0e0835e63af2 100644
--- a/mm/memfd.c
+++ b/mm/memfd.c
@@ -64,9 +64,7 @@ static void shmem_tag_pins(struct address_space *mapping)
  */
 static int shmem_wait_for_pins(struct address_space *mapping)
 {
-   struct radix_tree_iter iter;
-   void __rcu **slot;
-   pgoff_t start;
+   XA_STATE(xas, >i_pages, 0);
struct page *page;
int error, scan;
 
@@ -74,7 +72,9 @@ static int shmem_wait_for_pins(struct address_space *mapping)
 
error = 0;
for (scan = 0; scan <= LAST_SCAN; scan++) {
-   if (!radix_tree_tagged(>i_pages, SHMEM_TAG_PINNED))
+   unsigned int tagged = 0;
+
+   if (!xas_tagged(, SHMEM_TAG_PINNED))
break;
 
if (!scan)
@@ -82,45 +82,34 @@ static int shmem_wait_for_pins(struct address_space 
*mapping)
else if (schedule_timeout_killable((HZ << scan) / 200))
scan = LAST_SCAN;
 
-   start = 0;
-   rcu_read_lock();
-   radix_tree_for_each_tagged(slot, >i_pages, ,
-  start, SHMEM_TAG_PINNED) {
-
-   page = radix_tree_deref_slot(slot);
-   if (radix_tree_exception(page)) {
-   if (radix_tree_deref_retry(page)) {
-   slot = radix_tree_iter_retry();
-   continue;
-   }
-
-   page = NULL;
-   }
-
-   if (page &&
-   page_count(page) - page_mapcount(page) != 1) {
-   if (scan < LAST_SCAN)
-   goto continue_resched;
-
+   xas_set(, 0);
+   xas_lock_irq();
+   xas_for_each_tag(, page, ULONG_MAX, SHMEM_TAG_PINNED) {
+   bool clear = true;
+   if (xa_is_value(page))
+   continue;
+   if (page_count(page) - page_mapcount(page) != 1) {
/*
 * On the last scan, we clean up all those tags
 * we inserted; but make a note that we still
 * found pages pinned.
 */
-   error = -EBUSY;
-   }
-
-   xa_lock_irq(>i_pages);
-   radix_tree_tag_clear(>i_pages,
-iter.index, SHMEM_TAG_PINNED);
-   xa_unlock_irq(>i_pages);
-continue_resched:
-   if (need_resched()) {
-   slot = radix_tree_iter_resume(slot, );
-   cond_resched_rcu();
+   if (scan == LAST_SCAN)
+   error = -EBUSY;
+   else
+   clear = false;
}
+   if (clear)
+   xas_clear_tag(, SHMEM_TAG_PINNED);
+   if (++tagged % XA_CHECK_SCHED)
+   continue;
+
+   xas_pause();
+   xas_unlock_irq();
+   cond_resched();
+   xas_lock_irq();
}
-   rcu_read_unlock();
+   xas_unlock_irq();
}
 
return error;
-- 
2.16.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v8 52/63] memfd: Convert shmem_tag_pins to XArray

2018-03-06 Thread Matthew Wilcox

From: Matthew Wilcox 

Simplify the locking by taking the spinlock while we walk the tree on
the assumption that many acquires and releases of the lock will be
worse than holding the lock for a (potentially) long time.

We could replicate the same locking behaviour with the xarray, but would
have to be careful that the xa_node wasn't RCU-freed under us before we
took the lock.

Signed-off-by: Matthew Wilcox 
---
 mm/memfd.c | 43 ++-
 1 file changed, 18 insertions(+), 25 deletions(-)

diff --git a/mm/memfd.c b/mm/memfd.c
index 4cf7401cb09c..3b299d72df78 100644
--- a/mm/memfd.c
+++ b/mm/memfd.c
@@ -21,7 +21,7 @@
 #include 
 
 /*
- * We need a tag: a new tag would expand every radix_tree_node by 8 bytes,
+ * We need a tag: a new tag would expand every xa_node by 8 bytes,
  * so reuse a tag which we firmly believe is never set or cleared on shmem.
  */
 #define SHMEM_TAG_PINNEDPAGECACHE_TAG_TOWRITE
@@ -29,35 +29,28 @@
 
 static void shmem_tag_pins(struct address_space *mapping)
 {
-   struct radix_tree_iter iter;
-   void __rcu **slot;
-   pgoff_t start;
+   XA_STATE(xas, >i_pages, 0);
struct page *page;
+   unsigned int tagged = 0;
 
lru_add_drain();
-   start = 0;
-   rcu_read_lock();
-
-   radix_tree_for_each_slot(slot, >i_pages, , start) {
-   page = radix_tree_deref_slot(slot);
-   if (!page || radix_tree_exception(page)) {
-   if (radix_tree_deref_retry(page)) {
-   slot = radix_tree_iter_retry();
-   continue;
-   }
-   } else if (page_count(page) - page_mapcount(page) > 1) {
-   xa_lock_irq(>i_pages);
-   radix_tree_tag_set(>i_pages, iter.index,
-  SHMEM_TAG_PINNED);
-   xa_unlock_irq(>i_pages);
-   }
 
-   if (need_resched()) {
-   slot = radix_tree_iter_resume(slot, );
-   cond_resched_rcu();
-   }
+   xas_lock_irq();
+   xas_for_each(, page, ULONG_MAX) {
+   if (xa_is_value(page))
+   continue;
+   if (page_count(page) - page_mapcount(page) > 1)
+   xas_set_tag(, SHMEM_TAG_PINNED);
+
+   if (++tagged % XA_CHECK_SCHED)
+   continue;
+
+   xas_pause();
+   xas_unlock_irq();
+   cond_resched();
+   xas_lock_irq();
}
-   rcu_read_unlock();
+   xas_unlock_irq();
 }
 
 /*
-- 
2.16.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v8 50/63] shmem: Convert shmem_free_swap to XArray

2018-03-06 Thread Matthew Wilcox

From: Matthew Wilcox 

This is a perfect use for xa_cmpxchg().  Note the use of 0 for GFP
flags; we won't be allocating memory.

Signed-off-by: Matthew Wilcox 
---
 mm/shmem.c | 7 ++-
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index a0a354a87f3b..cfbffb4b47a2 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -635,16 +635,13 @@ static void shmem_delete_from_page_cache(struct page 
*page, void *radswap)
 }
 
 /*
- * Remove swap entry from radix tree, free the swap and its page cache.
+ * Remove swap entry from page cache, free the swap and its page cache.
  */
 static int shmem_free_swap(struct address_space *mapping,
   pgoff_t index, void *radswap)
 {
-   void *old;
+   void *old = xa_cmpxchg(>i_pages, index, radswap, NULL, 0);
 
-   xa_lock_irq(>i_pages);
-   old = radix_tree_delete_item(>i_pages, index, radswap);
-   xa_unlock_irq(>i_pages);
if (old != radswap)
return -ENOENT;
free_swap_and_cache(radix_to_swp_entry(radswap));
-- 
2.16.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v8 49/63] shmem: Convert shmem_alloc_hugepage to XArray

2018-03-06 Thread Matthew Wilcox

From: Matthew Wilcox 

xa_find() is a slightly easier API to use than
radix_tree_gang_lookup_slot() because it contains its own RCU locking.

Signed-off-by: Matthew Wilcox 
---
 mm/shmem.c | 14 --
 1 file changed, 4 insertions(+), 10 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index fb06fb3e644a..a0a354a87f3b 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1416,23 +1416,17 @@ static struct page *shmem_alloc_hugepage(gfp_t gfp,
struct shmem_inode_info *info, pgoff_t index)
 {
struct vm_area_struct pvma;
-   struct inode *inode = >vfs_inode;
-   struct address_space *mapping = inode->i_mapping;
-   pgoff_t idx, hindex;
-   void __rcu **results;
+   struct address_space *mapping = info->vfs_inode.i_mapping;
+   pgoff_t hindex;
struct page *page;
 
if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGE_PAGECACHE))
return NULL;
 
hindex = round_down(index, HPAGE_PMD_NR);
-   rcu_read_lock();
-   if (radix_tree_gang_lookup_slot(>i_pages, , ,
-   hindex, 1) && idx < hindex + HPAGE_PMD_NR) {
-   rcu_read_unlock();
+   if (xa_find(>i_pages, , hindex + HPAGE_PMD_NR - 1,
+   XA_PRESENT))
return NULL;
-   }
-   rcu_read_unlock();
 
shmem_pseudo_vma_init(, info, hindex);
page = alloc_pages_vma(gfp | __GFP_COMP | __GFP_NORETRY | __GFP_NOWARN,
-- 
2.16.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v8 51/63] shmem: Convert shmem_partial_swap_usage to XArray

2018-03-06 Thread Matthew Wilcox

From: Matthew Wilcox 

Simpler code because the xarray takes care of things like the limit and
dereferencing the slot.

Signed-off-by: Matthew Wilcox 
---
 mm/shmem.c | 18 +++---
 1 file changed, 3 insertions(+), 15 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index cfbffb4b47a2..707430003ec7 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -658,29 +658,17 @@ static int shmem_free_swap(struct address_space *mapping,
 unsigned long shmem_partial_swap_usage(struct address_space *mapping,
pgoff_t start, pgoff_t end)
 {
-   struct radix_tree_iter iter;
-   void **slot;
+   XA_STATE(xas, >i_pages, start);
struct page *page;
unsigned long swapped = 0;
 
rcu_read_lock();
-
-   radix_tree_for_each_slot(slot, >i_pages, , start) {
-   if (iter.index >= end)
-   break;
-
-   page = radix_tree_deref_slot(slot);
-
-   if (radix_tree_deref_retry(page)) {
-   slot = radix_tree_iter_retry();
-   continue;
-   }
-
+   xas_for_each(, page, end - 1) {
if (xa_is_value(page))
swapped++;
 
if (need_resched()) {
-   slot = radix_tree_iter_resume(slot, );
+   xas_pause();
cond_resched_rcu();
}
}
-- 
2.16.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v8 54/63] shmem: Comment fixups

2018-03-06 Thread Matthew Wilcox

From: Matthew Wilcox 

Remove the last mentions of radix tree from various comments.

Signed-off-by: Matthew Wilcox 
---
 mm/shmem.c | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 707430003ec7..6b044cb6c8b5 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -743,7 +743,7 @@ void shmem_unlock_mapping(struct address_space *mapping)
 }
 
 /*
- * Remove range of pages and swap entries from radix tree, and free them.
+ * Remove range of pages and swap entries from page cache, and free them.
  * If !unfalloc, truncate or punch hole; if unfalloc, undo failed fallocate.
  */
 static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
@@ -1118,10 +1118,10 @@ static int shmem_unuse_inode(struct shmem_inode_info 
*info,
 * We needed to drop mutex to make that restrictive page
 * allocation, but the inode might have been freed while we
 * dropped it: although a racing shmem_evict_inode() cannot
-* complete without emptying the radix_tree, our page lock
+* complete without emptying the page cache, our page lock
 * on this swapcache page is not enough to prevent that -
 * free_swap_and_cache() of our swap entry will only
-* trylock_page(), removing swap from radix_tree whatever.
+* trylock_page(), removing swap from page cache whatever.
 *
 * We must not proceed to shmem_add_to_page_cache() if the
 * inode has been freed, but of course we cannot rely on
@@ -1187,7 +1187,7 @@ int shmem_unuse(swp_entry_t swap, struct page *page)
false);
if (error)
goto out;
-   /* No radix_tree_preload: swap entry keeps a place for page in tree */
+   /* No memory allocation: swap entry occupies the slot for the page */
error = -EAGAIN;
 
mutex_lock(_swaplist_mutex);
@@ -1866,7 +1866,7 @@ alloc_nohuge: page = 
shmem_alloc_and_acct_page(gfp, inode,
spin_unlock_irq(>lock);
goto repeat;
}
-   if (error == -EEXIST)   /* from above or from radix_tree_insert */
+   if (error == -EEXIST)
goto repeat;
return error;
 }
@@ -2478,7 +2478,7 @@ static ssize_t shmem_file_read_iter(struct kiocb *iocb, 
struct iov_iter *to)
 }
 
 /*
- * llseek SEEK_DATA or SEEK_HOLE through the radix_tree.
+ * llseek SEEK_DATA or SEEK_HOLE through the page cache.
  */
 static pgoff_t shmem_seek_hole_data(struct address_space *mapping,
pgoff_t index, pgoff_t end, int whence)
-- 
2.16.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v8 59/63] f2fs: Convert to XArray

2018-03-06 Thread Matthew Wilcox

From: Matthew Wilcox 

This is a straightforward conversion.

Signed-off-by: Matthew Wilcox 
---
 fs/f2fs/data.c   |  3 +--
 fs/f2fs/dir.c|  5 +
 fs/f2fs/inline.c |  6 +-
 fs/f2fs/node.c   | 10 ++
 4 files changed, 5 insertions(+), 19 deletions(-)

diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
index dfbccf884d4f..8deac207fbc3 100644
--- a/fs/f2fs/data.c
+++ b/fs/f2fs/data.c
@@ -2384,8 +2384,7 @@ void f2fs_set_page_dirty_nobuffers(struct page *page)
xa_lock_irqsave(>i_pages, flags);
WARN_ON_ONCE(!PageUptodate(page));
account_page_dirtied(page, mapping);
-   radix_tree_tag_set(>i_pages,
-   page_index(page), PAGECACHE_TAG_DIRTY);
+   __xa_set_tag(>i_pages, page_index(page), PAGECACHE_TAG_DIRTY);
xa_unlock_irqrestore(>i_pages, flags);
unlock_page_memcg(page);
 
diff --git a/fs/f2fs/dir.c b/fs/f2fs/dir.c
index a0aa4dd5a7d4..bf7f33f97fdf 100644
--- a/fs/f2fs/dir.c
+++ b/fs/f2fs/dir.c
@@ -708,7 +708,6 @@ void f2fs_delete_entry(struct f2fs_dir_entry *dentry, 
struct page *page,
unsigned int bit_pos;
int slots = GET_DENTRY_SLOTS(le16_to_cpu(dentry->name_len));
struct address_space *mapping = page_mapping(page);
-   unsigned long flags;
int i;
 
f2fs_update_time(F2FS_I_SB(dir), REQ_TIME);
@@ -741,10 +740,8 @@ void f2fs_delete_entry(struct f2fs_dir_entry *dentry, 
struct page *page,
 
if (bit_pos == NR_DENTRY_IN_BLOCK &&
!truncate_hole(dir, page->index, page->index + 1)) {
-   xa_lock_irqsave(>i_pages, flags);
-   radix_tree_tag_clear(>i_pages, page_index(page),
+   xa_clear_tag(>i_pages, page_index(page),
 PAGECACHE_TAG_DIRTY);
-   xa_unlock_irqrestore(>i_pages, flags);
 
clear_page_dirty_for_io(page);
ClearPagePrivate(page);
diff --git a/fs/f2fs/inline.c b/fs/f2fs/inline.c
index eb82891e4ab6..de06efb53cef 100644
--- a/fs/f2fs/inline.c
+++ b/fs/f2fs/inline.c
@@ -204,7 +204,6 @@ int f2fs_write_inline_data(struct inode *inode, struct page 
*page)
void *src_addr, *dst_addr;
struct dnode_of_data dn;
struct address_space *mapping = page_mapping(page);
-   unsigned long flags;
int err;
 
set_new_dnode(, inode, NULL, NULL, 0);
@@ -226,10 +225,7 @@ int f2fs_write_inline_data(struct inode *inode, struct 
page *page)
kunmap_atomic(src_addr);
set_page_dirty(dn.inode_page);
 
-   xa_lock_irqsave(>i_pages, flags);
-   radix_tree_tag_clear(>i_pages, page_index(page),
-PAGECACHE_TAG_DIRTY);
-   xa_unlock_irqrestore(>i_pages, flags);
+   xa_clear_tag(>i_pages, page_index(page), PAGECACHE_TAG_DIRTY);
 
set_inode_flag(inode, FI_APPEND_WRITE);
set_inode_flag(inode, FI_DATA_EXIST);
diff --git a/fs/f2fs/node.c b/fs/f2fs/node.c
index ce912c33b11e..57b444ba988b 100644
--- a/fs/f2fs/node.c
+++ b/fs/f2fs/node.c
@@ -88,14 +88,10 @@ bool available_free_memory(struct f2fs_sb_info *sbi, int 
type)
 static void clear_node_page_dirty(struct page *page)
 {
struct address_space *mapping = page->mapping;
-   unsigned int long flags;
 
if (PageDirty(page)) {
-   xa_lock_irqsave(>i_pages, flags);
-   radix_tree_tag_clear(>i_pages,
-   page_index(page),
+   xa_clear_tag(>i_pages, page_index(page),
PAGECACHE_TAG_DIRTY);
-   xa_unlock_irqrestore(>i_pages, flags);
 
clear_page_dirty_for_io(page);
dec_page_count(F2FS_M_SB(mapping), F2FS_DIRTY_NODES);
@@ -1139,9 +1135,7 @@ void ra_node_page(struct f2fs_sb_info *sbi, nid_t nid)
return;
f2fs_bug_on(sbi, check_nid_range(sbi, nid));
 
-   rcu_read_lock();
-   apage = radix_tree_lookup(_MAPPING(sbi)->i_pages, nid);
-   rcu_read_unlock();
+   apage = xa_load(_MAPPING(sbi)->i_pages, nid);
if (apage)
return;
 
-- 
2.16.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v8 58/63] nilfs2: Convert to XArray

2018-03-06 Thread Matthew Wilcox

From: Matthew Wilcox 

I'm not 100% convinced that the rewrite of nilfs_copy_back_pages is
correct, but it will at least have different bugs from the current
version.

Signed-off-by: Matthew Wilcox 
---
 fs/nilfs2/btnode.c | 37 +++-
 fs/nilfs2/page.c   | 72 +++---
 2 files changed, 56 insertions(+), 53 deletions(-)

diff --git a/fs/nilfs2/btnode.c b/fs/nilfs2/btnode.c
index dec98cab729d..68797603bc08 100644
--- a/fs/nilfs2/btnode.c
+++ b/fs/nilfs2/btnode.c
@@ -177,42 +177,36 @@ int nilfs_btnode_prepare_change_key(struct address_space 
*btnc,
ctxt->newbh = NULL;
 
if (inode->i_blkbits == PAGE_SHIFT) {
-   lock_page(obh->b_page);
-   /*
-* We cannot call radix_tree_preload for the kernels older
-* than 2.6.23, because it is not exported for modules.
-*/
+   void *entry;
+   struct page *opage = obh->b_page;
+   lock_page(opage);
 retry:
-   err = radix_tree_preload(GFP_NOFS & ~__GFP_HIGHMEM);
-   if (err)
-   goto failed_unlock;
/* BUG_ON(oldkey != obh->b_page->index); */
-   if (unlikely(oldkey != obh->b_page->index))
-   NILFS_PAGE_BUG(obh->b_page,
+   if (unlikely(oldkey != opage->index))
+   NILFS_PAGE_BUG(opage,
   "invalid oldkey %lld (newkey=%lld)",
   (unsigned long long)oldkey,
   (unsigned long long)newkey);
 
-   xa_lock_irq(>i_pages);
-   err = radix_tree_insert(>i_pages, newkey, obh->b_page);
-   xa_unlock_irq(>i_pages);
+   entry = xa_cmpxchg(>i_pages, newkey, NULL, opage, 
GFP_NOFS);
/*
 * Note: page->index will not change to newkey until
 * nilfs_btnode_commit_change_key() will be called.
 * To protect the page in intermediate state, the page lock
 * is held.
 */
-   radix_tree_preload_end();
-   if (!err)
+   if (!entry)
return 0;
-   else if (err != -EEXIST)
+   if (xa_is_err(entry)) {
+   err = xa_err(entry);
goto failed_unlock;
+   }
 
err = invalidate_inode_pages2_range(btnc, newkey, newkey);
if (!err)
goto retry;
/* fallback to copy mode */
-   unlock_page(obh->b_page);
+   unlock_page(opage);
}
 
nbh = nilfs_btnode_create_block(btnc, newkey);
@@ -252,9 +246,8 @@ void nilfs_btnode_commit_change_key(struct address_space 
*btnc,
mark_buffer_dirty(obh);
 
xa_lock_irq(>i_pages);
-   radix_tree_delete(>i_pages, oldkey);
-   radix_tree_tag_set(>i_pages, newkey,
-  PAGECACHE_TAG_DIRTY);
+   __xa_erase(>i_pages, oldkey);
+   __xa_set_tag(>i_pages, newkey, PAGECACHE_TAG_DIRTY);
xa_unlock_irq(>i_pages);
 
opage->index = obh->b_blocknr = newkey;
@@ -283,9 +276,7 @@ void nilfs_btnode_abort_change_key(struct address_space 
*btnc,
return;
 
if (nbh == NULL) {  /* blocksize == pagesize */
-   xa_lock_irq(>i_pages);
-   radix_tree_delete(>i_pages, newkey);
-   xa_unlock_irq(>i_pages);
+   xa_erase(>i_pages, newkey);
unlock_page(ctxt->bh->b_page);
} else
brelse(nbh);
diff --git a/fs/nilfs2/page.c b/fs/nilfs2/page.c
index 4cb850a6f1c2..a3995406d5d3 100644
--- a/fs/nilfs2/page.c
+++ b/fs/nilfs2/page.c
@@ -304,10 +304,10 @@ int nilfs_copy_dirty_pages(struct address_space *dmap,
 void nilfs_copy_back_pages(struct address_space *dmap,
   struct address_space *smap)
 {
+   XA_STATE(xas, >i_pages, 0);
struct pagevec pvec;
unsigned int i, n;
pgoff_t index = 0;
-   int err;
 
pagevec_init();
 repeat:
@@ -317,43 +317,56 @@ void nilfs_copy_back_pages(struct address_space *dmap,
 
for (i = 0; i < pagevec_count(); i++) {
struct page *page = pvec.pages[i], *dpage;
-   pgoff_t offset = page->index;
+   xas_set(, page->index);
 
lock_page(page);
-   dpage = find_lock_page(dmap, offset);
+   do {
+   xas_lock_irq();
+   dpage = xas_create();
+   if (!xas_error())
+   break;
+   xas_unlock_irq();
+   if (!xas_nomem(, GFP_NOFS)) {
+

[PATCH v8 55/63] btrfs: Convert page cache to XArray

2018-03-06 Thread Matthew Wilcox

From: Matthew Wilcox 

Signed-off-by: Matthew Wilcox 
---
 fs/btrfs/compression.c | 4 +---
 fs/btrfs/extent_io.c   | 8 +++-
 2 files changed, 4 insertions(+), 8 deletions(-)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index ad330af89eef..c2286f436571 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -457,9 +457,7 @@ static noinline int add_ra_bio_pages(struct inode *inode,
if (pg_index > end_index)
break;
 
-   rcu_read_lock();
-   page = radix_tree_lookup(>i_pages, pg_index);
-   rcu_read_unlock();
+   page = xa_load(>i_pages, pg_index);
if (page && !xa_is_value(page)) {
misses++;
if (misses > 4)
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index bcc24ee5a2c9..edc472b41037 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -5170,11 +5170,9 @@ void clear_extent_buffer_dirty(struct extent_buffer *eb)
 
clear_page_dirty_for_io(page);
xa_lock_irq(>mapping->i_pages);
-   if (!PageDirty(page)) {
-   radix_tree_tag_clear(>mapping->i_pages,
-   page_index(page),
-   PAGECACHE_TAG_DIRTY);
-   }
+   if (!PageDirty(page))
+   __xa_clear_tag(>mapping->i_pages,
+   page_index(page), PAGECACHE_TAG_DIRTY);
xa_unlock_irq(>mapping->i_pages);
ClearPageError(page);
unlock_page(page);
-- 
2.16.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v8 57/63] fs: Convert writeback to XArray

2018-03-06 Thread Matthew Wilcox

From: Matthew Wilcox 

A couple of short loops.

Signed-off-by: Matthew Wilcox 
---
 fs/fs-writeback.c | 25 +
 1 file changed, 9 insertions(+), 16 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 091577edc497..98e5e08274a2 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -339,9 +339,9 @@ static void inode_switch_wbs_work_fn(struct work_struct 
*work)
struct address_space *mapping = inode->i_mapping;
struct bdi_writeback *old_wb = inode->i_wb;
struct bdi_writeback *new_wb = isw->new_wb;
-   struct radix_tree_iter iter;
+   XA_STATE(xas, >i_pages, 0);
+   struct page *page;
bool switched = false;
-   void **slot;
 
/*
 * By the time control reaches here, RCU grace period has passed
@@ -375,25 +375,18 @@ static void inode_switch_wbs_work_fn(struct work_struct 
*work)
 * to possibly dirty pages while PAGECACHE_TAG_WRITEBACK points to
 * pages actually under writeback.
 */
-   radix_tree_for_each_tagged(slot, >i_pages, , 0,
-  PAGECACHE_TAG_DIRTY) {
-   struct page *page = radix_tree_deref_slot_protected(slot,
-   >i_pages.xa_lock);
-   if (likely(page) && PageDirty(page)) {
+   xas_for_each_tag(, page, ULONG_MAX, PAGECACHE_TAG_DIRTY) {
+   if (PageDirty(page)) {
dec_wb_stat(old_wb, WB_RECLAIMABLE);
inc_wb_stat(new_wb, WB_RECLAIMABLE);
}
}
 
-   radix_tree_for_each_tagged(slot, >i_pages, , 0,
-  PAGECACHE_TAG_WRITEBACK) {
-   struct page *page = radix_tree_deref_slot_protected(slot,
-   >i_pages.xa_lock);
-   if (likely(page)) {
-   WARN_ON_ONCE(!PageWriteback(page));
-   dec_wb_stat(old_wb, WB_WRITEBACK);
-   inc_wb_stat(new_wb, WB_WRITEBACK);
-   }
+   xas_set(, 0);
+   xas_for_each_tag(, page, ULONG_MAX, PAGECACHE_TAG_WRITEBACK) {
+   WARN_ON_ONCE(!PageWriteback(page));
+   dec_wb_stat(old_wb, WB_WRITEBACK);
+   inc_wb_stat(new_wb, WB_WRITEBACK);
}
 
wb_get(new_wb);
-- 
2.16.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v8 56/63] fs: Convert buffer to XArray

2018-03-06 Thread Matthew Wilcox

From: Matthew Wilcox 

Mostly comment fixes, but one use of __xa_set_tag.

Signed-off-by: Matthew Wilcox 
---
 fs/buffer.c | 14 +++---
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 3ee82c056d85..70af8fbc64cf 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -585,7 +585,7 @@ void mark_buffer_dirty_inode(struct buffer_head *bh, struct 
inode *inode)
 EXPORT_SYMBOL(mark_buffer_dirty_inode);
 
 /*
- * Mark the page dirty, and set it dirty in the radix tree, and mark the inode
+ * Mark the page dirty, and set it dirty in the page cache, and mark the inode
  * dirty.
  *
  * If warn is true, then emit a warning if the page is not uptodate and has
@@ -602,8 +602,8 @@ void __set_page_dirty(struct page *page, struct 
address_space *mapping,
if (page->mapping) {/* Race with truncate? */
WARN_ON_ONCE(warn && !PageUptodate(page));
account_page_dirtied(page, mapping);
-   radix_tree_tag_set(>i_pages,
-   page_index(page), PAGECACHE_TAG_DIRTY);
+   __xa_set_tag(>i_pages, page_index(page),
+   PAGECACHE_TAG_DIRTY);
}
xa_unlock_irqrestore(>i_pages, flags);
 }
@@ -1066,7 +1066,7 @@ __getblk_slow(struct block_device *bdev, sector_t block,
  * The relationship between dirty buffers and dirty pages:
  *
  * Whenever a page has any dirty buffers, the page's dirty bit is set, and
- * the page is tagged dirty in its radix tree.
+ * the page is tagged dirty in the page cache.
  *
  * At all times, the dirtiness of the buffers represents the dirtiness of
  * subsections of the page.  If the page has buffers, the page dirty bit is
@@ -1089,9 +1089,9 @@ __getblk_slow(struct block_device *bdev, sector_t block,
  * mark_buffer_dirty - mark a buffer_head as needing writeout
  * @bh: the buffer_head to mark dirty
  *
- * mark_buffer_dirty() will set the dirty bit against the buffer, then set its
- * backing page dirty, then tag the page as dirty in its address_space's radix
- * tree and then attach the address_space's inode to its superblock's dirty
+ * mark_buffer_dirty() will set the dirty bit against the buffer, then set
+ * its backing page dirty, then tag the page as dirty in the page cache
+ * and then attach the address_space's inode to its superblock's dirty
  * inode list.
  *
  * mark_buffer_dirty() is atomic.  It takes bh->b_page->mapping->private_lock,
-- 
2.16.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v8 61/63] dax: Convert to XArray

2018-03-06 Thread Matthew Wilcox

From: Matthew Wilcox 

The DAX code (by its nature) is deeply interwoven with the radix tree
infrastructure, doing operations directly on the radix tree slots.
Convert the whole file to use XArray concepts; mostly passing around
xa_state instead of address_space, index or slot.

Signed-off-by: Matthew Wilcox 
---
 fs/dax.c | 369 +--
 1 file changed, 143 insertions(+), 226 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index ae70bebdb835..d14c2d931377 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -45,6 +45,7 @@
 /* The 'colour' (ie low bits) within a PMD of a page offset.  */
 #define PG_PMD_COLOUR  ((PMD_SIZE >> PAGE_SHIFT) - 1)
 #define PG_PMD_NR  (PMD_SIZE >> PAGE_SHIFT)
+#define PMD_ORDER  (PMD_SHIFT - PAGE_SHIFT)
 
 static wait_queue_head_t wait_table[DAX_WAIT_TABLE_ENTRIES];
 
@@ -74,21 +75,26 @@ fs_initcall(init_dax_wait_table);
 #define DAX_ZERO_PAGE  (1UL << 2)
 #define DAX_EMPTY  (1UL << 3)
 
-static unsigned long dax_radix_sector(void *entry)
+static bool xa_is_dax_locked(void *entry)
+{
+   return xa_to_value(entry) & DAX_ENTRY_LOCK;
+}
+
+static unsigned long xa_to_dax_sector(void *entry)
 {
return xa_to_value(entry) >> DAX_SHIFT;
 }
 
-static void *dax_radix_locked_entry(sector_t sector, unsigned long flags)
+static void *xa_mk_dax_locked(sector_t sector, unsigned long flags)
 {
return xa_mk_value(flags | ((unsigned long)sector << DAX_SHIFT) |
DAX_ENTRY_LOCK);
 }
 
-static unsigned int dax_radix_order(void *entry)
+static unsigned int dax_entry_order(void *entry)
 {
if (xa_to_value(entry) & DAX_PMD)
-   return PMD_SHIFT - PAGE_SHIFT;
+   return PMD_ORDER;
return 0;
 }
 
@@ -113,10 +119,10 @@ static int dax_is_empty_entry(void *entry)
 }
 
 /*
- * DAX radix tree locking
+ * DAX page cache entry locking
  */
 struct exceptional_entry_key {
-   struct address_space *mapping;
+   struct xarray *xa;
pgoff_t entry_start;
 };
 
@@ -125,9 +131,10 @@ struct wait_exceptional_entry_queue {
struct exceptional_entry_key key;
 };
 
-static wait_queue_head_t *dax_entry_waitqueue(struct address_space *mapping,
-   pgoff_t index, void *entry, struct exceptional_entry_key *key)
+static wait_queue_head_t *dax_entry_waitqueue(struct xa_state *xas,
+   void *entry, struct exceptional_entry_key *key)
 {
+   unsigned long index = xas->xa_index;
unsigned long hash;
 
/*
@@ -138,10 +145,10 @@ static wait_queue_head_t *dax_entry_waitqueue(struct 
address_space *mapping,
if (dax_is_pmd_entry(entry))
index &= ~PG_PMD_COLOUR;
 
-   key->mapping = mapping;
+   key->xa = xas->xa;
key->entry_start = index;
 
-   hash = hash_long((unsigned long)mapping ^ index, DAX_WAIT_TABLE_BITS);
+   hash = hash_long((unsigned long)xas->xa ^ index, DAX_WAIT_TABLE_BITS);
return wait_table + hash;
 }
 
@@ -152,7 +159,7 @@ static int wake_exceptional_entry_func(wait_queue_entry_t 
*wait, unsigned int mo
struct wait_exceptional_entry_queue *ewait =
container_of(wait, struct wait_exceptional_entry_queue, wait);
 
-   if (key->mapping != ewait->key.mapping ||
+   if (key->xa != ewait->key.xa ||
key->entry_start != ewait->key.entry_start)
return 0;
return autoremove_wake_function(wait, mode, sync, NULL);
@@ -163,13 +170,12 @@ static int wake_exceptional_entry_func(wait_queue_entry_t 
*wait, unsigned int mo
  * The important information it's conveying is whether the entry at
  * this index used to be a PMD entry.
  */
-static void dax_wake_mapping_entry_waiter(struct address_space *mapping,
-   pgoff_t index, void *entry, bool wake_all)
+static void dax_wake_entry(struct xa_state *xas, void *entry, bool wake_all)
 {
struct exceptional_entry_key key;
wait_queue_head_t *wq;
 
-   wq = dax_entry_waitqueue(mapping, index, entry, );
+   wq = dax_entry_waitqueue(xas, entry, );
 
/*
 * Checking for locked entry and prepare_to_wait_exclusive() happens
@@ -182,53 +188,27 @@ static void dax_wake_mapping_entry_waiter(struct 
address_space *mapping,
 }
 
 /*
- * Check whether the given slot is locked.  Must be called with the i_pages
- * lock held.
- */
-static inline int slot_locked(struct address_space *mapping, void **slot)
-{
-   unsigned long entry = xa_to_value(
-   radix_tree_deref_slot_protected(slot, 
>i_pages.xa_lock));
-   return entry & DAX_ENTRY_LOCK;
-}
-
-/*
- * Mark the given slot as locked.  Must be called with the i_pages lock held.
+ * Mark the given entry as locked.  Must be called with the i_pages lock held.
  */
-static inline void *lock_slot(struct address_space *mapping, void **slot)
+static inline void *lock_entry(struct xa_state *xas)
 {
-   unsigned long v = xa_to_value(
-

[PATCH v8 60/63] lustre: Convert to XArray

2018-03-06 Thread Matthew Wilcox

From: Matthew Wilcox 

Signed-off-by: Matthew Wilcox 
---
 drivers/staging/lustre/lustre/llite/glimpse.c   | 12 +---
 drivers/staging/lustre/lustre/mdc/mdc_request.c | 16 
 2 files changed, 13 insertions(+), 15 deletions(-)

diff --git a/drivers/staging/lustre/lustre/llite/glimpse.c 
b/drivers/staging/lustre/lustre/llite/glimpse.c
index 3075358f3f08..014035be5ac7 100644
--- a/drivers/staging/lustre/lustre/llite/glimpse.c
+++ b/drivers/staging/lustre/lustre/llite/glimpse.c
@@ -57,7 +57,7 @@ static const struct cl_lock_descr whole_file = {
 };
 
 /*
- * Check whether file has possible unwriten pages.
+ * Check whether file has possible unwritten pages.
  *
  * \retval 1file is mmap-ed or has dirty pages
  *  0otherwise
@@ -66,16 +66,14 @@ blkcnt_t dirty_cnt(struct inode *inode)
 {
blkcnt_t cnt = 0;
struct vvp_object *vob = cl_inode2vvp(inode);
-   void  *results[1];
 
-   if (inode->i_mapping)
-   cnt += radix_tree_gang_lookup_tag(>i_mapping->i_pages,
- results, 0, 1,
- PAGECACHE_TAG_DIRTY);
+   if (inode->i_mapping && xa_tagged(>i_mapping->i_pages,
+   PAGECACHE_TAG_DIRTY))
+   cnt = 1;
if (cnt == 0 && atomic_read(>vob_mmap_cnt) > 0)
cnt = 1;
 
-   return (cnt > 0) ? 1 : 0;
+   return cnt;
 }
 
 int cl_glimpse_lock(const struct lu_env *env, struct cl_io *io,
diff --git a/drivers/staging/lustre/lustre/mdc/mdc_request.c 
b/drivers/staging/lustre/lustre/mdc/mdc_request.c
index 6950cb21638e..dbda8a9e351d 100644
--- a/drivers/staging/lustre/lustre/mdc/mdc_request.c
+++ b/drivers/staging/lustre/lustre/mdc/mdc_request.c
@@ -931,17 +931,18 @@ static struct page *mdc_page_locate(struct address_space 
*mapping, __u64 *hash,
 * hash _smaller_ than one we are looking for.
 */
unsigned long offset = hash_x_index(*hash, hash64);
+   XA_STATE(xas, >i_pages, offset);
struct page *page;
-   int found;
 
-   xa_lock_irq(>i_pages);
-   found = radix_tree_gang_lookup(>i_pages,
-  (void **), offset, 1);
-   if (found > 0 && !xa_is_value(page)) {
+   xas_lock_irq();
+   page = xas_find(, ULONG_MAX);
+   if (xa_is_value(page))
+   page = NULL;
+   if (page) {
struct lu_dirpage *dp;
 
get_page(page);
-   xa_unlock_irq(>i_pages);
+   xas_unlock_irq();
/*
 * In contrast to find_lock_page() we are sure that directory
 * page cannot be truncated (while DLM lock is held) and,
@@ -989,8 +990,7 @@ static struct page *mdc_page_locate(struct address_space 
*mapping, __u64 *hash,
page = ERR_PTR(-EIO);
}
} else {
-   xa_unlock_irq(>i_pages);
-   page = NULL;
+   xas_unlock_irq();
}
return page;
 }
-- 
2.16.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v8 63/63] radix tree: Remove unused functions

2018-03-06 Thread Matthew Wilcox

From: Matthew Wilcox 

The following functions are (now) unused:
 - __radix_tree_delete_node
 - radix_tree_gang_lookup_slot
 - radix_tree_join
 - radix_tree_maybe_preload_order
 - radix_tree_split
 - radix_tree_split_preload

Signed-off-by: Matthew Wilcox 
---
 include/linux/radix-tree.h |  16 +--
 lib/radix-tree.c   | 294 +
 2 files changed, 4 insertions(+), 306 deletions(-)

diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index f64beb9ba175..eb2ae901f2ec 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -147,12 +147,11 @@ static inline unsigned int iter_shift(const struct 
radix_tree_iter *iter)
  * radix_tree_lookup_slot
  * radix_tree_tag_get
  * radix_tree_gang_lookup
- * radix_tree_gang_lookup_slot
  * radix_tree_gang_lookup_tag
  * radix_tree_gang_lookup_tag_slot
  * radix_tree_tagged
  *
- * The first 8 functions are able to be called locklessly, using RCU. The
+ * The first 7 functions are able to be called locklessly, using RCU. The
  * caller must ensure calls to these functions are made within rcu_read_lock()
  * regions. Other readers (lock-free or otherwise) and modifications may be
  * running concurrently.
@@ -254,9 +253,6 @@ void radix_tree_iter_replace(struct radix_tree_root *,
const struct radix_tree_iter *, void __rcu **slot, void *entry);
 void radix_tree_replace_slot(struct radix_tree_root *,
 void __rcu **slot, void *entry);
-void __radix_tree_delete_node(struct radix_tree_root *,
- struct radix_tree_node *,
- radix_tree_update_node_t update_node);
 void radix_tree_iter_delete(struct radix_tree_root *,
struct radix_tree_iter *iter, void __rcu **slot);
 void *radix_tree_delete_item(struct radix_tree_root *, unsigned long, void *);
@@ -266,12 +262,8 @@ void radix_tree_clear_tags(struct radix_tree_root *, 
struct radix_tree_node *,
 unsigned int radix_tree_gang_lookup(const struct radix_tree_root *,
void **results, unsigned long first_index,
unsigned int max_items);
-unsigned int radix_tree_gang_lookup_slot(const struct radix_tree_root *,
-   void __rcu ***results, unsigned long *indices,
-   unsigned long first_index, unsigned int max_items);
 int radix_tree_preload(gfp_t gfp_mask);
 int radix_tree_maybe_preload(gfp_t gfp_mask);
-int radix_tree_maybe_preload_order(gfp_t gfp_mask, int order);
 void radix_tree_init(void);
 void *radix_tree_tag_set(struct radix_tree_root *,
unsigned long index, unsigned int tag);
@@ -296,12 +288,6 @@ static inline void radix_tree_preload_end(void)
preempt_enable();
 }
 
-int radix_tree_split_preload(unsigned old_order, unsigned new_order, gfp_t);
-int radix_tree_split(struct radix_tree_root *, unsigned long index,
-   unsigned new_order);
-int radix_tree_join(struct radix_tree_root *, unsigned long index,
-   unsigned new_order, void *);
-
 void __rcu **idr_get_free(struct radix_tree_root *root,
  struct radix_tree_iter *iter, gfp_t gfp,
  unsigned long max);
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index 71697bd25140..20858120ac0b 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -41,9 +41,6 @@
 #include 
 
 
-/* Number of nodes in fully populated tree of given height */
-static unsigned long height_to_maxnodes[RADIX_TREE_MAX_PATH + 1] __read_mostly;
-
 /*
  * Radix tree node cache.
  */
@@ -463,73 +460,6 @@ int radix_tree_maybe_preload(gfp_t gfp_mask)
 }
 EXPORT_SYMBOL(radix_tree_maybe_preload);
 
-#ifdef CONFIG_RADIX_TREE_MULTIORDER
-/*
- * Preload with enough objects to ensure that we can split a single entry
- * of order @old_order into many entries of size @new_order
- */
-int radix_tree_split_preload(unsigned int old_order, unsigned int new_order,
-   gfp_t gfp_mask)
-{
-   unsigned top = 1 << (old_order % RADIX_TREE_MAP_SHIFT);
-   unsigned layers = (old_order / RADIX_TREE_MAP_SHIFT) -
-   (new_order / RADIX_TREE_MAP_SHIFT);
-   unsigned nr = 0;
-
-   WARN_ON_ONCE(!gfpflags_allow_blocking(gfp_mask));
-   BUG_ON(new_order >= old_order);
-
-   while (layers--)
-   nr = nr * RADIX_TREE_MAP_SIZE + 1;
-   return __radix_tree_preload(gfp_mask, top * nr);
-}
-#endif
-
-/*
- * The same as function above, but preload number of nodes required to insert
- * (1 << order) continuous naturally-aligned elements.
- */
-int radix_tree_maybe_preload_order(gfp_t gfp_mask, int order)
-{
-   unsigned long nr_subtrees;
-   int nr_nodes, subtree_height;
-
-   /* Preloading doesn't help anything with this gfp mask, skip it */
-   if

[PATCH v8 62/63] page cache: Finish XArray conversion

2018-03-06 Thread Matthew Wilcox

From: Matthew Wilcox 

With no more radix tree API users left, we can drop the GFP flags
and use xa_init() instead of INIT_RADIX_TREE().

Signed-off-by: Matthew Wilcox 
---
 fs/inode.c  | 2 +-
 mm/swap_state.c | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index 14d938e2c9a2..6dcbb1b47fb7 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -349,7 +349,7 @@ EXPORT_SYMBOL(inc_nlink);
 void address_space_init_once(struct address_space *mapping)
 {
memset(mapping, 0, sizeof(*mapping));
-   INIT_RADIX_TREE(>i_pages, GFP_ATOMIC | __GFP_ACCOUNT);
+   xa_init_flags(>i_pages, XA_FLAGS_LOCK_IRQ);
init_rwsem(>i_mmap_rwsem);
INIT_LIST_HEAD(>private_list);
spin_lock_init(>private_lock);
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 21218f6e438b..ebf837d13d2d 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -595,7 +595,7 @@ int init_swap_address_space(unsigned int type, unsigned 
long nr_pages)
return -ENOMEM;
for (i = 0; i < nr; i++) {
space = spaces + i;
-   INIT_RADIX_TREE(>i_pages, GFP_ATOMIC|__GFP_NOWARN);
+   xa_init_flags(>i_pages, XA_FLAGS_LOCK_IRQ);
atomic_set(>i_mmap_writable, 0);
space->a_ops = _aops;
/* swap cache doesn't use writeback related tags */
-- 
2.16.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v8 46/63] shmem: Convert shmem_confirm_swap to XArray

2018-03-06 Thread Matthew Wilcox

From: Matthew Wilcox 

xa_load has its own RCU locking, so we can eliminate it here.

Signed-off-by: Matthew Wilcox 
---
 mm/shmem.c | 7 +--
 1 file changed, 1 insertion(+), 6 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 5813808965cd..0af8a439dfad 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -348,12 +348,7 @@ static int shmem_xa_replace(struct address_space *mapping,
 static bool shmem_confirm_swap(struct address_space *mapping,
   pgoff_t index, swp_entry_t swap)
 {
-   void *item;
-
-   rcu_read_lock();
-   item = radix_tree_lookup(>i_pages, index);
-   rcu_read_unlock();
-   return item == swp_to_radix_entry(swap);
+   return xa_load(>i_pages, index) == swp_to_radix_entry(swap);
 }
 
 /*
-- 
2.16.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Btrfs: scrub: batch rebuild for raid56

2018-03-06 Thread Liu Bo

On Tue, Mar 06, 2018 at 11:47:47AM +0100, David Sterba wrote:
> On Fri, Mar 02, 2018 at 04:10:37PM -0700, Liu Bo wrote:
> > In case of raid56, writes and rebuilds always take BTRFS_STRIPE_LEN(64K)
> > as unit, however, scrub_extent() sets blocksize as unit, so rebuild
> > process may be triggered on every block on a same stripe.
> > 
> > A typical example would be that when we're replacing a disappeared disk,
> > all reads on the disks get -EIO, every block (size is 4K if blocksize is
> > 4K) would go thru these,
> > 
> > scrub_handle_errored_block
> >   scrub_recheck_block # re-read pages one by one
> >   scrub_recheck_block # rebuild by calling raid56_parity_recover()
> > page by page
> > 
> > Although with raid56 stripe cache most of reads during rebuild can be
> > avoided, the parity recover calculation(xor or raid6 algorithms) needs to
> > be done $(BTRFS_STRIPE_LEN / blocksize) times.
> > 
> > This makes it less stupid by doing raid56 scrub/replace on stripe length.
> 
> missing s-o-b
>

I'm surprised that checkpatch.pl didn't complain.

> > ---
> >  fs/btrfs/scrub.c | 78 
> > +++-
> >  1 file changed, 60 insertions(+), 18 deletions(-)
> > 
> > diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
> > index 9882513..e3203a1 100644
> > --- a/fs/btrfs/scrub.c
> > +++ b/fs/btrfs/scrub.c
> > @@ -1718,6 +1718,44 @@ static int scrub_submit_raid56_bio_wait(struct 
> > btrfs_fs_info *fs_info,
> > return blk_status_to_errno(bio->bi_status);
> >  }
> >  
> > +static void scrub_recheck_block_on_raid56(struct btrfs_fs_info *fs_info,
> > + struct scrub_block *sblock)
> > +{
> > +   struct scrub_page *first_page = sblock->pagev[0];
> > +   struct bio *bio = btrfs_io_bio_alloc(BIO_MAX_PAGES);
> 
> nontrivial initializations (variable to variable) are better put into
> the statement section.
>

OK.

> > +   int page_num;
> > +
> > +   /* All pages in sblock belongs to the same stripe on the same device. */
> > +   ASSERT(first_page->dev);
> > +   if (first_page->dev->bdev == NULL)
> > +   goto out;
> > +
> > +   bio_set_dev(bio, first_page->dev->bdev);
> > +
> > +   for (page_num = 0; page_num < sblock->page_count; page_num++) {
> > +   struct scrub_page *page = sblock->pagev[page_num];
> > +
> > +   WARN_ON(!page->page);
> > +   bio_add_page(bio, page->page, PAGE_SIZE, 0);
> > +   }
> > +
> > +   if (scrub_submit_raid56_bio_wait(fs_info, bio, first_page)) {
> > +   bio_put(bio);
> > +   goto out;
> > +   }
> > +
> > +   bio_put(bio);
> > +
> > +   scrub_recheck_block_checksum(sblock);
> > +
> > +   return;
> > +out:
> > +   for (page_num = 0; page_num < sblock->page_count; page_num++)
> > +   sblock->pagev[page_num]->io_error = 1;
> > +
> > +   sblock->no_io_error_seen = 0;
> > +}
> > +
> >  /*
> >   * this function will check the on disk data for checksum errors, header
> >   * errors and read I/O errors. If any I/O errors happen, the exact pages
> > @@ -1733,6 +1771,10 @@ static void scrub_recheck_block(struct btrfs_fs_info 
> > *fs_info,
> >  
> > sblock->no_io_error_seen = 1;
> >  
> > +   /* short cut for raid56 */
> > +   if (!retry_failed_mirror && scrub_is_page_on_raid56(sblock->pagev[0]))
> > +   return scrub_recheck_block_on_raid56(fs_info, sblock);
> > +
> > for (page_num = 0; page_num < sblock->page_count; page_num++) {
> > struct bio *bio;
> > struct scrub_page *page = sblock->pagev[page_num];
> > @@ -1748,19 +1790,12 @@ static void scrub_recheck_block(struct 
> > btrfs_fs_info *fs_info,
> > bio_set_dev(bio, page->dev->bdev);
> >  
> > bio_add_page(bio, page->page, PAGE_SIZE, 0);
> > -   if (!retry_failed_mirror && scrub_is_page_on_raid56(page)) {
> > -   if (scrub_submit_raid56_bio_wait(fs_info, bio, page)) {
> > -   page->io_error = 1;
> > -   sblock->no_io_error_seen = 0;
> > -   }
> > -   } else {
> > -   bio->bi_iter.bi_sector = page->physical >> 9;
> > -   bio_set_op_attrs(bio, REQ_OP_READ, 0);
> > +   bio->bi_iter.bi_sector = page->physical >> 9;
> > +   bio_set_op_attrs(bio, REQ_OP_READ, 0);
> 
> https://elixir.bootlin.com/linux/latest/source/include/linux/blk_types.h#L270
> 
> bio_set_op_attrs should not be used

OK.

Thanks,

-liubo
> 
> >  
> > -   if (btrfsic_submit_bio_wait(bio)) {
> > -   page->io_error = 1;
> > -   sblock->no_io_error_seen = 0;
> > -   }
> > +   if (btrfsic_submit_bio_wait(bio)) {
> > +   page->io_error = 1;
> > +   sblock->no_io_error_seen = 0;
> > }
> >  
> > bio_put(bio);
> > @@ -2728,7 +2763,8 @@ static int scrub_find_csum(struct scrub_ctx *sctx, 
> > u64

[PATCH v8 48/63] shmem: Convert shmem_add_to_page_cache to XArray

2018-03-06 Thread Matthew Wilcox

From: Matthew Wilcox 

This removes the last caller of radix_tree_maybe_preload_order().
Simpler code, unless we run out of memory for new xa_nodes partway through
inserting entries into the xarray.  Hopefully we can support multi-index
entries in the page cache soon and all the awful code goes away.

Signed-off-by: Matthew Wilcox 
---
 mm/shmem.c | 87 --
 1 file changed, 39 insertions(+), 48 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 49f42dc9e1dc..fb06fb3e644a 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -558,9 +558,10 @@ static unsigned long shmem_unused_huge_shrink(struct 
shmem_sb_info *sbinfo,
  */
 static int shmem_add_to_page_cache(struct page *page,
   struct address_space *mapping,
-  pgoff_t index, void *expected)
+  pgoff_t index, void *expected, gfp_t gfp)
 {
-   int error, nr = hpage_nr_pages(page);
+   XA_STATE(xas, >i_pages, index);
+   unsigned long i, nr = 1UL << compound_order(page);
 
VM_BUG_ON_PAGE(PageTail(page), page);
VM_BUG_ON_PAGE(index != round_down(index, nr), page);
@@ -569,49 +570,47 @@ static int shmem_add_to_page_cache(struct page *page,
VM_BUG_ON(expected && PageTransHuge(page));
 
page_ref_add(page, nr);
-   page->mapping = mapping;
page->index = index;
+   page->mapping = mapping;
 
-   xa_lock_irq(>i_pages);
-   if (PageTransHuge(page)) {
-   void __rcu **results;
-   pgoff_t idx;
-   int i;
-
-   error = 0;
-   if (radix_tree_gang_lookup_slot(>i_pages,
-   , , index, 1) &&
-   idx < index + HPAGE_PMD_NR) {
-   error = -EEXIST;
+   do {
+   xas_lock_irq();
+   xas_create_range(, index + nr - 1);
+   if (xas_error())
+   goto unlock;
+   for (i = 0; i < nr; i++) {
+   void *entry = xas_load();
+   if (entry != expected)
+   xas_set_err(, -ENOENT);
+   if (xas_error())
+   goto undo;
+   xas_store(, page + i);
+   xas_next();
}
-
-   if (!error) {
-   for (i = 0; i < HPAGE_PMD_NR; i++) {
-   error = radix_tree_insert(>i_pages,
-   index + i, page + i);
-   VM_BUG_ON(error);
-   }
+   if (PageTransHuge(page)) {
count_vm_event(THP_FILE_ALLOC);
+   __inc_node_page_state(page, NR_SHMEM_THPS);
}
-   } else if (!expected) {
-   error = radix_tree_insert(>i_pages, index, page);
-   } else {
-   error = shmem_xa_replace(mapping, index, expected, page);
-   }
-
-   if (!error) {
mapping->nrpages += nr;
-   if (PageTransHuge(page))
-   __inc_node_page_state(page, NR_SHMEM_THPS);
__mod_node_page_state(page_pgdat(page), NR_FILE_PAGES, nr);
__mod_node_page_state(page_pgdat(page), NR_SHMEM, nr);
-   xa_unlock_irq(>i_pages);
-   } else {
+   goto unlock;
+undo:
+   while (i-- > 0) {
+   xas_store(, NULL);
+   xas_prev();
+   }
+unlock:
+   xas_unlock_irq();
+   } while (xas_nomem(, gfp));
+
+   if (xas_error()) {
page->mapping = NULL;
-   xa_unlock_irq(>i_pages);
page_ref_sub(page, nr);
+   return xas_error();
}
-   return error;
+
+   return 0;
 }
 
 /*
@@ -1159,7 +1158,7 @@ static int shmem_unuse_inode(struct shmem_inode_info 
*info,
 */
if (!error)
error = shmem_add_to_page_cache(*pagep, mapping, index,
-   radswap);
+   radswap, gfp);
if (error != -ENOMEM) {
/*
 * Truncation and eviction use free_swap_and_cache(), which
@@ -1680,7 +1679,7 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t 
index,
false);
if (!error) {
error = shmem_add_to_page_cache(page, mapping, index,
-   swp_to_radix_entry(swap));
+   swp_to_radix_entry(swap), gfp);
/*
 * We already confirmed swap under page lock, and make
 * no memory

[PATCH v8 14/63] xarray: Define struct xa_node

2018-03-06 Thread Matthew Wilcox

From: Matthew Wilcox 

This is a direct replacement for struct radix_tree_node.  A couple of
struct members have changed name, so convert those.  Use a #define so
that radix tree users continue to work without change.

Signed-off-by: Matthew Wilcox 
---
 include/linux/radix-tree.h| 29 +++--
 include/linux/xarray.h| 24 ++
 lib/radix-tree.c  | 48 +--
 mm/workingset.c   | 16 ++--
 tools/testing/radix-tree/multiorder.c | 30 +++---
 5 files changed, 74 insertions(+), 73 deletions(-)

diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index c8a33e9e9a3c..f64beb9ba175 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -32,6 +32,7 @@
 
 /* Keep unconverted code working */
 #define radix_tree_rootxarray
+#define radix_tree_nodexa_node
 
 /*
  * The bottom two bits of the slot determine how the remaining bits in the
@@ -60,41 +61,17 @@ static inline bool radix_tree_is_internal_node(void *ptr)
 
 /*** radix-tree API starts here ***/
 
-#define RADIX_TREE_MAX_TAGS 3
-
 #define RADIX_TREE_MAP_SHIFT   XA_CHUNK_SHIFT
 #define RADIX_TREE_MAP_SIZE(1UL << RADIX_TREE_MAP_SHIFT)
 #define RADIX_TREE_MAP_MASK(RADIX_TREE_MAP_SIZE-1)
 
-#define RADIX_TREE_TAG_LONGS   \
-   ((RADIX_TREE_MAP_SIZE + BITS_PER_LONG - 1) / BITS_PER_LONG)
+#define RADIX_TREE_MAX_TAGSXA_MAX_TAGS
+#define RADIX_TREE_TAG_LONGS   XA_TAG_LONGS
 
 #define RADIX_TREE_INDEX_BITS  (8 /* CHAR_BIT */ * sizeof(unsigned long))
 #define RADIX_TREE_MAX_PATH (DIV_ROUND_UP(RADIX_TREE_INDEX_BITS, \
  RADIX_TREE_MAP_SHIFT))
 
-/*
- * @count is the count of every non-NULL element in the ->slots array
- * whether that is a data entry, a retry entry, a user pointer,
- * a sibling entry or a pointer to the next level of the tree.
- * @exceptional is the count of every element in ->slots which is
- * either a data entry or a sibling entry for data.
- */
-struct radix_tree_node {
-   unsigned char   shift;  /* Bits remaining in each slot */
-   unsigned char   offset; /* Slot offset in parent */
-   unsigned char   count;  /* Total entry count */
-   unsigned char   exceptional;/* Exceptional entry count */
-   struct radix_tree_node *parent; /* Used when ascending tree */
-   struct radix_tree_root *root;   /* The tree we belong to */
-   union {
-   struct list_head private_list;  /* For tree user */
-   struct rcu_head rcu_head;   /* Used when freeing node */
-   };
-   void __rcu  *slots[RADIX_TREE_MAP_SIZE];
-   unsigned long   tags[RADIX_TREE_MAX_TAGS][RADIX_TREE_TAG_LONGS];
-};
-
 /* The IDR tag is stored in the low bits of xa_flags */
 #define ROOT_IS_IDR((__force gfp_t)4)
 /* The top bits of xa_flags are used to store the root tags */
diff --git a/include/linux/xarray.h b/include/linux/xarray.h
index 9b05b907062b..b51f354dfbf0 100644
--- a/include/linux/xarray.h
+++ b/include/linux/xarray.h
@@ -195,6 +195,30 @@ static inline void xa_init(struct xarray *xa)
 #endif
 #define XA_CHUNK_SIZE  (1UL << XA_CHUNK_SHIFT)
 #define XA_CHUNK_MASK  (XA_CHUNK_SIZE - 1)
+#define XA_MAX_TAGS3
+#define XA_TAG_LONGS   DIV_ROUND_UP(XA_CHUNK_SIZE, BITS_PER_LONG)
+
+/*
+ * @count is the count of every non-NULL element in the ->slots array
+ * whether that is a value entry, a retry entry, a user pointer,
+ * a sibling entry or a pointer to the next level of the tree.
+ * @nr_values is the count of every element in ->slots which is
+ * either a value entry or a sibling entry to a value entry.
+ */
+struct xa_node {
+   unsigned char   shift;  /* Bits remaining in each slot */
+   unsigned char   offset; /* Slot offset in parent */
+   unsigned char   count;  /* Total entry count */
+   unsigned char   nr_values;  /* Value entry count */
+   struct xa_node __rcu *parent;   /* NULL at top of tree */
+   struct xarray   *array; /* The array we belong to */
+   union {
+   struct list_head private_list;  /* For tree user */
+   struct rcu_head rcu_head;   /* Used when freeing node */
+   };
+   void __rcu  *slots[XA_CHUNK_SIZE];
+   unsigned long   tags[XA_MAX_TAGS][XA_TAG_LONGS];
+};
 
 /* Private */
 static inline bool xa_is_node(const void *entry)
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index c9ae6e6579f8..e98de16b1648 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -260,11 +260,11 @@ static void dump_node(struct radix_tree_node *node, 
unsigned long index)
 {
unsigned long i;
 
-   pr_debug("radix node: %p offset %d indices %lu-%lu parent %p tags %lx 
%lx %lx shift %d count

[PATCH v8 45/63] shmem: Convert replace to XArray

2018-03-06 Thread Matthew Wilcox

From: Matthew Wilcox 

shmem_radix_tree_replace() is renamed to shmem_xa_replace() and
converted to use the XArray API.

Signed-off-by: Matthew Wilcox 
---
 mm/shmem.c | 22 --
 1 file changed, 8 insertions(+), 14 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index ac53cae5d3a7..5813808965cd 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -321,24 +321,20 @@ void shmem_uncharge(struct inode *inode, long pages)
 }
 
 /*
- * Replace item expected in radix tree by a new item, while holding tree lock.
+ * Replace item expected in xarray by a new item, while holding xa_lock.
  */
-static int shmem_radix_tree_replace(struct address_space *mapping,
+static int shmem_xa_replace(struct address_space *mapping,
pgoff_t index, void *expected, void *replacement)
 {
-   struct radix_tree_node *node;
-   void **pslot;
+   XA_STATE(xas, >i_pages, index);
void *item;
 
VM_BUG_ON(!expected);
VM_BUG_ON(!replacement);
-   item = __radix_tree_lookup(>i_pages, index, , );
-   if (!item)
-   return -ENOENT;
+   item = xas_load();
if (item != expected)
return -ENOENT;
-   __radix_tree_replace(>i_pages, node, pslot,
-replacement, NULL);
+   xas_store(, replacement);
return 0;
 }
 
@@ -605,8 +601,7 @@ static int shmem_add_to_page_cache(struct page *page,
} else if (!expected) {
error = radix_tree_insert(>i_pages, index, page);
} else {
-   error = shmem_radix_tree_replace(mapping, index, expected,
-page);
+   error = shmem_xa_replace(mapping, index, expected, page);
}
 
if (!error) {
@@ -635,7 +630,7 @@ static void shmem_delete_from_page_cache(struct page *page, 
void *radswap)
VM_BUG_ON_PAGE(PageCompound(page), page);
 
xa_lock_irq(>i_pages);
-   error = shmem_radix_tree_replace(mapping, page->index, page, radswap);
+   error = shmem_xa_replace(mapping, page->index, page, radswap);
page->mapping = NULL;
mapping->nrpages--;
__dec_node_page_state(page, NR_FILE_PAGES);
@@ -1553,8 +1548,7 @@ static int shmem_replace_page(struct page **pagep, gfp_t 
gfp,
 * a nice clean interface for us to replace oldpage by newpage there.
 */
xa_lock_irq(_mapping->i_pages);
-   error = shmem_radix_tree_replace(swap_mapping, swap_index, oldpage,
-  newpage);
+   error = shmem_xa_replace(swap_mapping, swap_index, oldpage, newpage);
if (!error) {
__inc_node_page_state(newpage, NR_FILE_PAGES);
__dec_node_page_state(oldpage, NR_FILE_PAGES);
-- 
2.16.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v8 01/63] mac80211_hwsim: Use DEFINE_IDA

2018-03-06 Thread Matthew Wilcox

From: Matthew Wilcox 

This is preferred to opencoding an IDA_INIT.

Signed-off-by: Matthew Wilcox 
---
 drivers/net/wireless/mac80211_hwsim.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/wireless/mac80211_hwsim.c 
b/drivers/net/wireless/mac80211_hwsim.c
index 7b6c3640a94f..8bffd6ebc03e 100644
--- a/drivers/net/wireless/mac80211_hwsim.c
+++ b/drivers/net/wireless/mac80211_hwsim.c
@@ -253,7 +253,7 @@ static inline void hwsim_clear_chanctx_magic(struct 
ieee80211_chanctx_conf *c)
 
 static unsigned int hwsim_net_id;
 
-static struct ida hwsim_netgroup_ida = IDA_INIT;
+static DEFINE_IDA(hwsim_netgroup_ida);
 
 struct hwsim_net {
int netgroup;
-- 
2.16.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: How to change/fix 'Received UUID'

2018-03-06 Thread Hans van Kranenburg

On 05/03/2018 20:47, Marc MERLIN wrote:
> On Mon, Mar 05, 2018 at 10:38:16PM +0300, Andrei Borzenkov wrote:
>>> If I absolutely know that the data is the same on both sides, how do I
>>> either
>>> 1) force back in a 'Received UUID' value on the destination
>>
>> I suppose the most simple is to write small program that does it using
>> BTRFS_IOC_SET_RECEIVED_SUBVOL.
> 
> Understdood.
> Given that I have not worked with the code at all, what is the best 
> tool in btrfs progs, to add this to?
> 
> btrfstune?
> btrfs propery set?
> other?
> 
> David, is this something you'd be willing to add support for?
> (to be honest, it'll be quicker for someone who knows the code to add than
> for me, but if no one has the time, I'l see if I can have a shot at it)

If you want something right now that works, so you can continue doing
your backups, python-btrfs also has the ioctl, since v9, together with
an example of using it:

https://github.com/knorrie/python-btrfs/commit/1ace623f95300ecf581b1182780fd6432a46b24d

P.S. even when coding it in C, the documentation in that commit message
might be useful. :)

fwiw,
Hans
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/2] btrfs: Add two new unprivileged ioctls to allow normal users to call "sub list/show" etc.

2018-03-06 Thread David Sterba

On Tue, Mar 06, 2018 at 05:29:34PM +0900, Misono, Tomohiro wrote:
> This adds two new unprivileged ioctls:
> 
> 1st patch: version of tree_search ioctl which only searches/returns subvolume 
> related item.
> 2nd patch: user version of ino_lookup ioctl which also performs permission 
> check.
> 
> They will be used to implement user version of "subvolume list/show" etc in 
> user tools.

The unprivileged separate listing ioctls are highly requested so I'm
looking forward to the feedback round.

The usecase coverage should be same what the current 'btrfs subvol list'
does, except the complex filtering.

The merging target for that shall be 4.18 which should give us enough
time to discuss and review the usecase or the ioctl capabilities and
data structure formats.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/8] btrfs-progs: qgroups usability [corrected]

2018-03-06 Thread Jeffrey Mahoney

On 3/6/18 7:10 AM, Qu Wenruo wrote:
> 
> 
> On 2018年03月03日 02:46, je...@suse.com wrote:
>> From: Jeff Mahoney 
>>
>> Hi all -
>>
>> The following series addresses some usability issues with the qgroups UI.
>>
>> 1) Adds -W option so we can wait on a rescan completing without starting one.
>> 2) Adds qgroup information to 'btrfs subvolume show'
>> 3) Adds a -P option to show pathnames for first-level qgroups (or member
>>of nested qgroups with -v)
>> 4) Allows exporting the qgroup table in JSON format for use by external
>>programs/scripts.
> 
> Going to review the patchset in the following days, but I'm pretty
> curious about this feature.
> 
> Is there any plan to implement similar json interface for other tools?
> Or just qgroup only yet?

Dave and I talked about this off-list yesterday.  I had asked if perhaps
we'd want things like "btrfs subvolume list" and "btrfs subvolume show",
among other commands, to also offer JSON output.  We agreed that the
answer is "yes" and that we should use something like a global option
like "--format=json" to do that, e.g. "btrfs --format=json qgroup show."
   I have patches partially worked up to implement that.  The idea is
that commands would define in their flags which output formats they
accept.  If the user requests an unsupported format, they would receive
an error with usage, listing the accepted formats.  Each command is
responsible for outputting in a given format, but that doesn't mean we
couldn't standardize on a common library for most commands.  Dave and I
discussed using libsmartcols for output since it supports tabular and
JSON output.  For qgroups show, where we want to provide different data
structures for level 0 qgroups vs nested qgroups, it's unsuitable.  For
simple tables like 'subvolume show' or 'subvolume list' it could
probably work well.

One question I had was whether errors should be reported in the
requested format.  I'm inclined to say no and that's what my code does:
errors are still reported in plaintext with a nonzero error code.

-Jeff

>> Jeff Mahoney (8):
>>   btrfs-progs: quota: Add -W option to rescan to wait without starting
>> rescan
>>   btrfs-progs: qgroups: fix misleading index check
>>   btrfs-progs: constify pathnames passed as arguments
>>   btrfs-progs: qgroups: add pathname to show output
>>   btrfs-progs: qgroups: introduce and use info and limit structures
>>   btrfs-progs: qgroups: introduce btrfs_qgroup_query
>>   btrfs-progs: subvolume: add quota info to btrfs sub show
>>   btrfs-progs: qgroups: export qgroups usage information as JSON
>>
>>  Documentation/btrfs-qgroup.asciidoc |   8 +
>>  Documentation/btrfs-quota.asciidoc  |  10 +-
>>  Makefile.inc.in |   4 +-
>>  chunk-recover.c |   4 +-
>>  cmds-device.c   |   2 +-
>>  cmds-fi-usage.c |   6 +-
>>  cmds-qgroup.c   |  49 +++-
>>  cmds-quota.c|  21 +-
>>  cmds-rescue.c   |   4 +-
>>  cmds-subvolume.c|  46 
>>  configure.ac|   6 +
>>  kerncompat.h|   1 +
>>  qgroup.c| 526 
>> ++--
>>  qgroup.h|  22 +-
>>  send-utils.c|   4 +-
>>  utils.c |  22 +-
>>  utils.h |   2 +
>>  17 files changed, 621 insertions(+), 116 deletions(-)
>>
> 

signature.asc
Description: OpenPGP digital signature

Re: [PATCH 3/5] btrfs: Embed sector size check into BTRFS_MAX_INLINE_DATA_SIZE()

2018-03-06 Thread David Sterba

On Fri, Mar 02, 2018 at 01:22:52PM +0800, Qu Wenruo wrote:
> We have extra sector size check in cow_file_range_inline(), but doesn't
> implement it in BTRFS_MAX_INLINE_DATA_SIZE().
> 
> The biggest reason is that btrfs_symlink() also uses this macro to check
> name length.

I'm reading what the standard says about symlink and restrictions,
http://pubs.opengroup.org/onlinepubs/009695399/functions/symlink.html

the maximum length of the symlink contents should not exceed
SYMLINK_MAX, 'getconf SYMLINK_MAX /path/to/btrfs' says undefined and I
can't find any restrictions in manual pages. At least the target should
not exceed PATH_MAX in case is's an absolute path, the rest is up to the
path name resolution to decide if a relative symlink plus the base
directory exceeds PATH_MAX.

IOW, the symlink callback could extend the condition to check against
PATH_MAX, that is independent on the page size and sectorsize.

A larger symlink target is not a bug per se, just that it will never
pass through VFS once it would be used for file access.

> In fact such behavior makes max_inline calculation quite confusing, and
> cause unexpected large extent for symbol link.
> 
> Here we embed sector size check into BTRFS_MAX_INLINE_DATA_SIZE() so
> that it will never exceed sector size.

I don't think it's wise to mix sectorsize and nodesize, here the macro
BTRFS_MAX_INLINE_DATA_SIZE is related only to b-tree nodes.

> The downside is, for symbol link, we will reduce max symbol link length
> from 16K- to 4095, but it won't affect current system using that long
> name, but only prevent later creation.

The page size limit will be hit so >4096 symlink targets can be created
only on systems with big pages.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/3] fstests: btrfs: Add test case to check v1 space cache corruption

2018-03-06 Thread Filipe Manana

On Tue, Mar 6, 2018 at 12:07 PM, Qu Wenruo  wrote:
>
>
> On 2018年03月06日 18:12, Filipe Manana wrote:
>> On Tue, Mar 6, 2018 at 8:15 AM, Qu Wenruo  wrote:
>>> There are some btrfs corruption report in mail list for a while,
>>
>> There have been for years (well, since ever) many reports of different
>> types of corruptions.
>> Which kind of corruption are you referring to?
>>
>>> although such corruption is pretty rare and almost impossible to
>>> reproduce, with dm-log-writes we found it's highly related to v1 space
>>> cache.
>>>
>>> Unlike journal based filesystems, btrfs completely rely on metadata CoW
>>> to protect itself from power loss.
>>> Which needs extent allocator to avoid allocate extents on existing
>>> extent range.
>>> Btrfs also uses free space cache to speed up such allocation.
>>>
>>> However there is a problem, v1 space cache is not protected by data CoW,
>>> and can be corrupted during power loss.
>>> So btrfs do extra check on free space cache, verifying its own in-file csum,
>>> generation and free space recorded in cache and extent tree.
>>>
>>> The problem is, under heavy concurrency, v1 space cache can be corrupted
>>> even under normal operations without power loss.
>>
>> How?
>>
>>> And we believe corrupted space cache can break btrfs metadata CoW and
>>> leads to the rare corruption in next power loss.
>>
>> Which kind of corruption?
>>
>>>
>>> The most obvious symptom will be difference in free space:
>>>
>>> This will be caught by kernel, but such check is quite weak, and if
>>> the net free space change is 0 in one transaction, the corrupted
>>> cache can be loaded by kernel.
>>
>> How can that happen? The only case I'm aware of, explained below,
>> always leads to a difference (space cache has less free space then
>> what we actually have if we check the extent tree).
>>
>>>
>>> In this case, btrfs check would report things like :
>>> --
>>> block group 298844160 has wrong amount of free space
>>> failed to load free space cache for block group 298844160
>>> --
>>
>> This is normal, but not very common, due to tiny races that exists
>> between committing a transaction (writing the free space caches) and
>> running dellaloc for example (since reserving an extent while running
>> dealloc doesn't joing/start a transaction).
>
> Well, at least I didn't find any place to release space unprotected by
> trans handler, so free space of cache can only be less or equal to real
> free space.

Well, that what I said before. For that particular race I mentioned,
which is the only one I can remember of always existing, the only
inconsistency that can happen is that the cache has less free extents
than what you can find by scanning the extent tree. If the
inconsistency was not detected at cache loading time, we would only
leak extents, but never double allocate them.

>
> So in that case, corrupted cache will never pass the free space check so
> it will never be loaded.
>
> Another dead end unfortunately.
>
> Thanks,
> Qu
>>
>>>
>>> But considering the test case are using notreelog, btrfs won't do
>>> sub-transaction commit which doesn't increase generation, each
>>> transaction should be consistent, and nothing should be reported at all.
>>>
>>> Further more, we can even found corrupted file extents like:
>>> --
>>> root 5 inode 261 errors 100, file extent discount
>>> Found file extent holes:
>>> start: 962560, len: 32768
>>> ERROR: errors found in fs roots
>>
>> Why do you think that's a corruption? Does it cause data loss or any
>> user visible issue?
>>
>> Having file extent holes not inserted happens when mixing buffered and
>> direct IO writes to a file (and fsstress does that), for example:
>>
>> create file
>> buffered write at offset 0, length 64K
>> direct IO write at offset at offset 64K, length 4K
>> transaction commit
>> power loss
>> after this we got a missing 64K hole extent at offset 0 (at
>> btrfs_file_write_iter we only add hole extents if the start offset is
>> greater then the current i_size)
>>
>> But this does not cause any problem for the user or the fs itself, and
>> it's supposed to be like that in the NO_HOLES mode which one day
>> (probably) will be the default mode.
>>
>>> --
>>>
>>> Signed-off-by: Qu Wenruo 
>>> ---
>>>  common/dmlogwrites  |  72 +++
>>>  tests/btrfs/159 | 141 
>>> 
>>>  tests/btrfs/159.out |   2 +
>>>  tests/btrfs/group   |   1 +
>>>  4 files changed, 216 insertions(+)
>>>  create mode 100755 tests/btrfs/159
>>>  create mode 100644 tests/btrfs/159.out
>>>
>>> diff --git a/common/dmlogwrites b/common/dmlogwrites
>>> index 467b872e..54e7e242 100644
>>> --- a/common/dmlogwrites
>>> +++ b/common/dmlogwrites
>>> @@ -126,3 +126,75 @@ _log_writes_cleanup()
>>> $UDEV_SETTLE_PROG >/dev/null 2>&1
>>> _log_writes_remove
>>>  }
>>> +
>>> +# Convert log writes mark to entry number

Re: [PATCH 3/3] fstests: btrfs: Add test case to check v1 space cache corruption

2018-03-06 Thread Filipe Manana

On Tue, Mar 6, 2018 at 10:53 AM, Qu Wenruo  wrote:
>
>
> On 2018年03月06日 18:12, Filipe Manana wrote:
>> On Tue, Mar 6, 2018 at 8:15 AM, Qu Wenruo  wrote:
>>> There are some btrfs corruption report in mail list for a while,
>>
>> There have been for years (well, since ever) many reports of different
>> types of corruptions.
>> Which kind of corruption are you referring to?
>
> Transid error.

You mean parent transid mismatches in tree blocks?
Please always be explicit about problems are you mentioning. There can
be many "transid" problems.


>
>>
>>> although such corruption is pretty rare and almost impossible to
>>> reproduce, with dm-log-writes we found it's highly related to v1 space
>>> cache.
>>>
>>> Unlike journal based filesystems, btrfs completely rely on metadata CoW
>>> to protect itself from power loss.
>>> Which needs extent allocator to avoid allocate extents on existing
>>> extent range.
>>> Btrfs also uses free space cache to speed up such allocation.
>>>
>>> However there is a problem, v1 space cache is not protected by data CoW,
>>> and can be corrupted during power loss.
>>> So btrfs do extra check on free space cache, verifying its own in-file csum,
>>> generation and free space recorded in cache and extent tree.
>>>
>>> The problem is, under heavy concurrency, v1 space cache can be corrupted
>>> even under normal operations without power loss.
>>
>> How?
>
> At FUA writes, we can get v1 space cache who can pass checksum and
> generation check, but has difference in free space.
>
>>
>>> And we believe corrupted space cache can break btrfs metadata CoW and
>>> leads to the rare corruption in next power loss.
>>
>> Which kind of corruption?
>
> Transid related error.
>
>>
>>>
>>> The most obvious symptom will be difference in free space:
>>>
>>> This will be caught by kernel, but such check is quite weak, and if
>>> the net free space change is 0 in one transaction, the corrupted
>>> cache can be loaded by kernel.
>>
>> How can that happen? The only case I'm aware of, explained below,
>> always leads to a difference (space cache has less free space then
>> what we actually have if we check the extent tree).
>>
>>>
>>> In this case, btrfs check would report things like :
>>> --
>>> block group 298844160 has wrong amount of free space
>>> failed to load free space cache for block group 298844160
>>> --
>>
>> This is normal, but not very common, due to tiny races that exists
>> between committing a transaction (writing the free space caches) and
>> running dellaloc for example (since reserving an extent while running
>> dealloc doesn't joing/start a transaction).
>
> This race explains a lot.
>
> But could that cause corrupted cache to be loaded after power loss, and
> break metadata CoW?

No, unless there's a bug in the procedure of detecting inconsistent caches.

>
> At least for the time point when FUA happens, the free space cache can
> pass both csum and generation, we only have free space difference to
> prevent it to be loaded.
>
>>
>>>
>>> But considering the test case are using notreelog, btrfs won't do
>>> sub-transaction commit which doesn't increase generation, each
>>> transaction should be consistent, and nothing should be reported at all.
>>>
>>> Further more, we can even found corrupted file extents like:
>>> --
>>> root 5 inode 261 errors 100, file extent discount
>>> Found file extent holes:
>>> start: 962560, len: 32768
>>> ERROR: errors found in fs roots
>>
>> Why do you think that's a corruption? Does it cause data loss or any
>> user visible issue?
>
> It breaks the rule that we shouldn't have the hole in file extents.

Right. My question/remark was that, besides the warning emitted by
fsck, this does not cause any harm to users/applications nor the
filesytem. That is, no data loss, metadata corruption or anything that
prevents the user from reading all previously written data nor
anything preventing future IO to any range of the file.
So this is just a minor annoyance and far from a serious bug.

>
> IIRC Nikolay is trying to use inode_lock_shared() to solve this race.

I don't see what's the relation, even because this is not caused by
race conditions.

>
>>
>> Having file extent holes not inserted happens when mixing buffered and
>> direct IO writes to a file (and fsstress does that), for example:
>>
>> create file
>> buffered write at offset 0, length 64K
>> direct IO write at offset at offset 64K, length 4K
>> transaction commit
>> power loss
>> after this we got a missing 64K hole extent at offset 0 (at
>> btrfs_file_write_iter we only add hole extents if the start offset is
>> greater then the current i_size)
>>
>> But this does not cause any problem for the user or the fs itself, and
>> it's supposed to be like that in the NO_HOLES mode which one day
>> (probably) will be the default mode.
>
> At least before that happens, we should follow the current schema of
> file extents.
>
> If we just ignore

Re: [PATCH 0/8] btrfs-progs: qgroups usability [corrected]

2018-03-06 Thread Qu Wenruo



On 2018年03月03日 02:46, je...@suse.com wrote:
> From: Jeff Mahoney 
> 
> Hi all -
> 
> The following series addresses some usability issues with the qgroups UI.
> 
> 1) Adds -W option so we can wait on a rescan completing without starting one.
> 2) Adds qgroup information to 'btrfs subvolume show'
> 3) Adds a -P option to show pathnames for first-level qgroups (or member
>of nested qgroups with -v)
> 4) Allows exporting the qgroup table in JSON format for use by external
>programs/scripts.

Going to review the patchset in the following days, but I'm pretty
curious about this feature.

Is there any plan to implement similar json interface for other tools?
Or just qgroup only yet?

Thanks,
Qu

> 
> -Jeff
> 
> Jeff Mahoney (8):
>   btrfs-progs: quota: Add -W option to rescan to wait without starting
> rescan
>   btrfs-progs: qgroups: fix misleading index check
>   btrfs-progs: constify pathnames passed as arguments
>   btrfs-progs: qgroups: add pathname to show output
>   btrfs-progs: qgroups: introduce and use info and limit structures
>   btrfs-progs: qgroups: introduce btrfs_qgroup_query
>   btrfs-progs: subvolume: add quota info to btrfs sub show
>   btrfs-progs: qgroups: export qgroups usage information as JSON
> 
>  Documentation/btrfs-qgroup.asciidoc |   8 +
>  Documentation/btrfs-quota.asciidoc  |  10 +-
>  Makefile.inc.in |   4 +-
>  chunk-recover.c |   4 +-
>  cmds-device.c   |   2 +-
>  cmds-fi-usage.c |   6 +-
>  cmds-qgroup.c   |  49 +++-
>  cmds-quota.c|  21 +-
>  cmds-rescue.c   |   4 +-
>  cmds-subvolume.c|  46 
>  configure.ac|   6 +
>  kerncompat.h|   1 +
>  qgroup.c| 526 
> ++--
>  qgroup.h|  22 +-
>  send-utils.c|   4 +-
>  utils.c |  22 +-
>  utils.h |   2 +
>  17 files changed, 621 insertions(+), 116 deletions(-)
> 



signature.asc
Description: OpenPGP digital signature

Re: [PATCH 2/5] btrfs: Always limit inline extent size by uncompressed size

2018-03-06 Thread Qu Wenruo



On 2018年03月06日 19:58, David Sterba wrote:
> On Fri, Mar 02, 2018 at 07:40:14PM +0800, Qu Wenruo wrote:
>> On 2018年03月02日 19:00, Filipe Manana wrote:
>>> On Fri, Mar 2, 2018 at 10:54 AM, Qu Wenruo  wrote:
 On 2018年03月02日 18:46, Filipe Manana wrote:
> On Fri, Mar 2, 2018 at 5:22 AM, Qu Wenruo  wrote:
>> Normally when specifying max_inline, we should normally limit it by
>> uncompressed extent size, as it's the only thing user can control.
>
> Why does it matter that users can control it? Will they write less (or
> more) data to files because stuff won't get inlined?
> Why do they care about stuff getting inlined or not? That's an
> implementation detail of btrfs to speed up access to file data and
> save some space.

 Then why we still have max_inline mount option?
>>>
>>> My comment was about deciding based on which size to make the decision
>>> (compressed vs uncompressed).
>>
>> The same thing, we have given user options to trigger the behavior, then
>> we should give them *predictable* option to modify the behavior.
>>
>> Not something confusing like current max_inline.
>>
>> Either we give user max_inline and max_inline_compressed, or both follow
>> max_inline.
> 
> I agree with Filipe and don't see a reason to limit the inlining
> capabilities. Max inline exists to set the maximum file size that will
> be considered for inlining, the rest is implementation detail.
> 
> In a similar way the compression can decide if the data will be
> compressed, though the user has specified the mount option.
> 
> Adding another option (max_inline_compressed) does not necessarily
> improve the situation. This requires the user to understand the values
> it can have and how it interacts with other options.
> 
> I've been thinking about this patchset for a few days and still don't
> think there's a problem we need to fix. We can certainly improve the
> reporting, so the mount option value will be adjusted to the exact
> inline limit and then printed to syslog. Additionally we can export the
> value as sysfs file, and udate documentation.

Fine, I will discard this patch and just enhance the prompt.

Thanks,
Qu

> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 



signature.asc
Description: OpenPGP digital signature

Re: [PATCH 3/3] fstests: btrfs: Add test case to check v1 space cache corruption

2018-03-06 Thread Qu Wenruo



On 2018年03月06日 18:12, Filipe Manana wrote:
> On Tue, Mar 6, 2018 at 8:15 AM, Qu Wenruo  wrote:
>> There are some btrfs corruption report in mail list for a while,
> 
> There have been for years (well, since ever) many reports of different
> types of corruptions.
> Which kind of corruption are you referring to?
> 
>> although such corruption is pretty rare and almost impossible to
>> reproduce, with dm-log-writes we found it's highly related to v1 space
>> cache.
>>
>> Unlike journal based filesystems, btrfs completely rely on metadata CoW
>> to protect itself from power loss.
>> Which needs extent allocator to avoid allocate extents on existing
>> extent range.
>> Btrfs also uses free space cache to speed up such allocation.
>>
>> However there is a problem, v1 space cache is not protected by data CoW,
>> and can be corrupted during power loss.
>> So btrfs do extra check on free space cache, verifying its own in-file csum,
>> generation and free space recorded in cache and extent tree.
>>
>> The problem is, under heavy concurrency, v1 space cache can be corrupted
>> even under normal operations without power loss.
> 
> How?
> 
>> And we believe corrupted space cache can break btrfs metadata CoW and
>> leads to the rare corruption in next power loss.
> 
> Which kind of corruption?
> 
>>
>> The most obvious symptom will be difference in free space:
>>
>> This will be caught by kernel, but such check is quite weak, and if
>> the net free space change is 0 in one transaction, the corrupted
>> cache can be loaded by kernel.
> 
> How can that happen? The only case I'm aware of, explained below,
> always leads to a difference (space cache has less free space then
> what we actually have if we check the extent tree).
> 
>>
>> In this case, btrfs check would report things like :
>> --
>> block group 298844160 has wrong amount of free space
>> failed to load free space cache for block group 298844160
>> --
> 
> This is normal, but not very common, due to tiny races that exists
> between committing a transaction (writing the free space caches) and
> running dellaloc for example (since reserving an extent while running
> dealloc doesn't joing/start a transaction).

Well, at least I didn't find any place to release space unprotected by
trans handler, so free space of cache can only be less or equal to real
free space.

So in that case, corrupted cache will never pass the free space check so
it will never be loaded.

Another dead end unfortunately.

Thanks,
Qu
> 
>>
>> But considering the test case are using notreelog, btrfs won't do
>> sub-transaction commit which doesn't increase generation, each
>> transaction should be consistent, and nothing should be reported at all.
>>
>> Further more, we can even found corrupted file extents like:
>> --
>> root 5 inode 261 errors 100, file extent discount
>> Found file extent holes:
>> start: 962560, len: 32768
>> ERROR: errors found in fs roots
> 
> Why do you think that's a corruption? Does it cause data loss or any
> user visible issue?
> 
> Having file extent holes not inserted happens when mixing buffered and
> direct IO writes to a file (and fsstress does that), for example:
> 
> create file
> buffered write at offset 0, length 64K
> direct IO write at offset at offset 64K, length 4K
> transaction commit
> power loss
> after this we got a missing 64K hole extent at offset 0 (at
> btrfs_file_write_iter we only add hole extents if the start offset is
> greater then the current i_size)
> 
> But this does not cause any problem for the user or the fs itself, and
> it's supposed to be like that in the NO_HOLES mode which one day
> (probably) will be the default mode.
> 
>> --
>>
>> Signed-off-by: Qu Wenruo 
>> ---
>>  common/dmlogwrites  |  72 +++
>>  tests/btrfs/159 | 141 
>> 
>>  tests/btrfs/159.out |   2 +
>>  tests/btrfs/group   |   1 +
>>  4 files changed, 216 insertions(+)
>>  create mode 100755 tests/btrfs/159
>>  create mode 100644 tests/btrfs/159.out
>>
>> diff --git a/common/dmlogwrites b/common/dmlogwrites
>> index 467b872e..54e7e242 100644
>> --- a/common/dmlogwrites
>> +++ b/common/dmlogwrites
>> @@ -126,3 +126,75 @@ _log_writes_cleanup()
>> $UDEV_SETTLE_PROG >/dev/null 2>&1
>> _log_writes_remove
>>  }
>> +
>> +# Convert log writes mark to entry number
>> +# Result entry number is output to stdout, could be empty if not found
>> +_log_writes_mark_to_entry_number()
>> +{
>> +   local _mark=$1
>> +   local ret
>> +
>> +   [ -z "$_mark" ] && _fail \
>> +   "mark must be given for _log_writes_mark_to_entry_number"
>> +
>> +   ret=$($here/src/log-writes/replay-log --find --log $LOGWRITES_DEV \
>> +   --end-mark $_mark 2> /dev/null)
>> +   [ -z "$ret" ] && return
>> +   ret=$(echo "$ret" | cut -f1 -d\@)
>> +   echo "mark $_mark has entry number $ret" >> $seqres.full
>> +

Re: [PATCH 2/5] btrfs: Always limit inline extent size by uncompressed size

2018-03-06 Thread David Sterba

On Fri, Mar 02, 2018 at 07:40:14PM +0800, Qu Wenruo wrote:
> On 2018年03月02日 19:00, Filipe Manana wrote:
> > On Fri, Mar 2, 2018 at 10:54 AM, Qu Wenruo  wrote:
> >> On 2018年03月02日 18:46, Filipe Manana wrote:
> >>> On Fri, Mar 2, 2018 at 5:22 AM, Qu Wenruo  wrote:
>  Normally when specifying max_inline, we should normally limit it by
>  uncompressed extent size, as it's the only thing user can control.
> >>>
> >>> Why does it matter that users can control it? Will they write less (or
> >>> more) data to files because stuff won't get inlined?
> >>> Why do they care about stuff getting inlined or not? That's an
> >>> implementation detail of btrfs to speed up access to file data and
> >>> save some space.
> >>
> >> Then why we still have max_inline mount option?
> > 
> > My comment was about deciding based on which size to make the decision
> > (compressed vs uncompressed).
> 
> The same thing, we have given user options to trigger the behavior, then
> we should give them *predictable* option to modify the behavior.
> 
> Not something confusing like current max_inline.
> 
> Either we give user max_inline and max_inline_compressed, or both follow
> max_inline.

I agree with Filipe and don't see a reason to limit the inlining
capabilities. Max inline exists to set the maximum file size that will
be considered for inlining, the rest is implementation detail.

In a similar way the compression can decide if the data will be
compressed, though the user has specified the mount option.

Adding another option (max_inline_compressed) does not necessarily
improve the situation. This requires the user to understand the values
it can have and how it interacts with other options.

I've been thinking about this patchset for a few days and still don't
think there's a problem we need to fix. We can certainly improve the
reporting, so the mount option value will be adjusted to the exact
inline limit and then printed to syslog. Additionally we can export the
value as sysfs file, and udate documentation.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Btrfs: dev-replace: make sure target is identical to source when raid56 rebuild fails

2018-03-06 Thread David Sterba

On Fri, Mar 02, 2018 at 04:10:41PM -0700, Liu Bo wrote:
> In the last step of scrub_handle_error_block, we try to combine good
> copies on all possible mirrors, this works fine for raid1 and raid10,
> but not for raid56 as it's doing parity rebuild.
> 
> If parity rebuild doesn't get back with correct data which matches its
> checksum, in case of replace we'd rather write what is stored in the
> source device than the data calculuated from parity.
> 
> Signed-off-by: Liu Bo 

Added to next, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Btrfs: raid56: remove redundant async_missing_raid56

2018-03-06 Thread David Sterba

On Fri, Mar 02, 2018 at 04:10:39PM -0700, Liu Bo wrote:
> async_missing_raid56() is identical to async_read_rebuild().
> 
> Signed-off-by: Liu Bo 

Reviewed-by: David Sterba 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Btrfs: replace: cache rbio when rebuild data on missing device

2018-03-06 Thread David Sterba

On Fri, Mar 02, 2018 at 04:10:38PM -0700, Liu Bo wrote:
> Rebuild on missing device is as same as recover, after it's done, rbio
> has data which is consistent with on-disk data, so it can be cached to
> avoid further reads.

Please add a comment that describes why the READ and REBUILD can be
merged together, it's not obvious from the code.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/3] fstests: btrfs: Add test case to check v1 space cache corruption

2018-03-06 Thread Nikolay Borisov

On  6.03.2018 12:53, Qu Wenruo wrote:
> 
> 
[snip]

> It breaks the rule that we shouldn't have the hole in file extents.
> 
> IIRC Nikolay is trying to use inode_lock_shared() to solve this race.
> 

Unfortunately the inode_lock_shared approach is a no go since Filipe
objected to it quite adamantly. After discussion that happened in that
thread I can say I'm almost convinced that the pair of READDIO_LOCK and
setsize *do* provide necessary consistency for the DIO case. However at
the moment there is a memory barrier missing.

But I think the DIO case is unrelated to the issue you are discussing
here, no ?

[snip]
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/3] fstests: btrfs: Add test case to check v1 space cache corruption

2018-03-06 Thread Qu Wenruo



On 2018年03月06日 18:12, Filipe Manana wrote:
> On Tue, Mar 6, 2018 at 8:15 AM, Qu Wenruo  wrote:
>> There are some btrfs corruption report in mail list for a while,
> 
> There have been for years (well, since ever) many reports of different
> types of corruptions.
> Which kind of corruption are you referring to?

Transid error.

> 
>> although such corruption is pretty rare and almost impossible to
>> reproduce, with dm-log-writes we found it's highly related to v1 space
>> cache.
>>
>> Unlike journal based filesystems, btrfs completely rely on metadata CoW
>> to protect itself from power loss.
>> Which needs extent allocator to avoid allocate extents on existing
>> extent range.
>> Btrfs also uses free space cache to speed up such allocation.
>>
>> However there is a problem, v1 space cache is not protected by data CoW,
>> and can be corrupted during power loss.
>> So btrfs do extra check on free space cache, verifying its own in-file csum,
>> generation and free space recorded in cache and extent tree.
>>
>> The problem is, under heavy concurrency, v1 space cache can be corrupted
>> even under normal operations without power loss.
> 
> How?

At FUA writes, we can get v1 space cache who can pass checksum and
generation check, but has difference in free space.

> 
>> And we believe corrupted space cache can break btrfs metadata CoW and
>> leads to the rare corruption in next power loss.
> 
> Which kind of corruption?

Transid related error.

> 
>>
>> The most obvious symptom will be difference in free space:
>>
>> This will be caught by kernel, but such check is quite weak, and if
>> the net free space change is 0 in one transaction, the corrupted
>> cache can be loaded by kernel.
> 
> How can that happen? The only case I'm aware of, explained below,
> always leads to a difference (space cache has less free space then
> what we actually have if we check the extent tree).
> 
>>
>> In this case, btrfs check would report things like :
>> --
>> block group 298844160 has wrong amount of free space
>> failed to load free space cache for block group 298844160
>> --
> 
> This is normal, but not very common, due to tiny races that exists
> between committing a transaction (writing the free space caches) and
> running dellaloc for example (since reserving an extent while running
> dealloc doesn't joing/start a transaction).

This race explains a lot.

But could that cause corrupted cache to be loaded after power loss, and
break metadata CoW?

At least for the time point when FUA happens, the free space cache can
pass both csum and generation, we only have free space difference to
prevent it to be loaded.

> 
>>
>> But considering the test case are using notreelog, btrfs won't do
>> sub-transaction commit which doesn't increase generation, each
>> transaction should be consistent, and nothing should be reported at all.
>>
>> Further more, we can even found corrupted file extents like:
>> --
>> root 5 inode 261 errors 100, file extent discount
>> Found file extent holes:
>> start: 962560, len: 32768
>> ERROR: errors found in fs roots
> 
> Why do you think that's a corruption? Does it cause data loss or any
> user visible issue?

It breaks the rule that we shouldn't have the hole in file extents.

IIRC Nikolay is trying to use inode_lock_shared() to solve this race.

> 
> Having file extent holes not inserted happens when mixing buffered and
> direct IO writes to a file (and fsstress does that), for example:
> 
> create file
> buffered write at offset 0, length 64K
> direct IO write at offset at offset 64K, length 4K
> transaction commit
> power loss
> after this we got a missing 64K hole extent at offset 0 (at
> btrfs_file_write_iter we only add hole extents if the start offset is
> greater then the current i_size)
> 
> But this does not cause any problem for the user or the fs itself, and
> it's supposed to be like that in the NO_HOLES mode which one day
> (probably) will be the default mode.

At least before that happens, we should follow the current schema of
file extents.

If we just ignore problems that won't cause data loss and keeps them,
there will never be a good on-disk format schema.

Thanks,
Qu

> 
>> --
>>
>> Signed-off-by: Qu Wenruo 
>> ---
>>  common/dmlogwrites  |  72 +++
>>  tests/btrfs/159 | 141 
>> 
>>  tests/btrfs/159.out |   2 +
>>  tests/btrfs/group   |   1 +
>>  4 files changed, 216 insertions(+)
>>  create mode 100755 tests/btrfs/159
>>  create mode 100644 tests/btrfs/159.out
>>
>> diff --git a/common/dmlogwrites b/common/dmlogwrites
>> index 467b872e..54e7e242 100644
>> --- a/common/dmlogwrites
>> +++ b/common/dmlogwrites
>> @@ -126,3 +126,75 @@ _log_writes_cleanup()
>> $UDEV_SETTLE_PROG >/dev/null 2>&1
>> _log_writes_remove
>>  }
>> +
>> +# Convert log writes mark to entry number
>> +# Result entry number is output to stdout, could be empty if

Re: [PATCH] Btrfs: scrub: batch rebuild for raid56

2018-03-06 Thread David Sterba

On Fri, Mar 02, 2018 at 04:10:37PM -0700, Liu Bo wrote:
> In case of raid56, writes and rebuilds always take BTRFS_STRIPE_LEN(64K)
> as unit, however, scrub_extent() sets blocksize as unit, so rebuild
> process may be triggered on every block on a same stripe.
> 
> A typical example would be that when we're replacing a disappeared disk,
> all reads on the disks get -EIO, every block (size is 4K if blocksize is
> 4K) would go thru these,
> 
> scrub_handle_errored_block
>   scrub_recheck_block # re-read pages one by one
>   scrub_recheck_block # rebuild by calling raid56_parity_recover()
> page by page
> 
> Although with raid56 stripe cache most of reads during rebuild can be
> avoided, the parity recover calculation(xor or raid6 algorithms) needs to
> be done $(BTRFS_STRIPE_LEN / blocksize) times.
> 
> This makes it less stupid by doing raid56 scrub/replace on stripe length.

missing s-o-b

> ---
>  fs/btrfs/scrub.c | 78 
> +++-
>  1 file changed, 60 insertions(+), 18 deletions(-)
> 
> diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
> index 9882513..e3203a1 100644
> --- a/fs/btrfs/scrub.c
> +++ b/fs/btrfs/scrub.c
> @@ -1718,6 +1718,44 @@ static int scrub_submit_raid56_bio_wait(struct 
> btrfs_fs_info *fs_info,
>   return blk_status_to_errno(bio->bi_status);
>  }
>  
> +static void scrub_recheck_block_on_raid56(struct btrfs_fs_info *fs_info,
> +   struct scrub_block *sblock)
> +{
> + struct scrub_page *first_page = sblock->pagev[0];
> + struct bio *bio = btrfs_io_bio_alloc(BIO_MAX_PAGES);

nontrivial initializations (variable to variable) are better put into
the statement section.

> + int page_num;
> +
> + /* All pages in sblock belongs to the same stripe on the same device. */
> + ASSERT(first_page->dev);
> + if (first_page->dev->bdev == NULL)
> + goto out;
> +
> + bio_set_dev(bio, first_page->dev->bdev);
> +
> + for (page_num = 0; page_num < sblock->page_count; page_num++) {
> + struct scrub_page *page = sblock->pagev[page_num];
> +
> + WARN_ON(!page->page);
> + bio_add_page(bio, page->page, PAGE_SIZE, 0);
> + }
> +
> + if (scrub_submit_raid56_bio_wait(fs_info, bio, first_page)) {
> + bio_put(bio);
> + goto out;
> + }
> +
> + bio_put(bio);
> +
> + scrub_recheck_block_checksum(sblock);
> +
> + return;
> +out:
> + for (page_num = 0; page_num < sblock->page_count; page_num++)
> + sblock->pagev[page_num]->io_error = 1;
> +
> + sblock->no_io_error_seen = 0;
> +}
> +
>  /*
>   * this function will check the on disk data for checksum errors, header
>   * errors and read I/O errors. If any I/O errors happen, the exact pages
> @@ -1733,6 +1771,10 @@ static void scrub_recheck_block(struct btrfs_fs_info 
> *fs_info,
>  
>   sblock->no_io_error_seen = 1;
>  
> + /* short cut for raid56 */
> + if (!retry_failed_mirror && scrub_is_page_on_raid56(sblock->pagev[0]))
> + return scrub_recheck_block_on_raid56(fs_info, sblock);
> +
>   for (page_num = 0; page_num < sblock->page_count; page_num++) {
>   struct bio *bio;
>   struct scrub_page *page = sblock->pagev[page_num];
> @@ -1748,19 +1790,12 @@ static void scrub_recheck_block(struct btrfs_fs_info 
> *fs_info,
>   bio_set_dev(bio, page->dev->bdev);
>  
>   bio_add_page(bio, page->page, PAGE_SIZE, 0);
> - if (!retry_failed_mirror && scrub_is_page_on_raid56(page)) {
> - if (scrub_submit_raid56_bio_wait(fs_info, bio, page)) {
> - page->io_error = 1;
> - sblock->no_io_error_seen = 0;
> - }
> - } else {
> - bio->bi_iter.bi_sector = page->physical >> 9;
> - bio_set_op_attrs(bio, REQ_OP_READ, 0);
> + bio->bi_iter.bi_sector = page->physical >> 9;
> + bio_set_op_attrs(bio, REQ_OP_READ, 0);

https://elixir.bootlin.com/linux/latest/source/include/linux/blk_types.h#L270

bio_set_op_attrs should not be used

>  
> - if (btrfsic_submit_bio_wait(bio)) {
> - page->io_error = 1;
> - sblock->no_io_error_seen = 0;
> - }
> + if (btrfsic_submit_bio_wait(bio)) {
> + page->io_error = 1;
> + sblock->no_io_error_seen = 0;
>   }
>  
>   bio_put(bio);
> @@ -2728,7 +2763,8 @@ static int scrub_find_csum(struct scrub_ctx *sctx, u64 
> logical, u8 *csum)
>  }
>  
>  /* scrub extent tries to collect up to 64 kB for each bio */
> -static int scrub_extent(struct scrub_ctx *sctx, u64 logical, u64 len,
> +static int scrub_extent(struct scrub_ctx *sctx, struct map_lookup *map,
> + u64 logical, u64 len,
>

Re: [PATCH] Btrfs: scrub: remove unnecessary variable set

2018-03-06 Thread David Sterba

On Fri, Mar 02, 2018 at 04:10:40PM -0700, Liu Bo wrote:
> Variable "success" is only checked when !sctx->is_dev_replace.

Though it's right, the code becomes less obvious at least to me that
it's not missing something. There are several conditions and branches,
one more explicit variable setting will not kill the performance but
helps to understand the code flow.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: ERROR: unsupported checksum algorithm 35355

2018-03-06 Thread David Sterba

On Tue, Mar 06, 2018 at 04:45:40PM +0800, Qu Wenruo wrote:
> Here is the fixed superblock.
> 
> csum type and incompat flags get fixed.
> 
> I'm not sure if they are the only problems, but I strongly recommend to
> run btrfs check before mount.

I haven't found any other obviously corrupted items. The only
overwritten data are at offset 0xc0 (192) and it's 6 bytes. Not an ascii
pattern nor a looks-like-a-pointer value.

Value on offset 192 looks like a valid block pointer 0x176d2000, but the
csum_type looks quite random and not related to anything I'd suspect.

The overwrite could have happened any time after the last successful
mount, the incompat flags or csum type are not updated on each write
(unlike th tree root pointers or levels, or checksum).

This does not look like a direct memory bitflip, but rather a mysterious
mermory corruption, that could be a bitflip in a pointer that got
redirected to the fs_info::super_copy structure, or something else. A
'check' would tell more.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/3] fstests: btrfs: Add test case to check v1 space cache corruption

2018-03-06 Thread Filipe Manana

On Tue, Mar 6, 2018 at 8:15 AM, Qu Wenruo  wrote:
> There are some btrfs corruption report in mail list for a while,

There have been for years (well, since ever) many reports of different
types of corruptions.
Which kind of corruption are you referring to?

> although such corruption is pretty rare and almost impossible to
> reproduce, with dm-log-writes we found it's highly related to v1 space
> cache.
>
> Unlike journal based filesystems, btrfs completely rely on metadata CoW
> to protect itself from power loss.
> Which needs extent allocator to avoid allocate extents on existing
> extent range.
> Btrfs also uses free space cache to speed up such allocation.
>
> However there is a problem, v1 space cache is not protected by data CoW,
> and can be corrupted during power loss.
> So btrfs do extra check on free space cache, verifying its own in-file csum,
> generation and free space recorded in cache and extent tree.
>
> The problem is, under heavy concurrency, v1 space cache can be corrupted
> even under normal operations without power loss.

How?

> And we believe corrupted space cache can break btrfs metadata CoW and
> leads to the rare corruption in next power loss.

Which kind of corruption?

>
> The most obvious symptom will be difference in free space:
>
> This will be caught by kernel, but such check is quite weak, and if
> the net free space change is 0 in one transaction, the corrupted
> cache can be loaded by kernel.

How can that happen? The only case I'm aware of, explained below,
always leads to a difference (space cache has less free space then
what we actually have if we check the extent tree).

>
> In this case, btrfs check would report things like :
> --
> block group 298844160 has wrong amount of free space
> failed to load free space cache for block group 298844160
> --

This is normal, but not very common, due to tiny races that exists
between committing a transaction (writing the free space caches) and
running dellaloc for example (since reserving an extent while running
dealloc doesn't joing/start a transaction).

>
> But considering the test case are using notreelog, btrfs won't do
> sub-transaction commit which doesn't increase generation, each
> transaction should be consistent, and nothing should be reported at all.
>
> Further more, we can even found corrupted file extents like:
> --
> root 5 inode 261 errors 100, file extent discount
> Found file extent holes:
> start: 962560, len: 32768
> ERROR: errors found in fs roots

Why do you think that's a corruption? Does it cause data loss or any
user visible issue?

Having file extent holes not inserted happens when mixing buffered and
direct IO writes to a file (and fsstress does that), for example:

create file
buffered write at offset 0, length 64K
direct IO write at offset at offset 64K, length 4K
transaction commit
power loss
after this we got a missing 64K hole extent at offset 0 (at
btrfs_file_write_iter we only add hole extents if the start offset is
greater then the current i_size)

But this does not cause any problem for the user or the fs itself, and
it's supposed to be like that in the NO_HOLES mode which one day
(probably) will be the default mode.

> --
>
> Signed-off-by: Qu Wenruo 
> ---
>  common/dmlogwrites  |  72 +++
>  tests/btrfs/159 | 141 
> 
>  tests/btrfs/159.out |   2 +
>  tests/btrfs/group   |   1 +
>  4 files changed, 216 insertions(+)
>  create mode 100755 tests/btrfs/159
>  create mode 100644 tests/btrfs/159.out
>
> diff --git a/common/dmlogwrites b/common/dmlogwrites
> index 467b872e..54e7e242 100644
> --- a/common/dmlogwrites
> +++ b/common/dmlogwrites
> @@ -126,3 +126,75 @@ _log_writes_cleanup()
> $UDEV_SETTLE_PROG >/dev/null 2>&1
> _log_writes_remove
>  }
> +
> +# Convert log writes mark to entry number
> +# Result entry number is output to stdout, could be empty if not found
> +_log_writes_mark_to_entry_number()
> +{
> +   local _mark=$1
> +   local ret
> +
> +   [ -z "$_mark" ] && _fail \
> +   "mark must be given for _log_writes_mark_to_entry_number"
> +
> +   ret=$($here/src/log-writes/replay-log --find --log $LOGWRITES_DEV \
> +   --end-mark $_mark 2> /dev/null)
> +   [ -z "$ret" ] && return
> +   ret=$(echo "$ret" | cut -f1 -d\@)
> +   echo "mark $_mark has entry number $ret" >> $seqres.full
> +   echo "$ret"
> +}
> +
> +# Find next fua write entry number
> +# Result entry number is output to stdout, could be empty if not found
> +_log_writes_find_next_fua()
> +{
> +   local _start_entry=$1
> +   local ret
> +
> +   if [ -z "$_start_entry" ]; then
> +   ret=$($here/src/log-writes/replay-log --find --log 
> $LOGWRITES_DEV \
> +   --next-fua 2> /dev/null)
> +   else
> +   ret=$($here/src/log-writes/replay-log --find --log 
>

Re: [PATCH 3/3] fstests: btrfs: Add test case to check v1 space cache corruption

2018-03-06 Thread Qu Wenruo



On 2018年03月06日 17:03, Amir Goldstein wrote:
> On Tue, Mar 6, 2018 at 10:15 AM, Qu Wenruo  wrote:
>> There are some btrfs corruption report in mail list for a while,
>> although such corruption is pretty rare and almost impossible to
>> reproduce, with dm-log-writes we found it's highly related to v1 space
>> cache.
>>
>> Unlike journal based filesystems, btrfs completely rely on metadata CoW
>> to protect itself from power loss.
>> Which needs extent allocator to avoid allocate extents on existing
>> extent range.
>> Btrfs also uses free space cache to speed up such allocation.
>>
>> However there is a problem, v1 space cache is not protected by data CoW,
>> and can be corrupted during power loss.
>> So btrfs do extra check on free space cache, verifying its own in-file csum,
>> generation and free space recorded in cache and extent tree.
>>
>> The problem is, under heavy concurrency, v1 space cache can be corrupted
>> even under normal operations without power loss.
>> And we believe corrupted space cache can break btrfs metadata CoW and
>> leads to the rare corruption in next power loss.
>>
>> The most obvious symptom will be difference in free space:
>>
>> This will be caught by kernel, but such check is quite weak, and if
>> the net free space change is 0 in one transaction, the corrupted
>> cache can be loaded by kernel.
>>
>> In this case, btrfs check would report things like :
>> --
>> block group 298844160 has wrong amount of free space
>> failed to load free space cache for block group 298844160
>> --
>>
>> But considering the test case are using notreelog, btrfs won't do
>> sub-transaction commit which doesn't increase generation, each
>> transaction should be consistent, and nothing should be reported at all.
>>
>> Further more, we can even found corrupted file extents like:
>> --
>> root 5 inode 261 errors 100, file extent discount
>> Found file extent holes:
>> start: 962560, len: 32768
>> ERROR: errors found in fs roots
>> --
>>
> 
> So what is the expectation from this test on upstream btrfs?
> Probable failure? reliable failure?

Reliable failure, as the root reason is not fully exposed yet.

> Are there random seeds to fsstress that can make the test fail reliably?

Since concurrency is involved, I don't think seed would help much.

> Or does failure also depend on IO timing and other uncontrolled parameters?

Currently the concurrency would be the main factor.

>> +#
>> +#---
>> +# Copyright (c) SUSE.  All Rights Reserved.
> 
> 2018>
>> +#
[snip]
>> +
>> +_log_writes_replay_log_entry_range $prev >> $seqres.full 2>&1
>> +while [ ! -z "$cur" ]; do
>> +   _log_writes_replay_log_entry_range $cur $prev
>> +   # Catch the btrfs check output into temp file, as we need to
>> +   # grep the output to find the cache corruption
>> +   $BTRFS_UTIL_PROG check --check-data-csum $SCRATCH_DEV &> $tmp.fsck
> 
> So by making this a btrfs specific test you avoid the need to mount/umount and
> revert to $prev. Right?

Yes. Especially notreelog mount option disables journal-like behavior.

> 
> Please spell out the missing pieces for making a generic variant
> to this test, so if anyone wants to pick it up they have a good starting 
> point.
> Or maybe you still intend to post a generic test as well?

I'm still working on the generic test, but the priority is the btrfs
corruption fixing.

For the missing pieces, we need dm-snapshot to make journal based
filesystems to replay their log without polluting the original device.

Despite that, current code should illustrate the framework.

> 
>> +
>> +   # Cache passed generation,csum and free space check but corrupted
>> +   # will be reported as error
>> +   if [ $? -ne 0 ]; then
>> +   cat $tmp.fsck >> $seqres.full
>> +   _fail "btrfs check found corruption"
>> +   fi
>> +
>> +   # Mount option has ruled out any possible factors affect space cache
>> +   # And we are at the FUA writes, no generation related problem should
>> +   # happen anyway
>> +   if grep -q -e 'failed to load free space cache' $tmp.fsck; then
>> +   cat $tmp.fsck >> $seqres.full
>> +   _fail "btrfs check found invalid space cache"
>> +   fi
>> +
>> +   prev=$cur
>> +   cur=$(_log_writes_find_next_fua $prev)
>> +   [ -z $cur ] && break
>> +
>> +   # Same as above
>> +   cur=$(($cur + 1))
>> +done
>> +
>> +echo "Silence is golden"
>> +
>> +# success, all done
>> +status=0
>> +exit
>> diff --git a/tests/btrfs/159.out b/tests/btrfs/159.out
>> new file mode 100644
>> index ..e569e60c
>> --- /dev/null
>> +++ b/tests/btrfs/159.out
>> @@ -0,0 +1,2 @@
>> +QA output created by 159
>> +Silence is golden
>> diff --git a/tests/btrfs/group b/tests/btrfs/group
>> index 8007e07e..bc83db94 100644
>> --- a/tests/btrfs/group
>> +++ b/tests/btrfs/group
>> @@ -161,3 +161,4 @@

Re: [PATCH 3/3] fstests: btrfs: Add test case to check v1 space cache corruption

2018-03-06 Thread Amir Goldstein

On Tue, Mar 6, 2018 at 10:15 AM, Qu Wenruo  wrote:
> There are some btrfs corruption report in mail list for a while,
> although such corruption is pretty rare and almost impossible to
> reproduce, with dm-log-writes we found it's highly related to v1 space
> cache.
>
> Unlike journal based filesystems, btrfs completely rely on metadata CoW
> to protect itself from power loss.
> Which needs extent allocator to avoid allocate extents on existing
> extent range.
> Btrfs also uses free space cache to speed up such allocation.
>
> However there is a problem, v1 space cache is not protected by data CoW,
> and can be corrupted during power loss.
> So btrfs do extra check on free space cache, verifying its own in-file csum,
> generation and free space recorded in cache and extent tree.
>
> The problem is, under heavy concurrency, v1 space cache can be corrupted
> even under normal operations without power loss.
> And we believe corrupted space cache can break btrfs metadata CoW and
> leads to the rare corruption in next power loss.
>
> The most obvious symptom will be difference in free space:
>
> This will be caught by kernel, but such check is quite weak, and if
> the net free space change is 0 in one transaction, the corrupted
> cache can be loaded by kernel.
>
> In this case, btrfs check would report things like :
> --
> block group 298844160 has wrong amount of free space
> failed to load free space cache for block group 298844160
> --
>
> But considering the test case are using notreelog, btrfs won't do
> sub-transaction commit which doesn't increase generation, each
> transaction should be consistent, and nothing should be reported at all.
>
> Further more, we can even found corrupted file extents like:
> --
> root 5 inode 261 errors 100, file extent discount
> Found file extent holes:
> start: 962560, len: 32768
> ERROR: errors found in fs roots
> --
>

So what is the expectation from this test on upstream btrfs?
Probable failure? reliable failure?
Are there random seeds to fsstress that can make the test fail reliably?
Or does failure also depend on IO timing and other uncontrolled parameters?

> Signed-off-by: Qu Wenruo 
> ---
>  common/dmlogwrites  |  72 +++
>  tests/btrfs/159 | 141 
> 
>  tests/btrfs/159.out |   2 +
>  tests/btrfs/group   |   1 +
>  4 files changed, 216 insertions(+)
>  create mode 100755 tests/btrfs/159
>  create mode 100644 tests/btrfs/159.out
>
> diff --git a/common/dmlogwrites b/common/dmlogwrites
> index 467b872e..54e7e242 100644
> --- a/common/dmlogwrites
> +++ b/common/dmlogwrites
> @@ -126,3 +126,75 @@ _log_writes_cleanup()
> $UDEV_SETTLE_PROG >/dev/null 2>&1
> _log_writes_remove
>  }
> +
> +# Convert log writes mark to entry number
> +# Result entry number is output to stdout, could be empty if not found
> +_log_writes_mark_to_entry_number()
> +{
> +   local _mark=$1
> +   local ret
> +
> +   [ -z "$_mark" ] && _fail \
> +   "mark must be given for _log_writes_mark_to_entry_number"
> +
> +   ret=$($here/src/log-writes/replay-log --find --log $LOGWRITES_DEV \
> +   --end-mark $_mark 2> /dev/null)
> +   [ -z "$ret" ] && return
> +   ret=$(echo "$ret" | cut -f1 -d\@)
> +   echo "mark $_mark has entry number $ret" >> $seqres.full
> +   echo "$ret"
> +}
> +
> +# Find next fua write entry number
> +# Result entry number is output to stdout, could be empty if not found
> +_log_writes_find_next_fua()
> +{
> +   local _start_entry=$1
> +   local ret
> +
> +   if [ -z "$_start_entry" ]; then
> +   ret=$($here/src/log-writes/replay-log --find --log 
> $LOGWRITES_DEV \
> +   --next-fua 2> /dev/null)
> +   else
> +   ret=$($here/src/log-writes/replay-log --find --log 
> $LOGWRITES_DEV \
> +   --next-fua --start-entry $_start_entry 2> /dev/null)
> +   fi
> +   [ -z "$ret" ] && return
> +
> +   ret=$(echo "$ret" | cut -f1 -d\@)
> +   echo "next fua is entry number $ret" >> $seqres.full
> +   echo "$ret"
> +}
> +
> +# Replay log range to specified entry
> +# $1:  End entry. The last entry will *NOT* be replayed
> +# $2:  Start entry. If not specified, start from the first entry.
> +_log_writes_replay_log_entry_range()
> +{
> +   local _end=$1
> +   local _start=$2
> +
> +   [ -z "$_end" ] && _fail \
> +   "end entry must be specified for _log_writes_replay_log_entry_range"
> +
> +   if [[ "$_start" && "$_start" -gt "$_end" ]]; then
> +   _fail \
> +   "wrong parameter order for 
> _log_writes_replay_log_entry_range:start=$_start end=$_end"
> +   fi
> +
> +   # Original replay-log won't replay the last entry. So increase entry
> +   # number here to ensure the end entry to be replayed
> +   if [ -z "$_start" ]; then
> +

1 2 >

1 - 100 of 117 matches

Mail list logo