Re: [PATCH 0/9 PULL REQUEST] Qgroup fixes for 4.11

2017-03-06 Thread Qu Wenruo

Any response?

These patches are already here for at least 2 kernel releases.
And are all bug fixes, and fix bugs that are already reported.

I didn't see any reason why it should be delayed for so long time.

Thanks,
Qu

At 02/27/2017 03:10 PM, Qu Wenruo wrote:

Pull request can be fetched from my github:
https://github.com/adam900710/linux.git qgroup_fixes_for_4.11

The base is 6288d6eabc7505f42dda34a2c2962f91914be3a4.
Author: Liu Bo 
Date:   Tue Feb 21 12:12:58 2017 -0800

Btrfs: use the correct type when creating cow dio extent

These patches are already in mail list for almost 3 months.
For example the tracepoint patch is last updated at 2016/12/09.

With this patchset btrfs can pass all xfstest qgroup tests now.

This pull request should be good for 4.11 as they are all bug fixes and
has been delayed for several times.

I don't know if these patchset will be delayed again if someone wants to
cleanup something else, and cause rebase conflicts to delay such fixes.
But I must say, that's very frustrating to see bug fixes just get dropped
again and again just due to new cleanups and lack of reviewers.

Despite all these pities, this pull request includes:
1) Fix for inode_cache mount option
   Although no one really cares inode_cache mount option, it will screw
   qgroup for a long time.
   Not only it will screw up qgroup test uses golden output, but also
   new test cases use btrfsck to verify qgroup.

2) Fix for btrfs/104 kernel warning
   This is caused by quota enabling with dirty buffers not written to
   disc.

   Fixed by checking EXTENT_QGROUP_RESERVED flag other than just
   decreasing qgroup->reserved.

3) Fix qgroup underflow caused by freeing ranges not reserved by caller
   Mainly exposed by Chandan on PPC64.

   It's possible that buffered write is blocked by reserving metadata,
   and in error path we will release reserved space for both data and
   metadata.

   However the reserved data can already be reserved by another process
   writing the same page.

   In that case, data reserved space can be freed by two process, one
   for error path, one for normal write routine, causing underflow.

   Fixed by checking if that data range is reserved by ourselves and
   only free it if it's reserved by ourselves.

Update since 2016/12/09:
  Rebased to latest for-linux-4.11.

  Add missing reviewed-by and tested-by tags.

  Add more comment for btrfs_qgroup_reserve_data() for error handling.

  Add more comment for qgroup_free_reserved_data() for later
  enhancement (not function enhancement).

Qu Wenruo (9):
  btrfs: qgroup: Add trace point for qgroup reserved space
  btrfs: qgroup: Re-arrange tracepoint timing to co-operate with
reserved space tracepoint
  btrfs: qgroup: Fix qgroup corruption caused by inode_cache mount
option
  btrfs: qgroup: Add quick exit for non-fs extents
  btrfs: qgroup: Cleanup btrfs_qgroup_prepare_account_extents function
  btrfs: qgroup: Return actually freed bytes for qgroup release or free
data
  btrfs: qgroup: Fix qgroup reserved space underflow caused by buffered
write and quota enable
  btrfs: qgroup: Introduce extent changeset for qgroup reserve functions
  btrfs: qgroup: Fix qgroup reserved space underflow by only freeing
reserved ranges

 fs/btrfs/ctree.h |  12 ++-
 fs/btrfs/extent-tree.c   |  31 +++---
 fs/btrfs/extent_io.h |  14 ++-
 fs/btrfs/file.c  |  46 +
 fs/btrfs/inode-map.c |   6 +-
 fs/btrfs/inode.c |  64 +
 fs/btrfs/ioctl.c |  11 ++-
 fs/btrfs/qgroup.c| 221 ---
 fs/btrfs/qgroup.h|  14 +--
 fs/btrfs/relocation.c|  13 ++-
 fs/btrfs/transaction.c   |  21 ++--
 include/trace/events/btrfs.h |  43 +
 12 files changed, 358 insertions(+), 138 deletions(-)




--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/8 v2] Non-blocking AIO

2017-03-06 Thread Avi Kivity

On 03/06/2017 10:25 AM, Jan Kara wrote:

On Sun 05-03-17 16:56:21, Avi Kivity wrote:

The goal of the patch series is to return -EAGAIN/-EWOULDBLOCK if
any of these conditions are met. This way userspace can push most
of the write()s to the kernel to the best of its ability to complete
and if it returns -EAGAIN, can defer it to another thread.


Is it not possible to push the iocb to a workqueue?  This will allow
existing userspace to work with the new functionality, unchanged. Any
userspace implementation would have to do the same thing, so it's not like
we're saving anything by pushing it there.

That is not easy because until IO is fully submitted, you need some parts
of the context of the process which submits the IO (e.g. memory mappings,
but possibly also other credentials). So you would need to somehow transfer
this information to the workqueue.




It's at least possible to pass the mm_struct to the workqueue, and I 
imagine other process attributes.  But I appreciate the difficulty.


It would be quite annoying to have to keep a large number of worker 
threads active, just in case aio is not working.  Modern NVMes have 
fairly deep queues, and at the worst case, you'll need one thread for 
each I/O to keep everything busy.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/8 v2] Non-blocking AIO

2017-03-06 Thread Jan Kara
On Sun 05-03-17 16:56:21, Avi Kivity wrote:
> >The goal of the patch series is to return -EAGAIN/-EWOULDBLOCK if
> >any of these conditions are met. This way userspace can push most
> >of the write()s to the kernel to the best of its ability to complete
> >and if it returns -EAGAIN, can defer it to another thread.
> >
> 
> Is it not possible to push the iocb to a workqueue?  This will allow
> existing userspace to work with the new functionality, unchanged. Any
> userspace implementation would have to do the same thing, so it's not like
> we're saving anything by pushing it there.

That is not easy because until IO is fully submitted, you need some parts
of the context of the process which submits the IO (e.g. memory mappings,
but possibly also other credentials). So you would need to somehow transfer
this information to the workqueue.

Honza
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 0/6] Chunk level degradable check

2017-03-06 Thread Qu Wenruo
Btrfs currently uses num_tolerated_disk_barrier_failures to do global
check for tolerated missing device.

Although the one-size-fit-all solution is quite safe, it's too strict if
data and metadata has different duplication level.

For example, if one use Single data and RAID1 metadata for 2 disks, it
means any missing device will make the fs unable to be degraded mounted.

But in fact, some times all single chunks may be in the existing device
and in that case, we should allow it to be rw degraded mounted.

Such case can be easily reproduced using the following script:
 # mkfs.btrfs -f -m raid1 -d sing /dev/sdb /dev/sdc
 # wipefs -f /dev/sdc
 # mount /dev/sdb -o degraded,rw

If using btrfs-debug-tree to check /dev/sdb, one should find that the
data chunk is only in sdb, so in fact it should allow degraded mount.

This patchset will introduce a new per-chunk degradable check for btrfs,
allow above case to succeed, and it's quite small anyway.

And enhance kernel error message for missing device, at least kernel can
know what's making mount failed, other than meaningless
"failed to read system chunk/chunk tree -5".

v2:
  Update after almost 2 years.
  Add the last patch to enhance the kernel output, so user can know it's
  missing devices prevent btrfs to mount.

Qu Wenruo (6):
  btrfs: Introduce a function to check if all chunks a OK for degraded
rw mount
  btrfs: Do chunk level rw degrade check at mount time
  btrfs: Do chunk level degradation check for remount
  btrfs: Allow barrier_all_devices to do chunk level device check
  btrfs: Cleanup num_tolerated_disk_barrier_failures
  btrfs: Enhance missing device kernel message

 fs/btrfs/ctree.h   |  2 --
 fs/btrfs/disk-io.c | 80 ++--
 fs/btrfs/disk-io.h |  2 --
 fs/btrfs/super.c   |  5 ++-
 fs/btrfs/volumes.c | 97 --
 fs/btrfs/volumes.h |  7 
 6 files changed, 92 insertions(+), 101 deletions(-)

-- 
2.12.0



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 5/6] btrfs: Cleanup num_tolerated_disk_barrier_failures

2017-03-06 Thread Qu Wenruo
As we use per-chunk degradable check, now the global
num_tolerated_disk_barrier_failures is of no use.

So cleanup it.

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/ctree.h   |  2 --
 fs/btrfs/disk-io.c | 54 --
 fs/btrfs/disk-io.h |  2 --
 fs/btrfs/volumes.c | 17 -
 4 files changed, 75 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index f03c2f285eb1..068c1be83266 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1061,8 +1061,6 @@ struct btrfs_fs_info {
/* next backup root to be overwritten */
int backup_root_index;
 
-   int num_tolerated_disk_barrier_failures;
-
/* device replace state */
struct btrfs_dev_replace dev_replace;
 
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index f596bd130524..f86eaf63e5f5 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3639,60 +3639,6 @@ int btrfs_get_num_tolerated_disk_barrier_failures(u64 
flags)
return min_tolerated;
 }
 
-int btrfs_calc_num_tolerated_disk_barrier_failures(
-   struct btrfs_fs_info *fs_info)
-{
-   struct btrfs_ioctl_space_info space;
-   struct btrfs_space_info *sinfo;
-   u64 types[] = {BTRFS_BLOCK_GROUP_DATA,
-  BTRFS_BLOCK_GROUP_SYSTEM,
-  BTRFS_BLOCK_GROUP_METADATA,
-  BTRFS_BLOCK_GROUP_DATA | BTRFS_BLOCK_GROUP_METADATA};
-   int i;
-   int c;
-   int num_tolerated_disk_barrier_failures =
-   (int)fs_info->fs_devices->num_devices;
-
-   for (i = 0; i < ARRAY_SIZE(types); i++) {
-   struct btrfs_space_info *tmp;
-
-   sinfo = NULL;
-   rcu_read_lock();
-   list_for_each_entry_rcu(tmp, &fs_info->space_info, list) {
-   if (tmp->flags == types[i]) {
-   sinfo = tmp;
-   break;
-   }
-   }
-   rcu_read_unlock();
-
-   if (!sinfo)
-   continue;
-
-   down_read(&sinfo->groups_sem);
-   for (c = 0; c < BTRFS_NR_RAID_TYPES; c++) {
-   u64 flags;
-
-   if (list_empty(&sinfo->block_groups[c]))
-   continue;
-
-   btrfs_get_block_group_info(&sinfo->block_groups[c],
-  &space);
-   if (space.total_bytes == 0 || space.used_bytes == 0)
-   continue;
-   flags = space.flags;
-
-   num_tolerated_disk_barrier_failures = min(
-   num_tolerated_disk_barrier_failures,
-   btrfs_get_num_tolerated_disk_barrier_failures(
-   flags));
-   }
-   up_read(&sinfo->groups_sem);
-   }
-
-   return num_tolerated_disk_barrier_failures;
-}
-
 int write_all_supers(struct btrfs_fs_info *fs_info, int max_mirrors)
 {
struct list_head *head;
diff --git a/fs/btrfs/disk-io.h b/fs/btrfs/disk-io.h
index 2e0ec29bfd69..4522d2f11909 100644
--- a/fs/btrfs/disk-io.h
+++ b/fs/btrfs/disk-io.h
@@ -142,8 +142,6 @@ struct btrfs_root *btrfs_create_tree(struct 
btrfs_trans_handle *trans,
 int btree_lock_page_hook(struct page *page, void *data,
void (*flush_fn)(void *));
 int btrfs_get_num_tolerated_disk_barrier_failures(u64 flags);
-int btrfs_calc_num_tolerated_disk_barrier_failures(
-   struct btrfs_fs_info *fs_info);
 int __init btrfs_end_io_wq_init(void);
 void btrfs_end_io_wq_exit(void);
 
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 729cbd0d2b60..72c78f096755 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -1973,9 +1973,6 @@ int btrfs_rm_device(struct btrfs_fs_info *fs_info, const 
char *device_path,
free_fs_devices(cur_devices);
}
 
-   fs_info->num_tolerated_disk_barrier_failures =
-   btrfs_calc_num_tolerated_disk_barrier_failures(fs_info);
-
 out:
mutex_unlock(&uuid_mutex);
return ret;
@@ -2474,8 +2471,6 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, 
const char *device_path
   "sysfs: failed to create fsid for sprout");
}
 
-   fs_info->num_tolerated_disk_barrier_failures =
-   btrfs_calc_num_tolerated_disk_barrier_failures(fs_info);
ret = btrfs_commit_transaction(trans);
 
if (seeding_dev) {
@@ -3858,13 +3853,6 @@ int btrfs_balance(struct btrfs_balance_control *bctl,
   bctl->meta.target, bctl->data.target);
}
 
-   if (bctl->sys.flags & BTRFS_BALANCE_ARGS_CONVERT) {
-   fs_info->num_tolerated_disk_barrier_failures = min(
-   btrfs_calc_num_tolerated_disk_barrier_failures(fs_info),
-  

[PATCH v2 6/6] btrfs: Enhance missing device kernel message

2017-03-06 Thread Qu Wenruo
For missing device, btrfs will just refuse to mount with almost
meaningless kernel message like:

 BTRFS info (device vdb6): disk space caching is enabled
 BTRFS info (device vdb6): has skinny extents
 BTRFS error (device vdb6): failed to read the system array: -5
 BTRFS error (device vdb6): open_ctree failed

This patch will add extra device missing output, making the result to:

 BTRFS info (device vdb6): disk space caching is enabled
 BTRFS info (device vdb6): has skinny extents
 BTRFS warning (device vdb6): devid 2 uuid 80470722-cad2-4b90-b7c3-fee294552f1b 
is missing
 BTRFS error (device vdb6): failed to read the system array: -5
 BTRFS error (device vdb6): open_ctree failed

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/volumes.c | 25 ++---
 fs/btrfs/volumes.h |  2 ++
 2 files changed, 20 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 72c78f096755..b33e96495934 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -6442,6 +6442,7 @@ static int read_one_chunk(struct btrfs_fs_info *fs_info, 
struct btrfs_key *key,
if (!map->stripes[i].dev &&
!btrfs_test_opt(fs_info, DEGRADED)) {
free_extent_map(em);
+   btrfs_report_missing_device(fs_info, devid, uuid);
return -EIO;
}
if (!map->stripes[i].dev) {
@@ -6452,8 +6453,7 @@ static int read_one_chunk(struct btrfs_fs_info *fs_info, 
struct btrfs_key *key,
free_extent_map(em);
return -EIO;
}
-   btrfs_warn(fs_info, "devid %llu uuid %pU is missing",
-  devid, uuid);
+   btrfs_report_missing_device(fs_info, devid, uuid);
}
map->stripes[i].dev->in_fs_metadata = 1;
}
@@ -6570,17 +6570,21 @@ static int read_one_dev(struct btrfs_fs_info *fs_info,
 
device = btrfs_find_device(fs_info, devid, dev_uuid, fs_uuid);
if (!device) {
-   if (!btrfs_test_opt(fs_info, DEGRADED))
+   if (!btrfs_test_opt(fs_info, DEGRADED)) {
+   btrfs_report_missing_device(fs_info, devid, dev_uuid);
return -EIO;
+   }
 
device = add_missing_dev(fs_devices, devid, dev_uuid);
if (!device)
return -ENOMEM;
-   btrfs_warn(fs_info, "devid %llu uuid %pU missing",
-   devid, dev_uuid);
+   btrfs_report_missing_device(fs_info, devid, dev_uuid);
} else {
-   if (!device->bdev && !btrfs_test_opt(fs_info, DEGRADED))
-   return -EIO;
+   if (!device->bdev) {
+   btrfs_report_missing_device(fs_info, devid, dev_uuid);
+   if (!btrfs_test_opt(fs_info, DEGRADED))
+   return -EIO;
+   }
 
if(!device->bdev && !device->missing) {
/*
@@ -6591,6 +6595,7 @@ static int read_one_dev(struct btrfs_fs_info *fs_info,
 */
device->fs_devices->missing_devices++;
device->missing = 1;
+   btrfs_report_missing_device(fs_info, devid, dev_uuid);
}
 
/* Move the device to its own fs_devices */
@@ -6748,6 +6753,12 @@ int btrfs_read_sys_array(struct btrfs_fs_info *fs_info)
return -EIO;
 }
 
+void btrfs_report_missing_device(struct btrfs_fs_info *fs_info, u64 devid,
+u8 *uuid)
+{
+   btrfs_warn_rl(fs_info, "devid %llu uuid %pU is missing", devid, uuid);
+}
+
 /*
  * Check if all chunks in the fs is OK for read-write degraded mount
  *
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 112fccacdabc..4f981b75958d 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -543,4 +543,6 @@ void btrfs_set_fs_info_ptr(struct btrfs_fs_info *fs_info);
 void btrfs_reset_fs_info_ptr(struct btrfs_fs_info *fs_info);
 
 bool btrfs_check_rw_degradable(struct btrfs_fs_info *fs_info);
+void btrfs_report_missing_device(struct btrfs_fs_info *fs_info, u64 devid,
+u8 *uuid);
 #endif
-- 
2.12.0



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 1/6] btrfs: Introduce a function to check if all chunks a OK for degraded rw mount

2017-03-06 Thread Qu Wenruo
Introduce a new function, btrfs_check_rw_degradable(), to check if all
chunks in btrfs is OK for degraded rw mount.

It provides the new basis for accurate btrfs mount/remount and even
runtime degraded mount check other than old one-size-fit-all method.

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/volumes.c | 53 +
 fs/btrfs/volumes.h |  1 +
 2 files changed, 54 insertions(+)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 7c8c7bbee197..dd9dd94d7043 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -6765,6 +6765,59 @@ int btrfs_read_sys_array(struct btrfs_fs_info *fs_info)
return -EIO;
 }
 
+/*
+ * Check if all chunks in the fs is OK for read-write degraded mount
+ *
+ * Return true if the fs is OK to be mounted degraded read-write
+ * Return false if the fs is not OK to be mounted degraded
+ */
+bool btrfs_check_rw_degradable(struct btrfs_fs_info *fs_info)
+{
+   struct btrfs_mapping_tree *map_tree = &fs_info->mapping_tree;
+   struct extent_map *em;
+   u64 next_start = 0;
+   bool ret = true;
+
+   read_lock(&map_tree->map_tree.lock);
+   em = lookup_extent_mapping(&map_tree->map_tree, 0, (u64)-1);
+   /* No chunk at all? Return false anyway */
+   if (!em) {
+   ret = false;
+   goto out;
+   }
+   while (em) {
+   struct map_lookup *map;
+   int missing = 0;
+   int max_tolerated;
+   int i;
+
+   map = (struct map_lookup *) em->bdev;
+   max_tolerated =
+   btrfs_get_num_tolerated_disk_barrier_failures(
+   map->type);
+   for (i = 0; i < map->num_stripes; i++) {
+   if (map->stripes[i].dev->missing)
+   missing++;
+   }
+   if (missing > max_tolerated) {
+   ret = false;
+   btrfs_warn(fs_info,
+   "chunk %llu missing %d devices, max tolerance is %d for writeble mount",
+  em->start, missing, max_tolerated);
+   free_extent_map(em);
+   goto out;
+   }
+   next_start = extent_map_end(em);
+   free_extent_map(em);
+
+   em = lookup_extent_mapping(&map_tree->map_tree, next_start,
+  (u64)(-1) - next_start);
+   }
+out:
+   read_unlock(&map_tree->map_tree.lock);
+   return ret;
+}
+
 int btrfs_read_chunk_tree(struct btrfs_fs_info *fs_info)
 {
struct btrfs_root *root = fs_info->chunk_root;
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 59be81206dd7..db1b5ef479cf 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -538,4 +538,5 @@ struct list_head *btrfs_get_fs_uuids(void);
 void btrfs_set_fs_info_ptr(struct btrfs_fs_info *fs_info);
 void btrfs_reset_fs_info_ptr(struct btrfs_fs_info *fs_info);
 
+bool btrfs_check_rw_degradable(struct btrfs_fs_info *fs_info);
 #endif
-- 
2.12.0



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 2/6] btrfs: Do chunk level rw degrade check at mount time

2017-03-06 Thread Qu Wenruo
Now use the btrfs_check_rw_degradable() to do mount time degration check.

With this patch, now we can mount with the following case:
 # mkfs.btrfs -f -m raid1 -d single /dev/sdb /dev/sdc
 # wipefs -a /dev/sdc
 # mount /dev/sdb /mnt/btrfs -o degraded
 As the single data chunk is only in sdb, so it's OK to mount as
 degraded, as missing one device is OK for RAID1.

But still fail with the following case as expected:
 # mkfs.btrfs -f -m raid1 -d single /dev/sdb /dev/sdc
 # wipefs -a /dev/sdb
 # mount /dev/sdc /mnt/btrfs -o degraded
 As the data chunk is only in sdb, so it's not OK to mount it as
 degraded.

Reported-by: Zhao Lei 
Reported-by: Anand Jain 
Signed-off-by: Qu Wenruo 
---
 fs/btrfs/disk-io.c | 11 +++
 1 file changed, 3 insertions(+), 8 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 73fdc6bdaea9..c26b8a0b121c 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3057,15 +3057,10 @@ int open_ctree(struct super_block *sb,
btrfs_err(fs_info, "failed to read block groups: %d", ret);
goto fail_sysfs;
}
-   fs_info->num_tolerated_disk_barrier_failures =
-   btrfs_calc_num_tolerated_disk_barrier_failures(fs_info);
-   if (fs_info->fs_devices->missing_devices >
-fs_info->num_tolerated_disk_barrier_failures &&
-   !(sb->s_flags & MS_RDONLY)) {
+
+   if (!(sb->s_flags & MS_RDONLY) && !btrfs_check_rw_degradable(fs_info)) {
btrfs_warn(fs_info,
-"missing devices (%llu) exceeds the limit (%d), writeable mount is not 
allowed",
-   fs_info->fs_devices->missing_devices,
-   fs_info->num_tolerated_disk_barrier_failures);
+   "writeable mount is not allowed due to too many missing 
devices");
goto fail_sysfs;
}
 
-- 
2.12.0



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 3/6] btrfs: Do chunk level degradation check for remount

2017-03-06 Thread Qu Wenruo
Just the same for mount time check, use btrfs_check_rw_degradable() to
check if we are OK to be remounted rw.

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/super.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index da687dc79cce..1f5772501c92 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -1784,9 +1784,8 @@ static int btrfs_remount(struct super_block *sb, int 
*flags, char *data)
goto restore;
}
 
-   if (fs_info->fs_devices->missing_devices >
-fs_info->num_tolerated_disk_barrier_failures &&
-   !(*flags & MS_RDONLY)) {
+   if (!(*flags & MS_RDONLY) &&
+   !btrfs_check_rw_degradable(fs_info)) {
btrfs_warn(fs_info,
"too many missing devices, writeable remount is 
not allowed");
ret = -EACCES;
-- 
2.12.0



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 4/6] btrfs: Allow barrier_all_devices to do chunk level device check

2017-03-06 Thread Qu Wenruo
The last user of num_tolerated_disk_barrier_failures is
barrier_all_devices().
But it's can be easily changed to new per-chunk degradable check
framework.

Now btrfs_device will have two extra members, representing send/wait
error, set at write_dev_flush() time.
With these 2 new members, btrfs_check_rw_degradable() can check if the
fs is still OK when the fs is committed to disk.

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/disk-io.c | 15 +++
 fs/btrfs/volumes.c |  4 +++-
 fs/btrfs/volumes.h |  4 
 3 files changed, 14 insertions(+), 9 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index c26b8a0b121c..f596bd130524 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3569,17 +3569,17 @@ static int barrier_all_devices(struct btrfs_fs_info 
*info)
 {
struct list_head *head;
struct btrfs_device *dev;
-   int errors_send = 0;
-   int errors_wait = 0;
int ret;
 
/* send down all the barriers */
head = &info->fs_devices->devices;
list_for_each_entry_rcu(dev, head, dev_list) {
+   dev->err_wait = false;
+   dev->err_send = false;
if (dev->missing)
continue;
if (!dev->bdev) {
-   errors_send++;
+   dev->err_send = true;
continue;
}
if (!dev->in_fs_metadata || !dev->writeable)
@@ -3587,7 +3587,7 @@ static int barrier_all_devices(struct btrfs_fs_info *info)
 
ret = write_dev_flush(dev, 0);
if (ret)
-   errors_send++;
+   dev->err_send = true;
}
 
/* wait for all the barriers */
@@ -3595,7 +3595,7 @@ static int barrier_all_devices(struct btrfs_fs_info *info)
if (dev->missing)
continue;
if (!dev->bdev) {
-   errors_wait++;
+   dev->err_wait = true;
continue;
}
if (!dev->in_fs_metadata || !dev->writeable)
@@ -3603,10 +3603,9 @@ static int barrier_all_devices(struct btrfs_fs_info 
*info)
 
ret = write_dev_flush(dev, 1);
if (ret)
-   errors_wait++;
+   dev->err_wait = true;
}
-   if (errors_send > info->num_tolerated_disk_barrier_failures ||
-   errors_wait > info->num_tolerated_disk_barrier_failures)
+   if (!btrfs_check_rw_degradable(info))
return -EIO;
return 0;
 }
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index dd9dd94d7043..729cbd0d2b60 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -6796,7 +6796,9 @@ bool btrfs_check_rw_degradable(struct btrfs_fs_info 
*fs_info)
btrfs_get_num_tolerated_disk_barrier_failures(
map->type);
for (i = 0; i < map->num_stripes; i++) {
-   if (map->stripes[i].dev->missing)
+   if (map->stripes[i].dev->missing ||
+   map->stripes[i].dev->err_wait ||
+   map->stripes[i].dev->err_send)
missing++;
}
if (missing > max_tolerated) {
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index db1b5ef479cf..112fccacdabc 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -75,6 +75,10 @@ struct btrfs_device {
int can_discard;
int is_tgtdev_for_dev_replace;
 
+   /* If this devices fails to send/wait dev flush */
+   bool err_send;
+   bool err_wait;
+
 #ifdef __BTRFS_NEED_DEVICE_DATA_ORDERED
seqcount_t data_seqcount;
 #endif
-- 
2.12.0



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [4.7.2] btrfs_run_delayed_refs:2963: errno=-17 Object already exists

2017-03-06 Thread Marc Joliet
On Montag, 6. März 2017 00:53:40 CET Marc Joliet wrote:
> On Mittwoch, 1. März 2017 19:14:07 CET you wrote:
> >  In any
> > 
> > case, I started btrfs-check on the device itself.
> 
> *Sigh*, I had to restart it, because I forgot to redirect to a file and
> quite frankly wasn't expecting this flood of output, but here's a summary
> of the output after about 2 days:
> 
[snip old output]

OK, it finished last night.  Here's the summary again:

% wc -l btrfs_check_output_20170303.log 
3028222 btrfs_check_output_20170303.log
% grep -v "backref lost" btrfs_check_output_20170303.log | grep -v "check \
(leaf\|node\) failed" | grep -v "lost its parent" | grep -v "referencer count" 
checking extents
ERROR: block group[3879328546816 1073741824] used 1072840704 but extent items 
used 1129164800
ERROR: block group[4163870130176 1073741824] used 1072259072 but extent items 
used 0
ERROR: block group[4223999672320 1073741824] used 1073664000 but extent items 
used 1074188288
ERROR: block group[4278760505344 1073741824] used 1073377280 but extent items 
used 1073901568
ERROR: block group[4406535782400 1073741824] used 1073627136 but extent items 
used 0
ERROR: extent [3830028140544 4096] referencer bytenr mismatch, wanted: 
3830028140544, have: 3826183847936
ERROR: errors found in extent allocation tree or chunk allocation
checking free space cache
checking fs roots
Checking filesystem on /dev/sdb2
UUID: f97b3cda-15e8-418b-bb9b-235391ef2a38
found 892572778496 bytes used err is -5
total csum bytes: 860790216
total tree bytes: 36906336256
total fs tree bytes: 34551476224
total extent tree bytes: 1230610432
btree space waste bytes: 7446885892
file data blocks allocated: 16359581663232
 referenced 2358137831424

> That's right, slowly approaching 1.5 million lines of btrfs-check output!
> That's *way* more than I ran it the last time this error happened a few
> weeks ago.

As can be seen above, that ballooned to over 3 million lines.  Since the 
output is 4.2 MB, even after XZ compression, I put it up on my Dropbox, just 
in case it's interesting to anybody:

https://www.dropbox.com/s/h6ftqpygfr4vsks/btrfs_check_output_20170303.log.xz?
dl=0

Since this is my backup drive, and the second time within a month that it had 
problems like this, *and* I've got both the btrfs-image dump and btrfs-check 
output, I'm going to go ahead and reformat, so that my three computers are 
finally backed up again.

Oh, and for what it's worth, I did test against a 4.8 kernel, and pretty much 
immediately got the "forced RO" error, just like with 4.9.  I didn't try 
anything older (or newer).

As a last step, I'll probably collect all information I have and post it to 
bugzilla when I have a chance, since others might hit it, too (Kai did before 
me, after all).

Greetings
-- 
Marc Joliet
--
"People who think they know everything really annoy those of us who know we
don't" - Bjarne Stroustrup


signature.asc
Description: This is a digitally signed message part.


Re: [PATCH 7/8] Revert "ext4: avoid deadlocks in the writeback path by using sb_getblk_gfp"

2017-03-06 Thread Michal Hocko
On Tue 17-01-17 08:54:50, Michal Hocko wrote:
> On Mon 16-01-17 22:01:18, Theodore Ts'o wrote:
> > On Fri, Jan 06, 2017 at 03:11:06PM +0100, Michal Hocko wrote:
> > > From: Michal Hocko 
> > > 
> > > This reverts commit c45653c341f5c8a0ce19c8f0ad4678640849cb86 because
> > > sb_getblk_gfp is not really needed as
> > > sb_getblk
> > >   __getblk_gfp
> > > __getblk_slow
> > >   grow_buffers
> > > grow_dev_page
> > > gfp_mask = mapping_gfp_constraint(inode->i_mapping, ~__GFP_FS) | gfp
> > > 
> > > so __GFP_FS is cleared unconditionally and therefore the above commit
> > > didn't have any real effect in fact.
> > > 
> > > This patch should not introduce any functional change. The main point
> > > of this change is to reduce explicit GFP_NOFS usage inside ext4 code to
> > > make the review of the remaining usage easier.
> > > 
> > > Signed-off-by: Michal Hocko 
> > > Reviewed-by: Jan Kara 
> > 
> > If I'm not mistaken, this patch is not dependent on any of the other
> > patches in this series (and the other patches are not dependent on
> > this one).  Hence, I could take this patch via the ext4 tree, correct?
> 
> Yes, that is correct

Hi Ted,
this doesn't seem to be in any of the branches [1]. I plan to resend the
whole scope nofs series, should I add this to the pile or you are going
to route it via your tree?

[1] git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


lockdep splat on latest master

2017-03-06 Thread Nikolay Borisov
Hello, 

Running generic/208 on latest linux master I got the following splat: 

[ 3540.719007] ==
[ 3540.719007] [ INFO: possible circular locking dependency detected ]
[ 3540.719007] 4.11.0-rc1-nbor #147 Tainted: GW  
[ 3540.719007] ---
[ 3540.719007] aio-dio-invalid/1451 is trying to acquire lock:
[ 3540.719007]  (&mm->mmap_sem){++}, at: [] 
get_user_pages_unlocked+0x4f/0x1a0
[ 3540.719007]
[ 3540.719007] but task is already holding lock:
[ 3540.719007]  (&ei->dio_sem){.+}, at: [] 
btrfs_direct_IO+0x324/0x390
[ 3540.719007]
[ 3540.719007] which lock already depends on the new lock.
[ 3540.719007]
[ 3540.719007]
[ 3540.719007] the existing dependency chain (in reverse order) is:
[ 3540.719007]
[ 3540.719007] -> #4 (&ei->dio_sem){.+}:
[ 3540.719007]lock_acquire+0xc5/0x220
[ 3540.719007]down_write+0x44/0x80
[ 3540.719007]btrfs_log_changed_extents+0x7c/0x660
[ 3540.719007]btrfs_log_inode+0xb78/0xf50
[ 3540.719007]btrfs_log_inode_parent+0x2a9/0xa70
[ 3540.719007]btrfs_log_dentry_safe+0x74/0xa0
[ 3540.719007]btrfs_sync_file+0x321/0x4d0
[ 3540.719007]vfs_fsync_range+0x46/0xc0
[ 3540.719007]vfs_fsync+0x1c/0x20
[ 3540.719007]do_fsync+0x38/0x60
[ 3540.719007]SyS_fdatasync+0x13/0x20
[ 3540.719007]entry_SYSCALL_64_fastpath+0x23/0xc6
[ 3540.719007]
[ 3540.719007] -> #3 (&ei->log_mutex){+.+...}:
[ 3540.719007]lock_acquire+0xc5/0x220
[ 3540.719007]__mutex_lock+0x7c/0x960
[ 3540.719007]mutex_lock_nested+0x1b/0x20
[ 3540.719007]btrfs_record_unlink_dir+0x3e/0xb0
[ 3540.719007]btrfs_unlink+0x72/0xf0
[ 3540.719007]vfs_unlink+0xbe/0x1b0
[ 3540.719007]do_unlinkat+0x244/0x280
[ 3540.719007]SyS_unlinkat+0x1d/0x30
[ 3540.719007]entry_SYSCALL_64_fastpath+0x23/0xc6
[ 3540.719007]
[ 3540.719007] -> #2 (sb_internal#2){.+}:
[ 3540.719007]lock_acquire+0xc5/0x220
[ 3540.719007]down_write+0x44/0x80
[ 3540.719007]percpu_down_write+0x25/0x120
[ 3540.719007]freeze_super+0xbf/0x1a0
[ 3540.719007]do_vfs_ioctl+0x598/0x770
[ 3540.719007]SyS_ioctl+0x4c/0x90
[ 3540.719007]entry_SYSCALL_64_fastpath+0x23/0xc6
[ 3540.719007]
[ 3540.719007] -> #1 (sb_pagefaults){..}:
[ 3540.719007]lock_acquire+0xc5/0x220
[ 3540.719007]__sb_start_write+0x119/0x1d0
[ 3540.719007]btrfs_page_mkwrite+0x51/0x420
[ 3540.719007]do_page_mkwrite+0x38/0xb0
[ 3540.719007]__handle_mm_fault+0x6b5/0xef0
[ 3540.719007]handle_mm_fault+0x175/0x300
[ 3540.719007]__do_page_fault+0x1e0/0x4d0
[ 3540.719007]trace_do_page_fault+0xaa/0x270
[ 3540.719007]do_async_page_fault+0x19/0x70
[ 3540.719007]async_page_fault+0x28/0x30
[ 3540.719007]
[ 3540.719007] -> #0 (&mm->mmap_sem){++}:
[ 3540.719007]__lock_acquire+0x16f1/0x17c0
[ 3540.719007]lock_acquire+0xc5/0x220
[ 3540.719007]down_read+0x47/0x70
[ 3540.719007]get_user_pages_unlocked+0x4f/0x1a0
[ 3540.719007]get_user_pages_fast+0x81/0x170
[ 3540.719007]iov_iter_get_pages+0xc1/0x300
[ 3540.719007]__blockdev_direct_IO+0x14f8/0x34e0
[ 3540.719007]btrfs_direct_IO+0x1e8/0x390
[ 3540.719007]generic_file_direct_write+0xb5/0x160
[ 3540.719007]btrfs_file_write_iter+0x26d/0x500
[ 3540.719007]aio_write+0xdb/0x190
[ 3540.719007]do_io_submit+0x5aa/0x830
[ 3540.719007]SyS_io_submit+0x10/0x20
[ 3540.719007]entry_SYSCALL_64_fastpath+0x23/0xc6
[ 3540.719007]
[ 3540.719007] other info that might help us debug this:
[ 3540.719007]
[ 3540.719007] Chain exists of:
[ 3540.719007]   &mm->mmap_sem --> &ei->log_mutex --> &ei->dio_sem
[ 3540.719007]
[ 3540.719007]  Possible unsafe locking scenario:
[ 3540.719007]
[ 3540.719007]CPU0CPU1
[ 3540.719007]
[ 3540.719007]   lock(&ei->dio_sem);
[ 3540.719007]lock(&ei->log_mutex);
[ 3540.719007]lock(&ei->dio_sem);
[ 3540.719007]   lock(&mm->mmap_sem);
[ 3540.719007]
[ 3540.719007]  *** DEADLOCK ***
[ 3540.719007]
[ 3540.719007] 2 locks held by aio-dio-invalid/1451:
[ 3540.719007]  #0:  (sb_writers#11){.+}, at: [] 
aio_write+0x168/0x190
[ 3540.719007]  #1:  (&ei->dio_sem){.+}, at: [] 
btrfs_direct_IO+0x324/0x390
[ 3540.719007]
[ 3540.719007] stack backtrace:
[ 3540.719007] CPU: 3 PID: 1451 Comm: aio-dio-invalid Tainted: GW   
4.11.0-rc1-nbor #147
[ 3540.719007] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
Ubuntu-1.8.2-1ubuntu1 04/01/2014
[ 3540.719007] Call Trace:
[ 3540.719007]  dump_stack+0x85/0xc9
[ 3540.719007]  print_circular_bug+0x2ac/0x2ba
[ 3540.719007]  __lock_acquire+0x16f1/0x17c0
[ 3540.719007]  ? __lock_acquire+

Re: raid1 degraded mount still produce single chunks, writeable mount not allowed

2017-03-06 Thread Austin S. Hemmelgarn

On 2017-03-03 15:10, Kai Krakow wrote:

Am Fri, 3 Mar 2017 07:19:06 -0500
schrieb "Austin S. Hemmelgarn" :


On 2017-03-03 00:56, Kai Krakow wrote:

Am Thu, 2 Mar 2017 11:37:53 +0100
schrieb Adam Borowski :


On Wed, Mar 01, 2017 at 05:30:37PM -0700, Chris Murphy wrote:

 [...]


Well, there's Qu's patch at:
https://www.spinics.net/lists/linux-btrfs/msg47283.html
but it doesn't apply cleanly nor is easy to rebase to current
kernels.

 [...]


Well, yeah.  The current check is naive and wrong.  It does have a
purpose, just fails in this, very common, case.


I guess the reasoning behind this is: Creating any more chunks on
this drive will make raid1 chunks with only one copy. Adding
another drive later will not replay the copies without user
interaction. Is that true?

If yes, this may leave you with a mixed case of having a raid1 drive
with some chunks not mirrored and some mirrored. When the other
drives goes missing later, you are loosing data or even the whole
filesystem although you were left with the (wrong) imagination of
having a mirrored drive setup...

Is this how it works?

If yes, a real patch would also need to replay the missing copies
after adding a new drive.


The problem is that that would use some serious disk bandwidth
without user intervention.  The way from userspace to fix this is to
scrub the FS.  It would essentially be the same from kernel space,
which means that if you had a multi-TB FS and this happened, you'd be
running at below capacity in terms of bandwidth for quite some time.
If this were to be implemented, it would have to be keyed off of the
per-chunk degraded check (so that _only_ the chunks that need it get
touched), and there would need to be a switch to disable it.


Well, I'd expect that a replaced drive would involve reduced bandwidth
for a while. Every traditional RAID does this. The key solution there
is that you can limit bandwidth and/or define priorities (BG rebuild
rate).

Btrfs OTOH could be a lot more smarter, only rebuilding chunks that are
affected. The kernel can already do IO priorities and some sort of
bandwidth limiting should also be possible. I think IO throttling is
already implemented in the kernel somewhere (at least with 4.10) and
also in btrfs. So the basics are there.
I/O prioritization in Linux is crap right now.  Only one scheduler 
properly supports it, and that scheduler is deprecated, not to mention 
that it didn't work reliably to begin with.  There is a bandwidth 
limiting mechanism in place, but that's for userspace stuff, not kernel 
stuff (which is why scrub is such an issue, the actual I/O is done by 
the kernel, not userspace).


In a RAID setup, performance should never have priority over redundancy
by default.

If performance is an important factor, I suggest working with SSD
writeback caches. This is already possible with different kernel
techniques like mdcache or bcache. Proper hardware controllers also
support this in hardware. It's cheap to have a mirrored SSD
writeback cache of 1TB or so if your setup already contains a multiple
terabytes array. Such a setup has huge performance benefits in setups
we deploy (tho, not btrfs related).

Also, adding/replacing a drive is usually not a totally unplanned
event. Except for hot spares, a missing drive will be replaced at the
time you arrive on-site. If performance is a factor, this can be done
the same time as manually starting the process. So why not should it be
done automatically?
You're already going to be involved because you can't (from a practical 
perspective) automate the physical device replacement, so all that 
making it automatic does is make things more convenient.  In general, if 
you're concerned enough to be using a RAID array, you probably shouldn't 
be trading convenience for data safety, and as of right now, BTRFS isn't 
mature enough that it could be said to be consistently safe to automate 
almost anything.


There are plenty of other reasons for it to not be automatic though, the 
biggest being that it will waste bandwidth (and therefore time) if you 
plan to convert profiles after adding the device.  That said, it would 
be nice to have a switch for the add command to automatically re-balance 
the array.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 5/7] xfs: use memalloc_nofs_{save,restore} instead of memalloc_noio*

2017-03-06 Thread Michal Hocko
From: Michal Hocko 

kmem_zalloc_large and _xfs_buf_map_pages use memalloc_noio_{save,restore}
API to prevent from reclaim recursion into the fs because vmalloc can
invoke unconditional GFP_KERNEL allocations and these functions might be
called from the NOFS contexts. The memalloc_noio_save will enforce
GFP_NOIO context which is even weaker than GFP_NOFS and that seems to be
unnecessary. Let's use memalloc_nofs_{save,restore} instead as it should
provide exactly what we need here - implicit GFP_NOFS context.

Changes since v1
- s@memalloc_noio_restore@memalloc_nofs_restore@ in _xfs_buf_map_pages
  as per Brian Foster

Acked-by: Vlastimil Babka 
Reviewed-by: Brian Foster 
Reviewed-by: Darrick J. Wong 
Signed-off-by: Michal Hocko 
---
 fs/xfs/kmem.c| 12 ++--
 fs/xfs/xfs_buf.c |  8 
 2 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
index e14da724a0b5..6b7b04468aa8 100644
--- a/fs/xfs/kmem.c
+++ b/fs/xfs/kmem.c
@@ -66,7 +66,7 @@ kmem_alloc(size_t size, xfs_km_flags_t flags)
 void *
 kmem_zalloc_large(size_t size, xfs_km_flags_t flags)
 {
-   unsigned noio_flag = 0;
+   unsigned nofs_flag = 0;
void*ptr;
gfp_t   lflags;
 
@@ -78,17 +78,17 @@ kmem_zalloc_large(size_t size, xfs_km_flags_t flags)
 * __vmalloc() will allocate data pages and auxillary structures (e.g.
 * pagetables) with GFP_KERNEL, yet we may be under GFP_NOFS context
 * here. Hence we need to tell memory reclaim that we are in such a
-* context via PF_MEMALLOC_NOIO to prevent memory reclaim re-entering
+* context via PF_MEMALLOC_NOFS to prevent memory reclaim re-entering
 * the filesystem here and potentially deadlocking.
 */
-   if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
-   noio_flag = memalloc_noio_save();
+   if (flags & KM_NOFS)
+   nofs_flag = memalloc_nofs_save();
 
lflags = kmem_flags_convert(flags);
ptr = __vmalloc(size, lflags | __GFP_HIGHMEM | __GFP_ZERO, PAGE_KERNEL);
 
-   if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
-   memalloc_noio_restore(noio_flag);
+   if (flags & KM_NOFS)
+   memalloc_nofs_restore(nofs_flag);
 
return ptr;
 }
diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index b6208728ba39..ca09061369cb 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -443,17 +443,17 @@ _xfs_buf_map_pages(
bp->b_addr = NULL;
} else {
int retried = 0;
-   unsigned noio_flag;
+   unsigned nofs_flag;
 
/*
 * vm_map_ram() will allocate auxillary structures (e.g.
 * pagetables) with GFP_KERNEL, yet we are likely to be under
 * GFP_NOFS context here. Hence we need to tell memory reclaim
-* that we are in such a context via PF_MEMALLOC_NOIO to prevent
+* that we are in such a context via PF_MEMALLOC_NOFS to prevent
 * memory reclaim re-entering the filesystem here and
 * potentially deadlocking.
 */
-   noio_flag = memalloc_noio_save();
+   nofs_flag = memalloc_nofs_save();
do {
bp->b_addr = vm_map_ram(bp->b_pages, bp->b_page_count,
-1, PAGE_KERNEL);
@@ -461,7 +461,7 @@ _xfs_buf_map_pages(
break;
vm_unmap_aliases();
} while (retried++ <= 1);
-   memalloc_noio_restore(noio_flag);
+   memalloc_nofs_restore(nofs_flag);
 
if (!bp->b_addr)
return -ENOMEM;
-- 
2.11.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/7] mm: introduce memalloc_nofs_{save,restore} API

2017-03-06 Thread Michal Hocko
From: Michal Hocko 

GFP_NOFS context is used for the following 5 reasons currently
- to prevent from deadlocks when the lock held by the allocation
  context would be needed during the memory reclaim
- to prevent from stack overflows during the reclaim because
  the allocation is performed from a deep context already
- to prevent lockups when the allocation context depends on
  other reclaimers to make a forward progress indirectly
- just in case because this would be safe from the fs POV
- silence lockdep false positives

Unfortunately overuse of this allocation context brings some problems
to the MM. Memory reclaim is much weaker (especially during heavy FS
metadata workloads), OOM killer cannot be invoked because the MM layer
doesn't have enough information about how much memory is freeable by the
FS layer.

In many cases it is far from clear why the weaker context is even used
and so it might be used unnecessarily. We would like to get rid of
those as much as possible. One way to do that is to use the flag in
scopes rather than isolated cases. Such a scope is declared when really
necessary, tracked per task and all the allocation requests from within
the context will simply inherit the GFP_NOFS semantic.

Not only this is easier to understand and maintain because there are
much less problematic contexts than specific allocation requests, this
also helps code paths where FS layer interacts with other layers (e.g.
crypto, security modules, MM etc...) and there is no easy way to convey
the allocation context between the layers.

Introduce memalloc_nofs_{save,restore} API to control the scope
of GFP_NOFS allocation context. This is basically copying
memalloc_noio_{save,restore} API we have for other restricted allocation
context GFP_NOIO. The PF_MEMALLOC_NOFS flag already exists and it is
just an alias for PF_FSTRANS which has been xfs specific until recently.
There are no more PF_FSTRANS users anymore so let's just drop it.

PF_MEMALLOC_NOFS is now checked in the MM layer and drops __GFP_FS
implicitly same as PF_MEMALLOC_NOIO drops __GFP_IO. memalloc_noio_flags
is renamed to current_gfp_context because it now cares about both
PF_MEMALLOC_NOFS and PF_MEMALLOC_NOIO contexts. Xfs code paths preserve
their semantic. kmem_flags_convert() doesn't need to evaluate the flag
anymore.

This patch shouldn't introduce any functional changes.

Let's hope that filesystems will drop direct GFP_NOFS (resp. ~__GFP_FS)
usage as much as possible and only use a properly documented
memalloc_nofs_{save,restore} checkpoints where they are appropriate.

Acked-by: Vlastimil Babka 
Signed-off-by: Michal Hocko 
---
 fs/xfs/kmem.h|  2 +-
 include/linux/gfp.h  |  8 
 include/linux/sched.h|  8 +++-
 include/linux/sched/mm.h | 26 +++---
 kernel/locking/lockdep.c |  6 +++---
 mm/page_alloc.c  | 10 ++
 mm/vmscan.c  |  6 +++---
 7 files changed, 47 insertions(+), 19 deletions(-)

diff --git a/fs/xfs/kmem.h b/fs/xfs/kmem.h
index d973dbfc2bfa..ae08cfd9552a 100644
--- a/fs/xfs/kmem.h
+++ b/fs/xfs/kmem.h
@@ -50,7 +50,7 @@ kmem_flags_convert(xfs_km_flags_t flags)
lflags = GFP_ATOMIC | __GFP_NOWARN;
} else {
lflags = GFP_KERNEL | __GFP_NOWARN;
-   if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
+   if (flags & KM_NOFS)
lflags &= ~__GFP_FS;
}
 
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 978232a3b4ae..2bfcfd33e476 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -210,8 +210,16 @@ struct vm_area_struct;
  *
  * GFP_NOIO will use direct reclaim to discard clean pages or slab pages
  *   that do not require the starting of any physical IO.
+ *   Please try to avoid using this flag directly and instead use
+ *   memalloc_noio_{save,restore} to mark the whole scope which cannot
+ *   perform any IO with a short explanation why. All allocation requests
+ *   will inherit GFP_NOIO implicitly.
  *
  * GFP_NOFS will use direct reclaim but will not use any filesystem interfaces.
+ *   Please try to avoid using this flag directly and instead use
+ *   memalloc_nofs_{save,restore} to mark the whole scope which 
cannot/shouldn't
+ *   recurse into the FS layer with a short explanation why. All allocation
+ *   requests will inherit GFP_NOFS implicitly.
  *
  * GFP_USER is for userspace allocations that also need to be directly
  *   accessibly by the kernel or hardware. It is typically used by hardware
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 4528f7c9789f..9c3ee2281a56 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1211,9 +1211,9 @@ extern struct pid *cad_pid;
 #define PF_USED_ASYNC  0x4000  /* Used async_schedule*(), used 
by module init */
 #define PF_NOFREEZE0x8000  /* This thread should not be 
f

Re: raid1 degraded mount still produce single chunks, writeable mount not allowed

2017-03-06 Thread Austin S. Hemmelgarn

On 2017-03-05 14:13, Peter Grandi wrote:

What makes me think that "unmirrored" 'raid1' profile chunks
are "not a thing" is that it is impossible to remove
explicitly a member device from a 'raid1' profile volume:
first one has to 'convert' to 'single', and then the 'remove'
copies back to the remaining devices the 'single' chunks that
are on the explicitly 'remove'd device. Which to me seems
absurd.



It is, there should be a way to do this as a single operation.
[ ... ] The reason this is currently the case though is a
simple one, 'btrfs device delete' is just a special instance
of balance [ ... ]  does no profile conversion, but having
that as an option would actually be _very_ useful from a data
safety perspective.


That seems to me an even more "confused" opinion: because
removing a device to make it "missing" and removing it
permanently should be very different operations.

Consider the common case of a 3-member volume with a 'raid1'
target profile: if the sysadm thinks that a drive should be
replaced, the goal is to take it out *without* converting every
chunk to 'single', because with 2-out-of-3 devices half of the
chunks will still be fully mirrored.

Also, removing the device to be replaced should really not be
the same thing as balancing the chunks, if there is space, to be
'raid1' across remaining drives, because that's a completely
different operation.
There is a command specifically for replacing devices.  It operates very 
differently from the add+delete or delete+add sequences.  Instead of 
balancing, it's more similar to LVM's pvmove command.  It redirects all 
new writes that would go to the old device to the new one, then copies 
all the data from the old to the new (while properly recreating damaged 
chunks).  it uses way less bandwidth than add+delete, runs faster, and 
is in general much safer because it moves less data around.  If you're 
just replacing devices, you should be using this, not the add and delete 
commands, which are more for reshaping arrays than repairing them.


Additionally, if you _have_ to use add and remove to replace a device, 
if possible, you should add the new device then delete the old one, not 
the other way around, as that avoids most of the issues other than the 
high load on the filesystem from the balance operation.



Going further in my speculation, I suspect that at the core of
the Btrfs multidevice design there is a persistent "confusion"
(to use en euphemism) between volumes having a profile, and
merely chunks have a profile.



There generally is.  The profile is entirely a property of the
chunks (each chunk literally has a bit of metadata that says
what profile it is), not the volume.  There's some metadata in
the volume somewhere that says what profile to use for new
chunks of each type (I think),


That's the "target" profile for the volume.


but that doesn't dictate what chunk profiles there are on the
volume. [ ... ]


But as that's the case then the current Btrfs logic for
determining whether a volume is degraded or not is quite
"confused" indeed.
Entirely agreed.  Currently, it checks the target profile, when it 
should be checking per-chunk.


Because suppose there is again the simple case of a 3-device
volume, where all existing chunks have 'raid1' profile and the
volume's target profile is also 'raid1' and one device has gone
offline: the volume cannot be said to be "degraded", unless a
full examination of all chunks is made. Because it can well
happen that in fact *none* of the chunks was mirrored to that
device, for example, however unlikely. And viceversa. Even with
3 devices some chunks may be temporarily "unmirrored" (even if
for brief times hopefully).

The average case is that half of the chunks will be fully
mirrored across the two remaining devices and half will be
"unmirrored".

Now consider re-adding the third device: at that point the
volume has got back all 3 devices, so it is not "degraded", but
50% of the chunks in the volume will still be "unmirrored", even
if eventually they will be mirrored on the newly added device.

Note: possibilities get even more interesting with a 4-device
volume with 'raid1' profile chunks, and similar case involving
other profiles than 'raid1'.

Therefore the current Btrfs logic for deciding whether a volume
is "degraded" seems simply "confused" to me, because whether
there are missing devices and some chunks are "unmirrored" is
not quite the same thing.

The same applies to the current logic that in a 2-device volume
with a device missing new chunks are created as "single" profile
instead of as "unmirrored" 'raid1' profile: another example of
"confusion" between number of devices and chunk profile.

Note: the best that can be said is that a volume has both a
"target chunk profile" (one per data, metadata, system chunks)
and a target number of member devices, and that a volume with a
number of devices below the target *might* be degraded, and that
whether a volume is in fact degraded is not 

[PATCH 0/7 v5] scope GFP_NOFS api

2017-03-06 Thread Michal Hocko
Hi,
I have posted the previous version here [1]. There are no real changes
in the implementation since then. I've just added "lockdep: teach
lockdep about memalloc_noio_save" from Nikolay which is a lockdep bugfix
developed independently but "mm: introduce memalloc_nofs_{save,restore}
API" depends on it so I added it here. Then I've rebased the series on
top of 4.11-rc1 which contains sched.h split up which required to add
sched/mm.h include.

There didn't seem to be any real objections and so I think we should go
and finally merge this - ideally in this release cycle as it doesn't
really introduce any functional changes. Those were separated out and
will be posted later. The risk of regressions should really be small
because we do not remove any real GFP_NOFS users yet.

Diffstat says
 fs/jbd2/journal.c |  8 
 fs/jbd2/transaction.c | 12 
 fs/xfs/kmem.c | 12 ++--
 fs/xfs/kmem.h |  2 +-
 fs/xfs/libxfs/xfs_btree.c |  2 +-
 fs/xfs/xfs_aops.c |  6 +++---
 fs/xfs/xfs_buf.c  |  8 
 fs/xfs/xfs_trans.c| 12 ++--
 include/linux/gfp.h   | 18 +-
 include/linux/jbd2.h  |  2 ++
 include/linux/sched.h |  6 +++---
 include/linux/sched/mm.h  | 26 +++---
 kernel/locking/lockdep.c  | 11 +--
 lib/radix-tree.c  |  2 ++
 mm/page_alloc.c   | 10 ++
 mm/vmscan.c   |  6 +++---
 16 files changed, 106 insertions(+), 37 deletions(-)

Shortlog:
Michal Hocko (6):
  lockdep: allow to disable reclaim lockup detection
  xfs: abstract PF_FSTRANS to PF_MEMALLOC_NOFS
  mm: introduce memalloc_nofs_{save,restore} API
  xfs: use memalloc_nofs_{save,restore} instead of memalloc_noio*
  jbd2: mark the transaction context with the scope GFP_NOFS context
  jbd2: make the whole kjournald2 kthread NOFS safe

Nikolay Borisov (1):
  lockdep: teach lockdep about memalloc_noio_save


[1] http://lkml.kernel.org/r/20170206140718.16222-1-mho...@kernel.org
[2] http://lkml.kernel.org/r/20170117030118.727jqyamjhojz...@thunk.org
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/7] lockdep: teach lockdep about memalloc_noio_save

2017-03-06 Thread Michal Hocko
From: Nikolay Borisov 

Commit 21caf2fc1931 ("mm: teach mm by current context info to not do I/O
during memory allocation") added the memalloc_noio_(save|restore) functions
to enable people to modify the MM behavior by disabling I/O during memory
allocation. This was further extended in Fixes: 934f3072c17c ("mm: clear
__GFP_FS when PF_MEMALLOC_NOIO is set"). memalloc_noio_* functions prevent
allocation paths recursing back into the filesystem without explicitly
changing the flags for every allocation site. However, lockdep hasn't been
keeping up with the changes and it entirely misses handling the memalloc_noio
adjustments. Instead, it is left to the callers of __lockdep_trace_alloc to
call the function after they have shaven the respective GFP flags which
can lead to false positives:

[  644.173373] =
[  644.174012] [ INFO: inconsistent lock state ]
[  644.174012] 4.10.0-nbor #134 Not tainted
[  644.174012] -
[  644.174012] inconsistent {IN-RECLAIM_FS-W} -> {RECLAIM_FS-ON-W} usage.
[  644.174012] fsstress/3365 [HC0[0]:SC0[0]:HE1:SE1] takes:
[  644.174012]  (&xfs_nondir_ilock_class){?.}, at: [] 
xfs_ilock+0x141/0x230
[  644.174012] {IN-RECLAIM_FS-W} state was registered at:
[  644.174012]   __lock_acquire+0x62a/0x17c0
[  644.174012]   lock_acquire+0xc5/0x220
[  644.174012]   down_write_nested+0x4f/0x90
[  644.174012]   xfs_ilock+0x141/0x230
[  644.174012]   xfs_reclaim_inode+0x12a/0x320
[  644.174012]   xfs_reclaim_inodes_ag+0x2c8/0x4e0
[  644.174012]   xfs_reclaim_inodes_nr+0x33/0x40
[  644.174012]   xfs_fs_free_cached_objects+0x19/0x20
[  644.174012]   super_cache_scan+0x191/0x1a0
[  644.174012]   shrink_slab+0x26f/0x5f0
[  644.174012]   shrink_node+0xf9/0x2f0
[  644.174012]   kswapd+0x356/0x920
[  644.174012]   kthread+0x10c/0x140
[  644.174012]   ret_from_fork+0x31/0x40
[  644.174012] irq event stamp: 173777
[  644.174012] hardirqs last  enabled at (173777): [] 
__local_bh_enable_ip+0x70/0xc0
[  644.174012] hardirqs last disabled at (173775): [] 
__local_bh_enable_ip+0x37/0xc0
[  644.174012] softirqs last  enabled at (173776): [] 
_xfs_buf_find+0x67a/0xb70
[  644.174012] softirqs last disabled at (173774): [] 
_xfs_buf_find+0x5db/0xb70
[  644.174012]
[  644.174012] other info that might help us debug this:
[  644.174012]  Possible unsafe locking scenario:
[  644.174012]
[  644.174012]CPU0
[  644.174012]
[  644.174012]   lock(&xfs_nondir_ilock_class);
[  644.174012]   
[  644.174012] lock(&xfs_nondir_ilock_class);
[  644.174012]
[  644.174012]  *** DEADLOCK ***
[  644.174012]
[  644.174012] 4 locks held by fsstress/3365:
[  644.174012]  #0:  (sb_writers#10){++}, at: [] 
mnt_want_write+0x24/0x50
[  644.174012]  #1:  (&sb->s_type->i_mutex_key#12){++}, at: 
[] vfs_setxattr+0x6f/0xb0
[  644.174012]  #2:  (sb_internal#2){++}, at: [] 
xfs_trans_alloc+0xfc/0x140
[  644.174012]  #3:  (&xfs_nondir_ilock_class){?.}, at: 
[] xfs_ilock+0x141/0x230
[  644.174012]
[  644.174012] stack backtrace:
[  644.174012] CPU: 0 PID: 3365 Comm: fsstress Not tainted 4.10.0-nbor #134
[  644.174012] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
Ubuntu-1.8.2-1ubuntu1 04/01/2014
[  644.174012] Call Trace:
[  644.174012]  dump_stack+0x85/0xc9
[  644.174012]  print_usage_bug.part.37+0x284/0x293
[  644.174012]  ? print_shortest_lock_dependencies+0x1b0/0x1b0
[  644.174012]  mark_lock+0x27e/0x660
[  644.174012]  mark_held_locks+0x66/0x90
[  644.174012]  lockdep_trace_alloc+0x6f/0xd0
[  644.174012]  kmem_cache_alloc_node_trace+0x3a/0x2c0
[  644.174012]  ? vm_map_ram+0x2a1/0x510
[  644.174012]  vm_map_ram+0x2a1/0x510
[  644.174012]  ? vm_map_ram+0x46/0x510
[  644.174012]  _xfs_buf_map_pages+0x77/0x140
[  644.174012]  xfs_buf_get_map+0x185/0x2a0
[  644.174012]  xfs_attr_rmtval_set+0x233/0x430
[  644.174012]  xfs_attr_leaf_addname+0x2d2/0x500
[  644.174012]  xfs_attr_set+0x214/0x420
[  644.174012]  xfs_xattr_set+0x59/0xb0
[  644.174012]  __vfs_setxattr+0x76/0xa0
[  644.174012]  __vfs_setxattr_noperm+0x5e/0xf0
[  644.174012]  vfs_setxattr+0xae/0xb0
[  644.174012]  ? __might_fault+0x43/0xa0
[  644.174012]  setxattr+0x15e/0x1a0
[  644.174012]  ? __lock_is_held+0x53/0x90
[  644.174012]  ? rcu_read_lock_sched_held+0x93/0xa0
[  644.174012]  ? rcu_sync_lockdep_assert+0x2f/0x60
[  644.174012]  ? __sb_start_write+0x130/0x1d0
[  644.174012]  ? mnt_want_write+0x24/0x50
[  644.174012]  path_setxattr+0x8f/0xc0
[  644.174012]  SyS_lsetxattr+0x11/0x20
[  644.174012]  entry_SYSCALL_64_fastpath+0x23/0xc6

Let's fix this by making lockdep explicitly do the shaving of respective
GFP flags.

Fixes: 934f3072c17c ("mm: clear __GFP_FS when PF_MEMALLOC_NOIO is set")
Acked-by: Michal Hocko 
Acked-by: Peter Zijlstra (Intel) 
Signed-off-by: Nikolay Borisov 
Signed-off-by: Michal Hocko 
---
 kernel/locking/lockdep.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index 12e38

[PATCH 3/7] xfs: abstract PF_FSTRANS to PF_MEMALLOC_NOFS

2017-03-06 Thread Michal Hocko
From: Michal Hocko 

xfs has defined PF_FSTRANS to declare a scope GFP_NOFS semantic quite
some time ago. We would like to make this concept more generic and use
it for other filesystems as well. Let's start by giving the flag a
more generic name PF_MEMALLOC_NOFS which is in line with an exiting
PF_MEMALLOC_NOIO already used for the same purpose for GFP_NOIO
contexts. Replace all PF_FSTRANS usage from the xfs code in the first
step before we introduce a full API for it as xfs uses the flag directly
anyway.

This patch doesn't introduce any functional change.

Acked-by: Vlastimil Babka 
Reviewed-by: Darrick J. Wong 
Reviewed-by: Brian Foster 
Signed-off-by: Michal Hocko 
---
 fs/xfs/kmem.c |  4 ++--
 fs/xfs/kmem.h |  2 +-
 fs/xfs/libxfs/xfs_btree.c |  2 +-
 fs/xfs/xfs_aops.c |  6 +++---
 fs/xfs/xfs_trans.c| 12 ++--
 include/linux/sched.h |  2 ++
 6 files changed, 15 insertions(+), 13 deletions(-)

diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
index 2dfdc62f795e..e14da724a0b5 100644
--- a/fs/xfs/kmem.c
+++ b/fs/xfs/kmem.c
@@ -81,13 +81,13 @@ kmem_zalloc_large(size_t size, xfs_km_flags_t flags)
 * context via PF_MEMALLOC_NOIO to prevent memory reclaim re-entering
 * the filesystem here and potentially deadlocking.
 */
-   if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
+   if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
noio_flag = memalloc_noio_save();
 
lflags = kmem_flags_convert(flags);
ptr = __vmalloc(size, lflags | __GFP_HIGHMEM | __GFP_ZERO, PAGE_KERNEL);
 
-   if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
+   if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
memalloc_noio_restore(noio_flag);
 
return ptr;
diff --git a/fs/xfs/kmem.h b/fs/xfs/kmem.h
index 689f746224e7..d973dbfc2bfa 100644
--- a/fs/xfs/kmem.h
+++ b/fs/xfs/kmem.h
@@ -50,7 +50,7 @@ kmem_flags_convert(xfs_km_flags_t flags)
lflags = GFP_ATOMIC | __GFP_NOWARN;
} else {
lflags = GFP_KERNEL | __GFP_NOWARN;
-   if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
+   if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
lflags &= ~__GFP_FS;
}
 
diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
index c3decedc9455..3059a3ec7ecb 100644
--- a/fs/xfs/libxfs/xfs_btree.c
+++ b/fs/xfs/libxfs/xfs_btree.c
@@ -2886,7 +2886,7 @@ xfs_btree_split_worker(
struct xfs_btree_split_args *args = container_of(work,
struct xfs_btree_split_args, 
work);
unsigned long   pflags;
-   unsigned long   new_pflags = PF_FSTRANS;
+   unsigned long   new_pflags = PF_MEMALLOC_NOFS;
 
/*
 * we are in a transaction context here, but may also be doing work
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index bf65a9ea8642..330c6019120e 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -189,7 +189,7 @@ xfs_setfilesize_trans_alloc(
 * We hand off the transaction to the completion thread now, so
 * clear the flag here.
 */
-   current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+   current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
return 0;
 }
 
@@ -252,7 +252,7 @@ xfs_setfilesize_ioend(
 * thus we need to mark ourselves as being in a transaction manually.
 * Similarly for freeze protection.
 */
-   current_set_flags_nested(&tp->t_pflags, PF_FSTRANS);
+   current_set_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
__sb_writers_acquired(VFS_I(ip)->i_sb, SB_FREEZE_FS);
 
/* we abort the update if there was an IO error */
@@ -1021,7 +1021,7 @@ xfs_do_writepage(
 * Given that we do not allow direct reclaim to call us, we should
 * never be called while in a filesystem transaction.
 */
-   if (WARN_ON_ONCE(current->flags & PF_FSTRANS))
+   if (WARN_ON_ONCE(current->flags & PF_MEMALLOC_NOFS))
goto redirty;
 
/*
diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
index 70f42ea86dfb..f5969c8274fc 100644
--- a/fs/xfs/xfs_trans.c
+++ b/fs/xfs/xfs_trans.c
@@ -134,7 +134,7 @@ xfs_trans_reserve(
boolrsvd = (tp->t_flags & XFS_TRANS_RESERVE) != 0;
 
/* Mark this thread as being in a transaction */
-   current_set_flags_nested(&tp->t_pflags, PF_FSTRANS);
+   current_set_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 
/*
 * Attempt to reserve the needed disk blocks by decrementing
@@ -144,7 +144,7 @@ xfs_trans_reserve(
if (blocks > 0) {
error = xfs_mod_fdblocks(tp->t_mountp, -((int64_t)blocks), 
rsvd);
if (error != 0) {
-   current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+   

[PATCH 6/7] jbd2: mark the transaction context with the scope GFP_NOFS context

2017-03-06 Thread Michal Hocko
From: Michal Hocko 

now that we have memalloc_nofs_{save,restore} api we can mark the whole
transaction context as implicitly GFP_NOFS. All allocations will
automatically inherit GFP_NOFS this way. This means that we do not have
to mark any of those requests with GFP_NOFS and moreover all the
ext4_kv[mz]alloc(GFP_NOFS) are also safe now because even the hardcoded
GFP_KERNEL allocations deep inside the vmalloc will be NOFS now.

Reviewed-by: Jan Kara 
Signed-off-by: Michal Hocko 
---
 fs/jbd2/transaction.c | 12 
 include/linux/jbd2.h  |  2 ++
 2 files changed, 14 insertions(+)

diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index 5e659ee08d6a..d8f09f34285f 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -29,6 +29,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
@@ -388,6 +389,11 @@ static int start_this_handle(journal_t *journal, handle_t 
*handle,
 
rwsem_acquire_read(&journal->j_trans_commit_map, 0, 0, _THIS_IP_);
jbd2_journal_free_transaction(new_transaction);
+   /*
+* Make sure that no allocations done while the transaction is
+* open is going to recurse back to the fs layer.
+*/
+   handle->saved_alloc_context = memalloc_nofs_save();
return 0;
 }
 
@@ -466,6 +472,7 @@ handle_t *jbd2__journal_start(journal_t *journal, int 
nblocks, int rsv_blocks,
trace_jbd2_handle_start(journal->j_fs_dev->bd_dev,
handle->h_transaction->t_tid, type,
line_no, nblocks);
+
return handle;
 }
 EXPORT_SYMBOL(jbd2__journal_start);
@@ -1760,6 +1767,11 @@ int jbd2_journal_stop(handle_t *handle)
if (handle->h_rsv_handle)
jbd2_journal_free_reserved(handle->h_rsv_handle);
 free_and_exit:
+   /*
+* scope of th GFP_NOFS context is over here and so we can
+* restore the original alloc context.
+*/
+   memalloc_nofs_restore(handle->saved_alloc_context);
jbd2_free_handle(handle);
return err;
 }
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index dfaa1f4dcb0c..606b6bce3a5b 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -491,6 +491,8 @@ struct jbd2_journal_handle
 
unsigned long   h_start_jiffies;
unsigned inth_requested_credits;
+
+   unsigned intsaved_alloc_context;
 };
 
 
-- 
2.11.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 7/7] jbd2: make the whole kjournald2 kthread NOFS safe

2017-03-06 Thread Michal Hocko
From: Michal Hocko 

kjournald2 is central to the transaction commit processing. As such any
potential allocation from this kernel thread has to be GFP_NOFS. Make
sure to mark the whole kernel thread GFP_NOFS by the memalloc_nofs_save.

Suggested-by: Jan Kara 
Reviewed-by: Jan Kara 
Signed-off-by: Michal Hocko 
---
 fs/jbd2/journal.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index a1a359bfcc9c..78433ce1db40 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -43,6 +43,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #define CREATE_TRACE_POINTS
 #include 
@@ -206,6 +207,13 @@ static int kjournald2(void *arg)
wake_up(&journal->j_wait_done_commit);
 
/*
+* Make sure that no allocations from this kernel thread will ever 
recurse
+* to the fs layer because we are responsible for the transaction commit
+* and any fs involvement might get stuck waiting for the trasn. commit.
+*/
+   memalloc_nofs_save();
+
+   /*
 * And now, wait forever for commit wakeup events.
 */
write_lock(&journal->j_state_lock);
-- 
2.11.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/7] lockdep: allow to disable reclaim lockup detection

2017-03-06 Thread Michal Hocko
From: Michal Hocko 

The current implementation of the reclaim lockup detection can lead to
false positives and those even happen and usually lead to tweak the
code to silence the lockdep by using GFP_NOFS even though the context
can use __GFP_FS just fine. See
http://lkml.kernel.org/r/20160512080321.GA18496@dastard as an example.

=
[ INFO: inconsistent lock state ]
4.5.0-rc2+ #4 Tainted: G   O
-
inconsistent {RECLAIM_FS-ON-R} -> {IN-RECLAIM_FS-W} usage.
kswapd0/543 [HC0[0]:SC0[0]:HE1:SE1] takes:

(&xfs_nondir_ilock_class){-+}, at: [] 
xfs_ilock+0x177/0x200 [xfs]

{RECLAIM_FS-ON-R} state was registered at:
  [] mark_held_locks+0x79/0xa0
  [] lockdep_trace_alloc+0xb3/0x100
  [] kmem_cache_alloc+0x33/0x230
  [] kmem_zone_alloc+0x81/0x120 [xfs]
  [] xfs_refcountbt_init_cursor+0x3e/0xa0 [xfs]
  [] __xfs_refcount_find_shared+0x75/0x580 [xfs]
  [] xfs_refcount_find_shared+0x84/0xb0 [xfs]
  [] xfs_getbmap+0x608/0x8c0 [xfs]
  [] xfs_vn_fiemap+0xab/0xc0 [xfs]
  [] do_vfs_ioctl+0x498/0x670
  [] SyS_ioctl+0x79/0x90
  [] entry_SYSCALL_64_fastpath+0x12/0x6f

   CPU0
   
  lock(&xfs_nondir_ilock_class);
  
lock(&xfs_nondir_ilock_class);

 *** DEADLOCK ***

3 locks held by kswapd0/543:

stack backtrace:
CPU: 0 PID: 543 Comm: kswapd0 Tainted: G   O4.5.0-rc2+ #4

Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006

 82a34f10 88003aa078d0 813a14f9 88003d8551c0
 88003aa07920 8110ec65  0001
 8801 000b 0008 88003d855aa0
Call Trace:
 [] dump_stack+0x4b/0x72
 [] print_usage_bug+0x215/0x240
 [] mark_lock+0x1f5/0x660
 [] ? print_shortest_lock_dependencies+0x1a0/0x1a0
 [] __lock_acquire+0xa80/0x1e50
 [] ? kmem_cache_alloc+0x15e/0x230
 [] ? kmem_zone_alloc+0x81/0x120 [xfs]
 [] lock_acquire+0xd8/0x1e0
 [] ? xfs_ilock+0x177/0x200 [xfs]
 [] ? xfs_reflink_cancel_cow_range+0x150/0x300 [xfs]
 [] down_write_nested+0x5e/0xc0
 [] ? xfs_ilock+0x177/0x200 [xfs]
 [] xfs_ilock+0x177/0x200 [xfs]
 [] xfs_reflink_cancel_cow_range+0x150/0x300 [xfs]
 [] xfs_fs_evict_inode+0xdc/0x1e0 [xfs]
 [] evict+0xc5/0x190
 [] dispose_list+0x39/0x60
 [] prune_icache_sb+0x4b/0x60
 [] super_cache_scan+0x14f/0x1a0
 [] shrink_slab.part.63.constprop.79+0x1e9/0x4e0
 [] shrink_zone+0x15e/0x170
 [] kswapd+0x4f1/0xa80
 [] ? zone_reclaim+0x230/0x230
 [] kthread+0xf2/0x110
 [] ? kthread_create_on_node+0x220/0x220
 [] ret_from_fork+0x3f/0x70
 [] ? kthread_create_on_node+0x220/0x220

To quote Dave:
"
Ignoring whether reflink should be doing anything or not, that's a
"xfs_refcountbt_init_cursor() gets called both outside and inside
transactions" lockdep false positive case. The problem here is
lockdep has seen this allocation from within a transaction, hence a
GFP_NOFS allocation, and now it's seeing it in a GFP_KERNEL context.
Also note that we have an active reference to this inode.

So, because the reclaim annotations overload the interrupt level
detections and it's seen the inode ilock been taken in reclaim
("interrupt") context, this triggers a reclaim context warning where
it thinks it is unsafe to do this allocation in GFP_KERNEL context
holding the inode ilock...
"

This sounds like a fundamental problem of the reclaim lock detection.
It is really impossible to annotate such a special usecase IMHO unless
the reclaim lockup detection is reworked completely. Until then it
is much better to provide a way to add "I know what I am doing flag"
and mark problematic places. This would prevent from abusing GFP_NOFS
flag which has a runtime effect even on configurations which have
lockdep disabled.

Introduce __GFP_NOLOCKDEP flag which tells the lockdep gfp tracking to
skip the current allocation request.

While we are at it also make sure that the radix tree doesn't
accidentaly override tags stored in the upper part of the gfp_mask.

Suggested-by: Peter Zijlstra 
Acked-by: Peter Zijlstra (Intel) 
Acked-by: Vlastimil Babka 
Signed-off-by: Michal Hocko 
---
 include/linux/gfp.h  | 10 +-
 kernel/locking/lockdep.c |  4 
 lib/radix-tree.c |  2 ++
 3 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index db373b9d3223..978232a3b4ae 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -40,6 +40,11 @@ struct vm_area_struct;
 #define ___GFP_DIRECT_RECLAIM  0x40u
 #define ___GFP_WRITE   0x80u
 #define ___GFP_KSWAPD_RECLAIM  0x100u
+#ifdef CONFIG_LOCKDEP
+#define ___GFP_NOLOCKDEP   0x400u
+#else
+#define ___GFP_NOLOCKDEP   0
+#endif
 /* If the above are modified, __GFP_BITS_SHIFT may need updating */
 
 /*
@@ -179,8 +184,11 @@ struct vm_area_struct;
 #define __GFP_NOTRACK  ((__force gfp_t)___GFP_NOTRACK)
 #define __GFP_NOTRACK_FALSE_POSITIVE (__GFP_NOTRACK)
 
+/* Disable lockdep for GFP context tracking */
+#define __GFP_NOLOCKDEP ((__force gf

Re: [PATCH 0/8 v2] Non-blocking AIO

2017-03-06 Thread Jens Axboe
On 03/06/2017 01:25 AM, Jan Kara wrote:
> On Sun 05-03-17 16:56:21, Avi Kivity wrote:
>>> The goal of the patch series is to return -EAGAIN/-EWOULDBLOCK if
>>> any of these conditions are met. This way userspace can push most
>>> of the write()s to the kernel to the best of its ability to complete
>>> and if it returns -EAGAIN, can defer it to another thread.
>>>
>>
>> Is it not possible to push the iocb to a workqueue?  This will allow
>> existing userspace to work with the new functionality, unchanged. Any
>> userspace implementation would have to do the same thing, so it's not like
>> we're saving anything by pushing it there.
> 
> That is not easy because until IO is fully submitted, you need some parts
> of the context of the process which submits the IO (e.g. memory mappings,
> but possibly also other credentials). So you would need to somehow transfer
> this information to the workqueue.

Outside of technical challenges, the API also needs to return EAGAIN or
start blocking at some point. We can't expose a direct connection to
queue work like that, and let any user potentially create millions of
pending work items (and IOs). That's why the current API is safe, even
though it does suck that it block seemingly randomly for users.

-- 
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/8 v2] Non-blocking AIO

2017-03-06 Thread Avi Kivity



On 03/06/2017 05:19 PM, Jens Axboe wrote:

On 03/06/2017 01:25 AM, Jan Kara wrote:

On Sun 05-03-17 16:56:21, Avi Kivity wrote:

The goal of the patch series is to return -EAGAIN/-EWOULDBLOCK if
any of these conditions are met. This way userspace can push most
of the write()s to the kernel to the best of its ability to complete
and if it returns -EAGAIN, can defer it to another thread.


Is it not possible to push the iocb to a workqueue?  This will allow
existing userspace to work with the new functionality, unchanged. Any
userspace implementation would have to do the same thing, so it's not like
we're saving anything by pushing it there.

That is not easy because until IO is fully submitted, you need some parts
of the context of the process which submits the IO (e.g. memory mappings,
but possibly also other credentials). So you would need to somehow transfer
this information to the workqueue.

Outside of technical challenges, the API also needs to return EAGAIN or
start blocking at some point. We can't expose a direct connection to
queue work like that, and let any user potentially create millions of
pending work items (and IOs).


You wouldn't expect more concurrent events than the maxevents parameter 
that was supplied to io_setup syscall; it should have reserved any 
resources needed.



  That's why the current API is safe, even
though it does suck that it block seemingly randomly for users.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/8 v2] Non-blocking AIO

2017-03-06 Thread Jens Axboe
On 03/06/2017 08:29 AM, Avi Kivity wrote:
> 
> 
> On 03/06/2017 05:19 PM, Jens Axboe wrote:
>> On 03/06/2017 01:25 AM, Jan Kara wrote:
>>> On Sun 05-03-17 16:56:21, Avi Kivity wrote:
> The goal of the patch series is to return -EAGAIN/-EWOULDBLOCK if
> any of these conditions are met. This way userspace can push most
> of the write()s to the kernel to the best of its ability to complete
> and if it returns -EAGAIN, can defer it to another thread.
>
 Is it not possible to push the iocb to a workqueue?  This will allow
 existing userspace to work with the new functionality, unchanged. Any
 userspace implementation would have to do the same thing, so it's not like
 we're saving anything by pushing it there.
>>> That is not easy because until IO is fully submitted, you need some parts
>>> of the context of the process which submits the IO (e.g. memory mappings,
>>> but possibly also other credentials). So you would need to somehow transfer
>>> this information to the workqueue.
>> Outside of technical challenges, the API also needs to return EAGAIN or
>> start blocking at some point. We can't expose a direct connection to
>> queue work like that, and let any user potentially create millions of
>> pending work items (and IOs).
> 
> You wouldn't expect more concurrent events than the maxevents parameter 
> that was supplied to io_setup syscall; it should have reserved any 
> resources needed.

Doesn't matter what limit you apply, my point still stands - at some
point you have to return EAGAIN, or block. Returning EAGAIN without
the caller having flagged support for that change of behavior would
be problematic.

And for this to really work, aio would need some serious help in
how it applies limits. It looks like a hot mess.

-- 
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/8 v2] Non-blocking AIO

2017-03-06 Thread Avi Kivity

On 03/06/2017 05:38 PM, Jens Axboe wrote:

On 03/06/2017 08:29 AM, Avi Kivity wrote:


On 03/06/2017 05:19 PM, Jens Axboe wrote:

On 03/06/2017 01:25 AM, Jan Kara wrote:

On Sun 05-03-17 16:56:21, Avi Kivity wrote:

The goal of the patch series is to return -EAGAIN/-EWOULDBLOCK if
any of these conditions are met. This way userspace can push most
of the write()s to the kernel to the best of its ability to complete
and if it returns -EAGAIN, can defer it to another thread.


Is it not possible to push the iocb to a workqueue?  This will allow
existing userspace to work with the new functionality, unchanged. Any
userspace implementation would have to do the same thing, so it's not like
we're saving anything by pushing it there.

That is not easy because until IO is fully submitted, you need some parts
of the context of the process which submits the IO (e.g. memory mappings,
but possibly also other credentials). So you would need to somehow transfer
this information to the workqueue.

Outside of technical challenges, the API also needs to return EAGAIN or
start blocking at some point. We can't expose a direct connection to
queue work like that, and let any user potentially create millions of
pending work items (and IOs).

You wouldn't expect more concurrent events than the maxevents parameter
that was supplied to io_setup syscall; it should have reserved any
resources needed.

Doesn't matter what limit you apply, my point still stands - at some
point you have to return EAGAIN, or block. Returning EAGAIN without
the caller having flagged support for that change of behavior would
be problematic.


Doesn't it already return EAGAIN (or some other error) if you exceed 
maxevents?



And for this to really work, aio would need some serious help in
how it applies limits. It looks like a hot mess.


For sure.  I think it would be a shame to create more user-facing 
complexity.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/8 v2] Non-blocking AIO

2017-03-06 Thread Jens Axboe
On 03/06/2017 08:59 AM, Avi Kivity wrote:
> On 03/06/2017 05:38 PM, Jens Axboe wrote:
>> On 03/06/2017 08:29 AM, Avi Kivity wrote:
>>>
>>> On 03/06/2017 05:19 PM, Jens Axboe wrote:
 On 03/06/2017 01:25 AM, Jan Kara wrote:
> On Sun 05-03-17 16:56:21, Avi Kivity wrote:
>>> The goal of the patch series is to return -EAGAIN/-EWOULDBLOCK if
>>> any of these conditions are met. This way userspace can push most
>>> of the write()s to the kernel to the best of its ability to complete
>>> and if it returns -EAGAIN, can defer it to another thread.
>>>
>> Is it not possible to push the iocb to a workqueue?  This will allow
>> existing userspace to work with the new functionality, unchanged. Any
>> userspace implementation would have to do the same thing, so it's not 
>> like
>> we're saving anything by pushing it there.
> That is not easy because until IO is fully submitted, you need some parts
> of the context of the process which submits the IO (e.g. memory mappings,
> but possibly also other credentials). So you would need to somehow 
> transfer
> this information to the workqueue.
 Outside of technical challenges, the API also needs to return EAGAIN or
 start blocking at some point. We can't expose a direct connection to
 queue work like that, and let any user potentially create millions of
 pending work items (and IOs).
>>> You wouldn't expect more concurrent events than the maxevents parameter
>>> that was supplied to io_setup syscall; it should have reserved any
>>> resources needed.
>> Doesn't matter what limit you apply, my point still stands - at some
>> point you have to return EAGAIN, or block. Returning EAGAIN without
>> the caller having flagged support for that change of behavior would
>> be problematic.
> 
> Doesn't it already return EAGAIN (or some other error) if you exceed 
> maxevents?

It's a setup thing. We check these limits when someone creates an IO
context, and carve out the specified entries form our global pool. Then
we free those "resources" when the io context is freed.

Right now I can setup an IO context with 1000 entries on it, yet that
number has NO bearing on when io_submit() would potentially block or
return EAGAIN.

We can have a huge gap on the intent signaled by io context setup, and
the reality imposed by what actually happens on the IO submission side.

-- 
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/9 PULL REQUEST] Qgroup fixes for 4.11

2017-03-06 Thread David Sterba
On Mon, Mar 06, 2017 at 04:08:41PM +0800, Qu Wenruo wrote:
> Any response?
> 
> These patches are already here for at least 2 kernel releases.
> And are all bug fixes, and fix bugs that are already reported.
> 
> I didn't see any reason why it should be delayed for so long time.

The reason is not enough review for the whole series. The delay is
unfortunate, same as almost zero people capable & willing to review it,
despite your efforts to keep the series up to date with current code.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/8 v2] Non-blocking AIO

2017-03-06 Thread Jens Axboe
On 03/06/2017 09:59 AM, Avi Kivity wrote:
> 
> 
> On 03/06/2017 06:08 PM, Jens Axboe wrote:
>> On 03/06/2017 08:59 AM, Avi Kivity wrote:
>>> On 03/06/2017 05:38 PM, Jens Axboe wrote:
 On 03/06/2017 08:29 AM, Avi Kivity wrote:
> On 03/06/2017 05:19 PM, Jens Axboe wrote:
>> On 03/06/2017 01:25 AM, Jan Kara wrote:
>>> On Sun 05-03-17 16:56:21, Avi Kivity wrote:
> The goal of the patch series is to return -EAGAIN/-EWOULDBLOCK if
> any of these conditions are met. This way userspace can push most
> of the write()s to the kernel to the best of its ability to complete
> and if it returns -EAGAIN, can defer it to another thread.
>
 Is it not possible to push the iocb to a workqueue?  This will allow
 existing userspace to work with the new functionality, unchanged. Any
 userspace implementation would have to do the same thing, so it's not 
 like
 we're saving anything by pushing it there.
>>> That is not easy because until IO is fully submitted, you need some 
>>> parts
>>> of the context of the process which submits the IO (e.g. memory 
>>> mappings,
>>> but possibly also other credentials). So you would need to somehow 
>>> transfer
>>> this information to the workqueue.
>> Outside of technical challenges, the API also needs to return EAGAIN or
>> start blocking at some point. We can't expose a direct connection to
>> queue work like that, and let any user potentially create millions of
>> pending work items (and IOs).
> You wouldn't expect more concurrent events than the maxevents parameter
> that was supplied to io_setup syscall; it should have reserved any
> resources needed.
 Doesn't matter what limit you apply, my point still stands - at some
 point you have to return EAGAIN, or block. Returning EAGAIN without
 the caller having flagged support for that change of behavior would
 be problematic.
>>> Doesn't it already return EAGAIN (or some other error) if you exceed
>>> maxevents?
>> It's a setup thing. We check these limits when someone creates an IO
>> context, and carve out the specified entries form our global pool. Then
>> we free those "resources" when the io context is freed.
>>
>> Right now I can setup an IO context with 1000 entries on it, yet that
>> number has NO bearing on when io_submit() would potentially block or
>> return EAGAIN.
>>
>> We can have a huge gap on the intent signaled by io context setup, and
>> the reality imposed by what actually happens on the IO submission side.
> 
> Isn't that a bug?  Shouldn't that 1001st incomplete io_submit() return 
> EAGAIN?
> 
> Just tested it, and maxevents is not respected for this:
> 
> io_setup(1, [0x7fc64537f000])   = 0
> io_submit(0x7fc64537f000, 10, [{pread, fildes=3, buf=0x1eb4000, 
> nbytes=4096, offset=0}, {pread, fildes=3, buf=0x1eb4000, nbytes=4096, 
> offset=0}, {pread, fildes=3, buf=0x1eb4000, nbytes=4096, offset=0}, 
> {pread, fildes=3, buf=0x1eb4000, nbytes=4096, offset=0}, {pread, 
> fildes=3, buf=0x1eb4000, nbytes=4096, offset=0}, {pread, fildes=3, 
> buf=0x1eb4000, nbytes=4096, offset=0}, {pread, fildes=3, buf=0x1eb4000, 
> nbytes=4096, offset=0}, {pread, fildes=3, buf=0x1eb4000, nbytes=4096, 
> offset=0}, {pread, fildes=3, buf=0x1eb4000, nbytes=4096, offset=0}, 
> {pread, fildes=3, buf=0x1eb4000, nbytes=4096, offset=0}]) = 10
> 
> which is unexpected, to me.

ioctx_alloc()
{
[...]

/*  
 * We keep track of the number of available ringbuffer slots, to prevent
 * overflow (reqs_available), and we also use percpu counters for this. 
 *  
 * So since up to half the slots might be on other cpu's percpu counters
 * and unavailable, double nr_events so userspace sees what they
 * expected: additionally, we move req_batch slots to/from percpu   
 * counters at a time, so make sure that isn't 0:   
 */ 
nr_events = max(nr_events, num_possible_cpus() * 4);
nr_events *= 2;
}


-- 
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/8 v2] Non-blocking AIO

2017-03-06 Thread Avi Kivity



On 03/06/2017 06:08 PM, Jens Axboe wrote:

On 03/06/2017 08:59 AM, Avi Kivity wrote:

On 03/06/2017 05:38 PM, Jens Axboe wrote:

On 03/06/2017 08:29 AM, Avi Kivity wrote:

On 03/06/2017 05:19 PM, Jens Axboe wrote:

On 03/06/2017 01:25 AM, Jan Kara wrote:

On Sun 05-03-17 16:56:21, Avi Kivity wrote:

The goal of the patch series is to return -EAGAIN/-EWOULDBLOCK if
any of these conditions are met. This way userspace can push most
of the write()s to the kernel to the best of its ability to complete
and if it returns -EAGAIN, can defer it to another thread.


Is it not possible to push the iocb to a workqueue?  This will allow
existing userspace to work with the new functionality, unchanged. Any
userspace implementation would have to do the same thing, so it's not like
we're saving anything by pushing it there.

That is not easy because until IO is fully submitted, you need some parts
of the context of the process which submits the IO (e.g. memory mappings,
but possibly also other credentials). So you would need to somehow transfer
this information to the workqueue.

Outside of technical challenges, the API also needs to return EAGAIN or
start blocking at some point. We can't expose a direct connection to
queue work like that, and let any user potentially create millions of
pending work items (and IOs).

You wouldn't expect more concurrent events than the maxevents parameter
that was supplied to io_setup syscall; it should have reserved any
resources needed.

Doesn't matter what limit you apply, my point still stands - at some
point you have to return EAGAIN, or block. Returning EAGAIN without
the caller having flagged support for that change of behavior would
be problematic.

Doesn't it already return EAGAIN (or some other error) if you exceed
maxevents?

It's a setup thing. We check these limits when someone creates an IO
context, and carve out the specified entries form our global pool. Then
we free those "resources" when the io context is freed.

Right now I can setup an IO context with 1000 entries on it, yet that
number has NO bearing on when io_submit() would potentially block or
return EAGAIN.

We can have a huge gap on the intent signaled by io context setup, and
the reality imposed by what actually happens on the IO submission side.


Isn't that a bug?  Shouldn't that 1001st incomplete io_submit() return 
EAGAIN?


Just tested it, and maxevents is not respected for this:

io_setup(1, [0x7fc64537f000])   = 0
io_submit(0x7fc64537f000, 10, [{pread, fildes=3, buf=0x1eb4000, 
nbytes=4096, offset=0}, {pread, fildes=3, buf=0x1eb4000, nbytes=4096, 
offset=0}, {pread, fildes=3, buf=0x1eb4000, nbytes=4096, offset=0}, 
{pread, fildes=3, buf=0x1eb4000, nbytes=4096, offset=0}, {pread, 
fildes=3, buf=0x1eb4000, nbytes=4096, offset=0}, {pread, fildes=3, 
buf=0x1eb4000, nbytes=4096, offset=0}, {pread, fildes=3, buf=0x1eb4000, 
nbytes=4096, offset=0}, {pread, fildes=3, buf=0x1eb4000, nbytes=4096, 
offset=0}, {pread, fildes=3, buf=0x1eb4000, nbytes=4096, offset=0}, 
{pread, fildes=3, buf=0x1eb4000, nbytes=4096, offset=0}]) = 10


which is unexpected, to me.



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: No need to check !(flags & MS_RDONLY) twice

2017-03-06 Thread Omar Sandoval
On Sat, Mar 04, 2017 at 12:32:50PM -0600, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues 
> 
> Code cleanup.
> The code block is for !(*flags & MS_RDONLY). We don't need
> to check it again.

Reviewed-by: Omar Sandoval 

> Signed-off-by: Goldwyn Rodrigues 
> ---
>  fs/btrfs/super.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
> index da687dc..e9ae93e 100644
> --- a/fs/btrfs/super.c
> +++ b/fs/btrfs/super.c
> @@ -1785,8 +1785,7 @@ static int btrfs_remount(struct super_block *sb, int 
> *flags, char *data)
>   }
>  
>   if (fs_info->fs_devices->missing_devices >
> -  fs_info->num_tolerated_disk_barrier_failures &&
> - !(*flags & MS_RDONLY)) {
> +  fs_info->num_tolerated_disk_barrier_failures) {
>   btrfs_warn(fs_info,
>   "too many missing devices, writeable remount is 
> not allowed");
>   ret = -EACCES;
> -- 
> 2.10.2
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/9 PULL REQUEST] Qgroup fixes for 4.11

2017-03-06 Thread Chris Mason

On 03/06/2017 11:44 AM, David Sterba wrote:

On Mon, Mar 06, 2017 at 04:08:41PM +0800, Qu Wenruo wrote:

Any response?

These patches are already here for at least 2 kernel releases.
And are all bug fixes, and fix bugs that are already reported.

I didn't see any reason why it should be delayed for so long time.


The reason is not enough review for the whole series. The delay is
unfortunate, same as almost zero people capable & willing to review it,
despite your efforts to keep the series up to date with current code.



I've been hesitated to take more series of qgroup fixes because past 
sets have ended up causing problems down the line.  I'm queuing it for 
v4.12 while I collect tests and reviews though, it's definitely not fair 
for you to have to keep rebasing it without suggestions on what to fix.


Thanks!

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/8 v2] Non-blocking AIO

2017-03-06 Thread Avi Kivity



On 03/06/2017 07:06 PM, Jens Axboe wrote:

On 03/06/2017 09:59 AM, Avi Kivity wrote:


On 03/06/2017 06:08 PM, Jens Axboe wrote:

On 03/06/2017 08:59 AM, Avi Kivity wrote:

On 03/06/2017 05:38 PM, Jens Axboe wrote:

On 03/06/2017 08:29 AM, Avi Kivity wrote:

On 03/06/2017 05:19 PM, Jens Axboe wrote:

On 03/06/2017 01:25 AM, Jan Kara wrote:

On Sun 05-03-17 16:56:21, Avi Kivity wrote:

The goal of the patch series is to return -EAGAIN/-EWOULDBLOCK if
any of these conditions are met. This way userspace can push most
of the write()s to the kernel to the best of its ability to complete
and if it returns -EAGAIN, can defer it to another thread.


Is it not possible to push the iocb to a workqueue?  This will allow
existing userspace to work with the new functionality, unchanged. Any
userspace implementation would have to do the same thing, so it's not like
we're saving anything by pushing it there.

That is not easy because until IO is fully submitted, you need some parts
of the context of the process which submits the IO (e.g. memory mappings,
but possibly also other credentials). So you would need to somehow transfer
this information to the workqueue.

Outside of technical challenges, the API also needs to return EAGAIN or
start blocking at some point. We can't expose a direct connection to
queue work like that, and let any user potentially create millions of
pending work items (and IOs).

You wouldn't expect more concurrent events than the maxevents parameter
that was supplied to io_setup syscall; it should have reserved any
resources needed.

Doesn't matter what limit you apply, my point still stands - at some
point you have to return EAGAIN, or block. Returning EAGAIN without
the caller having flagged support for that change of behavior would
be problematic.

Doesn't it already return EAGAIN (or some other error) if you exceed
maxevents?

It's a setup thing. We check these limits when someone creates an IO
context, and carve out the specified entries form our global pool. Then
we free those "resources" when the io context is freed.

Right now I can setup an IO context with 1000 entries on it, yet that
number has NO bearing on when io_submit() would potentially block or
return EAGAIN.

We can have a huge gap on the intent signaled by io context setup, and
the reality imposed by what actually happens on the IO submission side.

Isn't that a bug?  Shouldn't that 1001st incomplete io_submit() return
EAGAIN?

Just tested it, and maxevents is not respected for this:

io_setup(1, [0x7fc64537f000])   = 0
io_submit(0x7fc64537f000, 10, [{pread, fildes=3, buf=0x1eb4000,
nbytes=4096, offset=0}, {pread, fildes=3, buf=0x1eb4000, nbytes=4096,
offset=0}, {pread, fildes=3, buf=0x1eb4000, nbytes=4096, offset=0},
{pread, fildes=3, buf=0x1eb4000, nbytes=4096, offset=0}, {pread,
fildes=3, buf=0x1eb4000, nbytes=4096, offset=0}, {pread, fildes=3,
buf=0x1eb4000, nbytes=4096, offset=0}, {pread, fildes=3, buf=0x1eb4000,
nbytes=4096, offset=0}, {pread, fildes=3, buf=0x1eb4000, nbytes=4096,
offset=0}, {pread, fildes=3, buf=0x1eb4000, nbytes=4096, offset=0},
{pread, fildes=3, buf=0x1eb4000, nbytes=4096, offset=0}]) = 10

which is unexpected, to me.

ioctx_alloc()
{
 [...]

 /*
  * We keep track of the number of available ringbuffer slots, to 
prevent
  * overflow (reqs_available), and we also use percpu counters for this.
  *
  * So since up to half the slots might be on other cpu's percpu 
counters
  * and unavailable, double nr_events so userspace sees what they
  * expected: additionally, we move req_batch slots to/from percpu
  * counters at a time, so make sure that isn't 0:
  */
 nr_events = max(nr_events, num_possible_cpus() * 4);
 nr_events *= 2;
}


On a 4-lcore desktop:

io_setup(1, [0x7fc210041000])   = 0
io_submit(0x7fc210041000, 1, [big array]) = 126
io_submit(0x7fc210041000, 1, [big array]) = -1 EAGAIN (Resource 
temporarily unavailable)


so, the user should already expect EAGAIN from io_submit() due to 
resource limits.  I'm sure the check could be tightened so that if we do 
have to use a workqueue, we respect the user's limit rather than some 
inflated number.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/8 v2] Non-blocking AIO

2017-03-06 Thread Avi Kivity

On 03/06/2017 08:27 PM, Jens Axboe wrote:

On 03/06/2017 11:17 AM, Avi Kivity wrote:


On 03/06/2017 07:06 PM, Jens Axboe wrote:

On 03/06/2017 09:59 AM, Avi Kivity wrote:

On 03/06/2017 06:08 PM, Jens Axboe wrote:

On 03/06/2017 08:59 AM, Avi Kivity wrote:

On 03/06/2017 05:38 PM, Jens Axboe wrote:

On 03/06/2017 08:29 AM, Avi Kivity wrote:

On 03/06/2017 05:19 PM, Jens Axboe wrote:

On 03/06/2017 01:25 AM, Jan Kara wrote:

On Sun 05-03-17 16:56:21, Avi Kivity wrote:

The goal of the patch series is to return -EAGAIN/-EWOULDBLOCK if
any of these conditions are met. This way userspace can push most
of the write()s to the kernel to the best of its ability to complete
and if it returns -EAGAIN, can defer it to another thread.


Is it not possible to push the iocb to a workqueue?  This will allow
existing userspace to work with the new functionality, unchanged. Any
userspace implementation would have to do the same thing, so it's not like
we're saving anything by pushing it there.

That is not easy because until IO is fully submitted, you need some parts
of the context of the process which submits the IO (e.g. memory mappings,
but possibly also other credentials). So you would need to somehow transfer
this information to the workqueue.

Outside of technical challenges, the API also needs to return EAGAIN or
start blocking at some point. We can't expose a direct connection to
queue work like that, and let any user potentially create millions of
pending work items (and IOs).

You wouldn't expect more concurrent events than the maxevents parameter
that was supplied to io_setup syscall; it should have reserved any
resources needed.

Doesn't matter what limit you apply, my point still stands - at some
point you have to return EAGAIN, or block. Returning EAGAIN without
the caller having flagged support for that change of behavior would
be problematic.

Doesn't it already return EAGAIN (or some other error) if you exceed
maxevents?

It's a setup thing. We check these limits when someone creates an IO
context, and carve out the specified entries form our global pool. Then
we free those "resources" when the io context is freed.

Right now I can setup an IO context with 1000 entries on it, yet that
number has NO bearing on when io_submit() would potentially block or
return EAGAIN.

We can have a huge gap on the intent signaled by io context setup, and
the reality imposed by what actually happens on the IO submission side.

Isn't that a bug?  Shouldn't that 1001st incomplete io_submit() return
EAGAIN?

Just tested it, and maxevents is not respected for this:

io_setup(1, [0x7fc64537f000])   = 0
io_submit(0x7fc64537f000, 10, [{pread, fildes=3, buf=0x1eb4000,
nbytes=4096, offset=0}, {pread, fildes=3, buf=0x1eb4000, nbytes=4096,
offset=0}, {pread, fildes=3, buf=0x1eb4000, nbytes=4096, offset=0},
{pread, fildes=3, buf=0x1eb4000, nbytes=4096, offset=0}, {pread,
fildes=3, buf=0x1eb4000, nbytes=4096, offset=0}, {pread, fildes=3,
buf=0x1eb4000, nbytes=4096, offset=0}, {pread, fildes=3, buf=0x1eb4000,
nbytes=4096, offset=0}, {pread, fildes=3, buf=0x1eb4000, nbytes=4096,
offset=0}, {pread, fildes=3, buf=0x1eb4000, nbytes=4096, offset=0},
{pread, fildes=3, buf=0x1eb4000, nbytes=4096, offset=0}]) = 10

which is unexpected, to me.

ioctx_alloc()
{
  [...]

  /*
   * We keep track of the number of available ringbuffer slots, to 
prevent
   * overflow (reqs_available), and we also use percpu counters for 
this.
   *
   * So since up to half the slots might be on other cpu's percpu 
counters
   * and unavailable, double nr_events so userspace sees what they
   * expected: additionally, we move req_batch slots to/from percpu
   * counters at a time, so make sure that isn't 0:
   */
  nr_events = max(nr_events, num_possible_cpus() * 4);
  nr_events *= 2;
}

On a 4-lcore desktop:

io_setup(1, [0x7fc210041000])   = 0
io_submit(0x7fc210041000, 1, [big array]) = 126
io_submit(0x7fc210041000, 1, [big array]) = -1 EAGAIN (Resource
temporarily unavailable)

so, the user should already expect EAGAIN from io_submit() due to
resource limits.  I'm sure the check could be tightened so that if we do
have to use a workqueue, we respect the user's limit rather than some
inflated number.

This is why I previously said that the 1000 requests you potentially
asks for when setting up your IO context has NOTHING to do with when you
will run into EAGAIN. Yes, returning EAGAIN if the app exceeds the
limit that it itself has set is existing behavior and it certainly makes
sense. And it's an easily handled condition, since the app can just
backoff and wait/reap completion events.


Every time I used aio, I considered maxevents to be the maximum number 
of in-flight requests for that queue, and observed this limit 
religiously.  It's possible others don't.



But if we allow EAGAIN to bubble up from block request submission, then
that's a chan

Re: [PATCH 0/8 v2] Non-blocking AIO

2017-03-06 Thread Jens Axboe
On 03/06/2017 11:17 AM, Avi Kivity wrote:
> 
> 
> On 03/06/2017 07:06 PM, Jens Axboe wrote:
>> On 03/06/2017 09:59 AM, Avi Kivity wrote:
>>>
>>> On 03/06/2017 06:08 PM, Jens Axboe wrote:
 On 03/06/2017 08:59 AM, Avi Kivity wrote:
> On 03/06/2017 05:38 PM, Jens Axboe wrote:
>> On 03/06/2017 08:29 AM, Avi Kivity wrote:
>>> On 03/06/2017 05:19 PM, Jens Axboe wrote:
 On 03/06/2017 01:25 AM, Jan Kara wrote:
> On Sun 05-03-17 16:56:21, Avi Kivity wrote:
>>> The goal of the patch series is to return -EAGAIN/-EWOULDBLOCK if
>>> any of these conditions are met. This way userspace can push most
>>> of the write()s to the kernel to the best of its ability to complete
>>> and if it returns -EAGAIN, can defer it to another thread.
>>>
>> Is it not possible to push the iocb to a workqueue?  This will allow
>> existing userspace to work with the new functionality, unchanged. Any
>> userspace implementation would have to do the same thing, so it's 
>> not like
>> we're saving anything by pushing it there.
> That is not easy because until IO is fully submitted, you need some 
> parts
> of the context of the process which submits the IO (e.g. memory 
> mappings,
> but possibly also other credentials). So you would need to somehow 
> transfer
> this information to the workqueue.
 Outside of technical challenges, the API also needs to return EAGAIN or
 start blocking at some point. We can't expose a direct connection to
 queue work like that, and let any user potentially create millions of
 pending work items (and IOs).
>>> You wouldn't expect more concurrent events than the maxevents parameter
>>> that was supplied to io_setup syscall; it should have reserved any
>>> resources needed.
>> Doesn't matter what limit you apply, my point still stands - at some
>> point you have to return EAGAIN, or block. Returning EAGAIN without
>> the caller having flagged support for that change of behavior would
>> be problematic.
> Doesn't it already return EAGAIN (or some other error) if you exceed
> maxevents?
 It's a setup thing. We check these limits when someone creates an IO
 context, and carve out the specified entries form our global pool. Then
 we free those "resources" when the io context is freed.

 Right now I can setup an IO context with 1000 entries on it, yet that
 number has NO bearing on when io_submit() would potentially block or
 return EAGAIN.

 We can have a huge gap on the intent signaled by io context setup, and
 the reality imposed by what actually happens on the IO submission side.
>>> Isn't that a bug?  Shouldn't that 1001st incomplete io_submit() return
>>> EAGAIN?
>>>
>>> Just tested it, and maxevents is not respected for this:
>>>
>>> io_setup(1, [0x7fc64537f000])   = 0
>>> io_submit(0x7fc64537f000, 10, [{pread, fildes=3, buf=0x1eb4000,
>>> nbytes=4096, offset=0}, {pread, fildes=3, buf=0x1eb4000, nbytes=4096,
>>> offset=0}, {pread, fildes=3, buf=0x1eb4000, nbytes=4096, offset=0},
>>> {pread, fildes=3, buf=0x1eb4000, nbytes=4096, offset=0}, {pread,
>>> fildes=3, buf=0x1eb4000, nbytes=4096, offset=0}, {pread, fildes=3,
>>> buf=0x1eb4000, nbytes=4096, offset=0}, {pread, fildes=3, buf=0x1eb4000,
>>> nbytes=4096, offset=0}, {pread, fildes=3, buf=0x1eb4000, nbytes=4096,
>>> offset=0}, {pread, fildes=3, buf=0x1eb4000, nbytes=4096, offset=0},
>>> {pread, fildes=3, buf=0x1eb4000, nbytes=4096, offset=0}]) = 10
>>>
>>> which is unexpected, to me.
>> ioctx_alloc()
>> {
>>  [...]
>>
>>  /*
>>   * We keep track of the number of available ringbuffer slots, to 
>> prevent
>>   * overflow (reqs_available), and we also use percpu counters for 
>> this.
>>   *
>>   * So since up to half the slots might be on other cpu's percpu 
>> counters
>>   * and unavailable, double nr_events so userspace sees what they
>>   * expected: additionally, we move req_batch slots to/from percpu
>>   * counters at a time, so make sure that isn't 0:
>>   */
>>  nr_events = max(nr_events, num_possible_cpus() * 4);
>>  nr_events *= 2;
>> }
> 
> On a 4-lcore desktop:
> 
> io_setup(1, [0x7fc210041000])   = 0
> io_submit(0x7fc210041000, 1, [big array]) = 126
> io_submit(0x7fc210041000, 1, [big array]) = -1 EAGAIN (Resource 
> temporarily unavailable)
> 
> so, the user should already expect EAGAIN from io_submit() due to 
> resource limits.  I'm sure the check could be tightened so that if we do 
> have to use a workqueue, we respect the user's limit rather than some 
> inflated number.

This is why I previously said that the 1000 requests you potentially
asks for when setting up your IO context has NOTHING to do with when you
will run into EAGAIN. Yes, returning EAGAIN

Re: [PATCH v2 0/6] Chunk level degradable check

2017-03-06 Thread Dmitrii Tcvetkov
On Mon, 6 Mar 2017 16:58:49 +0800
Qu Wenruo  wrote:

> Btrfs currently uses num_tolerated_disk_barrier_failures to do global
> check for tolerated missing device.
> 
> Although the one-size-fit-all solution is quite safe, it's too strict
> if data and metadata has different duplication level.
> 
> For example, if one use Single data and RAID1 metadata for 2 disks, it
> means any missing device will make the fs unable to be degraded
> mounted.
> 
> But in fact, some times all single chunks may be in the existing
> device and in that case, we should allow it to be rw degraded mounted.
> 
> Such case can be easily reproduced using the following script:
>  # mkfs.btrfs -f -m raid1 -d sing /dev/sdb /dev/sdc
>  # wipefs -f /dev/sdc
>  # mount /dev/sdb -o degraded,rw
> 
> If using btrfs-debug-tree to check /dev/sdb, one should find that the
> data chunk is only in sdb, so in fact it should allow degraded mount.
> 
> This patchset will introduce a new per-chunk degradable check for
> btrfs, allow above case to succeed, and it's quite small anyway.
> 
> And enhance kernel error message for missing device, at least kernel
> can know what's making mount failed, other than meaningless
> "failed to read system chunk/chunk tree -5".

Hello,

Tested the patchset for raid1 and raid10. Successfully allows
degraded mount with single chunks on the filesystems without one drive.

Feel free to add Tested-By: Dmitrii Tcvetkov 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: Change s_flags instead of returning -EBUSY

2017-03-06 Thread Liu Bo
On Sat, Mar 04, 2017 at 12:33:22PM -0600, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues 
> 
> The problem is with parallel mounting multiple subvolumes rw when the
> root filesystem is marked as read-only such as a boot sequence
> using systemd. Not all subvolumes will be mounted because some of
> them will return -EBUSY.
> 
> Here is a sample execution time.
> Root filesystem is mounted read-only so s->s_flags are set to MS_RDONLY
> flags is the parameter passed and s_flags is the one recorded in sb.
> btrfs_mount is called via vfs_kern_mount().
> 
> Mount Thread 1Mount Thread 2
> 
> btrfs_mount(flags & ~MS_RDONLY)
>  check (flags ^ s_flags) & MS_RDONLY
>  returns -EBUSY   btrfs_mount(flags & ~MS_RDONLY)
>   check (flags ^ s_flags) & MS_RDONLY
>returns -EBUSY
> btrfs_mount(flags | MS_RDONLY)
> btrfs_remount(flags & ~MS_RDONLY)
> s->s_flags &= ~MS_RDONLY
>   btrfs_mount(flags | MS_RDONLY)
>   check (flags ^ s_flags) & MS_RDONLY)
>returns -EBUSY
>mount FAILS
> 
> The -EBUSY was originally introduced in:
> 4b82d6e ("Btrfs: Add mount into directory support")
> as a copy of (then) get_sb_dev().
> 
> Later commit 0723a04 ("btrfs: allow mounting btrfs subvolumes
> with different ro/rw options") added the option of allowing
> subvolumes in rw/ro modes.
> 
> This fix is instead of toggling the flags in remount, we set
> s_flags &= MS_RDONLY when we see a conflict in s_flags and passed parameter
> flags and let mount continue as it is. This will allow the first mount attempt
> to succeed, and we can get rid of the re-kern_mount() and remount sequence
> altogether.

Reviewed-by: Liu Bo 

Thanks,

-liubo

> 
> Signed-off-by: Goldwyn Rodrigues 
> ---
>  fs/btrfs/super.c | 36 ++--
>  1 file changed, 10 insertions(+), 26 deletions(-)
> 
> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
> index e9ae93e..978b4a6 100644
> --- a/fs/btrfs/super.c
> +++ b/fs/btrfs/super.c
> @@ -67,8 +67,6 @@
>  static const struct super_operations btrfs_super_ops;
>  static struct file_system_type btrfs_fs_type;
>  
> -static int btrfs_remount(struct super_block *sb, int *flags, char *data);
> -
>  const char *btrfs_decode_error(int errno)
>  {
>   char *errstr = "unknown";
> @@ -1379,28 +1377,6 @@ static struct dentry *mount_subvol(const char 
> *subvol_name, u64 subvol_objectid,
>   }
>  
>   mnt = vfs_kern_mount(&btrfs_fs_type, flags, device_name, newargs);
> - if (PTR_ERR_OR_ZERO(mnt) == -EBUSY) {
> - if (flags & MS_RDONLY) {
> - mnt = vfs_kern_mount(&btrfs_fs_type, flags & ~MS_RDONLY,
> -  device_name, newargs);
> - } else {
> - mnt = vfs_kern_mount(&btrfs_fs_type, flags | MS_RDONLY,
> -  device_name, newargs);
> - if (IS_ERR(mnt)) {
> - root = ERR_CAST(mnt);
> - mnt = NULL;
> - goto out;
> - }
> -
> - down_write(&mnt->mnt_sb->s_umount);
> - ret = btrfs_remount(mnt->mnt_sb, &flags, NULL);
> - up_write(&mnt->mnt_sb->s_umount);
> - if (ret < 0) {
> - root = ERR_PTR(ret);
> - goto out;
> - }
> - }
> - }
>   if (IS_ERR(mnt)) {
>   root = ERR_CAST(mnt);
>   mnt = NULL;
> @@ -1606,8 +1582,16 @@ static struct dentry *btrfs_mount(struct 
> file_system_type *fs_type, int flags,
>   if (s->s_root) {
>   btrfs_close_devices(fs_devices);
>   free_fs_info(fs_info);
> - if ((flags ^ s->s_flags) & MS_RDONLY)
> - error = -EBUSY;
> + /* If s_flags is MS_RDONLY and flags is rw (~MS_RDONLY)
> +  * we need to reset s_flags so that sb can be writable
> +  * since we can be called from mount_subvol().
> +  * The vfsmount manages to preserve the ro/rw flags
> +  * of the root/orignal mount.
> +  * In case of vice-versa, s_flags is already does not
> +  * have MS_RDONLY set, so don't bother.
> +  */
> + if ((s->s_flags & MS_RDONLY) && !(flags & MS_RDONLY))
> + s->s_flags &= ~MS_RDONLY;
>   } else {
>   snprintf(s->s_id, sizeof(s->s_id), "%pg", bdev);
>   btrfs_sb(s)->bdev_holder = fs_type;
> -- 
> 2.10.2
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo

Re: [PATCH] btrfs: Change s_flags instead of returning -EBUSY

2017-03-06 Thread Omar Sandoval
On Sat, Mar 04, 2017 at 12:33:22PM -0600, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues 
> 
> The problem is with parallel mounting multiple subvolumes rw when the
> root filesystem is marked as read-only such as a boot sequence
> using systemd. Not all subvolumes will be mounted because some of
> them will return -EBUSY.
> 
> Here is a sample execution time.
> Root filesystem is mounted read-only so s->s_flags are set to MS_RDONLY
> flags is the parameter passed and s_flags is the one recorded in sb.
> btrfs_mount is called via vfs_kern_mount().
> 
> Mount Thread 1Mount Thread 2
> 
> btrfs_mount(flags & ~MS_RDONLY)
>  check (flags ^ s_flags) & MS_RDONLY
>  returns -EBUSY   btrfs_mount(flags & ~MS_RDONLY)
>   check (flags ^ s_flags) & MS_RDONLY
>returns -EBUSY
> btrfs_mount(flags | MS_RDONLY)
> btrfs_remount(flags & ~MS_RDONLY)
> s->s_flags &= ~MS_RDONLY
>   btrfs_mount(flags | MS_RDONLY)
>   check (flags ^ s_flags) & MS_RDONLY)
>returns -EBUSY
>mount FAILS
> 
> The -EBUSY was originally introduced in:
> 4b82d6e ("Btrfs: Add mount into directory support")
> as a copy of (then) get_sb_dev().
> 
> Later commit 0723a04 ("btrfs: allow mounting btrfs subvolumes
> with different ro/rw options") added the option of allowing
> subvolumes in rw/ro modes.
> 
> This fix is instead of toggling the flags in remount, we set
> s_flags &= MS_RDONLY when we see a conflict in s_flags and passed parameter
> flags and let mount continue as it is. This will allow the first mount attempt
> to succeed, and we can get rid of the re-kern_mount() and remount sequence
> altogether.
> 
> Signed-off-by: Goldwyn Rodrigues 
> ---
>  fs/btrfs/super.c | 36 ++--
>  1 file changed, 10 insertions(+), 26 deletions(-)
> 
> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
> index e9ae93e..978b4a6 100644
> --- a/fs/btrfs/super.c
> +++ b/fs/btrfs/super.c
> @@ -67,8 +67,6 @@
>  static const struct super_operations btrfs_super_ops;
>  static struct file_system_type btrfs_fs_type;
>  
> -static int btrfs_remount(struct super_block *sb, int *flags, char *data);
> -
>  const char *btrfs_decode_error(int errno)
>  {
>   char *errstr = "unknown";
> @@ -1379,28 +1377,6 @@ static struct dentry *mount_subvol(const char 
> *subvol_name, u64 subvol_objectid,
>   }
>  
>   mnt = vfs_kern_mount(&btrfs_fs_type, flags, device_name, newargs);
> - if (PTR_ERR_OR_ZERO(mnt) == -EBUSY) {
> - if (flags & MS_RDONLY) {
> - mnt = vfs_kern_mount(&btrfs_fs_type, flags & ~MS_RDONLY,
> -  device_name, newargs);
> - } else {
> - mnt = vfs_kern_mount(&btrfs_fs_type, flags | MS_RDONLY,
> -  device_name, newargs);
> - if (IS_ERR(mnt)) {
> - root = ERR_CAST(mnt);
> - mnt = NULL;
> - goto out;
> - }
> -
> - down_write(&mnt->mnt_sb->s_umount);
> - ret = btrfs_remount(mnt->mnt_sb, &flags, NULL);
> - up_write(&mnt->mnt_sb->s_umount);
> - if (ret < 0) {
> - root = ERR_PTR(ret);
> - goto out;
> - }
> - }
> - }
>   if (IS_ERR(mnt)) {
>   root = ERR_CAST(mnt);
>   mnt = NULL;
> @@ -1606,8 +1582,16 @@ static struct dentry *btrfs_mount(struct 
> file_system_type *fs_type, int flags,
>   if (s->s_root) {
>   btrfs_close_devices(fs_devices);
>   free_fs_info(fs_info);
> - if ((flags ^ s->s_flags) & MS_RDONLY)
> - error = -EBUSY;
> + /* If s_flags is MS_RDONLY and flags is rw (~MS_RDONLY)

Blech, keep that gross comment style under net/.

> +  * we need to reset s_flags so that sb can be writable
> +  * since we can be called from mount_subvol().
> +  * The vfsmount manages to preserve the ro/rw flags
> +  * of the root/orignal mount.
> +  * In case of vice-versa, s_flags is already does not
> +  * have MS_RDONLY set, so don't bother.
> +  */
> + if ((s->s_flags & MS_RDONLY) && !(flags & MS_RDONLY))
> + s->s_flags &= ~MS_RDONLY;

Hm, this doesn't seem right. btrfs_remount() does a bunch of other
important checks (e.g., BTRFS_FS_STATE_ERROR) that you're throwing away
by doing it this way.

>   } else {
>   snprintf(s->s_id, sizeof(s->s_id), "%pg", bdev);
>   btrfs_sb(s)->bdev_holder = fs_

Re: [PATCH 2/7] Btrfs: separate DISCARD from __btrfs_map_block

2017-03-06 Thread Liu Bo
On Mon, Feb 20, 2017 at 11:54:31AM +0800, Qu Wenruo wrote:
> 
> 
> At 02/18/2017 09:28 AM, Liu Bo wrote:
> > Since DISCARD is not as important as an operation like write, we don't
> > copy it to target device during replace, and it makes __btrfs_map_block
> > less complex.
> 
> Makes sense to me.
> 
> > 
> > Signed-off-by: Liu Bo 
> > ---
> >  fs/btrfs/volumes.c | 306 
> > +
> >  1 file changed, 192 insertions(+), 114 deletions(-)
> > 
> > diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> > index c52b0fe..96228f3 100644
> > --- a/fs/btrfs/volumes.c
> > +++ b/fs/btrfs/volumes.c
> > @@ -5294,6 +5294,175 @@ void btrfs_put_bbio(struct btrfs_bio *bbio)
> > kfree(bbio);
> >  }
> > 
> > +/* can REQ_OP_DISCARD be sent with other REQ like REQ_OP_WRITE? */
> > +/*
> > + * Please note that, discard won't be sent to target device of device
> > + * replace.
> > + */
> > +static int __btrfs_map_block_for_discard(struct btrfs_fs_info *fs_info,
> > +u64 logical, u64 length,
> > +struct btrfs_bio **bbio_ret)
> > +{
> > +   struct extent_map_tree *em_tree = &fs_info->mapping_tree.map_tree;
> > +   struct extent_map *em;
> > +   struct map_lookup *map;
> > +   struct btrfs_bio *bbio;
> > +   u64 offset;
> > +   u64 stripe_nr;
> > +   u64 stripe_nr_end;
> > +   u64 stripe_end_offset;
> > +   u64 stripe_cnt;
> > +   u64 stripe_len;
> > +   u64 stripe_offset;
> > +   u64 num_stripes;
> > +   u32 stripe_index;
> > +   u32 factor = 0;
> > +   u32 sub_stripes = 0;
> > +   u64 stripes_per_dev = 0;
> > +   u32 remaining_stripes = 0;
> > +   u32 last_stripe = 0;
> > +   int ret = 0;
> > +   int i;
> > +
> > +   /* discard always return a bbio */
> > +   ASSERT(bbio_ret);
> > +
> > +   read_lock(&em_tree->lock);
> > +   em = lookup_extent_mapping(em_tree, logical, length);
> > +   read_unlock(&em_tree->lock);
> 
> It seems that get_chunk_map() in previous patch can replace such searching
> and error message.
>

Yeah, I forgot to update with it.

> > +
> > +   if (!em) {
> > +   btrfs_crit(fs_info, "unable to find logical %llu len %llu",
> > +   logical, length);
> > +   return -EINVAL;
> > +   }
> > +
> > +   if (em->start > logical || em->start + em->len < logical) {
> > +   btrfs_crit(fs_info,
> > +  "found a bad mapping, wanted %Lu, found %Lu-%Lu",
> > +  logical, em->start, em->start + em->len);
> > +   free_extent_map(em);
> > +   return -EINVAL;
> > +   }
> > +
> > +   map = em->map_lookup;
> > +   /* we don't discard raid56 yet */
> > +   if (map->type & BTRFS_BLOCK_GROUP_RAID56_MASK) {
> > +   ret = -EOPNOTSUPP;
> > +   goto out;
> > +   }
> > +
> > +   offset = logical - em->start;
> > +   length = min_t(u64, em->len - offset, length);
> > +
> > +   stripe_len = map->stripe_len;
> > +   /*
> > +* stripe_nr counts the total number of stripes we have to stride
> > +* to get to this block
> > +*/
> > +   stripe_nr = div64_u64(offset, stripe_len);
> > +   stripe_offset = stripe_nr * stripe_len;
> > +   ASSERT(offset >= stripe_offset);
> 
> What about a DIV_ROUND_DOWN helper?
> Surprisingly we only have DIR_ROUND_UP, not not DIV_ROUND_DOWN.
> 
> And if we're only going to support 64K stripe len, then round_down() is good
> for current usage.
> 
> > +
> > +   /* stripe_offset is the offset of this block in its stripe */
> > +   stripe_offset = offset - stripe_offset;
> 
> This is a little confusing.
> What about using another variable called @stripe_start instead of using the
> same variable @stripe_offset to temporarily store stripe start bytenr.
> 
> I prefer to do it in one run without resuing @stripe_offset variable to
> avoid confusion.

Right, I was trying to keep the check of (offset >= stripe_offset), but it's not
necessary.

> 
> > +
> > +   stripe_nr_end = ALIGN(offset + length, map->stripe_len);
> 
> round_up() causes less confusion.
> 
> And IIRC, ALIGN/round_up can only handle power of 2, this implies the
> stripe_len must be power of 2, which is OK for now.
> If using ALIGN here, we can also use round_down() in previous stripe_nr.
>

Good point.

Thanks,

-liubo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: assertion failed: page_ops & PAGE_LOCK

2017-03-06 Thread Liu Bo
On Sun, Mar 05, 2017 at 11:59:17AM -0500, Dave Jones wrote:
> After commenting out the assertion that Liu bo pointed out was bogus,
> my trinity runs last a little longer.. This is a new one I think..

I hit this once, haven't got enough info. to check what went wrong, but I'm
working on it.

Thanks,

-liubo

> 
> assertion failed: page_ops & PAGE_LOCK, file: fs/btrfs/extent_io.c, line: 1716
> [ cut here ]
> kernel BUG at fs/btrfs/ctree.h:3423!
> invalid opcode:  [#1] PREEMPT SMP DEBUG_PAGEALLOC
> CPU: 1 PID: 32625 Comm: trinity-c40 Not tainted 4.10.0-think+ #3 
> task: 8804c6b95440 task.stack: c9ba8000
> RIP: 0010:assfail.constprop.68+0x1c/0x1e
> RSP: 0018:c9bab6c0 EFLAGS: 00010282
> RAX: 004e RBX: 00a3 RCX: 0001
> RDX:  RSI: 81ee593a RDI: 
> RBP: c9bab6c0 R08:  R09: 0001
> R10: 0001 R11:  R12: c9bab790
> R13:  R14: 01f8 R15: ea000b2026c0
> FS:  7f102a0e4b40() GS:880507a0() knlGS:
> CS:  0010 DS:  ES:  CR0: 80050033
> CR2: 7f1029f21000 CR3: 0004c6bde000 CR4: 001406e0
> DR0: 7f695d948000 DR1:  DR2: 
> DR3:  DR6: 0ff0 DR7: 0600
> Call Trace:
>  __process_pages_contig+0x4b4/0x4d0
>  __unlock_for_delalloc.isra.43+0x32/0x50
>  find_lock_delalloc_range+0x15e/0x210
>  writepage_delalloc.isra.51+0x91/0x1a0
>  __extent_writepage+0x10d/0x470
>  extent_write_cache_pages.constprop.57+0x2d3/0x430
>  extent_writepages+0x5d/0x90
>  ? btrfs_releasepage+0x20/0x20
>  btrfs_writepages+0x28/0x30
>  do_writepages+0x21/0x30
>  __filemap_fdatawrite_range+0xc6/0x100
>  filemap_fdatawrite_range+0x13/0x20
>  btrfs_fdatawrite_range+0x20/0x50
>  start_ordered_ops+0x19/0x30
>  btrfs_sync_file+0x99/0x540
>  ? debug_smp_processor_id+0x17/0x20
>  ? get_lock_stats+0x19/0x50
>  ? debug_smp_processor_id+0x17/0x20
>  ? get_lock_stats+0x19/0x50
>  vfs_fsync_range+0x4b/0xb0
>  btrfs_file_write_iter+0x412/0x570
>  ? rcu_sync_lockdep_assert+0x2f/0x60
>  __do_readv_writev+0x2ea/0x380
>  do_readv_writev+0x7c/0xc0
>  ? debug_smp_processor_id+0x17/0x20
>  ? get_lock_stats+0x19/0x50
>  ? __context_tracking_exit.part.5+0x82/0x1e0
>  vfs_writev+0x3f/0x50
>  do_pwritev+0xb5/0xd0
>  SyS_pwritev2+0x17/0x30
>  do_syscall_64+0x66/0x1d0
>  entry_SYSCALL64_slow_path+0x25/0x25
> RIP: 0033:0x7f1029a0e0f9
> RSP: 002b:7ffe59e72c48 EFLAGS: 0246
> [CONT START]  ORIG_RAX: 0148
> RAX: ffda RBX: 0148 RCX: 7f1029a0e0f9
> RDX: 00f2 RSI: 562b5ef7de50 RDI: 0185
> RBP: 7f1029fc5000 R08: 08002180528a R09: 0007
> R10: baba R11: 0246 R12: 0002
> R13: 7f1029fc5048 R14: 7f102a0e4ad8 R15: 7f1029fc5000
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs: update comments in cache_save_setup

2017-03-06 Thread Liu Bo
We also don't bother to flush free space cache while with free space
tree.

Cc: David Sterba 
Signed-off-by: Liu Bo 
---
 fs/btrfs/extent-tree.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index be54776..ff45060 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -3443,7 +3443,8 @@ static int cache_save_setup(struct 
btrfs_block_group_cache *block_group,
/*
 * don't bother trying to write stuff out _if_
 * a) we're not cached,
-* b) we're with nospace_cache mount option.
+* b) we're with nospace_cache mount option,
+* c) we're with v2 space_cache (FREE_SPACE_TREE).
 */
dcs = BTRFS_DC_WRITTEN;
spin_unlock(&block_group->lock);
-- 
2.5.5

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2] Btrfs: fix unexpected file hole after disk errors

2017-03-06 Thread Liu Bo
Btrfs creates hole extents to cover any unwritten section right before
doing buffer writes after commit 3ac0d7b96a26 ("btrfs: Change the expanding
write sequence to fix snapshot related bug.").

However, that takes the start position of the buffered write to compare
against the current EOF, hole extents would be created only if (EOF <
start).

If the EOF is at the middle of the buffered write, no hole extents will be
created and a file hole without a hole extent is left in this file.

This bug was revealed by generic/019 in fstests.  'fsstress' in this test
may create the above situation and the test then fails all requests
including writes, so the buffer write which is supposed to cover the
hole (without the hole extent) couldn't make it on disk.  Running fsck
against such btrfs ends up with detecting file extent holes.

Things could be more serious, some stale data would be exposed to
userspace if files with this kind of hole are truncated to a position of
the hole, because the on-disk inode size is beyond the last extent in the
file.

This fixes the bug by comparing the end position against the EOF.

Cc: David Sterba 
Cc: Qu Wenruo 
Signed-off-by: Liu Bo 
---
v2: update comments to be precise.

 fs/btrfs/file.c | 10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 520cb72..dcf0286 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1865,11 +1865,13 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
pos = iocb->ki_pos;
count = iov_iter_count(from);
start_pos = round_down(pos, fs_info->sectorsize);
+   end_pos = round_up(pos + count, fs_info->sectorsize);
oldsize = i_size_read(inode);
-   if (start_pos > oldsize) {
-   /* Expand hole size to cover write data, preventing empty gap */
-   end_pos = round_up(pos + count,
-  fs_info->sectorsize);
+   if (end_pos > oldsize) {
+   /*
+* Expand hole size to cover write data in order to prevent an
+* empty gap in case of a write failure.
+*/
err = btrfs_cont_expand(inode, oldsize, end_pos);
if (err) {
inode_unlock(inode);
-- 
2.5.5

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] Btrfs: remove start_pos

2017-03-06 Thread Liu Bo
On Wed, Mar 01, 2017 at 04:48:20PM +0800, Qu Wenruo wrote:
> 
> 
> At 03/01/2017 09:04 AM, Liu Bo wrote:
> > @pos, not aligned @start_pos, should be used to check whether the eof page
> > needs to be marked as readonly, thus @start_pos can be removed.
> > 
> > Signed-off-by: Liu Bo 
> > ---
> >  fs/btrfs/file.c | 7 +--
> >  1 file changed, 1 insertion(+), 6 deletions(-)
> > 
> > diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> > index 0be837b..ef88e6d 100644
> > --- a/fs/btrfs/file.c
> > +++ b/fs/btrfs/file.c
> > @@ -1814,7 +1814,6 @@ static ssize_t btrfs_file_write_iter(struct kiocb 
> > *iocb,
> > struct inode *inode = file_inode(file);
> > struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
> > struct btrfs_root *root = BTRFS_I(inode)->root;
> > -   u64 start_pos;
> > u64 end_pos;
> > ssize_t num_written = 0;
> > bool sync = (file->f_flags & O_DSYNC) || IS_SYNC(file->f_mapping->host);
> > @@ -1822,7 +1821,6 @@ static ssize_t btrfs_file_write_iter(struct kiocb 
> > *iocb,
> > loff_t pos;
> > size_t count;
> > loff_t oldsize;
> > -   int clean_page = 0;
> > 
> > inode_lock(inode);
> > err = generic_write_checks(iocb, from);
> > @@ -1860,7 +1858,6 @@ static ssize_t btrfs_file_write_iter(struct kiocb 
> > *iocb,
> > 
> > pos = iocb->ki_pos;
> > count = iov_iter_count(from);
> > -   start_pos = round_down(pos, fs_info->sectorsize);
> > end_pos = round_up(pos + count, fs_info->sectorsize);
> > oldsize = i_size_read(inode);
> > if (end_pos > oldsize) {
> > @@ -1870,8 +1867,6 @@ static ssize_t btrfs_file_write_iter(struct kiocb 
> > *iocb,
> > inode_unlock(inode);
> > goto out;
> > }
> > -   if (start_pos > round_up(oldsize, fs_info->sectorsize))
> > -   clean_page = 1;
> > }
> > 
> > if (sync)
> > @@ -1883,7 +1878,7 @@ static ssize_t btrfs_file_write_iter(struct kiocb 
> > *iocb,
> > num_written = __btrfs_buffered_write(file, from, pos);
> > if (num_written > 0)
> > iocb->ki_pos = pos + num_written;
> > -   if (clean_page)
> > +   if (oldsize < pos)
> > pagecache_isize_extended(inode, oldsize,
> > i_size_read(inode));
> 
> Not familiar with page cache, so I can be totally wrong here.
> 
> But what will happen if @oldsize and @pos are in the same page?
> 
> For example:
> Page startPage start + 4K
> | ||  |
>   old size pos
> 
> Do we still need to call pagecache_iszie_extented() since we will dirty that
> page anyway?

Yes, isize has changed, if blocksize < pagesize, so it's still possible that the
next write access to the new isize doesn't own an block since no page_mkwrite()
has been called to allocate it, then a following writepage() may fail silently
from userspace's view (unless they run fsync and check its ret).

Thanks,

-liubo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/7] mm: introduce memalloc_nofs_{save,restore} API

2017-03-06 Thread Andrew Morton
On Mon,  6 Mar 2017 14:14:05 +0100 Michal Hocko  wrote:

> From: Michal Hocko 
> 
> GFP_NOFS context is used for the following 5 reasons currently
>   - to prevent from deadlocks when the lock held by the allocation
> context would be needed during the memory reclaim
>   - to prevent from stack overflows during the reclaim because
> the allocation is performed from a deep context already
>   - to prevent lockups when the allocation context depends on
> other reclaimers to make a forward progress indirectly
>   - just in case because this would be safe from the fs POV
>   - silence lockdep false positives
> 
> Unfortunately overuse of this allocation context brings some problems
> to the MM. Memory reclaim is much weaker (especially during heavy FS
> metadata workloads), OOM killer cannot be invoked because the MM layer
> doesn't have enough information about how much memory is freeable by the
> FS layer.
> 
> In many cases it is far from clear why the weaker context is even used
> and so it might be used unnecessarily. We would like to get rid of
> those as much as possible. One way to do that is to use the flag in
> scopes rather than isolated cases. Such a scope is declared when really
> necessary, tracked per task and all the allocation requests from within
> the context will simply inherit the GFP_NOFS semantic.
> 
> Not only this is easier to understand and maintain because there are
> much less problematic contexts than specific allocation requests, this
> also helps code paths where FS layer interacts with other layers (e.g.
> crypto, security modules, MM etc...) and there is no easy way to convey
> the allocation context between the layers.
> 
> Introduce memalloc_nofs_{save,restore} API to control the scope
> of GFP_NOFS allocation context. This is basically copying
> memalloc_noio_{save,restore} API we have for other restricted allocation
> context GFP_NOIO. The PF_MEMALLOC_NOFS flag already exists and it is
> just an alias for PF_FSTRANS which has been xfs specific until recently.
> There are no more PF_FSTRANS users anymore so let's just drop it.
> 
> PF_MEMALLOC_NOFS is now checked in the MM layer and drops __GFP_FS
> implicitly same as PF_MEMALLOC_NOIO drops __GFP_IO. memalloc_noio_flags
> is renamed to current_gfp_context because it now cares about both
> PF_MEMALLOC_NOFS and PF_MEMALLOC_NOIO contexts. Xfs code paths preserve
> their semantic. kmem_flags_convert() doesn't need to evaluate the flag
> anymore.
> 
> This patch shouldn't introduce any functional changes.
> 
> Let's hope that filesystems will drop direct GFP_NOFS (resp. ~__GFP_FS)
> usage as much as possible and only use a properly documented
> memalloc_nofs_{save,restore} checkpoints where they are appropriate.
> 
> 
>
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -210,8 +210,16 @@ struct vm_area_struct;
>   *
>   * GFP_NOIO will use direct reclaim to discard clean pages or slab pages
>   *   that do not require the starting of any physical IO.
> + *   Please try to avoid using this flag directly and instead use
> + *   memalloc_noio_{save,restore} to mark the whole scope which cannot
> + *   perform any IO with a short explanation why. All allocation requests
> + *   will inherit GFP_NOIO implicitly.
>   *
>   * GFP_NOFS will use direct reclaim but will not use any filesystem 
> interfaces.
> + *   Please try to avoid using this flag directly and instead use
> + *   memalloc_nofs_{save,restore} to mark the whole scope which 
> cannot/shouldn't
> + *   recurse into the FS layer with a short explanation why. All allocation
> + *   requests will inherit GFP_NOFS implicitly.

I wonder if these are worth a checkpatch rule.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v6 1/2] btrfs: Fix metadata underflow caused by btrfs_reloc_clone_csum error

2017-03-06 Thread Filipe Manana
On Mon, Mar 6, 2017 at 2:55 AM, Qu Wenruo  wrote:
> [BUG]
> When btrfs_reloc_clone_csum() reports error, it can underflow metadata
> and leads to kernel assertion on outstanding extents in
> run_delalloc_nocow() and cow_file_range().
>
>  BTRFS info (device vdb5): relocating block group 12582912 flags data
>  BTRFS info (device vdb5): found 1 extents
>  assertion failed: inode->outstanding_extents >= num_extents, file: 
> fs/btrfs//extent-tree.c, line: 5858
>
> Currently, due to another bug blocking ordered extents, the bug is only
> reproducible under certain block group layout and using error injection.
>
> a) Create one data block group with one 4K extent in it.
>To avoid the bug that hangs btrfs due to ordered extent which never
>finishes
> b) Make btrfs_reloc_clone_csum() always fail
> c) Relocate that block group
>
> [CAUSE]
> run_delalloc_nocow() and cow_file_range() handles error from
> btrfs_reloc_clone_csum() wrongly:
>
> (The ascii chart shows a more generic case of this bug other than the
> bug mentioned above)
>
> |<-- delalloc range --->|
> | OE 1 | OE 2 | ... | OE n |
> |<--- cleanup range --->|
> |<---  --->|
>  \/
>  btrfs_finish_ordered_io() range
>
> So error handler, which calls extent_clear_unlock_delalloc() with
> EXTENT_DELALLOC and EXTENT_DO_ACCOUNT bits, and btrfs_finish_ordered_io()
> will both cover OE n, and free its metadata, causing metadata under flow.
>
> [Fix]
> The fix is to ensure after calling btrfs_add_ordered_extent(), we only
> call error handler after increasing the iteration offset, so that
> cleanup range won't cover any created ordered extent.
>
> |<-- delalloc range --->|
> | OE 1 | OE 2 | ... | OE n |
> |<---  --->|<-- cleanup range ->|
>  \/
>  btrfs_finish_ordered_io() range
>
> Signed-off-by: Qu Wenruo 

Reviewed-by: Filipe Manana 

thanks

> ---
> changelog:
> v6:
>   New, split from v5 patch, as this is a separate bug.
> ---
>  fs/btrfs/inode.c | 51 +++
>  1 file changed, 39 insertions(+), 12 deletions(-)
>
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index b2bc07aad1ae..1d83d504f2e5 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -998,15 +998,24 @@ static noinline int cow_file_range(struct inode *inode,
> BTRFS_DATA_RELOC_TREE_OBJECTID) {
> ret = btrfs_reloc_clone_csums(inode, start,
>   cur_alloc_size);
> +   /*
> +* Only drop cache here, and process as normal.
> +*
> +* We must not allow extent_clear_unlock_delalloc()
> +* at out_unlock label to free meta of this ordered
> +* extent, as its meta should be freed by
> +* btrfs_finish_ordered_io().
> +*
> +* So we must continue until @start is increased to
> +* skip current ordered extent.
> +*/
> if (ret)
> -   goto out_drop_extent_cache;
> +   btrfs_drop_extent_cache(BTRFS_I(inode), start,
> +   start + ram_size - 1, 0);
> }
>
> btrfs_dec_block_group_reservations(fs_info, ins.objectid);
>
> -   if (disk_num_bytes < cur_alloc_size)
> -   break;
> -
> /* we're not doing compressed IO, don't unlock the first
>  * page (which the caller expects to stay locked), don't
>  * clear any dirty bits and don't set any writeback bits
> @@ -1022,10 +1031,21 @@ static noinline int cow_file_range(struct inode 
> *inode,
>  delalloc_end, locked_page,
>  EXTENT_LOCKED | EXTENT_DELALLOC,
>  op);
> -   disk_num_bytes -= cur_alloc_size;
> +   if (disk_num_bytes > cur_alloc_size)
> +   disk_num_bytes = 0;
> +   else
> +   disk_num_bytes -= cur_alloc_size;
> num_bytes -= cur_alloc_size;
> alloc_hint = ins.objectid + ins.offset;
> start += cur_alloc_size;
> +
> +   /*
> +* btrfs_reloc_clone_csums() error, since start is increased
> +* extent_clear_unlock_delalloc() at out_unlock label won't
> +* free metadata of current ordered extent, we're OK to exit.
> +*/
> +   if (ret)
> +   goto out_unlock;
> }
>  out:
> 

Re: [PATCH v6 2/2] btrfs: Handle delalloc error correctly to avoid ordered extent hang

2017-03-06 Thread Filipe Manana
On Mon, Mar 6, 2017 at 2:55 AM, Qu Wenruo  wrote:
> [BUG]
> If run_delalloc_range() returns error and there is already some ordered
> extents created, btrfs will be hanged with the following backtrace:
>
> Call Trace:
>  __schedule+0x2d4/0xae0
>  schedule+0x3d/0x90
>  btrfs_start_ordered_extent+0x160/0x200 [btrfs]
>  ? wake_atomic_t_function+0x60/0x60
>  btrfs_run_ordered_extent_work+0x25/0x40 [btrfs]
>  btrfs_scrubparity_helper+0x1c1/0x620 [btrfs]
>  btrfs_flush_delalloc_helper+0xe/0x10 [btrfs]
>  process_one_work+0x2af/0x720
>  ? process_one_work+0x22b/0x720
>  worker_thread+0x4b/0x4f0
>  kthread+0x10f/0x150
>  ? process_one_work+0x720/0x720
>  ? kthread_create_on_node+0x40/0x40
>  ret_from_fork+0x2e/0x40
>
> [CAUSE]
>
> |<-- delalloc range --->|
> | OE 1 | OE 2 | ... | OE n |
> |<>|   |<-- cleanup range ->|
>  ||
>  \_=> First page handled by end_extent_writepage() in __extent_writepage()
>
> The problem is caused by error handler of run_delalloc_range(), which
> doesn't handle any created ordered extents, leaving them waiting on
> btrfs_finish_ordered_io() to finish.
>
> However after run_delalloc_range() returns error, __extent_writepage()
> won't submit bio, so btrfs_writepage_end_io_hook() won't be triggered
> except the first page, and btrfs_finish_ordered_io() won't be triggered
> for created ordered extents either.
>
> So OE 2~n will hang forever, and if OE 1 is larger than one page, it
> will also hang.
>
> [FIX]
> Introduce btrfs_cleanup_ordered_extents() function to cleanup created
> ordered extents and finish them manually.
>
> The function is based on existing
> btrfs_endio_direct_write_update_ordered() function, and modify it to
> act just like btrfs_writepage_endio_hook() but handles specified range
> other than one page.
>
> After fix, delalloc error will be handled like:
>
> |<-- delalloc range --->|
> | OE 1 | OE 2 | ... | OE n |
> |<>|<  --->|<-- old error handler ->|
>  ||  ||
>  ||  \_=> Cleaned up by cleanup_ordered_extents()
>  \_=> First page handled by end_extent_writepage() in __extent_writepage()
>
> Signed-off-by: Qu Wenruo 

Reviewed-by: Filipe Manana 

thanks

> Signed-off-by: Filipe Manana 
> ---
> changelog:
> v2:
>   Add BTRFS_ORDERED_SKIP_METADATA flag to avoid double reducing
>   outstanding extents, which is already done by
>   extent_clear_unlock_delalloc() with EXTENT_DO_ACCOUNT control bit
> v3:
>   Skip first page to avoid underflow ordered->bytes_left.
>   Fix range passed in cow_file_range() which doesn't cover the whole
>   extent.
>   Expend extent_clear_unlock_delalloc() range to allow them to handle
>   metadata release.
> v4:
>   Don't use extra bit to skip metadata freeing for ordered extent,
>   but only handle btrfs_reloc_clone_csums() error just before processing
>   next extent.
>   This makes error handle much easier for run_delalloc_nocow().
> v5:
>   Variant gramma and comment fixes suggested by Filipe Manana
>   Enhanced commit message to focus on the generic error handler bug,
>   pointed out by Filipe Manana
>   Adding missing wq/func user in __endio_write_update_ordered(), pointed
>   out by Filipe Manana.
>   Enhanced commit message with ascii art to explain the bug more easily.
>   Fix a bug which can leads to corrupted extent allocation, exposed by
>   Filipe Manana.
> v6:
>   Split the metadata underflow fix to another patch.
>   Use better comment and btrfs_cleanup_ordered_extent() implementation
>   from Filipe Manana.
> ---
>  fs/btrfs/inode.c | 62 
> ++--
>  1 file changed, 47 insertions(+), 15 deletions(-)
>
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 1d83d504f2e5..9b03eb9b27d0 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -115,6 +115,30 @@ static struct extent_map *create_io_em(struct inode 
> *inode, u64 start, u64 len,
>u64 ram_bytes, int compress_type,
>int type);
>
> +static void __endio_write_update_ordered(struct inode *inode,
> +u64 offset, u64 bytes,
> +bool uptodate);
> +
> +/*
> + * Cleanup all submitted ordered extents in specified range to handle errors
> + * from the fill_dellaloc() callback.
> + *
> + * NOTE: caller must ensure that when an error happens, it can not call
> + * extent_clear_unlock_delalloc() to clear both the bits EXTENT_DO_ACCOUNTING
> + * and EXTENT_DELALLOC simultaneously, because that causes the reserved 
> metadata
> + * to be released, which we want to happen only when finishing the ordered
> + * extent (btrfs_finish_ordered_io()). Also note that the caller of the
> + * fill_dealloc() callback already does proper cleanup for the first page of
> + * the range, that is, it invokes the callback writepage_end_

Re: [PATCH] Btrfs: fix file corruption after cloning inline extents

2017-03-06 Thread Filipe Manana
On Fri, Mar 3, 2017 at 6:48 PM, Liu Bo  wrote:
> On Fri, Mar 03, 2017 at 03:36:36PM +, Filipe Manana wrote:
>> On Fri, Mar 3, 2017 at 12:36 AM, Liu Bo  wrote:
>> > On Thu, Mar 02, 2017 at 02:18:21PM -0800, Liu Bo wrote:
>> >> On Tue, Jul 14, 2015 at 04:34:48PM +0100, fdman...@kernel.org wrote:
>> >> > From: Filipe Manana 
>> >> >
>> >> > Using the clone ioctl (or extent_same ioctl, which calls the same extent
>> >> > cloning function as well) we end up allowing copy an inline extent from
>> >> > the source file into a non-zero offset of the destination file. This is
>> >> > something not expected and that the btrfs code is not prepared to deal
>> >> > with - all inline extents must be at a file offset equals to 0.
>> >> >
>> >>
>> >> Somehow I failed to reproduce the BUG_ON with this case.
>> >>
>> >> > For example, the following excerpt of a test case for fstests triggers
>> >> > a crash/BUG_ON() on a write operation after an inline extent is cloned
>> >> > into a non-zero offset:
>> >> >
>> >> >   _scratch_mkfs >>$seqres.full 2>&1
>> >> >   _scratch_mount
>> >> >
>> >> >   # Create our test files. File foo has the same 2K of data at offset 4K
>> >> >   # as file bar has at its offset 0.
>> >> >   $XFS_IO_PROG -f -s -c "pwrite -S 0xaa 0 4K" \
>> >> >   -c "pwrite -S 0xbb 4k 2K" \
>> >> >   -c "pwrite -S 0xcc 8K 4K" \
>> >> >   $SCRATCH_MNT/foo | _filter_xfs_io
>> >> >
>> >> >   # File bar consists of a single inline extent (2K size).
>> >> >   $XFS_IO_PROG -f -s -c "pwrite -S 0xbb 0 2K" \
>> >> >  $SCRATCH_MNT/bar | _filter_xfs_io
>> >> >
>> >> >   # Now call the clone ioctl to clone the extent of file bar into file
>> >> >   # foo at its offset 4K. This made file foo have an inline extent at
>> >> >   # offset 4K, something which the btrfs code can not deal with in 
>> >> > future
>> >> >   # IO operations because all inline extents are supposed to start at an
>> >> >   # offset of 0, resulting in all sorts of chaos.
>> >> >   # So here we validate that clone ioctl returns an EOPNOTSUPP, which is
>> >> >   # what it returns for other cases dealing with inlined extents.
>> >> >   $CLONER_PROG -s 0 -d $((4 * 1024)) -l $((2 * 1024)) \
>> >> >   $SCRATCH_MNT/bar $SCRATCH_MNT/foo
>> >> >
>> >> >   # Because of the inline extent at offset 4K, the following write made
>> >> >   # the kernel crash with a BUG_ON().
>> >> >   $XFS_IO_PROG -c "pwrite -S 0xdd 6K 2K" $SCRATCH_MNT/foo | 
>> >> > _filter_xfs_io
>> >> >
>> >>
>> >> On 4.10, after allowing to clone an inline extent to dst file's offset 
>> >> greater
>> >> than zero, I followed the test case manually and got these
>> >>
>> >> [root@localhost trinity]# /home/btrfs-progs/btrfs-debugfs -f 
>> >> /mnt/btrfs/foo
>> >> (257 0): ram 4096 disk 12648448 disk_size 4096
>> >> (257 4096): ram 2048 disk 0 disk_size 2048 -- inline
>> >> (257 8192): ram 4096 disk 12656640 disk_size 4096
>> >> file: /mnt/btrfs/foo extents 3 disk size 10240 logical size 12288 ratio 
>> >> 1.20
>> >>
>> >> [root@localhost trinity]# xfs_io -f -c "pwrite 6k 2k" /mnt/btrfs/foo
>> >> wrote 2048/2048 bytes at offset 6144
>> >> 2 KiB, 1 ops; 0. sec (12.520 MiB/sec and 6410.2564 ops/sec)
>> >>
>> >> [root@localhost trinity]# sync
>> >> [root@localhost trinity]# /home/btrfs-progs/btrfs-debugfs -f 
>> >> /mnt/btrfs/foo
>> >> (257 0): ram 4096 disk 12648448 disk_size 4096
>> >> (257 4096): ram 4096 disk 12582912 disk_size 4096
>> >> (257 8192): ram 4096 disk 12656640 disk_size 4096
>> >> file: /mnt/btrfs/foo extents 3 disk size 12288 logical size 12288 ratio 
>> >> 1.00
>> >>
>> >>
>> >> Looks like we now are able to cope with these inline extents?
>> >
>> > I went back to test against v4.1 and v4.5, it turns out that we got the 
>> > below
>> > BUG_ON() in MM and -EIO when writing to the inline extent, because of the 
>> > fact
>> > that, when writing to the page that covers the inline extent, firstly it 
>> > reads
>> > page to get an uptodate page for writing, in readpage(), for inline extent,
>> > btrfs_get_extent() always goes to search fs tree to read inline data out 
>> > to page
>> > and then tries to insert a em, -EEXIST would be returned if there is an 
>> > existing
>> > one.
>> >
>> > However, after commit 8dff9c853410 ("Btrfs: deal with duplciates during
>> > extent_map insertion in btrfs_get_extent"), we have that fixed, so now we 
>> > can
>> > read/write inline extent even they've been mixed with other regular 
>> > extents.
>> >
>> > But...I'm not 100% sure whether such files (with mixing inline with 
>> > regular)
>> > would have any other problems rather than read/write.  Let me know if you 
>> > could
>> > think of a corruption due to that.
>>
>> Without thinking too much and after doing some quick tests that passed
>> successfully, I'm not seeing where things can go wrong.
>> However it's odd to have a mix of inline and non-inline extents, since
>> the only cases where we create inline extents is for zero offsets and
>> their size is smaller than pag

[PATCH 1/2] Btrfs: remove unused delalloc_end parameter

2017-03-06 Thread fdmanana
From: Filipe Manana 

The delalloc_end parameter for extent_clear_unlock_delalloc() is never
used, and only making the code for all callers longer and more complex.
Just remove it.

Signed-off-by: Filipe Manana 
---
 fs/btrfs/extent_io.c |  2 +-
 fs/btrfs/extent_io.h |  2 +-
 fs/btrfs/inode.c | 32 +---
 3 files changed, 15 insertions(+), 21 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 28e8192..2cce08b 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1760,7 +1760,7 @@ static int __process_pages_contig(struct address_space 
*mapping,
 }
 
 void extent_clear_unlock_delalloc(struct inode *inode, u64 start, u64 end,
-u64 delalloc_end, struct page *locked_page,
+struct page *locked_page,
 unsigned clear_bits,
 unsigned long page_ops)
 {
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index 3e4fad4..b12f4db 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -452,7 +452,7 @@ int map_private_extent_buffer(struct extent_buffer *eb, 
unsigned long offset,
 void extent_range_clear_dirty_for_io(struct inode *inode, u64 start, u64 end);
 void extent_range_redirty_for_io(struct inode *inode, u64 start, u64 end);
 void extent_clear_unlock_delalloc(struct inode *inode, u64 start, u64 end,
-u64 delalloc_end, struct page *locked_page,
+struct page *locked_page,
 unsigned bits_to_clear,
 unsigned long page_ops);
 struct bio *
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index b2bc07a..07bee7d 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -106,7 +106,7 @@ static int btrfs_truncate(struct inode *inode);
 static int btrfs_finish_ordered_io(struct btrfs_ordered_extent 
*ordered_extent);
 static noinline int cow_file_range(struct inode *inode,
   struct page *locked_page,
-  u64 start, u64 end, u64 delalloc_end,
+  u64 start, u64 end,
   int *page_started, unsigned long *nr_written,
   int unlock, struct btrfs_dedupe_hash *hash);
 static struct extent_map *create_io_em(struct inode *inode, u64 start, u64 len,
@@ -558,7 +558,7 @@ static noinline void compress_file_range(struct inode 
*inode,
 * we don't need to create any more async work items.
 * Unlock and free up our temp pages.
 */
-   extent_clear_unlock_delalloc(inode, start, end, end,
+   extent_clear_unlock_delalloc(inode, start, end,
 NULL, clear_flags,
 PAGE_UNLOCK |
 PAGE_CLEAR_DIRTY |
@@ -713,8 +713,6 @@ static noinline void submit_compressed_extents(struct inode 
*inode,
 async_extent->start,
 async_extent->start +
 async_extent->ram_size - 1,
-async_extent->start +
-async_extent->ram_size - 1,
 &page_started, &nr_written, 0,
 NULL);
 
@@ -810,8 +808,6 @@ static noinline void submit_compressed_extents(struct inode 
*inode,
extent_clear_unlock_delalloc(inode, async_extent->start,
async_extent->start +
async_extent->ram_size - 1,
-   async_extent->start +
-   async_extent->ram_size - 1,
NULL, EXTENT_LOCKED | EXTENT_DELALLOC,
PAGE_UNLOCK | PAGE_CLEAR_DIRTY |
PAGE_SET_WRITEBACK);
@@ -831,7 +827,7 @@ static noinline void submit_compressed_extents(struct inode 
*inode,
tree->ops->writepage_end_io_hook(p, start, end,
 NULL, 0);
p->mapping = NULL;
-   extent_clear_unlock_delalloc(inode, start, end, end,
+   extent_clear_unlock_delalloc(inode, start, end,
 NULL, 0,
 PAGE_END_WRITEBACK |
 PAGE_SET_ERROR);
@@ -849,8 +845,6 @@ static noinline void submit_compressed_extents(struct inode 
*inode,
extent_clear_unlock_delalloc(inode, async_extent->start,
  

[PATCH 2/2] Btrfs: fix invalid attempt to free reserved space on failure to cow range

2017-03-06 Thread fdmanana
From: Filipe Manana 

When attempting to COW a file range (we are starting writeback and doing
COW), if we manage to reserve an extent for the range we will write into
but fail after reserving it and before creating the respective ordered
extent, we end up in an error path where we attempt to decrement the
data space's reserved_bytes counter after we already did it while
reserving the extent, leading to a warning/trace like the following:

[  847.621524] [ cut here ]
[  847.625441] WARNING: CPU: 5 PID: 4905 at fs/btrfs/extent-tree.c:4316 
btrfs_free_reserved_data_space_noquota+0x60/0x9f [btrfs]
[  847.633704] Modules linked in: btrfs crc32c_generic xor raid6_pq 
acpi_cpufreq i2c_piix4 ppdev psmouse tpm_tis serio_raw pcspkr parport_pc 
tpm_tis_core i2c_core sg
[  847.644616] CPU: 5 PID: 4905 Comm: xfs_io Not tainted 
4.10.0-rc8-btrfs-next-37+ #2
[  847.648601] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
[  847.648601] Call Trace:
[  847.648601]  dump_stack+0x67/0x90
[  847.648601]  __warn+0xc2/0xdd
[  847.648601]  warn_slowpath_null+0x1d/0x1f
[  847.648601]  btrfs_free_reserved_data_space_noquota+0x60/0x9f [btrfs]
[  847.648601]  btrfs_clear_bit_hook+0x140/0x258 [btrfs]
[  847.648601]  clear_state_bit+0x87/0x128 [btrfs]
[  847.648601]  __clear_extent_bit+0x222/0x2b7 [btrfs]
[  847.648601]  clear_extent_bit+0x17/0x19 [btrfs]
[  847.648601]  extent_clear_unlock_delalloc+0x3b/0x6b [btrfs]
[  847.648601]  cow_file_range.isra.39+0x387/0x39a [btrfs]
[  847.648601]  run_delalloc_nocow+0x4d7/0x70e [btrfs]
[  847.648601]  ? arch_local_irq_save+0x9/0xc
[  847.648601]  run_delalloc_range+0xa7/0x2b5 [btrfs]
[  847.648601]  writepage_delalloc.isra.31+0xb9/0x15c [btrfs]
[  847.648601]  __extent_writepage+0x249/0x2e8 [btrfs]
[  847.648601]  extent_write_cache_pages.constprop.33+0x28b/0x36c [btrfs]
[  847.648601]  ? arch_local_irq_save+0x9/0xc
[  847.648601]  ? mark_lock+0x24/0x201
[  847.648601]  extent_writepages+0x4b/0x5c [btrfs]
[  847.648601]  ? btrfs_writepage_start_hook+0xed/0xed [btrfs]
[  847.648601]  btrfs_writepages+0x28/0x2a [btrfs]
[  847.648601]  do_writepages+0x23/0x2c
[  847.648601]  __filemap_fdatawrite_range+0x5a/0x61
[  847.648601]  filemap_fdatawrite_range+0x13/0x15
[  847.648601]  btrfs_fdatawrite_range+0x20/0x46 [btrfs]
[  847.648601]  start_ordered_ops+0x19/0x23 [btrfs]
[  847.648601]  btrfs_sync_file+0x136/0x42c [btrfs]
[  847.648601]  vfs_fsync_range+0x8c/0x9e
[  847.648601]  vfs_fsync+0x1c/0x1e
[  847.648601]  do_fsync+0x31/0x4a
[  847.648601]  SyS_fsync+0x10/0x14
[  847.648601]  entry_SYSCALL_64_fastpath+0x18/0xad
[  847.648601] RIP: 0033:0x7f5b05200800
[  847.648601] RSP: 002b:7ffe204f71c8 EFLAGS: 0246 ORIG_RAX: 
004a
[  847.648601] RAX: ffda RBX: 8109637b RCX: 7f5b05200800
[  847.648601] RDX: 008bd0a0 RSI: 008bd2e0 RDI: 0003
[  847.648601] RBP: c90001d67f98 R08:  R09: 001f
[  847.648601] R10: 01f6 R11: 0246 R12: 0046
[  847.648601] R13: c90001d67f78 R14: 7f5b054be740 R15: 7f5b054be740
[  847.648601]  ? trace_hardirqs_off_caller+0x3f/0xaa
[  847.685787] ---[ end trace 2a4a3e15382508e8 ]---

So fix this by not attempting to decrement the data space info's reserved
bytes counter if we already reserved the extent and an error happened before
creating the ordered extent. We are already correctly freeing the reserved
extent if an error happens, so there's no additional measure needed.

Signed-off-by: Filipe Manana 
---
 fs/btrfs/extent_io.h |  4 +++-
 fs/btrfs/inode.c | 45 ++---
 2 files changed, 33 insertions(+), 16 deletions(-)

diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index b12f4db..ba7b410 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -21,8 +21,10 @@
 #define EXTENT_NORESERVE   (1U << 15)
 #define EXTENT_QGROUP_RESERVED (1U << 16)
 #define EXTENT_CLEAR_DATA_RESV (1U << 17)
+#define EXTENT_SPACE_FREED (1U << 18)
 #define EXTENT_IOBITS  (EXTENT_LOCKED | EXTENT_WRITEBACK)
-#define EXTENT_CTLBITS (EXTENT_DO_ACCOUNTING | EXTENT_FIRST_DELALLOC)
+#define EXTENT_CTLBITS (EXTENT_DO_ACCOUNTING | EXTENT_FIRST_DELALLOC |\
+EXTENT_SPACE_FREED)
 
 /*
  * flags for bio submission. The high bits indicate the compression
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 07bee7d..0d09ef2 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -912,10 +912,13 @@ static noinline int cow_file_range(struct inode *inode,
u64 num_bytes;
unsigned long ram_size;
u64 disk_num_bytes;
-   u64 cur_alloc_size;
+   u64 cur_alloc_size = 0;
u64 blocksize = fs_info->sectorsize;
struct btrfs_key ins;
struct extent_map *em;
+   unsigned clear_bits;
+   unsigned long page_ops;
+  

Re: [PATCH v2] Btrfs: fix unexpected file hole after disk errors

2017-03-06 Thread Qu Wenruo



At 03/07/2017 04:23 AM, Liu Bo wrote:

Btrfs creates hole extents to cover any unwritten section right before
doing buffer writes after commit 3ac0d7b96a26 ("btrfs: Change the expanding
write sequence to fix snapshot related bug.").

However, that takes the start position of the buffered write to compare
against the current EOF, hole extents would be created only if (EOF <
start).

If the EOF is at the middle of the buffered write, no hole extents will be
created and a file hole without a hole extent is left in this file.

This bug was revealed by generic/019 in fstests.  'fsstress' in this test
may create the above situation and the test then fails all requests
including writes, so the buffer write which is supposed to cover the
hole (without the hole extent) couldn't make it on disk.  Running fsck
against such btrfs ends up with detecting file extent holes.

Things could be more serious, some stale data would be exposed to
userspace if files with this kind of hole are truncated to a position of
the hole, because the on-disk inode size is beyond the last extent in the
file.

This fixes the bug by comparing the end position against the EOF.

Cc: David Sterba 
Cc: Qu Wenruo 
Signed-off-by: Liu Bo 


Looks good.

Reviewed-by: Qu Wenruo 

Thanks,
Qu

---
v2: update comments to be precise.

 fs/btrfs/file.c | 10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 520cb72..dcf0286 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1865,11 +1865,13 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
pos = iocb->ki_pos;
count = iov_iter_count(from);
start_pos = round_down(pos, fs_info->sectorsize);
+   end_pos = round_up(pos + count, fs_info->sectorsize);
oldsize = i_size_read(inode);
-   if (start_pos > oldsize) {
-   /* Expand hole size to cover write data, preventing empty gap */
-   end_pos = round_up(pos + count,
-  fs_info->sectorsize);
+   if (end_pos > oldsize) {
+   /*
+* Expand hole size to cover write data in order to prevent an
+* empty gap in case of a write failure.
+*/
err = btrfs_cont_expand(inode, oldsize, end_pos);
if (err) {
inode_unlock(inode);




--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 0/6] Chunk level degradable check

2017-03-06 Thread Adam Borowski
On Mon, Mar 06, 2017 at 04:58:49PM +0800, Qu Wenruo wrote:
> Btrfs currently uses num_tolerated_disk_barrier_failures to do global
> check for tolerated missing device.
> 
> Although the one-size-fit-all solution is quite safe, it's too strict if
> data and metadata has different duplication level.
> 
> For example, if one use Single data and RAID1 metadata for 2 disks, it
> means any missing device will make the fs unable to be degraded mounted.
> 
> But in fact, some times all single chunks may be in the existing device
> and in that case, we should allow it to be rw degraded mounted.
[...]
> This patchset will introduce a new per-chunk degradable check for btrfs,
> allow above case to succeed, and it's quite small anyway.

I've tested it quite extensively.

As Dmitrii already tested the common case of a raid1/raid10 degraded mount,
I concentrated mostly on cases where the answer is negative.  For example:
raid1 (A,B).  Pull out A.  Add C, start resync but interrupt it halfway. 
Pull out B.  Obviously, C doesn't have every chunk, but this doesn't come
from a naive count; Qu's patch handles this case correctly.

So far, so good.

Not so for -draid5 -mraid1, unfortunately:

[/mnt/btr2/scratch]# btrfs fi us /mnt/vol1
WARNING: RAID56 detected, not implemented
Overall:
Device size:  12.00GiB
Device allocated:  2.02GiB
Device unallocated:9.98GiB
Device missing:  0.00B
Used:  7.12MiB
Free (estimated):0.00B  (min: 8.00EiB)
Data ratio:   0.00
Metadata ratio:   2.00
Global reserve:   16.00MiB  (used: 0.00B)

Data,RAID5: Size:2.02GiB, Used:1.21GiB
   /dev/loop0  1.01GiB
   /dev/loop1  1.01GiB
   /dev/loop2  1.01GiB

Metadata,RAID1: Size:1.00GiB, Used:3.55MiB
   /dev/loop0  1.00GiB
   /dev/loop2  1.00GiB

System,RAID1: Size:8.00MiB, Used:16.00KiB
   /dev/loop1  8.00MiB
   /dev/loop2  8.00MiB

Unallocated:
   /dev/loop0  1.99GiB
   /dev/loop1  2.98GiB
   /dev/loop2  1.98GiB
[/mnt/btr2/scratch]# umount /mnt/vol1
[/mnt/btr2/scratch]# losetup -D 
   ✔
[/mnt/btr2/scratch]# losetup -f rb
[/mnt/btr2/scratch]# losetup -f rc
[/mnt/btr2/scratch]# mount -noatime,degraded /dev/loop0 /mnt/vol1
[/mnt/btr2/scratch]# btrfs fi us /mnt/vol1
WARNING: RAID56 detected, not implemented
Overall:
Device size:  12.00GiB
Device allocated:  2.02GiB
Device unallocated:9.98GiB
Device missing:  0.00B
Used:  7.12MiB
Free (estimated):0.00B  (min: 8.00EiB)
Data ratio:   0.00
Metadata ratio:   2.00
Global reserve:   16.00MiB  (used: 0.00B)

Data,RAID5: Size:2.02GiB, Used:1.21GiB
   /dev/loop0  1.01GiB
   /dev/loop0  1.01GiB
   /dev/loop1  1.01GiB

Metadata,RAID1: Size:1.00GiB, Used:3.55MiB
   /dev/loop0  1.00GiB
   /dev/loop1  1.00GiB

System,RAID1: Size:8.00MiB, Used:16.00KiB
   /dev/loop0  8.00MiB
   /dev/loop1  8.00MiB

Unallocated:
   /dev/loop0  1.99GiB
   /dev/loop0  2.98GiB
   /dev/loop1  1.98GiB

Write something, mount degraded again.  Massive data corruption, both on
plain reads and on scrub, unrecoverable.


Obviously, this problem is somewhere with RAID5 rather than this patch set,
but the safety check can't be removed before that is fixed.


-- 
⢀⣴⠾⠻⢶⣦⠀ Meow!
⣾⠁⢠⠒⠀⣿⡁
⢿⡄⠘⠷⠚⠋⠀ Collisions shmolisions, let's see them find a collision or second
⠈⠳⣄ preimage for double rot13!
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 0/6] Chunk level degradable check

2017-03-06 Thread Qu Wenruo



At 03/07/2017 08:36 AM, Adam Borowski wrote:

On Mon, Mar 06, 2017 at 04:58:49PM +0800, Qu Wenruo wrote:

Btrfs currently uses num_tolerated_disk_barrier_failures to do global
check for tolerated missing device.

Although the one-size-fit-all solution is quite safe, it's too strict if
data and metadata has different duplication level.

For example, if one use Single data and RAID1 metadata for 2 disks, it
means any missing device will make the fs unable to be degraded mounted.

But in fact, some times all single chunks may be in the existing device
and in that case, we should allow it to be rw degraded mounted.

[...]

This patchset will introduce a new per-chunk degradable check for btrfs,
allow above case to succeed, and it's quite small anyway.


I've tested it quite extensively.

As Dmitrii already tested the common case of a raid1/raid10 degraded mount,
I concentrated mostly on cases where the answer is negative.  For example:
raid1 (A,B).  Pull out A.  Add C, start resync but interrupt it halfway.
Pull out B.  Obviously, C doesn't have every chunk, but this doesn't come
from a naive count; Qu's patch handles this case correctly.

So far, so good.


Thanks for your full test.



Not so for -draid5 -mraid1, unfortunately:


Unfortunately, for raid5 there are still unfixed bugs.
In fact, some raid5/6 bugs are already fixed, but still not merged yet.

For example, the following patch will fix the RAID5/6 parity corruption.

https://patchwork.kernel.org/patch/9553581/



[/mnt/btr2/scratch]# btrfs fi us /mnt/vol1
WARNING: RAID56 detected, not implemented
Overall:
Device size:  12.00GiB
Device allocated:  2.02GiB
Device unallocated:9.98GiB
Device missing:  0.00B
Used:  7.12MiB
Free (estimated):0.00B  (min: 8.00EiB)
Data ratio:   0.00
Metadata ratio:   2.00
Global reserve:   16.00MiB  (used: 0.00B)

Data,RAID5: Size:2.02GiB, Used:1.21GiB
   /dev/loop0  1.01GiB
   /dev/loop1  1.01GiB
   /dev/loop2  1.01GiB

Metadata,RAID1: Size:1.00GiB, Used:3.55MiB
   /dev/loop0  1.00GiB
   /dev/loop2  1.00GiB

System,RAID1: Size:8.00MiB, Used:16.00KiB
   /dev/loop1  8.00MiB
   /dev/loop2  8.00MiB

Unallocated:
   /dev/loop0  1.99GiB
   /dev/loop1  2.98GiB
   /dev/loop2  1.98GiB
[/mnt/btr2/scratch]# umount /mnt/vol1
[/mnt/btr2/scratch]# losetup -D 
   ✔
[/mnt/btr2/scratch]# losetup -f rb
[/mnt/btr2/scratch]# losetup -f rc


So you're pulling out first device.
In theory, it should be completely OK for RAID5.
And the degradable check follows it.


[/mnt/btr2/scratch]# mount -noatime,degraded /dev/loop0 /mnt/vol1
[/mnt/btr2/scratch]# btrfs fi us /mnt/vol1
WARNING: RAID56 detected, not implemented
Overall:
Device size:  12.00GiB
Device allocated:  2.02GiB
Device unallocated:9.98GiB
Device missing:  0.00B
Used:  7.12MiB
Free (estimated):0.00B  (min: 8.00EiB)
Data ratio:   0.00
Metadata ratio:   2.00
Global reserve:   16.00MiB  (used: 0.00B)

Data,RAID5: Size:2.02GiB, Used:1.21GiB
   /dev/loop0  1.01GiB
   /dev/loop0  1.01GiB
   /dev/loop1  1.01GiB


Two loop0 shows up here, which should be detected as missing.

So it should be a btrfs-progs bug, and it'll be much easier to fix than 
kernel.




Metadata,RAID1: Size:1.00GiB, Used:3.55MiB
   /dev/loop0  1.00GiB
   /dev/loop1  1.00GiB

System,RAID1: Size:8.00MiB, Used:16.00KiB
   /dev/loop0  8.00MiB
   /dev/loop1  8.00MiB

Unallocated:
   /dev/loop0  1.99GiB
   /dev/loop0  2.98GiB
   /dev/loop1  1.98GiB

Write something, mount degraded again.  Massive data corruption, both on
plain reads and on scrub, unrecoverable.


Yep, same thing here.
And you'll be surprised that even 2 devices RAID5, which is the same as 
RAID1(parity is the same as data), can still cause the problem.


So, RAID5/6 definitely has problem in degraded mode.
While I prefer to focus on normal RAID5/6 bug fix first, and until we 
solve all RAID5/6 normal mode bugs with enough test cases covering them.




Obviously, this problem is somewhere with RAID5 rather than this patch set,
but the safety check can't be removed before that is fixed.


Do we have *safety check* in original behavior?

At least v4.11-rc1, btrfs still allows us to mount raid5/6 degraded.
So the patchset itself is behaving just as old one.

I'm completely fine to add a new patch to prohibit raid5/6 degraded 
mount, but that would be a different enhancement though.


Thanks,
Qu


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  

Re: assertion failed: page_ops & PAGE_LOCK

2017-03-06 Thread Liu Bo
On Sun, Mar 05, 2017 at 11:59:17AM -0500, Dave Jones wrote:
> After commenting out the assertion that Liu bo pointed out was bogus,
> my trinity runs last a little longer.. This is a new one I think..
>

Could you please try this patch?

Thanks,

-liubo

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 28e8192..8df7974 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1714,7 +1714,8 @@ static int __process_pages_contig(struct address_space 
*mapping,
 * can we find nothing at @index.
 */
ASSERT(page_ops & PAGE_LOCK);
-   return ret;
+   err = -EAGAIN;
+   goto out;
}
 
for (i = 0; i < ret; i++) {



> assertion failed: page_ops & PAGE_LOCK, file: fs/btrfs/extent_io.c, line: 1716
> [ cut here ]
> kernel BUG at fs/btrfs/ctree.h:3423!
> invalid opcode:  [#1] PREEMPT SMP DEBUG_PAGEALLOC
> CPU: 1 PID: 32625 Comm: trinity-c40 Not tainted 4.10.0-think+ #3 
> task: 8804c6b95440 task.stack: c9ba8000
> RIP: 0010:assfail.constprop.68+0x1c/0x1e
> RSP: 0018:c9bab6c0 EFLAGS: 00010282
> RAX: 004e RBX: 00a3 RCX: 0001
> RDX:  RSI: 81ee593a RDI: 
> RBP: c9bab6c0 R08:  R09: 0001
> R10: 0001 R11:  R12: c9bab790
> R13:  R14: 01f8 R15: ea000b2026c0
> FS:  7f102a0e4b40() GS:880507a0() knlGS:
> CS:  0010 DS:  ES:  CR0: 80050033
> CR2: 7f1029f21000 CR3: 0004c6bde000 CR4: 001406e0
> DR0: 7f695d948000 DR1:  DR2: 
> DR3:  DR6: 0ff0 DR7: 0600
> Call Trace:
>  __process_pages_contig+0x4b4/0x4d0
>  __unlock_for_delalloc.isra.43+0x32/0x50
>  find_lock_delalloc_range+0x15e/0x210
>  writepage_delalloc.isra.51+0x91/0x1a0
>  __extent_writepage+0x10d/0x470
>  extent_write_cache_pages.constprop.57+0x2d3/0x430
>  extent_writepages+0x5d/0x90
>  ? btrfs_releasepage+0x20/0x20
>  btrfs_writepages+0x28/0x30
>  do_writepages+0x21/0x30
>  __filemap_fdatawrite_range+0xc6/0x100
>  filemap_fdatawrite_range+0x13/0x20
>  btrfs_fdatawrite_range+0x20/0x50
>  start_ordered_ops+0x19/0x30
>  btrfs_sync_file+0x99/0x540
>  ? debug_smp_processor_id+0x17/0x20
>  ? get_lock_stats+0x19/0x50
>  ? debug_smp_processor_id+0x17/0x20
>  ? get_lock_stats+0x19/0x50
>  vfs_fsync_range+0x4b/0xb0
>  btrfs_file_write_iter+0x412/0x570
>  ? rcu_sync_lockdep_assert+0x2f/0x60
>  __do_readv_writev+0x2ea/0x380
>  do_readv_writev+0x7c/0xc0
>  ? debug_smp_processor_id+0x17/0x20
>  ? get_lock_stats+0x19/0x50
>  ? __context_tracking_exit.part.5+0x82/0x1e0
>  vfs_writev+0x3f/0x50
>  do_pwritev+0xb5/0xd0
>  SyS_pwritev2+0x17/0x30
>  do_syscall_64+0x66/0x1d0
>  entry_SYSCALL64_slow_path+0x25/0x25
> RIP: 0033:0x7f1029a0e0f9
> RSP: 002b:7ffe59e72c48 EFLAGS: 0246
> [CONT START]  ORIG_RAX: 0148
> RAX: ffda RBX: 0148 RCX: 7f1029a0e0f9
> RDX: 00f2 RSI: 562b5ef7de50 RDI: 0185
> RBP: 7f1029fc5000 R08: 08002180528a R09: 0007
> R10: baba R11: 0246 R12: 0002
> R13: 7f1029fc5048 R14: 7f102a0e4ad8 R15: 7f1029fc5000
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs progs pre-release 4.10-rc1

2017-03-06 Thread Tsutomu Itoh
Hi David,

On 2017/03/02 23:59, David Sterba wrote:
> Hi,
> 
> a pre-release has been tagged. There are patches that have queued so far, but
> I haven't gone through everything that's in the mailinglist. The 4.10 release
> ETA is next week so I'll try to process the backlog and merge what would seem
> applicable.
> 
> Changes:
>   * send: dump output fixes: missing newlies
>   * check: several fixes for the lowmem mode, improved error reporting
>   * build
> * removed some library deps for binaries that not use them
> * ctags, cscope
> * split Makefile to the autotool generated part and the rest, not needed
>   to autogen.sh after adding a file
>   * shared code: sync easy parts with kernel sources
>   * other
> * lots of cleanups
> * source file reorganization: convert, mkfs, utils
> * lots of spelling fixes in docs, other updates
> * more tests
> 
> ETA for 4.10 is in +7 days (2017-03-08).
> 
> Tarballs: https://www.kernel.org/pub/linux/kernel/people/kdave/btrfs-progs/
> Git: git://git.kernel.org/pub/scm/linux/kernel/git/kdave/btrfs-progs.git
> 
> Shortlog:
> 
> Austin S. Hemmelgarn (1):
>   btrfs-progs: better document btrfs receive security
> 
> Benedikt Morbach (1):
>   btrfs-progs: send-dump: add missing newlines
> 
> David Sterba (102):

>   btrfs-progs: rework option parser to use getopt for global options

I think that btrfs/{008,016,019,...} of xfstests has failed due to the
above patch.

btrfs/008 1s ... [failed, exit status 1] - output mismatch (see 
/xfstests/results//btrfs/008.out.bad)
--- tests/btrfs/008.out 2015-08-04 16:09:38.0 +0900
+++ /xfstests/results//btrfs/008.out.bad2017-03-07 09:00:50.581906234 
+0900
@@ -1,2 +1,3 @@
 QA output created by 008
-Silence is golden
+send failed
+(see /xfstests/results//btrfs/008.full for details)
...
(Run 'diff -u tests/btrfs/008.out /xfstests/results//btrfs/008.out.bad'  to 
see the entire diff)

Thanks,
Tsutomu

>   btrfs-progs: introduce global config
>   btrfs-progs: find-root: rename usage helper
>   btrfs-progs: move help defines to own header
>   btrfs-progs: move help implemetnation to own file
>   btrfs-progs: move some common definitions to own header
>   btrfs-progs: move mkfs definitions to own header
>   btrfs-progs: move convert definitions to own header
>   btrfs-progs: mkfs: move common api implementation to own file
>   btrfs-progs: convert: move common api implementation to own file
>   btrfs-progs: move fs features declarations to own header from utils
>   btrfs-progs: move fs features implementation to own file
>   btrfs-progs: convert: move definitions for interal conversion API to 
> own file
>   btrfs-progs: convert: move ext2 definitions out of main
>   btrfs-progs: convert: move ext2 conversion out of main.c
>   btrfs-progs: convert: move implementation for interal conversion API to 
> own file
>   btrfs-progs: build: list convert build objects in a variable
>   btrfs-progs: build: list mkfs.btrfs build objects in a variable
>   btrfs-progs: build: split LIBS
>   btrfs-progs: build: reorder target dependencies
>   btrfs-progs: build: replace target names with automatic variable
>   btrfs-progs: build: use target deps on commandline via automatic 
> variable
>   btrfs-progs: build: remove directory-specific include paths
>   btrfs-progs: mkfs: make list of source fs more visible
>   btrfs-progs: convert: use wider types types for inode counts for 
> progress reports
>   btrfs-progs: convert: update some forward declarations
>   btrfs-progs: build: add rule for ctags
>   btrfs-progs: build: split makefile to generated and stable parts
>   btrfs-progs: build: add rule for building cscope index
>   btrfs-progs: convert: move struct initialization to the init function
>   btrfs-progs: convert: use fixed lenght array for source fs name
>   btrfs-progs: convert: use on-stack buffer for subvol name dir
>   btrfs-progs: convert: remove unused includes
>   btrfs-progs: convert: better error handling in ext2_read_used_space
>   btrfs-progs: convert: use helper for special inode number check
>   btrfs-progs: convert: use bit field for convert flags
>   btrfs-progs: build: add stub makefile to convert
>   btrfs-progs: build: build library by default
>   btrfs-progs: kerncompat: print trace from ASSERT, if enabled
>   btrfs-progs: move more mkfs declarations to the common header
>   btrfs-progs: move mkfs helper implementation out of utils
>   btrfs-progs: convert: rename ext2 function to create a symlink
>   btrfs-progs: convert: move internal bg size definition
>   btrfs-progs: build: drop deprecated utility from test dependencies
>   btrfs-progs: build: use MAKEOPTS where missing
>   btrfs-progs: build: remove unused variables from docs makefile
>   btrfs-progs: m

Re: [PATCH 0/5] raid56: variant bug fixes

2017-03-06 Thread Qu Wenruo

So raid56 bug fixes are the same case as qgroup fixes now?

No reviewer so no merge?

I understand we need enough reviewer, however there is never enough 
reviewer for *minor* functions, like qgroup or raid56.


Such situation will just make such functions starve, bugs makes fewer 
tester and users, fewer users leads to even fewer developers, causing a 
minus spiral.


Thanks,
Qu

At 02/03/2017 04:20 PM, Qu Wenruo wrote:

This patchset can be fetched from my github repo:
https://github.com/adam900710/linux.git raid56_fixes

It's based on v4.10-rc6 and none of the patch is modified after its first
appearance in mail list.

The patchset fixes the following bugs:

1) False alert or wrong csum error number when scrubbing RAID5/6
   The bug itself won't cause any damage to fs, just pure race.

2) Corrupted data stripe rebuild corrupts P/Q
   So scrub makes one error into another, not really fixing anything

3) Use-after-free caused by cancelling dev-replace
   This is quite a deadly bug, since cancelling dev-replace can easily
   cause kernel panic, and thanks to raid bio steal, it makes the race
   windows quite large.

   Can be triggered by btrfs/069.

After all the fixes applied, no scrub/replace related regression can be
detected.

Qu Wenruo (5):
  btrfs: scrub: Introduce full stripe lock for RAID56
  btrfs: scrub: Fix RAID56 recovery race condition
  btrfs: raid56: Use correct stolen pages to calculate P/Q
  btrfs: raid56: Don't keep rbio for later steal
  btrfs: replace: Use ref counts to avoid destroying target device when
canceled

 fs/btrfs/ctree.h   |   4 ++
 fs/btrfs/dev-replace.c |   7 +-
 fs/btrfs/extent-tree.c |   3 +
 fs/btrfs/raid56.c  |  80 +++--
 fs/btrfs/scrub.c   | 192 +
 fs/btrfs/volumes.c |  36 +-
 fs/btrfs/volumes.h |  10 +++
 7 files changed, 309 insertions(+), 23 deletions(-)




--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs: fix regression in lock_delalloc_pages

2017-03-06 Thread Liu Bo
The bug is a regression after commit
(da2c7009f6ca "btrfs: teach __process_pages_contig about PAGE_LOCK operation")
and commit
(76c0021db8fd "Btrfs: use helper to simplify lock/unlock pages").

So if the dirty pages which are under writeback got truncated partially
before we lock the dirty pages, we couldn't find all pages mapping to the
delalloc range, and the bug didn't return an error so it kept going on and
found that the delalloc range got truncated and got to unlock the dirty
pages, and then the ASSERT could caught the error, and showed

-
assertion failed: page_ops & PAGE_LOCK, file: fs/btrfs/extent_io.c, line: 1716
-

This fixes the bug by returning the proper -EAGAIN.

Cc: David Sterba 
Reported-by: Dave Jones 
Signed-off-by: Liu Bo 
---
 fs/btrfs/extent_io.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 28e8192..8df7974 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1714,7 +1714,8 @@ static int __process_pages_contig(struct address_space 
*mapping,
 * can we find nothing at @index.
 */
ASSERT(page_ops & PAGE_LOCK);
-   return ret;
+   err = -EAGAIN;
+   goto out;
}
 
for (i = 0; i < ret; i++) {
-- 
2.5.5

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 0/6] Chunk level degradable check

2017-03-06 Thread Adam Borowski
On Tue, Mar 07, 2017 at 09:35:56AM +0800, Qu Wenruo wrote:
> At 03/07/2017 08:36 AM, Adam Borowski wrote:
> > Not so for -draid5 -mraid1, unfortunately:
> 
> Unfortunately, for raid5 there are still unfixed bugs.
> In fact, some raid5/6 bugs are already fixed, but still not merged yet.
> 
> > [/mnt/btr2/scratch]# btrfs fi us /mnt/vol1
> > Data,RAID5: Size:2.02GiB, Used:1.21GiB
> >/dev/loop0  1.01GiB
> >/dev/loop1  1.01GiB
> >/dev/loop2  1.01GiB

> > [/mnt/btr2/scratch]# umount /mnt/vol1
> > [/mnt/btr2/scratch]# losetup -D
> > [/mnt/btr2/scratch]# losetup -f rb
> > [/mnt/btr2/scratch]# losetup -f rc
> 
> So you're pulling out first device.
> In theory, it should be completely OK for RAID5.
> And the degradable check follows it.
> 
> > [/mnt/btr2/scratch]# mount -noatime,degraded /dev/loop0 /mnt/vol1
> > [/mnt/btr2/scratch]# btrfs fi us /mnt/vol1
> > Data,RAID5: Size:2.02GiB, Used:1.21GiB
> >/dev/loop0  1.01GiB
> >/dev/loop0  1.01GiB
> >/dev/loop1  1.01GiB
> 
> Two loop0 shows up here, which should be detected as missing.
> 
> So it should be a btrfs-progs bug, and it'll be much easier to fix than
> kernel.

Alas, it's not merely a display bug, mounting is enough.

> > Write something, mount degraded again.  Massive data corruption, both on
> > plain reads and on scrub, unrecoverable.
> 
> Yep, same thing here.
> And you'll be surprised that even 2 devices RAID5, which is the same as
> RAID1(parity is the same as data), can still cause the problem.
> 
> So, RAID5/6 definitely has problem in degraded mode.
> While I prefer to focus on normal RAID5/6 bug fix first, and until we solve
> all RAID5/6 normal mode bugs with enough test cases covering them.

Actually, turns out even the _first_ mount gets bad, even without writing a
single data byte.  So it's not related to our single chunks bug.

> > Obviously, this problem is somewhere with RAID5 rather than this patch set,
> > but the safety check can't be removed before that is fixed.
> 
> Do we have *safety check* in original behavior?
> 
> At least v4.11-rc1, btrfs still allows us to mount raid5/6 degraded.
> So the patchset itself is behaving just as old one.

Right.  Thus, there's no regression.

As it's a strict improvement over previous state (ie, fixes raid1 issues),
Tested-by: Adam Borowski  (if you don't mind spamming
commits with too many tags).

> I'm completely fine to add a new patch to prohibit raid5/6 degraded mount,
> but that would be a different enhancement though.

Yeah.  I guess it's more in the "don't use RAID5, there be dragons" land.


Thanks for these patches, they fix the #1 problem people have with RAID1.


[Apologies for that "✔" crap on some lines, my exit code on prompt thingy
is very paste-unfriendly; I keep forgetting it so often that I'd better get
rid of it...]

-- 
⢀⣴⠾⠻⢶⣦⠀ Meow!
⣾⠁⢠⠒⠀⣿⡁
⢿⡄⠘⠷⠚⠋⠀ Collisions shmolisions, let's see them find a collision or second
⠈⠳⣄ preimage for double rot13!
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] Btrfs: fix unexpected file hole after disk errors

2017-03-06 Thread Liu Bo
On Wed, Mar 01, 2017 at 10:44:53AM +0800, Qu Wenruo wrote:
> 
> 
> At 03/01/2017 09:04 AM, Liu Bo wrote:
> > Btrfs creates hole extents to cover any unwritten section right before
> > doing buffer writes after commit 3ac0d7b96a26 ("btrfs: Change the expanding
> > write sequence to fix snapshot related bug.").
> > 
> > However, that takes the start position of the buffered write to compare
> > against the current EOF, hole extents would be created only if (EOF <
> > start).
> > 
> > If the EOF is at the middle of the buffered write, no hole extents will be
> > created and a file hole without a hole extent is left in this file.
> > 
> > This bug was revealed by generic/019 in fstests.  'fsstress' in this test
> > may create the above situation and the test then fails all requests
> > including writes, so the buffer write which is supposed to cover the
> > hole (without the hole extent) couldn't make it on disk.  Running fsck
> > against such btrfs ends up with detecting file extent holes.
> > 
> > Things could be more serious, some stale data would be exposed to
> > userspace if files with this kind of hole are truncated to a position of
> > the hole, because the on-disk inode size is beyond the last extent in the
> > file.
> > 
> > This fixes the bug by comparing the end position against the EOF.
> > 
> > Signed-off-by: Liu Bo 
> 
> Patch looks good to me.
> Reviewed-by: Qu Wenruo 
> 
> > ---
> >  fs/btrfs/file.c | 5 ++---
> >  1 file changed, 2 insertions(+), 3 deletions(-)
> > 
> > diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> > index b5c5da2..0be837b 100644
> > --- a/fs/btrfs/file.c
> > +++ b/fs/btrfs/file.c
> > @@ -1861,11 +1861,10 @@ static ssize_t btrfs_file_write_iter(struct kiocb 
> > *iocb,
> > pos = iocb->ki_pos;
> > count = iov_iter_count(from);
> > start_pos = round_down(pos, fs_info->sectorsize);
> > +   end_pos = round_up(pos + count, fs_info->sectorsize);
> > oldsize = i_size_read(inode);
> > -   if (start_pos > oldsize) {
> > +   if (end_pos > oldsize) {
> > /* Expand hole size to cover write data, preventing empty gap */
> 
> The comment still makes sense here, but it could be better to explain why to
> insert the hole to cover the whole write range (in case write fails)
>

Sounds good, I'll update.

Thanks,

-liubo

> Thanks,
> Qu
> 
> > -   end_pos = round_up(pos + count,
> > -  fs_info->sectorsize);
> > err = btrfs_cont_expand(inode, oldsize, end_pos);
> > if (err) {
> > inode_unlock(inode);
> > 
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 6/6] btrfs: Enhance missing device kernel message

2017-03-06 Thread Anand Jain



On 03/06/2017 04:58 PM, Qu Wenruo wrote:

For missing device, btrfs will just refuse to mount with almost
meaningless kernel message like:

 BTRFS info (device vdb6): disk space caching is enabled
 BTRFS info (device vdb6): has skinny extents
 BTRFS error (device vdb6): failed to read the system array: -5
 BTRFS error (device vdb6): open_ctree failed

This patch will add extra device missing output, making the result to:

 BTRFS info (device vdb6): disk space caching is enabled
 BTRFS info (device vdb6): has skinny extents
 BTRFS warning (device vdb6): devid 2 uuid 80470722-cad2-4b90-b7c3-fee294552f1b 
is missing
 BTRFS error (device vdb6): failed to read the system array: -5
 BTRFS error (device vdb6): open_ctree failed



Reviewed-by: Anand Jain 

Thanks, Anand



Signed-off-by: Qu Wenruo 
---
 fs/btrfs/volumes.c | 25 ++---
 fs/btrfs/volumes.h |  2 ++
 2 files changed, 20 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 72c78f096755..b33e96495934 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -6442,6 +6442,7 @@ static int read_one_chunk(struct btrfs_fs_info *fs_info, 
struct btrfs_key *key,
if (!map->stripes[i].dev &&
!btrfs_test_opt(fs_info, DEGRADED)) {
free_extent_map(em);
+   btrfs_report_missing_device(fs_info, devid, uuid);
return -EIO;
}
if (!map->stripes[i].dev) {
@@ -6452,8 +6453,7 @@ static int read_one_chunk(struct btrfs_fs_info *fs_info, 
struct btrfs_key *key,
free_extent_map(em);
return -EIO;
}
-   btrfs_warn(fs_info, "devid %llu uuid %pU is missing",
-  devid, uuid);
+   btrfs_report_missing_device(fs_info, devid, uuid);
}
map->stripes[i].dev->in_fs_metadata = 1;
}
@@ -6570,17 +6570,21 @@ static int read_one_dev(struct btrfs_fs_info *fs_info,

device = btrfs_find_device(fs_info, devid, dev_uuid, fs_uuid);
if (!device) {
-   if (!btrfs_test_opt(fs_info, DEGRADED))
+   if (!btrfs_test_opt(fs_info, DEGRADED)) {
+   btrfs_report_missing_device(fs_info, devid, dev_uuid);
return -EIO;
+   }

device = add_missing_dev(fs_devices, devid, dev_uuid);
if (!device)
return -ENOMEM;
-   btrfs_warn(fs_info, "devid %llu uuid %pU missing",
-   devid, dev_uuid);
+   btrfs_report_missing_device(fs_info, devid, dev_uuid);
} else {
-   if (!device->bdev && !btrfs_test_opt(fs_info, DEGRADED))
-   return -EIO;
+   if (!device->bdev) {
+   btrfs_report_missing_device(fs_info, devid, dev_uuid);
+   if (!btrfs_test_opt(fs_info, DEGRADED))
+   return -EIO;
+   }

if(!device->bdev && !device->missing) {
/*
@@ -6591,6 +6595,7 @@ static int read_one_dev(struct btrfs_fs_info *fs_info,
 */
device->fs_devices->missing_devices++;
device->missing = 1;
+   btrfs_report_missing_device(fs_info, devid, dev_uuid);
}

/* Move the device to its own fs_devices */
@@ -6748,6 +6753,12 @@ int btrfs_read_sys_array(struct btrfs_fs_info *fs_info)
return -EIO;
 }

+void btrfs_report_missing_device(struct btrfs_fs_info *fs_info, u64 devid,
+u8 *uuid)
+{
+   btrfs_warn_rl(fs_info, "devid %llu uuid %pU is missing", devid, uuid);
+}
+
 /*
  * Check if all chunks in the fs is OK for read-write degraded mount
  *
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 112fccacdabc..4f981b75958d 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -543,4 +543,6 @@ void btrfs_set_fs_info_ptr(struct btrfs_fs_info *fs_info);
 void btrfs_reset_fs_info_ptr(struct btrfs_fs_info *fs_info);

 bool btrfs_check_rw_degradable(struct btrfs_fs_info *fs_info);
+void btrfs_report_missing_device(struct btrfs_fs_info *fs_info, u64 devid,
+u8 *uuid);
 #endif


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 4/6] btrfs: Allow barrier_all_devices to do chunk level device check

2017-03-06 Thread Qu Wenruo



At 03/07/2017 12:48 PM, Anand Jain wrote:



On 03/06/2017 04:58 PM, Qu Wenruo wrote:

The last user of num_tolerated_disk_barrier_failures is
barrier_all_devices().
But it's can be easily changed to new per-chunk degradable check
framework.

Now btrfs_device will have two extra members, representing send/wait
error, set at write_dev_flush() time.
With these 2 new members, btrfs_check_rw_degradable() can check if the
fs is still OK when the fs is committed to disk.


 This logic isn't reentrant, earlier it was. How about using
 stack variable instead ?

Thanks, Anand


Thanks for the review.

However I didn't quite get the point about the re-entrance and stack 
variable here.


1) About reentrancy
In previous version, the err_* bits are still put into btrfs_devices 
structure, just timing of resetting these bits are changes.


So either way, it's not reentrant in theory.

But that doesn't make a problem, as we have other things to protect when 
calling write_all_supers(), the only caller of barrier_all_devices().


So would you give me an example why we need to make it reentrant?

2) About using stack variable?
Did you mean build a array on stack to record which devices fails to 
send/wait and use the array as check condition other than 
btrfs_device->err_* and btrfs_device->missing ?


The only problem is, it sounds more complex than needed.
Despite the err_*, we also needs to check already missing devices, so I 
prefer the easy way, just checking btrfs_device->err_* and 
btrfs_device->missing.


Any simple example to explain your suggestion here?

Thanks,
Qu





Signed-off-by: Qu Wenruo 
---
 fs/btrfs/disk-io.c | 15 +++
 fs/btrfs/volumes.c |  4 +++-
 fs/btrfs/volumes.h |  4 
 3 files changed, 14 insertions(+), 9 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index c26b8a0b121c..f596bd130524 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3569,17 +3569,17 @@ static int barrier_all_devices(struct
btrfs_fs_info *info)
 {
 struct list_head *head;
 struct btrfs_device *dev;
-int errors_send = 0;
-int errors_wait = 0;
 int ret;

 /* send down all the barriers */
 head = &info->fs_devices->devices;
 list_for_each_entry_rcu(dev, head, dev_list) {
+dev->err_wait = false;
+dev->err_send = false;
 if (dev->missing)
 continue;
 if (!dev->bdev) {
-errors_send++;
+dev->err_send = true;
 continue;
 }
 if (!dev->in_fs_metadata || !dev->writeable)
@@ -3587,7 +3587,7 @@ static int barrier_all_devices(struct
btrfs_fs_info *info)

 ret = write_dev_flush(dev, 0);
 if (ret)
-errors_send++;
+dev->err_send = true;
 }

 /* wait for all the barriers */
@@ -3595,7 +3595,7 @@ static int barrier_all_devices(struct
btrfs_fs_info *info)
 if (dev->missing)
 continue;
 if (!dev->bdev) {
-errors_wait++;
+dev->err_wait = true;
 continue;
 }
 if (!dev->in_fs_metadata || !dev->writeable)
@@ -3603,10 +3603,9 @@ static int barrier_all_devices(struct
btrfs_fs_info *info)

 ret = write_dev_flush(dev, 1);
 if (ret)
-errors_wait++;
+dev->err_wait = true;
 }
-if (errors_send > info->num_tolerated_disk_barrier_failures ||
-errors_wait > info->num_tolerated_disk_barrier_failures)
+if (!btrfs_check_rw_degradable(info))
 return -EIO;
 return 0;
 }
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index dd9dd94d7043..729cbd0d2b60 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -6796,7 +6796,9 @@ bool btrfs_check_rw_degradable(struct
btrfs_fs_info *fs_info)
 btrfs_get_num_tolerated_disk_barrier_failures(
 map->type);
 for (i = 0; i < map->num_stripes; i++) {
-if (map->stripes[i].dev->missing)
+if (map->stripes[i].dev->missing ||
+map->stripes[i].dev->err_wait ||
+map->stripes[i].dev->err_send)
 missing++;
 }
 if (missing > max_tolerated) {
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index db1b5ef479cf..112fccacdabc 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -75,6 +75,10 @@ struct btrfs_device {
 int can_discard;
 int is_tgtdev_for_dev_replace;

+/* If this devices fails to send/wait dev flush */
+bool err_send;
+bool err_wait;





 #ifdef __BTRFS_NEED_DEVICE_DATA_ORDERED
 seqcount_t data_seqcount;
 #endif







--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: fix regression in lock_delalloc_pages

2017-03-06 Thread Qu Wenruo



At 03/07/2017 10:20 AM, Liu Bo wrote:

The bug is a regression after commit
(da2c7009f6ca "btrfs: teach __process_pages_contig about PAGE_LOCK operation")
and commit
(76c0021db8fd "Btrfs: use helper to simplify lock/unlock pages").

So if the dirty pages which are under writeback got truncated partially
before we lock the dirty pages, we couldn't find all pages mapping to the
delalloc range, and the bug didn't return an error so it kept going on and
found that the delalloc range got truncated and got to unlock the dirty
pages, and then the ASSERT could caught the error, and showed

-
assertion failed: page_ops & PAGE_LOCK, file: fs/btrfs/extent_io.c, line: 1716
-


We also triggered such bug when running xfstests btrfs/070, although 
possibility is somewhat low.


However since it's more about possibility, any fsstress may trigger it 
though.


Thanks,
Qu



This fixes the bug by returning the proper -EAGAIN.

Cc: David Sterba 
Reported-by: Dave Jones 
Signed-off-by: Liu Bo 
---
 fs/btrfs/extent_io.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 28e8192..8df7974 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1714,7 +1714,8 @@ static int __process_pages_contig(struct address_space 
*mapping,
 * can we find nothing at @index.
 */
ASSERT(page_ops & PAGE_LOCK);
-   return ret;
+   err = -EAGAIN;
+   goto out;
}

for (i = 0; i < ret; i++) {




--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs: fix a bogus warning when converting only data or metadata

2017-03-06 Thread Adam Borowski
If your filesystem has, eg, data:raid0 metadata:raid1, and you run "btrfs
balance -dconvert=raid1", the meta.target field will be uninitialized.
That's otherwise ok, as it's unused except for this warning.

Thus, let's use the existing set of raid levels for the comparison.

As a side effect, non-convert balances will now nag about data>metadata.

Signed-off-by: Adam Borowski 
---
To reproduce:
dd if=/dev/zero bs=1048576 count=1 seek=4095 of=ra
dd if=/dev/zero bs=1048576 count=1 seek=4095 of=rb
mkfs.btrfs ra rb # defaults to -draid0 -mraid1
losetup -f ra
losetup -f rb
mount /dev/loop0 /mnt/vol1
btrfs balance start -dconvert=raid1 /mnt/vol1

 fs/btrfs/volumes.c | 5 +
 1 file changed, 5 insertions(+)
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 3645af2749f8..c016db81ba43 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -3846,6 +3846,11 @@ int btrfs_balance(struct btrfs_balance_control *bctl,
}
} while (read_seqretry(&fs_info->profiles_lock, seq));
 
+   /* if we're not converting, the target field is uninitialized */
+   if (!(bctl->meta.flags & BTRFS_BALANCE_ARGS_CONVERT))
+   bctl->meta.target = fs_info->avail_metadata_alloc_bits;
+   if (!(bctl->data.flags & BTRFS_BALANCE_ARGS_CONVERT))
+   bctl->data.target = fs_info->avail_data_alloc_bits;
if (btrfs_get_num_tolerated_disk_barrier_failures(bctl->meta.target) <

btrfs_get_num_tolerated_disk_barrier_failures(bctl->data.target)) {
btrfs_warn(fs_info,
-- 
2.11.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 4/6] btrfs: Allow barrier_all_devices to do chunk level device check

2017-03-06 Thread Qu Wenruo



At 03/07/2017 02:55 PM, Anand Jain wrote:






1) About reentrancy
In previous version, the err_* bits are still put into btrfs_devices
structure, just timing of resetting these bits are changes.

So either way, it's not reentrant in theory.

But that doesn't make a problem, as we have other things to protect when
calling write_all_supers(), the only caller of barrier_all_devices().

So would you give me an example why we need to make it reentrant?


 Its updating the device struct I would avoid such a change
 for the reasons of this patch. (I notice that in V1 as well).
 Further btrfs does not handle online intermittent device failure,
 unless the complete story is taken care, I would avoid such a change.

 Theoretically this patch is buggy, btrfs_check_rw_degradable() has
 more consumers than just the barrier_all_devices(). However the
 dev->err_wait and dev->err_send are managed by only
 barrier_all_devices(). And the bad news is dev->err_wait and
 dev->err_send would affect the result of "missing" coming out of
 btrfs_check_rw_degradable() which is wrong for the threads other
 than barrier_all_devices(). Further, the only way dev->err_wait and
 dev->err_send gets reset is by the next call to
 barrier_all_devices(). And if lock is an answer that would makes
 it messy and complicated. I won't do that.

 There is a simple solution as below..


2) About using stack variable?


 pass err_send and err_write to btrfs_check_rw_degradable() through
 argument so to compute degradable for the barrier_all_devices().
 In this way changes are kept local thread specific.


In this way, we need to make a expandable structure to record devid <-> 
err_send/wait mapping.


Simple array is not suitable here, as the starting devid can be either 1 
or 0 depending on whether we're replacing.
Furthermore devid is not always sequential, we can have holes in devid 
allocation.


Although this behavior will indeed makes less impact on btrfs_device 
structure.


I'll update the patchset and try such method, just hopes this won't 
introduce too much new code.


Thanks for the advice,
Qu



Thanks, Anand



Did you mean build a array on stack to record which devices fails to
send/wait and use the array as check condition other than
btrfs_device->err_* and btrfs_device->missing ?

The only problem is, it sounds more complex than needed.
Despite the err_*, we also needs to check already missing devices, so I
prefer the easy way, just checking btrfs_device->err_* and
btrfs_device->missing.

Any simple example to explain your suggestion here?

Thanks,
Qu





Signed-off-by: Qu Wenruo 
---
 fs/btrfs/disk-io.c | 15 +++
 fs/btrfs/volumes.c |  4 +++-
 fs/btrfs/volumes.h |  4 
 3 files changed, 14 insertions(+), 9 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index c26b8a0b121c..f596bd130524 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3569,17 +3569,17 @@ static int barrier_all_devices(struct
btrfs_fs_info *info)
 {
 struct list_head *head;
 struct btrfs_device *dev;
-int errors_send = 0;
-int errors_wait = 0;
 int ret;

 /* send down all the barriers */
 head = &info->fs_devices->devices;
 list_for_each_entry_rcu(dev, head, dev_list) {
+dev->err_wait = false;
+dev->err_send = false;
 if (dev->missing)
 continue;
 if (!dev->bdev) {
-errors_send++;
+dev->err_send = true;
 continue;
 }
 if (!dev->in_fs_metadata || !dev->writeable)
@@ -3587,7 +3587,7 @@ static int barrier_all_devices(struct
btrfs_fs_info *info)

 ret = write_dev_flush(dev, 0);
 if (ret)
-errors_send++;
+dev->err_send = true;
 }

 /* wait for all the barriers */
@@ -3595,7 +3595,7 @@ static int barrier_all_devices(struct
btrfs_fs_info *info)
 if (dev->missing)
 continue;
 if (!dev->bdev) {
-errors_wait++;
+dev->err_wait = true;
 continue;
 }
 if (!dev->in_fs_metadata || !dev->writeable)
@@ -3603,10 +3603,9 @@ static int barrier_all_devices(struct
btrfs_fs_info *info)

 ret = write_dev_flush(dev, 1);
 if (ret)
-errors_wait++;
+dev->err_wait = true;
 }
-if (errors_send > info->num_tolerated_disk_barrier_failures ||
-errors_wait > info->num_tolerated_disk_barrier_failures)
+if (!btrfs_check_rw_degradable(info))
 return -EIO;
 return 0;
 }
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index dd9dd94d7043..729cbd0d2b60 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -6796,7 +6796,9 @@ bool btrfs_check_rw_degradable(struct
btrfs_fs_info *fs_info)
 btrfs_get_num_tolerated_disk_barrier_failures(
 map->type);
 for (i = 0; i < map->num_stripes; i++) {
-if (map->stripes[i].dev->missing)
+if (map->stripes[i].dev->missing ||
+  

Re: [PATCH 00/17] fs, btrfs refcount conversions

2017-03-06 Thread Qu Wenruo



At 03/06/2017 05:43 PM, Reshetova, Elena wrote:



At 03/03/2017 04:55 PM, Elena Reshetova wrote:

Now when new refcount_t type and API are finally merged
(see include/linux/refcount.h), the following
patches convert various refcounters in the btrfs filesystem from atomic_t
to refcount_t. By doing this we prevent intentional or accidental
underflows or overflows that can led to use-after-free vulnerabilities.

The below patches are fully independent and can be cherry-picked separately.
Since we convert all kernel subsystems in the same fashion, resulting
in about 300 patches, we have to group them for sending at least in some
fashion to be manageable. Please excuse the long cc list.

These patches have been tested with xfstests by running btrfs-related tests.
btrfs debug was enabled, warns on refcount errors, too. No output related to
refcount errors produced. However, the following errors were during the run:
 * tests btrfs/078, btrfs/114, btrfs/115, no errors anywhere in dmesg, but
 process hangs. They all seem to be around qgroup, sometimes error visible
 such as qgroup scan failed -4 before it blocks, but not always.


How reproducible of the hang?


Always in  my environment, but I would not much go into investigating why it 
happens, if it works for you.
My test environment is far from ideal: I am testing in VM with rather old 
userspace and couple of additional changes in,
so there are many things that can potentially go wrong. Anyway the strace for 
078 is in the attachment.


Thanks for the strace.

However no "-f" is passed to strace, so it doesn't contain much useful info.



If the patches pass all tests on your side, could you please take them in and 
propagate further?
I will continue with other kernel subsystems.


The patchset itself looks like a common cleanup, while I did encounter 
several cases (almost all scrub tests) causing kernel warning due to 
underflow.


So I'm afraid the patchset will not be merged until we fix all the 
underflows.


But thanks for the patchset, it helps us to expose a lot of problem.

Thanks,
Qu



Best Regards,
Elena.




I also see the -EINTR output, but that seems to be designed for
btrfs/11[45].

btrfs/078 is unrelated to qgroup, and all these three test pass in my
test environment, which is v4.11-rc1 with your patches applied.

I ran these 3 tests in a row with default and space_cache=v2 mount
options, and 5 times for each mount option, no hang at all.

It would help much if more info can be provided, from blocked process
backtrace to test mount option to base commit.

Thanks,
Qu


 * test btrfs/104 dmesg has additional error output:
 BTRFS warning (device vdc): qgroup 258 reserved space underflow, have: 0,
 to free: 4096
 I tried looking at the code on what causes the failure, but could not figure
 it out. It doesn't seem to be related to any refcount changes at least IMO.

The above test failures are hard for me to understand and interpreted, but
they don't seem to relate to refcount conversions.

Elena Reshetova (17):
  fs, btrfs: convert btrfs_bio.refs from atomic_t to refcount_t
  fs, btrfs: convert btrfs_transaction.use_count from atomic_t to
refcount_t
  fs, btrfs: convert extent_map.refs from atomic_t to refcount_t
  fs, btrfs: convert btrfs_ordered_extent.refs from atomic_t to
refcount_t
  fs, btrfs: convert btrfs_caching_control.count from atomic_t to
refcount_t
  fs, btrfs: convert btrfs_delayed_ref_node.refs from atomic_t to
refcount_t
  fs, btrfs: convert btrfs_delayed_node.refs from atomic_t to refcount_t
  fs, btrfs: convert btrfs_delayed_item.refs from atomic_t to refcount_t
  fs, btrfs: convert btrfs_root.refs from atomic_t to refcount_t
  fs, btrfs: convert extent_state.refs from atomic_t to refcount_t
  fs, btrfs: convert compressed_bio.pending_bios from atomic_t to
refcount_t
  fs, btrfs: convert scrub_recover.refs from atomic_t to refcount_t
  fs, btrfs: convert scrub_page.refs from atomic_t to refcount_t
  fs, btrfs: convert scrub_block.refs from atomic_t to refcount_t
  fs, btrfs: convert scrub_parity.refs from atomic_t to refcount_t
  fs, btrfs: convert scrub_ctx.refs from atomic_t to refcount_t
  fs, btrfs: convert btrfs_raid_bio.refs from atomic_t to refcount_t

 fs/btrfs/backref.c   |  2 +-
 fs/btrfs/compression.c   | 18 -
 fs/btrfs/ctree.h |  5 +++--
 fs/btrfs/delayed-inode.c | 46 ++--
 fs/btrfs/delayed-inode.h |  5 +++--
 fs/btrfs/delayed-ref.c   |  8 
 fs/btrfs/delayed-ref.h   |  8 +---
 fs/btrfs/disk-io.c   |  6 +++---
 fs/btrfs/disk-io.h   |  4 ++--
 fs/btrfs/extent-tree.c   | 20 +--
 fs/btrfs/extent_io.c | 18 -
 fs/btrfs/extent_io.h |  3 ++-
 fs/btrfs/extent_map.c| 10 +-
 fs/btrfs/extent_map.h|  3 ++-
 fs/btrfs/ordered-data.c  | 20 +--
 fs/btrfs/ordered-data.h  |  2 +-
 fs/btrfs/r

RE: [PATCH 00/17] fs, btrfs refcount conversions

2017-03-06 Thread Reshetova, Elena
> At 03/06/2017 05:43 PM, Reshetova, Elena wrote:
> >
> >> At 03/03/2017 04:55 PM, Elena Reshetova wrote:
> >>> Now when new refcount_t type and API are finally merged
> >>> (see include/linux/refcount.h), the following
> >>> patches convert various refcounters in the btrfs filesystem from atomic_t
> >>> to refcount_t. By doing this we prevent intentional or accidental
> >>> underflows or overflows that can led to use-after-free vulnerabilities.
> >>>
> >>> The below patches are fully independent and can be cherry-picked 
> >>> separately.
> >>> Since we convert all kernel subsystems in the same fashion, resulting
> >>> in about 300 patches, we have to group them for sending at least in some
> >>> fashion to be manageable. Please excuse the long cc list.
> >>>
> >>> These patches have been tested with xfstests by running btrfs-related 
> >>> tests.
> >>> btrfs debug was enabled, warns on refcount errors, too. No output related 
> >>> to
> >>> refcount errors produced. However, the following errors were during the 
> >>> run:
> >>>  * tests btrfs/078, btrfs/114, btrfs/115, no errors anywhere in dmesg, but
> >>>  process hangs. They all seem to be around qgroup, sometimes error visible
> >>>  such as qgroup scan failed -4 before it blocks, but not always.
> >>
> >> How reproducible of the hang?
> >
> > Always in  my environment, but I would not much go into investigating why it
> happens, if it works for you.
> > My test environment is far from ideal: I am testing in VM with rather old
> userspace and couple of additional changes in,
> > so there are many things that can potentially go wrong. Anyway the strace 
> > for
> 078 is in the attachment.
> 
> Thanks for the strace.
> 
> However no "-f" is passed to strace, so it doesn't contain much useful info.
> 
> >
> > If the patches pass all tests on your side, could you please take them in 
> > and
> propagate further?
> > I will continue with other kernel subsystems.
> 
> The patchset itself looks like a common cleanup, while I did encounter
> several cases (almost all scrub tests) causing kernel warning due to
> underflow.

Oh, could you please send me the warning outputs? I can hopefully analyze and 
fix them. 

Best Regards,
Elena.

> 
> So I'm afraid the patchset will not be merged until we fix all the
> underflows.
> 
> But thanks for the patchset, it helps us to expose a lot of problem.
> 
> Thanks,
> Qu
> 
> >
> > Best Regards,
> > Elena.
> >
> >
> >>
> >> I also see the -EINTR output, but that seems to be designed for
> >> btrfs/11[45].
> >>
> >> btrfs/078 is unrelated to qgroup, and all these three test pass in my
> >> test environment, which is v4.11-rc1 with your patches applied.
> >>
> >> I ran these 3 tests in a row with default and space_cache=v2 mount
> >> options, and 5 times for each mount option, no hang at all.
> >>
> >> It would help much if more info can be provided, from blocked process
> >> backtrace to test mount option to base commit.
> >>
> >> Thanks,
> >> Qu
> >>
> >>>  * test btrfs/104 dmesg has additional error output:
> >>>  BTRFS warning (device vdc): qgroup 258 reserved space underflow, have: 0,
> >>>  to free: 4096
> >>>  I tried looking at the code on what causes the failure, but could not 
> >>> figure
> >>>  it out. It doesn't seem to be related to any refcount changes at least 
> >>> IMO.
> >>>
> >>> The above test failures are hard for me to understand and interpreted, but
> >>> they don't seem to relate to refcount conversions.
> >>>
> >>> Elena Reshetova (17):
> >>>   fs, btrfs: convert btrfs_bio.refs from atomic_t to refcount_t
> >>>   fs, btrfs: convert btrfs_transaction.use_count from atomic_t to
> >>> refcount_t
> >>>   fs, btrfs: convert extent_map.refs from atomic_t to refcount_t
> >>>   fs, btrfs: convert btrfs_ordered_extent.refs from atomic_t to
> >>> refcount_t
> >>>   fs, btrfs: convert btrfs_caching_control.count from atomic_t to
> >>> refcount_t
> >>>   fs, btrfs: convert btrfs_delayed_ref_node.refs from atomic_t to
> >>> refcount_t
> >>>   fs, btrfs: convert btrfs_delayed_node.refs from atomic_t to refcount_t
> >>>   fs, btrfs: convert btrfs_delayed_item.refs from atomic_t to refcount_t
> >>>   fs, btrfs: convert btrfs_root.refs from atomic_t to refcount_t
> >>>   fs, btrfs: convert extent_state.refs from atomic_t to refcount_t
> >>>   fs, btrfs: convert compressed_bio.pending_bios from atomic_t to
> >>> refcount_t
> >>>   fs, btrfs: convert scrub_recover.refs from atomic_t to refcount_t
> >>>   fs, btrfs: convert scrub_page.refs from atomic_t to refcount_t
> >>>   fs, btrfs: convert scrub_block.refs from atomic_t to refcount_t
> >>>   fs, btrfs: convert scrub_parity.refs from atomic_t to refcount_t
> >>>   fs, btrfs: convert scrub_ctx.refs from atomic_t to refcount_t
> >>>   fs, btrfs: convert btrfs_raid_bio.refs from atomic_t to refcount_t
> >>>
> >>>  fs/btrfs/backref.c   |  2 +-
> >>>  fs/btrfs/compression.c   | 18 -
> >>>  fs/btrfs/ctree.h |