[PATCH v2] btrfs: handle failures of set_extent_bits in add_excluded_extent

2018-05-22 Thread Gu Jinxiang
set_extent_bits may fail, return the result in add_excluded_extent.

Signed-off-by: Gu Jinxiang 
Changelog:
v2-v1:
 1.remove goto to make the function run linearly.
 2.change commit description not pointing out the failure detail,
   since set_extent_bits's failure type may be added.
---
 fs/btrfs/extent-tree.c | 10 +++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 75cfb80d2551..65ef3456fa62 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -215,11 +215,15 @@ static int add_excluded_extent(struct btrfs_fs_info 
*fs_info,
   u64 start, u64 num_bytes)
 {
u64 end = start + num_bytes - 1;
-   set_extent_bits(_info->freed_extents[0],
+   int ret = 0;
+
+   ret = set_extent_bits(_info->freed_extents[0],
start, end, EXTENT_UPTODATE);
-   set_extent_bits(_info->freed_extents[1],
+   if (ret)
+   return ret;
+   ret = set_extent_bits(_info->freed_extents[1],
start, end, EXTENT_UPTODATE);
-   return 0;
+   return ret;
 }
 
 static void free_excluded_extents(struct btrfs_fs_info *fs_info,
-- 
1.9.1



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3] btrfs: drop uuid_mutex in btrfs_free_extra_devids()

2018-05-22 Thread Anand Jain
btrfs_free_extra_devids() is called only in the mount context which
traverses through the fs_devices::devices and frees the orphan devices
devices in the given %fs_devices if any. As the search for the orphan
device is limited to fs_devices::devices so we don't need the global
uuid_mutex.

There can't be any mount-point based ioctl threads in this context as
the mount thread is not yet returned. But there can be the btrfs-control
based scan ioctls thread which calls device_list_add().

Here in the mount thread the fs_devices::opened is incremented way before
btrfs_free_extra_devids() is called and in the scan context the fs_devices
which are already opened neither be freed or alloc-able at
device_list_add().

But lets say you change the device-path and call the scan again, then scan
would update the new device path and this operation could race against the
btrfs_free_extra_devids() thread, which might be in the process of
free-ing the same device. So synchronize it by using the
device_list_mutex.

This scenario is a very corner case, and practically the scan and mount
are anyway serialized by the usage so unless the race is instrumented its
very difficult to achieve.

Signed-off-by: Anand Jain 
---
Currently device_list_add() is very lean on its device_list_mutex usage,
a cleanup and fix is wip. Given the practicality of the above race
condition this patch is good to merge.

 fs/btrfs/volumes.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index b6757b53c297..a9c1f4f7ebd0 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -925,7 +925,7 @@ void btrfs_free_extra_devids(struct btrfs_fs_devices 
*fs_devices, int step)
struct btrfs_device *device, *next;
struct btrfs_device *latest_dev = NULL;
 
-   mutex_lock(_mutex);
+   mutex_lock(_devices->device_list_mutex);
 again:
/* This is the initialized path, it is safe to release the devices. */
list_for_each_entry_safe(device, next, _devices->devices, dev_list) {
@@ -979,8 +979,7 @@ void btrfs_free_extra_devids(struct btrfs_fs_devices 
*fs_devices, int step)
}
 
fs_devices->latest_bdev = latest_dev->bdev;
-
-   mutex_unlock(_mutex);
+   mutex_unlock(_devices->device_list_mutex);
 }
 
 static void free_device_rcu(struct rcu_head *head)
-- 
2.7.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 0/6] btrfs_search_slot cleanups

2018-05-22 Thread Su Yue


On 05/22/2018 08:35 PM, Nikolay Borisov wrote:
> 
> 
> On 22.05.2018 15:02, David Sterba wrote:
>> On Tue, May 22, 2018 at 07:05:14PM +0800, Su Yue wrote:
>>> Hi Liu and David,
>>> During my local xfstests on kdave/for-next, btrfs/139 failed and
>>> btrfs BUG_ON due to qgroup rescan.
>>> The bisect result is commit 560215eb3f32("Merge branch
>>> 'ext/liubo/search-cleanups-wip' into for-next-next-v4.18-20180521")
>>> which seems merged this patchset.
>>> The dmesg file is attached.
>>>
>>> On 05/18/2018 11:00 AM, Liu Bo wrote:
 Here are a collection of patches I did for btrfs_search_slot().

 v2: more explicit commit log for each patch.

 Liu Bo (6):
   Btrfs: remove superfluous free_extent_buffer
   Btrfs: use more straightforward extent_buffer_uptodate
   Btrfs: move get root of btrfs_search_slot to a helper
   Btrfs: remove unused check of skip_locking
   Btrfs: grab write lock directly if write_lock_level is the max level
   Btrfs: remove always true check in unlock_up

  fs/btrfs/ctree.c | 121 
 +--
  1 file changed, 73 insertions(+), 48 deletions(-)

>>>
>>>
>>
>>> [   46.129166] BTRFS info (device vdb): disk space caching is enabled
>>> [   46.130545] BTRFS info (device vdb): has skinny extents
>>> [   46.171386] mount (2798) used greatest stack depth: 12920 bytes left
>>> [   46.508170] BTRFS: device fsid 83a117c7-a9ea-4bf5-b42f-7092078610d5 
>>> devid 1 transid 5 /dev/vdc
>>> [   46.562428] BTRFS info (device vdc): disk space caching is enabled
>>> [   46.563690] BTRFS info (device vdc): has skinny extents
>>> [   46.564563] BTRFS info (device vdc): flagging fs with big metadata 
>>> feature
>>> [   46.587441] BTRFS info (device vdc): checking UUID tree
>>> [   46.766765] BTRFS info (device vdb): disk space caching is enabled
>>> [   46.768197] BTRFS info (device vdb): has skinny extents
>>> [   46.875534] run fstests btrfs/139 at 2018-05-22 18:40:36
>>> [   47.559411] BTRFS: device fsid 065f3825-057e-451f-8722-0d94d4a3f36f 
>>> devid 1 transid 5 /dev/vdc
>>> [   47.612001] BTRFS info (device vdc): disk space caching is enabled
>>> [   47.613254] BTRFS info (device vdc): has skinny extents
>>> [   47.614147] BTRFS info (device vdc): flagging fs with big metadata 
>>> feature
>>> [   47.632377] BTRFS info (device vdc): checking UUID tree
>>> [   47.681656] btrfs (3176) used greatest stack depth: 12632 bytes left
>>> [   47.691156] [ cut here ]
>>> [   47.692084] kernel BUG at fs/btrfs/locking.c:286!
>>
>> I saw the crash too but did not investigate the root cause. So I'll
>> remove the branch from for-next until it's fixed. Thanks for the report.
> 
> I think the problem stems from Qu's patch, which sets search_commit_root
> =1 but doesn't set skip_locking, as a result we don't lock the tree when
> we obtain a reference to the root node, yet later when traversing the
> tree due to skip_locking not being set we try to lock it, and this
> causes btrfs_assert_tree_locked to triggers. Can you test whether the
> following diff solves the issues:
> 
> 
> diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
> index bc19a7d11c98..23fadb640c59 100644
> --- a/fs/btrfs/qgroup.c
> +++ b/fs/btrfs/qgroup.c
> @@ -2702,6 +2702,7 @@ static void btrfs_qgroup_rescan_worker(struct
> btrfs_work *work)
>  * should be recorded by qgroup
>  */
> path->search_commit_root = 1;
> +   path->skip_locking = 1;
> 
> err = 0;
> while (!err && !btrfs_fs_closing(fs_info)) {
> 
> 
> If it does, this only means we need to make skip_locking = 1 being
> conditional on search_commit_root being set and this situation should be
> handled in btrfs_search_slot.

After patching the change, btrfs/139 passes without BUG_ON.

Thanks,
Su

> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 




pEpkey.asc
Description: application/pgp-keys


Re: [PATCH] Btrfs: fix memory and mount leak in btrfs_ioctl_rm_dev_v2()

2018-05-22 Thread Su Yue


On 05/23/2018 06:44 AM, Omar Sandoval wrote:
> From: Omar Sandoval 
> 
> If we have invalid flags set, when we error out we must drop our writer
> counter and free the buffer we allocated for the arguments. This bug is
> trivially reproduced with the following program:
> 
>   #include 
>   #include 
>   #include 
>   #include 
>   #include 
>   #include 
>   #include 
>   #include 
>   #include 
>   #include 
> 
>   int main(int argc, char **argv)
>   {
>   struct btrfs_ioctl_vol_args_v2 vol_args = {
>   .flags = UINT64_MAX,
>   };
>   int ret;
>   int fd;
> 
>   if (argc != 2) {
>   fprintf(stderr, "usage: %s PATH\n", argv[0]);
>   return EXIT_FAILURE;
>   }
> 
>   fd = open(argv[1], O_WRONLY);
>   if (fd == -1) {
>   perror("open");
>   return EXIT_FAILURE;
>   }
> 
>   ret = ioctl(fd, BTRFS_IOC_RM_DEV_V2, _args);
>   if (ret == -1)
>   perror("ioctl");
> 
>   close(fd);
>   return EXIT_SUCCESS;
>   }
> 
> When unmounting the filesystem, we'll hit the
> WARN_ON(mnt_get_writers(mnt)) in cleanup_mnt().
> 
> Fixes: 6b526ed70cf1 ("btrfs: introduce device delete by devid")
> Signed-off-by: Omar Sandoval 

Reviewed-by: Su Yue 

> ---
> Sigh, I keep stepping in these ioctl bugs just trying to get my swap
> file series ready. This one is based on top of my "Btrfs: fix partly
> checksummed file races" series, although it should apply to for-next
> without it +/- some moved lines.
> 
>  fs/btrfs/ioctl.c | 6 --
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> index c2263bf4d6f5..9af3be96099f 100644
> --- a/fs/btrfs/ioctl.c
> +++ b/fs/btrfs/ioctl.c
> @@ -3086,8 +3086,10 @@ static long btrfs_ioctl_rm_dev_v2(struct file *file, 
> void __user *arg)
>   }
>  
>   /* Check for compatibility reject unknown flags */
> - if (vol_args->flags & ~BTRFS_VOL_ARG_V2_FLAGS_SUPPORTED)
> - return -EOPNOTSUPP;
> + if (vol_args->flags & ~BTRFS_VOL_ARG_V2_FLAGS_SUPPORTED) {
> + ret = -EOPNOTSUPP;
> + goto out;
> + }
>  
>   if (test_and_set_bit(BTRFS_FS_EXCL_OP, _info->flags)) {
>   ret = BTRFS_ERROR_DEV_EXCL_RUN_IN_PROGRESS;
> 




pEpkey.asc
Description: application/pgp-keys


Re: [PATCH v2 3/4] btrfs: lzo: Add header length check to avoid slab out of bounds access

2018-05-22 Thread Qu Wenruo


On 2018年05月22日 23:06, David Sterba wrote:
> On Mon, May 21, 2018 at 01:19:26PM +0800, Qu Wenruo wrote:
>> James Harvey reported that some corrupted compressed extent data can
>> lead to various kernel memory corruption.
>>
>> Such corrupted extent data belongs to inode with NODATASUM flags, thus
>> data csum won't help us detecting such bug.
>>
>> If lucky enough, kasan could catch it like:
>> ==
>> BUG: KASAN: slab-out-of-bounds in lzo_decompress_bio+0x384/0x7a0 [btrfs]
>> Write of size 4096 at addr 8800606cb0f8 by task kworker/u16:0/2338
>>
>> CPU: 3 PID: 2338 Comm: kworker/u16:0 Tainted: G   O  
>> 4.17.0-rc5-custom+ #50
>> Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
>> Workqueue: btrfs-endio btrfs_endio_helper [btrfs]
>> Call Trace:
>>  dump_stack+0xc2/0x16b
>>  print_address_description+0x6a/0x270
>>  kasan_report+0x260/0x380
>>  memcpy+0x34/0x50
>>  lzo_decompress_bio+0x384/0x7a0 [btrfs]
>>  end_compressed_bio_read+0x99f/0x10b0 [btrfs]
>>  bio_endio+0x32e/0x640
>>  normal_work_helper+0x15a/0xea0 [btrfs]
>>  process_one_work+0x7e3/0x1470
>>  worker_thread+0x1b0/0x1170
>>  kthread+0x2db/0x390
>>  ret_from_fork+0x22/0x40
>> ...
>> ==
>>
>> The offending compressed data has the following info:
>>
>> Header:  length 32768(Looks completely valid)
>> Segment 0 Header:length 3472882419   (Obvious out of bounds)
>>
>> Then when handling segment 0, since it's over the current page, we need
>> the compressed data to workspace, then such large size would trigger
>> out-of-bounds memory access, screwing up the whole kernel.
>>
>> Fix it by adding extra checks on header and segment headers to ensure we
>> won't access out-of-bounds, and even checks the decompressed data won't
>> be out-of-bounds.
> 
> Good, feel free to add more if you find them. The BUG_ON in
> lzo_decompress can be replaced by return -EUCLEAN and the total size can
> be compared with the segment size if they match too.
> 
>> Reported-by: James Harvey 
>> Signed-off-by: Qu Wenruo 
>> ---
>>  fs/btrfs/lzo.c | 35 ++-
>>  1 file changed, 34 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/btrfs/lzo.c b/fs/btrfs/lzo.c
>> index d0c6789ff78f..4c75dcba3f04 100644
>> --- a/fs/btrfs/lzo.c
>> +++ b/fs/btrfs/lzo.c
>> @@ -281,6 +281,7 @@ static int lzo_decompress_bio(struct list_head *ws, 
>> struct compressed_bio *cb)
>>  unsigned long working_bytes;
>>  size_t in_len;
>>  size_t out_len;
>> +size_t max_segment_len = lzo1x_worst_compress(PAGE_SIZE);
> 
> The value is a compile-time constant, it does not need to be stored in a
> variable.

Just to save some long lines, as it's used several times.

> 
>>  unsigned long in_offset;
>>  unsigned long in_page_bytes_left;
>>  unsigned long tot_in;
>> @@ -294,6 +295,18 @@ static int lzo_decompress_bio(struct list_head *ws, 
>> struct compressed_bio *cb)
>>  
>>  data_in = kmap(pages_in[0]);
>>  tot_len = read_compress_length(data_in);
>> +/*
>> + * Compressed data header check.
>> + *
>> + * The real compressed size can't exceed extent length, and all pages
>> + * should be used (a full pending page is not possible).
>> + * If this happens it means the compressed extent is corrupted.
>> + */
>> +if (tot_len > min_t(size_t, BTRFS_MAX_COMPRESSED, srclen) ||
>> +tot_len < srclen - PAGE_SIZE) {
> 
> All such conditions can be put into unlikely() as this is an error
> handling shortcut.

I'm OK to put it into unlikely().

I'll update this in next version.

Thanks,
Qu

> 
>> +ret = -EUCLEAN;
>> +goto done;
>> +}
>>  
>>  tot_in = LZO_LEN;
>>  in_offset = LZO_LEN;
>> @@ -308,6 +321,17 @@ static int lzo_decompress_bio(struct list_head *ws, 
>> struct compressed_bio *cb)
>>  in_offset += LZO_LEN;
>>  tot_in += LZO_LEN;
>>  
>> +/*
>> + * Segment header check.
>> + *
>> + * The segment length must not exceed max lzo compression
>> + * size, nor the total compressed size
>> + */
>> +if (in_len > max_segment_len || tot_in + in_len > tot_len) {
>> +ret = -EUCLEAN;
>> +goto done;
>> +}
>> +
>>  tot_in += in_len;
>>  working_bytes = in_len;
>>  may_late_unmap = need_unmap = false;
>> @@ -358,7 +382,7 @@ static int lzo_decompress_bio(struct list_head *ws, 
>> struct compressed_bio *cb)
>>  }
>>  }
>>  
>> -out_len = lzo1x_worst_compress(PAGE_SIZE);
>> +out_len = max_segment_len;
>>  ret = lzo1x_decompress_safe(buf, in_len, workspace->buf,
>>

[PATCH v3 1/5] xfstests: create swap group

2018-05-22 Thread Omar Sandoval
From: Omar Sandoval 

I'm going to add a bunch of tests for swap files, so create a group for
them and add the existing tests.

Signed-off-by: Omar Sandoval 
---
 tests/generic/group | 4 ++--
 tests/xfs/group | 2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/tests/generic/group b/tests/generic/group
index 4fc3e457..111e42e7 100644
--- a/tests/generic/group
+++ b/tests/generic/group
@@ -358,8 +358,8 @@
 353 auto quick clone
 354 auto
 355 auto quick
-356 auto quick clone
-357 auto quick clone
+356 auto quick clone swap
+357 auto quick clone swap
 358 auto quick clone
 359 auto quick clone
 360 auto quick metadata
diff --git a/tests/xfs/group b/tests/xfs/group
index 51326d95..e5fd1c6d 100644
--- a/tests/xfs/group
+++ b/tests/xfs/group
@@ -416,7 +416,7 @@
 416 dangerous_fuzzers dangerous_scrub dangerous_repair
 417 dangerous_fuzzers dangerous_scrub dangerous_online_repair
 418 dangerous_fuzzers dangerous_scrub dangerous_repair
-419 auto quick
+419 auto quick swap
 420 auto quick clone dedupe
 421 auto quick clone dedupe
 422 dangerous_scrub dangerous_online_repair
-- 
2.17.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 5/5] generic: test invalid swap file activation

2018-05-22 Thread Omar Sandoval
From: Omar Sandoval 

Swap files cannot have holes, and they must at least two pages.
swapon(8) and mkswap(8) have stricter restrictions, so add versions of
those commands without any restrictions.

Signed-off-by: Omar Sandoval 
---
 .gitignore|  2 ++
 src/Makefile  |  2 +-
 src/mkswap.c  | 83 +++
 src/swapon.c  | 24 +
 tests/generic/494 | 77 +++
 tests/generic/494.out |  5 +++
 tests/generic/group   |  1 +
 7 files changed, 193 insertions(+), 1 deletion(-)
 create mode 100644 src/mkswap.c
 create mode 100644 src/swapon.c
 create mode 100755 tests/generic/494
 create mode 100644 tests/generic/494.out

diff --git a/.gitignore b/.gitignore
index 53029e24..efc73a7c 100644
--- a/.gitignore
+++ b/.gitignore
@@ -92,6 +92,7 @@
 /src/lstat64
 /src/makeextents
 /src/metaperf
+/src/mkswap
 /src/mmapcat
 /src/multi_open_unlink
 /src/nametest
@@ -111,6 +112,7 @@
 /src/seek_sanity_test
 /src/stale_handle
 /src/stat_test
+/src/swapon
 /src/t_access_root
 /src/t_dir_offset
 /src/t_dir_offset2
diff --git a/src/Makefile b/src/Makefile
index c42d3bb1..01fe99ef 100644
--- a/src/Makefile
+++ b/src/Makefile
@@ -26,7 +26,7 @@ LINUX_TARGETS = xfsctl bstat t_mtab getdevicesize 
preallo_rw_pattern_reader \
renameat2 t_getcwd e4compact test-nextquota punch-alternating \
attr-list-by-handle-cursor-test listxattr dio-interleaved t_dir_type \
dio-invalidate-cache stat_test t_encrypted_d_revalidate \
-   attr_replace_test
+   attr_replace_test swapon mkswap
 
 SUBDIRS = log-writes perf
 
diff --git a/src/mkswap.c b/src/mkswap.c
new file mode 100644
index ..d0bce2bd
--- /dev/null
+++ b/src/mkswap.c
@@ -0,0 +1,83 @@
+/* mkswap(8) without any sanity checks */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+struct swap_header {
+   charbootbits[1024];
+   uint32_tversion;
+   uint32_tlast_page;
+   uint32_tnr_badpages;
+   unsigned char   sws_uuid[16];
+   unsigned char   sws_volume[16];
+   uint32_tpadding[117];
+   uint32_tbadpages[1];
+};
+
+int main(int argc, char **argv)
+{
+   struct swap_header *hdr;
+   FILE *file;
+   struct stat st;
+   long page_size;
+   int ret;
+
+   if (argc != 2) {
+   fprintf(stderr, "usage: %s PATH\n", argv[0]);
+   return EXIT_FAILURE;
+   }
+
+   page_size = sysconf(_SC_PAGESIZE);
+   if (page_size == -1) {
+   perror("sysconf");
+   return EXIT_FAILURE;
+   }
+
+   hdr = calloc(1, page_size);
+   if (!hdr) {
+   perror("calloc");
+   return EXIT_FAILURE;
+   }
+
+   file = fopen(argv[1], "r+");
+   if (!file) {
+   perror("fopen");
+   free(hdr);
+   return EXIT_FAILURE;
+   }
+
+   ret = fstat(fileno(file), );
+   if (ret) {
+   perror("fstat");
+   free(hdr);
+   fclose(file);
+   return EXIT_FAILURE;
+   }
+
+   hdr->version = 1;
+   hdr->last_page = st.st_size / page_size - 1;
+   memset(>sws_uuid, 0x99, sizeof(hdr->sws_uuid));
+   memcpy((char *)hdr + page_size - 10, "SWAPSPACE2", 10);
+
+   if (fwrite(hdr, page_size, 1, file) != 1) {
+   perror("fwrite");
+   free(hdr);
+   fclose(file);
+   return EXIT_FAILURE;
+   }
+
+   if (fclose(file) == EOF) {
+   perror("fwrite");
+   free(hdr);
+   return EXIT_FAILURE;
+   }
+
+   free(hdr);
+
+   return EXIT_SUCCESS;
+}
diff --git a/src/swapon.c b/src/swapon.c
new file mode 100644
index ..0cb7108a
--- /dev/null
+++ b/src/swapon.c
@@ -0,0 +1,24 @@
+/* swapon(8) without any sanity checks; simply calls swapon(2) directly. */
+
+#include 
+#include 
+#include 
+#include 
+
+int main(int argc, char **argv)
+{
+   int ret;
+
+   if (argc != 2) {
+   fprintf(stderr, "usage: %s PATH\n", argv[0]);
+   return EXIT_FAILURE;
+   }
+
+   ret = swapon(argv[1], 0);
+   if (ret) {
+   perror("swapon");
+   return EXIT_FAILURE;
+   }
+
+   return EXIT_SUCCESS;
+}
diff --git a/tests/generic/494 b/tests/generic/494
new file mode 100755
index ..28468033
--- /dev/null
+++ b/tests/generic/494
@@ -0,0 +1,77 @@
+#! /bin/bash
+# FS QA Test 494
+#
+# Test invalid swap files.
+#
+#---
+# Copyright (c) 2018 Facebook.  All Rights Reserved.
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License as
+# published by the Free Software Foundation.
+#
+# This program is distributed in the hope 

[PATCH v3 0/5] xfstests: generic swap file tests

2018-05-22 Thread Omar Sandoval
From: Omar Sandoval 

Changes since v2:

- Remove file before formatting in _format_swapfile
- Comment on chattr in _format_swapfile
- Remove unnecessary label
- Fix typo in commit message of patch 3
- Redirect chattr +C output in patch 5 to fix XFS failures

Omar Sandoval (5):
  xfstests: create swap group
  generic: enable swapfile tests on Btrfs
  generic: add test for dedupe on an active swapfile
  generic: add test for truncate/fpunch of an active swapfile
  generic: test invalid swap file activation

 .gitignore|  2 ++
 common/rc | 17 +++--
 src/Makefile  |  2 +-
 src/mkswap.c  | 83 +++
 src/swapon.c  | 24 +
 tests/generic/356 |  7 ++--
 tests/generic/357 |  6 ++--
 tests/generic/492 | 76 +++
 tests/generic/492.out |  7 
 tests/generic/493 | 73 +
 tests/generic/493.out |  8 +
 tests/generic/494 | 77 +++
 tests/generic/494.out |  5 +++
 tests/generic/group   |  7 ++--
 tests/xfs/group   |  2 +-
 15 files changed, 383 insertions(+), 13 deletions(-)
 create mode 100644 src/mkswap.c
 create mode 100644 src/swapon.c
 create mode 100755 tests/generic/492
 create mode 100644 tests/generic/492.out
 create mode 100755 tests/generic/493
 create mode 100644 tests/generic/493.out
 create mode 100755 tests/generic/494
 create mode 100644 tests/generic/494.out

-- 
2.17.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 3/5] generic: add test for dedupe on an active swapfile

2018-05-22 Thread Omar Sandoval
From: Omar Sandoval 

Similar to the reflink test generic/356, but makes sure we can't dedupe
an active swapfile.

Signed-off-by: Omar Sandoval 
---
 tests/generic/492 | 76 +++
 tests/generic/492.out |  7 
 tests/generic/group   |  1 +
 3 files changed, 84 insertions(+)
 create mode 100755 tests/generic/492
 create mode 100644 tests/generic/492.out

diff --git a/tests/generic/492 b/tests/generic/492
new file mode 100755
index ..54a2553d
--- /dev/null
+++ b/tests/generic/492
@@ -0,0 +1,76 @@
+#! /bin/bash
+# FS QA Test 492
+#
+# Check that we can't dedupe a swapfile.
+#
+#---
+# Copyright (c) 2018 Facebook.  All Rights Reserved.
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License as
+# published by the Free Software Foundation.
+#
+# This program is distributed in the hope that it would be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write the Free Software Foundation,
+# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+#---
+#
+
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo "QA output created by $seq"
+
+here=`pwd`
+tmp=/tmp/$$
+status=1   # failure is the default!
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+_cleanup()
+{
+   cd /
+   rm -f $tmp.*
+}
+
+. ./common/rc
+. ./common/filter
+. ./common/reflink
+
+rm -f $seqres.full
+
+_supported_fs generic
+_supported_os Linux
+_require_scratch_swapfile
+_require_scratch_dedupe
+
+echo "Format and mount"
+_scratch_mkfs > $seqres.full 2>&1
+_scratch_mount >> $seqres.full 2>&1
+
+testdir="$SCRATCH_MNT/test-$seq"
+mkdir "$testdir"
+
+blocks=160
+blksz=65536
+
+echo "Initialize file"
+_format_swapfile "$testdir/file1" $((blocks * blksz))
+swapon "$testdir/file1"
+
+touch "$testdir/file2"
+$CHATTR_PROG +C "$testdir/file2" >/dev/null 2>&1
+
+echo "Try to dedupe"
+cp "$testdir/file1" "$testdir/file2"
+_dedupe_range "$testdir/file1" 0 "$testdir/file2" 0 $((blocks * blksz))
+_dedupe_range "$testdir/file2" 0 "$testdir/file1" 0 $((blocks * blksz))
+
+echo "Tear it down"
+swapoff "$testdir/file1"
+
+status=0
+exit
diff --git a/tests/generic/492.out b/tests/generic/492.out
new file mode 100644
index ..e1f3cc69
--- /dev/null
+++ b/tests/generic/492.out
@@ -0,0 +1,7 @@
+QA output created by 492
+Format and mount
+Initialize file
+Try to dedupe
+XFS_IOC_FILE_EXTENT_SAME: Text file busy
+XFS_IOC_FILE_EXTENT_SAME: Text file busy
+Tear it down
diff --git a/tests/generic/group b/tests/generic/group
index 111e42e7..dc5adf04 100644
--- a/tests/generic/group
+++ b/tests/generic/group
@@ -494,3 +494,4 @@
 489 auto quick attr
 490 auto quick rw
 491 auto quick freeze mount
+492 auto quick swap
-- 
2.17.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 4/5] generic: add test for truncate/fpunch of an active swapfile

2018-05-22 Thread Omar Sandoval
From: Omar Sandoval 

These should not be allowed.

Signed-off-by: Omar Sandoval 
---
 tests/generic/493 | 73 +++
 tests/generic/493.out |  8 +
 tests/generic/group   |  1 +
 3 files changed, 82 insertions(+)
 create mode 100755 tests/generic/493
 create mode 100644 tests/generic/493.out

diff --git a/tests/generic/493 b/tests/generic/493
new file mode 100755
index ..eba62dcb
--- /dev/null
+++ b/tests/generic/493
@@ -0,0 +1,73 @@
+#! /bin/bash
+# FS QA Test 493
+#
+# Test truncation/hole punching of an active swapfile.
+#
+#---
+# Copyright (c) 2018 Facebook.  All Rights Reserved.
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License as
+# published by the Free Software Foundation.
+#
+# This program is distributed in the hope that it would be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write the Free Software Foundation,
+# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+#---
+#
+
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo "QA output created by $seq"
+
+here=`pwd`
+tmp=/tmp/$$
+status=1   # failure is the default!
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+_cleanup()
+{
+   cd /
+   rm -f $tmp.*
+}
+
+. ./common/rc
+. ./common/filter
+
+rm -f $seqres.full
+
+_supported_fs generic
+_supported_os Linux
+_require_scratch_swapfile
+_require_xfs_io_command "fpunch"
+
+echo "Format and mount"
+_scratch_mkfs > $seqres.full 2>&1
+_scratch_mount >> $seqres.full 2>&1
+
+testdir="$SCRATCH_MNT/test-$seq"
+mkdir "$testdir"
+
+blocks=160
+blksz=65536
+
+echo "Initialize file"
+_format_swapfile "$testdir/file1" $((blocks * blksz))
+swapon "$testdir/file1"
+
+echo "Try to truncate"
+$XFS_IO_PROG -c "truncate $blksz" "$testdir/file1"
+
+echo "Try to punch hole"
+$XFS_IO_PROG -c "fpunch $blksz $((2 * blksz))" "$testdir/file1"
+
+echo "Tear it down"
+swapoff "$testdir/file1"
+
+status=0
+exit
diff --git a/tests/generic/493.out b/tests/generic/493.out
new file mode 100644
index ..929a76b6
--- /dev/null
+++ b/tests/generic/493.out
@@ -0,0 +1,8 @@
+QA output created by 493
+Format and mount
+Initialize file
+Try to truncate
+ftruncate: Text file busy
+Try to punch hole
+fallocate: Text file busy
+Tear it down
diff --git a/tests/generic/group b/tests/generic/group
index dc5adf04..9778930a 100644
--- a/tests/generic/group
+++ b/tests/generic/group
@@ -495,3 +495,4 @@
 490 auto quick rw
 491 auto quick freeze mount
 492 auto quick swap
+493 auto quick swap
-- 
2.17.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 2/5] generic: enable swapfile tests on Btrfs

2018-05-22 Thread Omar Sandoval
From: Omar Sandoval 

Commit 8c96cfbfe530 ("generic/35[67]: disable swapfile tests on Btrfs")
disabled the swapfile tests on Btrfs because it did not support
swapfiles at the time. Now that we're adding support, we want these
tests to run, but they don't. _require_scratch_swapfile always fails for
Btrfs because swapfiles on Btrfs must be set to nocow. After fixing
that, generic/356 and generic/357 fail for the same reason. After fixing
_that_, both tests still fail because we don't allow reflinking a
non-checksummed extent (which nocow implies) to a checksummed extent.
Add a helper for formatting a swap file which does the chattr, and
chattr the second file, which gets these tests running on kernels
supporting Btrfs swapfiles.

Signed-off-by: Omar Sandoval 
---
 common/rc | 17 ++---
 tests/generic/356 |  7 ---
 tests/generic/357 |  6 +++---
 3 files changed, 21 insertions(+), 9 deletions(-)

diff --git a/common/rc b/common/rc
index ffe53236..0ef3d3fe 100644
--- a/common/rc
+++ b/common/rc
@@ -,6 +,19 @@ _require_odirect()
rm -f $testfile 2>&1 > /dev/null
 }
 
+_format_swapfile() {
+   local fname="$1"
+   local sz="$2"
+
+   rm -f "$fname"
+   touch "$fname"
+   chmod 0600 "$fname"
+   # Swap files must be nocow on Btrfs.
+   $CHATTR_PROG +C "$fname" > /dev/null 2>&1
+   _pwrite_byte 0x61 0 "$sz" "$fname" >> $seqres.full
+   mkswap "$fname" >> $seqres.full
+}
+
 # Check that the filesystem supports swapfiles
 _require_scratch_swapfile()
 {
@@ -2231,10 +2244,8 @@ _require_scratch_swapfile()
_scratch_mount
 
# Minimum size for mkswap is 10 pages
-   local size=$(($(get_page_size) * 10))
+   _format_swapfile "$SCRATCH_MNT/swap" $(($(get_page_size) * 10))
 
-   _pwrite_byte 0x61 0 "$size" "$SCRATCH_MNT/swap" >/dev/null 2>&1
-   mkswap "$SCRATCH_MNT/swap" >/dev/null 2>&1
if ! swapon "$SCRATCH_MNT/swap" >/dev/null 2>&1; then
_scratch_unmount
_notrun "swapfiles are not supported"
diff --git a/tests/generic/356 b/tests/generic/356
index 51eeb652..b4a38f84 100755
--- a/tests/generic/356
+++ b/tests/generic/356
@@ -59,11 +59,12 @@ blocks=160
 blksz=65536
 
 echo "Initialize file"
-echo >> $seqres.full
-_pwrite_byte 0x61 0 $((blocks * blksz)) $testdir/file1 >> $seqres.full
-mkswap -U 27376b42-ff65-42ca-919f-6c9b62292a5c $testdir/file1 >> $seqres.full
+_format_swapfile "$testdir/file1" $((blocks * blksz))
 swapon $testdir/file1
 
+touch "$testdir/file2"
+$CHATTR_PROG +C "$testdir/file2" >/dev/null 2>&1
+
 echo "Try to reflink"
 _cp_reflink $testdir/file1 $testdir/file2 2>&1 | _filter_scratch
 
diff --git a/tests/generic/357 b/tests/generic/357
index 0dd0c10f..9a83a283 100755
--- a/tests/generic/357
+++ b/tests/generic/357
@@ -59,9 +59,9 @@ blocks=160
 blksz=65536
 
 echo "Initialize file"
-echo >> $seqres.full
-_pwrite_byte 0x61 0 $((blocks * blksz)) $testdir/file1 >> $seqres.full
-mkswap -U 27376b42-ff65-42ca-919f-6c9b62292a5c $testdir/file1 >> $seqres.full
+_format_swapfile "$testdir/file1" $((blocks * blksz))
+touch "$testdir/file2"
+$CHATTR_PROG +C "$testdir/file2" >/dev/null 2>&1
 _cp_reflink $testdir/file1 $testdir/file2 2>&1 | _filter_scratch
 
 echo "Try to swapon"
-- 
2.17.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 2/5] generic: enable swapfile tests on Btrfs

2018-05-22 Thread Omar Sandoval
On Tue, May 22, 2018 at 02:41:45PM +0800, Eryu Guan wrote:
> On Wed, May 16, 2018 at 01:38:46PM -0700, Omar Sandoval wrote:
> > From: Omar Sandoval 
> > 
> > Commit 8c96cfbfe530 ("generic/35[67]: disable swapfile tests on Btrfs")
> > disabled the swapfile tests on Btrfs because it did not support
> > swapfiles at the time. Now that we're adding support, we want these
> 
> So currently btrfs has no swapfile support yet, and tests still _notrun
> with current v4.17-rc5 kernel? Just want to make sure that's expected.

Yes, that's correct.

> > tests to run, but they don't. _require_scratch_swapfile always fails for
> > Btrfs because swapfiles on Btrfs must be set to nocow. After fixing
> > that, generic/356 and generic/357 fail for the same reason. After fixing
> > _that_, both tests still fail because we don't allow reflinking a
> > non-checksummed extent (which nocow implies) to a checksummed extent.
> > Add a helper for formatting a swap file which does the chattr, and
> > chattr the second file, which gets these tests running on kernels
> > supporting Btrfs swapfiles.
> > 
> > Signed-off-by: Omar Sandoval 
> > ---
> >  common/rc | 15 ---
> >  tests/generic/356 |  7 ---
> >  tests/generic/357 |  6 +++---
> >  3 files changed, 19 insertions(+), 9 deletions(-)
> > 
> > diff --git a/common/rc b/common/rc
> > index ffe53236..814b8b5c 100644
> > --- a/common/rc
> > +++ b/common/rc
> > @@ -,6 +,17 @@ _require_odirect()
> > rm -f $testfile 2>&1 > /dev/null
> >  }
> >  
> > +_format_swapfile() {
> > +   local fname="$1"
> > +   local sz="$2"
> > +
> > +   touch "$fname"
> 
> Better to remove $fname first, just in case it's an existing file with
> some blocks allocated, as chattr(1) says
> 
> "Note: For btrfs, the 'C' flag should be set on new or empty files.  If
> it is set on a file which already has data blocks, it is undefined when
> the blocks assigned to the file will be fully stable"

Good point.

> > +   chmod 0600 "$fname"
> > +   $CHATTR_PROG +C "$fname" > /dev/null 2>&1
> 
> It'd be good to have some comments on this chattr +C.

Will do.

> > +   _pwrite_byte 0x61 0 "$sz" "$fname" >> $seqres.full
> > +   mkswap -U 27376b42-ff65-42ca-919f-6c9b62292a5c "$fname" >> $seqres.full
> 
> Is the label really needed?

Nope, doesn't look like it. Thanks!
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs: fix memory and mount leak in btrfs_ioctl_rm_dev_v2()

2018-05-22 Thread Omar Sandoval
From: Omar Sandoval 

If we have invalid flags set, when we error out we must drop our writer
counter and free the buffer we allocated for the arguments. This bug is
trivially reproduced with the following program:

#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

int main(int argc, char **argv)
{
struct btrfs_ioctl_vol_args_v2 vol_args = {
.flags = UINT64_MAX,
};
int ret;
int fd;

if (argc != 2) {
fprintf(stderr, "usage: %s PATH\n", argv[0]);
return EXIT_FAILURE;
}

fd = open(argv[1], O_WRONLY);
if (fd == -1) {
perror("open");
return EXIT_FAILURE;
}

ret = ioctl(fd, BTRFS_IOC_RM_DEV_V2, _args);
if (ret == -1)
perror("ioctl");

close(fd);
return EXIT_SUCCESS;
}

When unmounting the filesystem, we'll hit the
WARN_ON(mnt_get_writers(mnt)) in cleanup_mnt().

Fixes: 6b526ed70cf1 ("btrfs: introduce device delete by devid")
Signed-off-by: Omar Sandoval 
---
Sigh, I keep stepping in these ioctl bugs just trying to get my swap
file series ready. This one is based on top of my "Btrfs: fix partly
checksummed file races" series, although it should apply to for-next
without it +/- some moved lines.

 fs/btrfs/ioctl.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index c2263bf4d6f5..9af3be96099f 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -3086,8 +3086,10 @@ static long btrfs_ioctl_rm_dev_v2(struct file *file, 
void __user *arg)
}
 
/* Check for compatibility reject unknown flags */
-   if (vol_args->flags & ~BTRFS_VOL_ARG_V2_FLAGS_SUPPORTED)
-   return -EOPNOTSUPP;
+   if (vol_args->flags & ~BTRFS_VOL_ARG_V2_FLAGS_SUPPORTED) {
+   ret = -EOPNOTSUPP;
+   goto out;
+   }
 
if (test_and_set_bit(BTRFS_FS_EXCL_OP, _info->flags)) {
ret = BTRFS_ERROR_DEV_EXCL_RUN_IN_PROGRESS;
-- 
2.17.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] Btrfs: fix dedupe vs chattr NODATASUM race

2018-05-22 Thread Omar Sandoval
From: Omar Sandoval 

In btrfs_extent_same(), we must check the NODATASUM flag while the
inodes are locked. Otherwise, it's possible that btrfs_ioctl_setflags()
will change the flags after we check. This was correct until a recent
change.

Fixes: 0a38effca37c ("Btrfs: split btrfs_extent_same")
Signed-off-by: Omar Sandoval 
---
 fs/btrfs/ioctl.c | 13 +++--
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 784e267aad32..c2263bf4d6f5 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -3586,12 +3586,6 @@ static int btrfs_extent_same(struct inode *src, u64 
loff, u64 olen,
if (olen == 0)
return 0;
 
-   /* don't make the dst file partly checksummed */
-   if ((BTRFS_I(src)->flags & BTRFS_INODE_NODATASUM) !=
-   (BTRFS_I(dst)->flags & BTRFS_INODE_NODATASUM)) {
-   return -EINVAL;
-   }
-
tail_len = olen % BTRFS_MAX_DEDUPE_LEN;
chunk_count = div_u64(olen, BTRFS_MAX_DEDUPE_LEN);
if (chunk_count == 0)
@@ -3624,6 +3618,13 @@ static int btrfs_extent_same(struct inode *src, u64 
loff, u64 olen,
else
btrfs_double_inode_lock(src, dst);
 
+   /* don't make the dst file partly checksummed */
+   if ((BTRFS_I(src)->flags & BTRFS_INODE_NODATASUM) !=
+   (BTRFS_I(dst)->flags & BTRFS_INODE_NODATASUM)) {
+   ret = -EINVAL;
+   goto out;
+   }
+
for (i = 0; i < chunk_count; i++) {
ret = btrfs_extent_same_range(src, loff, BTRFS_MAX_DEDUPE_LEN,
  dst, dst_loff, );
-- 
2.17.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] Btrfs: fix clone vs chattr NODATASUM race

2018-05-22 Thread Omar Sandoval
From: Omar Sandoval 

In btrfs_clone_files(), we must check the NODATASUM flag while the
inodes are locked. Otherwise, it's possible that btrfs_ioctl_setflags()
will change the flags after we check and we can end up with a party
checksummed file.

Fixes: 0e7b824c4ef9 ("Btrfs: don't make a file partly checksummed through file 
clone")
Signed-off-by: Omar Sandoval 
---
 fs/btrfs/ioctl.c | 12 +++-
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index cf0d3bc6f625..784e267aad32 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -4280,11 +4280,6 @@ static noinline int btrfs_clone_files(struct file *file, 
struct file *file_src,
src->i_sb != inode->i_sb)
return -EXDEV;
 
-   /* don't make the dst file partly checksummed */
-   if ((BTRFS_I(src)->flags & BTRFS_INODE_NODATASUM) !=
-   (BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM))
-   return -EINVAL;
-
if (S_ISDIR(src->i_mode) || S_ISDIR(inode->i_mode))
return -EISDIR;
 
@@ -4294,6 +4289,13 @@ static noinline int btrfs_clone_files(struct file *file, 
struct file *file_src,
inode_lock(src);
}
 
+   /* don't make the dst file partly checksummed */
+   if ((BTRFS_I(src)->flags & BTRFS_INODE_NODATASUM) !=
+   (BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM)) {
+   ret = -EINVAL;
+   goto out_unlock;
+   }
+
/* determine range to clone */
ret = -EINVAL;
if (off + len > src->i_size || off + len < off)
-- 
2.17.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/2] Btrfs: fix partly checksummed file races

2018-05-22 Thread Omar Sandoval
From: Omar Sandoval 

Clone and dedupe both have checks to make sure that we're not mixing
checksummed and not checksummed extents in a single file. However, both
checks are racy; we need to have the inodes locked or else the flags can
change after we check. The clone check has always been wrong. The dedupe
check was previously correct but is broken in kdave/for-next.

Based on kdave/for-next. Note that there's a Fixes: tag in there
referencing a commit in the for-next branch, so that would have to be
updated if the commit gets rebased. These patches are also available at
https://github.com/osandov/linux/tree/btrfs-nodatasum-race.

Thanks!

Omar Sandoval (2):
  Btrfs: fix clone vs chattr NODATASUM race
  Btrfs: fix dedupe vs chattr NODATASUM race

 fs/btrfs/ioctl.c | 25 ++---
 1 file changed, 14 insertions(+), 11 deletions(-)

-- 
2.17.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


csum failed root raveled during balance

2018-05-22 Thread ein
Hello devs,

I tested BTRFS in production for about a month:

21:08:17 up 34 days,  2:21,  3 users,  load average: 0.06, 0.02, 0.00

Without power blackout, hardware failure, SSD's SMART is flawless etc.
The tests ended with:

root@node0:~# dmesg | grep BTRFS | grep warn
185:980:[2927472.393557] BTRFS warning (device dm-0): csum failed root
-9 ino 312 off 608284672 csum 0x7d03a376 expected csum 0x3163a9b7 mirror 1
186:981:[2927472.394158] BTRFS warning (device dm-0): csum failed root
-9 ino 312 off 608284672 csum 0x7da1b152 expected csum 0x3163a9b7 mirror 1
191:986:[2928224.169814] BTRFS warning (device dm-0): csum failed root
-9 ino 314 off 608284672 csum 0x7d03a376 expected csum 0x3163a9b7 mirror 1
192:987:[2928224.171433] BTRFS warning (device dm-0): csum failed root
-9 ino 314 off 608284672 csum 0x7da1b152 expected csum 0x3163a9b7 mirror 1
206:1001:[2928298.039516] BTRFS warning (device dm-0): csum failed root
-9 ino 319 off 608284672 csum 0x7d03a376 expected csum 0x3163a9b7 mirror 1
207:1002:[2928298.043103] BTRFS warning (device dm-0): csum failed root
-9 ino 319 off 608284672 csum 0x7d03a376 expected csum 0x3163a9b7 mirror 1
208:1004:[2932213.513424] BTRFS warning (device dm-0): csum failed root
5 ino 219962 off 4564959232 csum 0xc616afb4 expected csum 0x5425e489
mirror 1
209:1005:[2932235.666368] BTRFS warning (device dm-0): csum failed root
5 ino 219962 off 16989835264 csum 0xd63ed5da expected csum 0x7429caa1
mirror 1
210:1072:[2936767.229277] BTRFS warning (device dm-0): csum failed root
5 ino 219915 off 82318458880 csum 0x83614341 expected csum 0x0b8706f8
mirror 1
211:1073:[2936767.276229] BTRFS warning (device dm-0): csum failed root
5 ino 219915 off 82318458880 csum 0x83614341 expected csum 0x0b8706f8
mirror 1

Above has been revealed during below command and quite high IO usage by
few VMs (Linux on top Ext4 with firebird database, lots of random
read/writes, two others with Windows 2016 and Windows Update in the
background):

btrfs balance start -dusage=85 ./

Surprisingly without error counter increasing (how this is even possible?!):

root@node0:/var/lib/libvirt/images# backup_dir="/var/lib/libvirt"; btrfs
fi df $backup_dir; btrfs filesystem usage $backup_dir; btrfs dev stats
$backup_dir
Data, single: total=203.00GiB, used=188.28GiB
System, single: total=32.00MiB, used=48.00KiB
Metadata, single: total=3.00GiB, used=856.27MiB
GlobalReserve, single: total=512.00MiB, used=0.00B
Overall:
Device size: 300.00GiB
Device allocated:206.03GiB
Device unallocated:   93.97GiB
Device missing:  0.00B
Used:189.12GiB
Free (estimated):108.68GiB  (min: 108.68GiB)
Data ratio:   1.00
Metadata ratio:   1.00
Global reserve:  512.00MiB  (used: 0.00B)

Data,single: Size:203.00GiB, Used:188.28GiB
   /dev/mapper/raid10-images 203.00GiB

Metadata,single: Size:3.00GiB, Used:856.27MiB
   /dev/mapper/raid10-images   3.00GiB

System,single: Size:32.00MiB, Used:48.00KiB
   /dev/mapper/raid10-images  32.00MiB

Unallocated:
   /dev/mapper/raid10-images  93.97GiB
[/dev/mapper/raid10-images].write_io_errs0
[/dev/mapper/raid10-images].read_io_errs 0
[/dev/mapper/raid10-images].flush_io_errs0
[/dev/mapper/raid10-images].corruption_errs  0

Few words about the setup Hardware layout:

/dev/sda Model=Samsung SSD 850 PRO 256GB, FwRev=EXM02B6Q,
SerialNo=S251NX0H822367V
/dev/sdb Model=Samsung SSD 850 PRO 256GB, FwRev=EXM02B6Q,
SerialNo=S251NX0H822370A
/dev/sdc Model=INTEL SSDSC2BP240G4, FwRev=L2010420,
SerialNo=BTJR6141002D240AGN
/dev/sdd Model=INTEL SSDSC2BP240G4, FwRev=L2010420,
SerialNo=BTJR6063000F240AGN

Linux Software RAID layout (mdraid):

md1 : active raid10 sdc1[2] sdb2[1] sdd1[3] sda2[0]
  468596736 blocks super 1.2 512K chunks 2 near-copies [4/4] []
  bitmap: 0/4 pages [0KB], 65536KB chunk

md0 : active raid1 sda1[0] sdb1[1]
  15616000 blocks super 1.2 [2/2] [UU]

Logical layout:

NAMEMAJ:MIN RM   SIZE RO TYPE   MOUNTPOINT
sda   8:00 238.5G  0 disk  
├─sda18:10  14.9G  0 part  
│ └─md0   9:00  14.9G  0 raid1 
│   └─raid1-rootfs  253:20  14.9G  0 lvm/
└─sda28:20 223.6G  0 part  
  └─md1   9:10 446.9G  0 raid10
├─raid10-images 253:00   300G  0 lvm/var/lib/libvirt
└─raid10-swapfs 253:10 4G  0 lvm   
sdb   8:16   0 238.5G  0 disk  
├─sdb18:17   0  14.9G  0 part  
│ └─md0   9:00  14.9G  0 raid1 
│   └─raid1-rootfs  253:20  14.9G  0 lvm/
└─sdb28:18   0 223.6G  0 part  
  └─md1   9:10 446.9G  0 raid10
├─raid10-images 253:00   300G  0 lvm/var/lib/libvirt
└─raid10-swapfs 253:10 4G  0 lvm   
sdc   8:32   0 223.6G  0 disk  
└─sdc18:33  

Re: [PATCH 00/13] convert block layer to bioset_init()/mempool_init()

2018-05-22 Thread Mike Snitzer
On Tue, May 22 2018 at  2:41am -0400,
Christoph Hellwig  wrote:

> On Mon, May 21, 2018 at 07:38:55PM -0400, Kent Overstreet wrote:
> > On Mon, May 21, 2018 at 02:24:32PM -0400, Mike Snitzer wrote:
> > > Every single data structure change in this series should be reviewed for
> > > unforeseen alignment consequences.  Jens seemed to say that is
> > > worthwhile.  Not sure if he'll do it or we divide it up.  If we divide
> > > it up a temp topic branch should be published for others to inspect.
> > > 
> > > Could be alignment hasn't been a historic concern for a bunch of the
> > > data structures changed in this series.. if so then all we can do is fix
> > > up any obvious potential for false sharing.
> > 
> > Honestly, I almost never worry about alignment... the very few times I do 
> > care,
> > I use __cacheline_aligned_in_smp.
> 
> And that generally is the right stratgey.  If Mike really doesn't want
> a change we can just open code the kmalloc for the bio set there, the
> important point is that we should not keep the old API around for no
> good reason.

Again, I never said I didn't want these changes.  I just thought it
worthwhile to mention the potential for alignment quirks arising.

Reality is false sharing is quite uncommon.  So my concern was/is pretty
niche and unlikely to be applicable.

That said, I did take the opportunity to look at all the DM structures
that were modified.  The bio_set and mempool_t structs are so large that
they span multiple cachelines as is.  So I focused on eliminating
unnecessary spanning of cachelines (by non-bio_set and non-mempool_t
members) and eliminated most holes in DM structs.  This is the result:
http://people.redhat.com/msnitzer/dm-4.18-struct-reorg/

Compare before.txt vs after_kent.txt vs after_mike.txt

Nothing staggering or special.  Just something I'll add once Kent's
latest changes land.

But, I also eliminated 2 4-byte holes in struct bio_set, Jens please
feel free to pick this up (if you think it useful):

From: Mike Snitzer 
Subject: [PATCH] block: eliminate 2 4-byte holes in bio_set structure

Signed-off-by: Mike Snitzer 
---
 include/linux/bio.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/bio.h b/include/linux/bio.h
index 5e472fc..e6fc692 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -735,7 +735,6 @@ static inline void bio_inc_remaining(struct bio *bio)
 
 struct bio_set {
struct kmem_cache *bio_slab;
-   unsigned int front_pad;
 
mempool_t bio_pool;
mempool_t bvec_pool;
@@ -743,6 +742,7 @@ struct bio_set {
mempool_t bio_integrity_pool;
mempool_t bvec_integrity_pool;
 #endif
+   unsigned int front_pad;
 
/*
 * Deadlock avoidance for stacking block drivers: see comments in

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v2 00/12] get rid of GFP_ZONE_TABLE/BAD

2018-05-22 Thread Michal Hocko
On Mon 21-05-18 23:20:21, Huaisheng Ye wrote:
> From: Huaisheng Ye 
> 
> Replace GFP_ZONE_TABLE and GFP_ZONE_BAD with encoded zone number.
> 
> Delete ___GFP_DMA, ___GFP_HIGHMEM and ___GFP_DMA32 from GFP bitmasks,
> the bottom three bits of GFP mask is reserved for storing encoded
> zone number.
> 
> The encoding method is XOR. Get zone number from enum zone_type,
> then encode the number with ZONE_NORMAL by XOR operation.
> The goal is to make sure ZONE_NORMAL can be encoded to zero. So,
> the compatibility can be guaranteed, such as GFP_KERNEL and GFP_ATOMIC
> can be used as before.
> 
> Reserve __GFP_MOVABLE in bit 3, so that it can continue to be used as
> a flag. Same as before, __GFP_MOVABLE respresents movable migrate type
> for ZONE_DMA, ZONE_DMA32, and ZONE_NORMAL. But when it is enabled with
> __GFP_HIGHMEM, ZONE_MOVABLE shall be returned instead of ZONE_HIGHMEM.
> __GFP_ZONE_MOVABLE is created to realize it.
> 
> With this patch, just enabling __GFP_MOVABLE and __GFP_HIGHMEM is not
> enough to get ZONE_MOVABLE from gfp_zone. All callers should use
> GFP_HIGHUSER_MOVABLE or __GFP_ZONE_MOVABLE directly to achieve that.
> 
> Decode zone number directly from bottom three bits of flags in gfp_zone.
> The theory of encoding and decoding is,
> A ^ B ^ B = A

So why is this any better than the current code. Sure I am not a great
fan of GFP_ZONE_TABLE because of how it is incomprehensible but this
doesn't look too much better, yet we are losing a check for incompatible
gfp flags. The diffstat looks really sound but then you just look and
see that the large part is the comment that at least explained the gfp
zone modifiers somehow and the debugging code. So what is the selling
point?

> Changes since v1,
> 
> v2: Add __GFP_ZONE_MOVABLE and modify GFP_HIGHUSER_MOVABLE to help
> callers to get ZONE_MOVABLE. Add __GFP_ZONE_MASK to mask lowest 3
> bits of GFP bitmasks.
> Modify some callers' gfp flag to update usage of address zone
> modifiers.
> Modify inline function gfp_zone to get better performance according
> to Matthew's suggestion.
> 
> Link: https://marc.info/?l=linux-mm=152596791931266=2
> 
> Huaisheng Ye (12):
>   include/linux/gfp.h: get rid of GFP_ZONE_TABLE/BAD
>   arch/x86/kernel/amd_gart_64: update usage of address zone modifiers
>   arch/x86/kernel/pci-calgary_64: update usage of address zone modifiers
>   drivers/iommu/amd_iommu: update usage of address zone modifiers
>   include/linux/dma-mapping: update usage of address zone modifiers
>   drivers/xen/swiotlb-xen: update usage of address zone modifiers
>   fs/btrfs/extent_io: update usage of address zone modifiers
>   drivers/block/zram/zram_drv: update usage of address zone modifiers
>   mm/vmpressure: update usage of address zone modifiers
>   mm/zsmalloc: update usage of address zone modifiers
>   include/linux/highmem: update usage of movableflags
>   arch/x86/include/asm/page.h: update usage of movableflags
> 
>  arch/x86/include/asm/page.h  |  3 +-
>  arch/x86/kernel/amd_gart_64.c|  2 +-
>  arch/x86/kernel/pci-calgary_64.c |  2 +-
>  drivers/block/zram/zram_drv.c|  6 +--
>  drivers/iommu/amd_iommu.c|  2 +-
>  drivers/xen/swiotlb-xen.c|  2 +-
>  fs/btrfs/extent_io.c |  2 +-
>  include/linux/dma-mapping.h  |  2 +-
>  include/linux/gfp.h  | 98 
> +---
>  include/linux/highmem.h  |  4 +-
>  mm/vmpressure.c  |  2 +-
>  mm/zsmalloc.c|  4 +-
>  12 files changed, 26 insertions(+), 103 deletions(-)
> 
> -- 
> 1.8.3.1
> 

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: implement unlocked buffered write

2018-05-22 Thread Chris Mason

On 22 May 2018, at 14:08, Christoph Hellwig wrote:


On Wed, May 16, 2018 at 11:52:37AM +0800, robbieko wrote:

From: Robbie Ko 

This idea is from direct io. By this patch, we can make the buffered
write parallel, and improve the performance and latency. But because 
we
can not update isize without i_mutex, the unlocked buffered write 
just

can be done in front of the EOF.

We needn't worry about the race between buffered write and truncate,
because the truncate need wait until all the buffered write end.

And we also needn't worry about the race between dio write and punch 
hole,

because we have extent lock to protect our operation.

I ran fio to test the performance of this feature.


And what protects two writes from interleaving their results now?


page locks...ish, we at least won't have results interleaved in a single 
page.  For btrfs it'll actually be multiple pages since we try to do 
more than one at a time.


I haven't verified all the assumptions around truncate and fallocate and 
friends expecting the dio special locking to be inside i_size.  In 
general this makes me a little uncomfortable.


But we're not avoiding the inode lock completely, we're just dropping it 
for the expensive parts of writing to the file.  A quick guess about 
what the expensive parts are:


1) balance_dirty_pages()
2) btrfs_btree_balance_dirty()
3) metadata reservations/enospc waiting.

Can I bribe you to benchmark how much each of those things is impacting 
the iops/latency benefits?


-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: implement unlocked buffered write

2018-05-22 Thread Christoph Hellwig
On Wed, May 16, 2018 at 11:52:37AM +0800, robbieko wrote:
> From: Robbie Ko 
> 
> This idea is from direct io. By this patch, we can make the buffered
> write parallel, and improve the performance and latency. But because we
> can not update isize without i_mutex, the unlocked buffered write just
> can be done in front of the EOF.
> 
> We needn't worry about the race between buffered write and truncate,
> because the truncate need wait until all the buffered write end.
> 
> And we also needn't worry about the race between dio write and punch hole,
> because we have extent lock to protect our operation.
> 
> I ran fio to test the performance of this feature.
> 
> == Hardware ==
> CPU: Intel® Xeon® D-1531
> SSD: Intel S3710 200G
> Volume : RAID 5 , SSD * 6
> 
> == config file ==
> [global]
> group_reporting
> time_based
> thread=1
> norandommap
> ioengine=libaio
> bs=4k
> iodepth=32
> size=16G
> runtime=180
> numjobs=8
> rw=randwrite
> 
> [file1]
> filename=/mnt/btrfs/nocow/testfile
> 
> == result (iops) ==
> lock = 68470
> unlocked = 94242
> 
> == result (clat) ==
> lock
>  lat (usec): min=184, max=1209.9K, avg=3738.35, stdev=20869.49
>  clat percentiles (usec):
>   |  1.00th=[  322],  5.00th=[  330], 10.00th=[  334], 20.00th=[  346],
>   | 30.00th=[  370], 40.00th=[  386], 50.00th=[  406], 60.00th=[  446],
>   | 70.00th=[  516], 80.00th=[  612], 90.00th=[  828], 95.00th=[10432],
>   | 99.00th=[84480], 99.50th=[117248], 99.90th=[226304], 99.95th=[333824],
>   | 99.99th=[692224]
> 
> unlocked
>  lat (usec): min=10, max=218208, avg=2691.44, stdev=5380.82
>  clat percentiles (usec):
>   |  1.00th=[  302],  5.00th=[  390], 10.00th=[  442], 20.00th=[  502],
>   | 30.00th=[  548], 40.00th=[  596], 50.00th=[  652], 60.00th=[  724],
>   | 70.00th=[  916], 80.00th=[ 5024], 90.00th=[ 5664], 95.00th=[10048],
>   | 99.00th=[29568], 99.50th=[39168], 99.90th=[54016], 99.95th=[59648],
>   | 99.99th=[78336]
> 
> Signed-off-by: Robbie Ko 
> ---
>  fs/btrfs/file.c | 17 +
>  1 file changed, 17 insertions(+)
> 
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index 41ab907..8eac540 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -1600,6 +1600,7 @@ static noinline ssize_t __btrfs_buffered_write(struct 
> file *file,
>   int ret = 0;
>   bool only_release_metadata = false;
>   bool force_page_uptodate = false;
> + bool relock = false;
>  
>   nrptrs = min(DIV_ROUND_UP(iov_iter_count(i), PAGE_SIZE),
>   PAGE_SIZE / (sizeof(struct page *)));
> @@ -1609,6 +1610,18 @@ static noinline ssize_t __btrfs_buffered_write(struct 
> file *file,
>   if (!pages)
>   return -ENOMEM;
>  
> + inode_dio_begin(inode);
> +
> + /*
> +  * If the write is beyond the EOF, we need update
> +  * the isize, but it is protected by i_mutex. So we can
> +  * not unlock the i_mutex at this case.
> +  */
> + if (pos + iov_iter_count(i) <= i_size_read(inode)) {
> + inode_unlock(inode);

And what protects two writes from interleaving their results now?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] btrfs: kill btrfs_write_inode

2018-05-22 Thread Omar Sandoval
On Tue, May 22, 2018 at 01:47:22PM -0400, Josef Bacik wrote:
> From: Josef Bacik 
> 
> We don't actually need this.  It used to be in place for O_SYNC writes,
> but we've used the normal fsync() path for that for years now.

Specifically, I believe 148f948ba877 ("vfs: Introduce new helpers for
syncing after writing to O_SYNC file or IS_SYNC inode") is where we
stopped using this for O_SYNC.

> The
> other case we hit this is through sync(), which will commit the
> transaction anyway.  All this does is make us commit the transaction a
> bunch for no reason, and it could deadlock with delayed iput's.

Reviewed-by: Omar Sandoval 

> Signed-off-by: Josef Bacik 
> ---
>  fs/btrfs/inode.c | 26 --
>  fs/btrfs/super.c |  1 -
>  2 files changed, 27 deletions(-)
> 
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 92b629212c81..5d47206604ca 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -6263,32 +6263,6 @@ static int btrfs_real_readdir(struct file *file, 
> struct dir_context *ctx)
>   return ret;
>  }
>  
> -int btrfs_write_inode(struct inode *inode, struct writeback_control *wbc)
> -{
> - struct btrfs_root *root = BTRFS_I(inode)->root;
> - struct btrfs_trans_handle *trans;
> - int ret = 0;
> - bool nolock = false;
> -
> - if (test_bit(BTRFS_INODE_DUMMY, _I(inode)->runtime_flags))
> - return 0;
> -
> - if (btrfs_fs_closing(root->fs_info) &&
> - btrfs_is_free_space_inode(BTRFS_I(inode)))
> - nolock = true;
> -
> - if (wbc->sync_mode == WB_SYNC_ALL) {
> - if (nolock)
> - trans = btrfs_join_transaction_nolock(root);
> - else
> - trans = btrfs_join_transaction(root);
> - if (IS_ERR(trans))
> - return PTR_ERR(trans);
> - ret = btrfs_commit_transaction(trans);
> - }
> - return ret;
> -}
> -
>  /*
>   * This is somewhat expensive, updating the tree every time the
>   * inode changes.  But, it is most likely to find the inode in cache.
> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
> index 4a584a7ad371..0a4fb69e0ddf 100644
> --- a/fs/btrfs/super.c
> +++ b/fs/btrfs/super.c
> @@ -2331,7 +2331,6 @@ static const struct super_operations btrfs_super_ops = {
>   .sync_fs= btrfs_sync_fs,
>   .show_options   = btrfs_show_options,
>   .show_devname   = btrfs_show_devname,
> - .write_inode= btrfs_write_inode,
>   .alloc_inode= btrfs_alloc_inode,
>   .destroy_inode  = btrfs_destroy_inode,
>   .statfs = btrfs_statfs,
> -- 
> 2.14.3
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] Btrfs: fix error handling in btrfs_truncate()

2018-05-22 Thread Omar Sandoval
On Tue, May 22, 2018 at 10:41:11AM -0700, Omar Sandoval wrote:
> On Tue, May 22, 2018 at 10:37:14AM -0700, Omar Sandoval wrote:
> > On Tue, May 22, 2018 at 07:17:48PM +0200, David Sterba wrote:
> > > On Tue, May 22, 2018 at 09:47:58AM -0700, Omar Sandoval wrote:
> > > > From: Omar Sandoval 
> > > > 
> > > > Jun Wu at Facebook reported that an internal service was seeing a return
> > > > value of 1 from ftruncate() on Btrfs in some cases.
> > > 
> > > Do you have a reproducer? To estimate how likely is to hit the problem
> > > in practice.

Okay last one, I promise, we just need the extent items to be on disk:

#include 
#include 
#include 
#include 

int main(void) {
char buf[256] = { 0 };
int ret;
int fd;

fd = open("test", O_CREAT | O_WRONLY | O_TRUNC, 0666);
if (fd == -1) {
perror("open");
return EXIT_FAILURE;
}

if (write(fd, buf, sizeof(buf)) != sizeof(buf)) {
perror("write");
close(fd);
return EXIT_FAILURE;
}

if (fsync(fd) == -1) {
perror("fsync");
close(fd);
return EXIT_FAILURE;
}

ret = ftruncate(fd, 128);
if (ret) {
printf("ftruncate() returned %d\n", ret);
close(fd);
return EXIT_FAILURE;
}

close(fd);

return EXIT_SUCCESS;
}

Basically, any time we truncate a compressed, inline file, as long as
its extents are already on disk, we get the erroneous return value.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] btrfs: always wait on ordered extents at fsync time

2018-05-22 Thread Josef Bacik
From: Josef Bacik 

There's a priority inversion that exists currently with btrfs fsync.  In
some cases we will collect outstanding ordered extents onto a list and
only wait on them at the very last second.  However this "very last
second" falls inside of a transaction handle, so if we are in a lower
priority cgroup we can end up holding the transaction open for longer
than needed, so if a high priority cgroup is also trying to fsync()
it'll see latency.

Fix this by getting rid of all of the logged extents magic and simply
wait on ordered extent before we star the tree log stuff.  This code has
changed a lot since I first wrote it and really isn't the performance
win it was originally because of the things we had to do around getting
the right checksums.  Killing all of this makes our lives easier and
gets rid of the priority inversion.

Signed-off-by: Josef Bacik 
---
 fs/btrfs/file.c  |  56 ++-
 fs/btrfs/ordered-data.c  | 123 
 fs/btrfs/ordered-data.h  |  20 +-
 fs/btrfs/tree-log.c  | 166 ---
 include/trace/events/btrfs.h |   1 -
 5 files changed, 19 insertions(+), 347 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 5772f0cbedef..2b1c36612384 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -2069,53 +2069,12 @@ int btrfs_sync_file(struct file *file, loff_t start, 
loff_t end, int datasync)
atomic_inc(>log_batch);
full_sync = test_bit(BTRFS_INODE_NEEDS_FULL_SYNC,
 _I(inode)->runtime_flags);
+
/*
-* We might have have had more pages made dirty after calling
-* start_ordered_ops and before acquiring the inode's i_mutex.
+* We have to do this here to avoid the priority inversion of waiting on
+* IO of a lower priority task while holding a transaciton open.
 */
-   if (full_sync) {
-   /*
-* For a full sync, we need to make sure any ordered operations
-* start and finish before we start logging the inode, so that
-* all extents are persisted and the respective file extent
-* items are in the fs/subvol btree.
-*/
-   ret = btrfs_wait_ordered_range(inode, start, len);
-   } else {
-   /*
-* Start any new ordered operations before starting to log the
-* inode. We will wait for them to finish in btrfs_sync_log().
-*
-* Right before acquiring the inode's mutex, we might have new
-* writes dirtying pages, which won't immediately start the
-* respective ordered operations - that is done through the
-* fill_delalloc callbacks invoked from the writepage and
-* writepages address space operations. So make sure we start
-* all ordered operations before starting to log our inode. Not
-* doing this means that while logging the inode, writeback
-* could start and invoke writepage/writepages, which would call
-* the fill_delalloc callbacks (cow_file_range,
-* submit_compressed_extents). These callbacks add first an
-* extent map to the modified list of extents and then create
-* the respective ordered operation, which means in
-* tree-log.c:btrfs_log_inode() we might capture all existing
-* ordered operations (with btrfs_get_logged_extents()) before
-* the fill_delalloc callback adds its ordered operation, and by
-* the time we visit the modified list of extent maps (with
-* btrfs_log_changed_extents()), we see and process the extent
-* map they created. We then use the extent map to construct a
-* file extent item for logging without waiting for the
-* respective ordered operation to finish - this file extent
-* item points to a disk location that might not have yet been
-* written to, containing random data - so after a crash a log
-* replay will make our inode have file extent items that point
-* to disk locations containing invalid data, as we returned
-* success to userspace without waiting for the respective
-* ordered operation to finish, because it wasn't captured by
-* btrfs_get_logged_extents().
-*/
-   ret = start_ordered_ops(inode, start, end);
-   }
+   ret = btrfs_wait_ordered_range(inode, start, len);
if (ret) {
inode_unlock(inode);
goto out;
@@ -2240,13 +2199,6 @@ int btrfs_sync_file(struct file *file, loff_t start, 
loff_t end, int datasync)
goto 

[PATCH 1/2] btrfs: kill btrfs_write_inode

2018-05-22 Thread Josef Bacik
From: Josef Bacik 

We don't actually need this.  It used to be in place for O_SYNC writes,
but we've used the normal fsync() path for that for years now.  The
other case we hit this is through sync(), which will commit the
transaction anyway.  All this does is make us commit the transaction a
bunch for no reason, and it could deadlock with delayed iput's.

Signed-off-by: Josef Bacik 
---
 fs/btrfs/inode.c | 26 --
 fs/btrfs/super.c |  1 -
 2 files changed, 27 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 92b629212c81..5d47206604ca 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -6263,32 +6263,6 @@ static int btrfs_real_readdir(struct file *file, struct 
dir_context *ctx)
return ret;
 }
 
-int btrfs_write_inode(struct inode *inode, struct writeback_control *wbc)
-{
-   struct btrfs_root *root = BTRFS_I(inode)->root;
-   struct btrfs_trans_handle *trans;
-   int ret = 0;
-   bool nolock = false;
-
-   if (test_bit(BTRFS_INODE_DUMMY, _I(inode)->runtime_flags))
-   return 0;
-
-   if (btrfs_fs_closing(root->fs_info) &&
-   btrfs_is_free_space_inode(BTRFS_I(inode)))
-   nolock = true;
-
-   if (wbc->sync_mode == WB_SYNC_ALL) {
-   if (nolock)
-   trans = btrfs_join_transaction_nolock(root);
-   else
-   trans = btrfs_join_transaction(root);
-   if (IS_ERR(trans))
-   return PTR_ERR(trans);
-   ret = btrfs_commit_transaction(trans);
-   }
-   return ret;
-}
-
 /*
  * This is somewhat expensive, updating the tree every time the
  * inode changes.  But, it is most likely to find the inode in cache.
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 4a584a7ad371..0a4fb69e0ddf 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -2331,7 +2331,6 @@ static const struct super_operations btrfs_super_ops = {
.sync_fs= btrfs_sync_fs,
.show_options   = btrfs_show_options,
.show_devname   = btrfs_show_devname,
-   .write_inode= btrfs_write_inode,
.alloc_inode= btrfs_alloc_inode,
.destroy_inode  = btrfs_destroy_inode,
.statfs = btrfs_statfs,
-- 
2.14.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] Btrfs: fix error handling in btrfs_truncate()

2018-05-22 Thread Omar Sandoval
On Tue, May 22, 2018 at 10:37:14AM -0700, Omar Sandoval wrote:
> On Tue, May 22, 2018 at 07:17:48PM +0200, David Sterba wrote:
> > On Tue, May 22, 2018 at 09:47:58AM -0700, Omar Sandoval wrote:
> > > From: Omar Sandoval 
> > > 
> > > Jun Wu at Facebook reported that an internal service was seeing a return
> > > value of 1 from ftruncate() on Btrfs in some cases.
> > 
> > Do you have a reproducer? To estimate how likely is to hit the problem
> > in practice.

Here's an even easier one: truncating a compressed, inline file twice:

#include 
#include 
#include 
#include 

int main() {
char buf[256] = { 0 };
int ret;
int fd;

fd = open("test", O_CREAT | O_WRONLY | O_TRUNC, 0666);
if (fd == -1) {
perror("open");
return EXIT_FAILURE;
}
if (write(fd, buf, sizeof(buf)) != sizeof(buf)) {
perror("write");
close(fd);
return EXIT_FAILURE;
}
close(fd);

fd = open("test", O_WRONLY, 0666);
if (fd == -1) {
perror("open");
return EXIT_FAILURE;
}
ret = ftruncate(fd, 128);
if (ret) {
printf("first ftruncate() returned %d\n", ret);
close(fd);
return EXIT_FAILURE;
}
close(fd);

fd = open("test", O_WRONLY, 0666);
if (fd == -1) {
perror("open");
return EXIT_FAILURE;
}
ret = ftruncate(fd, 64);
if (ret) {
printf("second ftruncate() returned %d\n", ret);
close(fd);
return EXIT_FAILURE;
}
close(fd);

return EXIT_SUCCESS;
}

The output is

second ftruncate() returned 1
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] Btrfs: fix error handling in btrfs_truncate()

2018-05-22 Thread Omar Sandoval
On Tue, May 22, 2018 at 07:17:48PM +0200, David Sterba wrote:
> On Tue, May 22, 2018 at 09:47:58AM -0700, Omar Sandoval wrote:
> > From: Omar Sandoval 
> > 
> > Jun Wu at Facebook reported that an internal service was seeing a return
> > value of 1 from ftruncate() on Btrfs in some cases.
> 
> Do you have a reproducer? To estimate how likely is to hit the problem
> in practice.

This reproduces it every time when mounted with compress-force=zstd:

#include 
#include 
#include 
#include 
#include 

int main() {
int i;

for (i = 0; i < 2; i++) {
char buf[256] = { 0 };
int fd, ret;
char *p;

fd = open("test", O_CREAT | O_WRONLY | O_TRUNC, 0666);
if (fd == -1) {
perror("open");
return EXIT_FAILURE;
}
if (write(fd, buf, sizeof(buf)) != sizeof(buf)) {
perror("write");
close(fd);
return EXIT_FAILURE;
}
close(fd);

fd = open("test", O_RDONLY, 0666);
if (fd == -1) {
perror("open");
return EXIT_FAILURE;
}
p = mmap(NULL, 256, PROT_READ, MAP_SHARED, fd, 0);
if (p == MAP_FAILED) {
perror("mmap");
close(fd);
return EXIT_FAILURE;
}
if (p[0] != 0)
return 1;
close(fd);

fd = open("test", O_WRONLY, 0666);
if (fd == -1) {
perror("open");
return EXIT_FAILURE;
}
ret = ftruncate(fd, 128);
if (ret) {
printf("ftruncate() returned %d\n", ret);
close(fd);
return EXIT_FAILURE;
}
close(fd);
}

return EXIT_SUCCESS;
}

This happens any time NEED_TRUNCATE_BLOCK is returned. The file has to
be inline and compressed, and there's some other condition that I
haven't figured out yet which the mmap() is there for.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: implement unlocked buffered write

2018-05-22 Thread David Sterba
On Wed, May 16, 2018 at 11:52:37AM +0800, robbieko wrote:
> From: Robbie Ko 
> 
> This idea is from direct io. By this patch, we can make the buffered
> write parallel, and improve the performance and latency. But because we
> can not update isize without i_mutex, the unlocked buffered write just
> can be done in front of the EOF.
> 
> We needn't worry about the race between buffered write and truncate,
> because the truncate need wait until all the buffered write end.
> 
> And we also needn't worry about the race between dio write and punch hole,
> because we have extent lock to protect our operation.
> 
> I ran fio to test the performance of this feature.
> 
> == Hardware ==
> CPU: Intel® Xeon® D-1531
> SSD: Intel S3710 200G
> Volume : RAID 5 , SSD * 6
> 
> == config file ==
> [global]
> group_reporting
> time_based
> thread=1
> norandommap
> ioengine=libaio
> bs=4k
> iodepth=32
> size=16G
> runtime=180
> numjobs=8
> rw=randwrite
> 
> [file1]
> filename=/mnt/btrfs/nocow/testfile
> 
> == result (iops) ==
> lock = 68470
> unlocked = 94242

This looks impressive.

> == result (clat) ==
> lock
>  lat (usec): min=184, max=1209.9K, avg=3738.35, stdev=20869.49
>  clat percentiles (usec):
>   |  1.00th=[  322],  5.00th=[  330], 10.00th=[  334], 20.00th=[  346],
>   | 30.00th=[  370], 40.00th=[  386], 50.00th=[  406], 60.00th=[  446],
>   | 70.00th=[  516], 80.00th=[  612], 90.00th=[  828], 95.00th=[10432],
>   | 99.00th=[84480], 99.50th=[117248], 99.90th=[226304], 99.95th=[333824],
>   | 99.99th=[692224]
> 
> unlocked
>  lat (usec): min=10, max=218208, avg=2691.44, stdev=5380.82
>  clat percentiles (usec):
>   |  1.00th=[  302],  5.00th=[  390], 10.00th=[  442], 20.00th=[  502],
>   | 30.00th=[  548], 40.00th=[  596], 50.00th=[  652], 60.00th=[  724],
>   | 70.00th=[  916], 80.00th=[ 5024], 90.00th=[ 5664], 95.00th=[10048],
>   | 99.00th=[29568], 99.50th=[39168], 99.90th=[54016], 99.95th=[59648],
>   | 99.99th=[78336]

The latencies are better, nice.

> Signed-off-by: Robbie Ko 
> ---
>  fs/btrfs/file.c | 17 +
>  1 file changed, 17 insertions(+)
> 
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index 41ab907..8eac540 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -1600,6 +1600,7 @@ static noinline ssize_t __btrfs_buffered_write(struct 
> file *file,
>   int ret = 0;
>   bool only_release_metadata = false;
>   bool force_page_uptodate = false;
> + bool relock = false;
>  
>   nrptrs = min(DIV_ROUND_UP(iov_iter_count(i), PAGE_SIZE),
>   PAGE_SIZE / (sizeof(struct page *)));
> @@ -1609,6 +1610,18 @@ static noinline ssize_t __btrfs_buffered_write(struct 
> file *file,
>   if (!pages)
>   return -ENOMEM;
>  
> + inode_dio_begin(inode);
> +
> + /*
> +  * If the write is beyond the EOF, we need update
> +  * the isize, but it is protected by i_mutex. So we can
> +  * not unlock the i_mutex at this case.
> +  */
> + if (pos + iov_iter_count(i) <= i_size_read(inode)) {
> + inode_unlock(inode);
> + relock = true;
> + }

And this looks scary :)

The reasons why this should be safe given in the changelog sound good to
me, but I haven't done a closer review. The patch is ok for addition to
for-next so we'll see if it breaks something, I'd rather be a bit
conservative about merging so no promises that it's going to land in
4.18. More reviews are welcome.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: implement unlocked buffered write

2018-05-22 Thread Omar Sandoval
On Wed, May 16, 2018 at 11:52:37AM +0800, robbieko wrote:
> From: Robbie Ko 
> 
> This idea is from direct io. By this patch, we can make the buffered
> write parallel, and improve the performance and latency. But because we
> can not update isize without i_mutex, the unlocked buffered write just
> can be done in front of the EOF.
> 
> We needn't worry about the race between buffered write and truncate,
> because the truncate need wait until all the buffered write end.

I'm not convinced that this race isn't an issue. Consider this:

__btrfs_buffered_write()btrfs_setsize()
inode_dio_begin()
inode_unlock()
inode_lock()
truncate_setsize() /* Updates i_size */
truncate_pagecache() /* Locks and unlocks pages 
*/
inode_dio_wait()
prepare_pages() /* Locks pages */
btrfs_dirty_pages() /* Updates i_size without i_rwsem! */

I think moving inode_dio_wait() to before truncate_setsize() in
btrfs_setsize() would close this race, but I haven't thought about it
long enough to determine whether that still works for the original
reason the inode_dio_wait() was added.

> And we also needn't worry about the race between dio write and punch hole,
> because we have extent lock to protect our operation.
> 
> I ran fio to test the performance of this feature.
> 
> == Hardware ==
> CPU: Intel® Xeon® D-1531
> SSD: Intel S3710 200G
> Volume : RAID 5 , SSD * 6
> 
> == config file ==
> [global]
> group_reporting
> time_based
> thread=1
> norandommap
> ioengine=libaio
> bs=4k
> iodepth=32
> size=16G
> runtime=180
> numjobs=8
> rw=randwrite
> 
> [file1]
> filename=/mnt/btrfs/nocow/testfile
> 
> == result (iops) ==
> lock = 68470
> unlocked = 94242
> 
> == result (clat) ==
> lock
>  lat (usec): min=184, max=1209.9K, avg=3738.35, stdev=20869.49
>  clat percentiles (usec):
>   |  1.00th=[  322],  5.00th=[  330], 10.00th=[  334], 20.00th=[  346],
>   | 30.00th=[  370], 40.00th=[  386], 50.00th=[  406], 60.00th=[  446],
>   | 70.00th=[  516], 80.00th=[  612], 90.00th=[  828], 95.00th=[10432],
>   | 99.00th=[84480], 99.50th=[117248], 99.90th=[226304], 99.95th=[333824],
>   | 99.99th=[692224]
> 
> unlocked
>  lat (usec): min=10, max=218208, avg=2691.44, stdev=5380.82
>  clat percentiles (usec):
>   |  1.00th=[  302],  5.00th=[  390], 10.00th=[  442], 20.00th=[  502],
>   | 30.00th=[  548], 40.00th=[  596], 50.00th=[  652], 60.00th=[  724],
>   | 70.00th=[  916], 80.00th=[ 5024], 90.00th=[ 5664], 95.00th=[10048],
>   | 99.00th=[29568], 99.50th=[39168], 99.90th=[54016], 99.95th=[59648],
>   | 99.99th=[78336]
> 
> Signed-off-by: Robbie Ko 
> ---
>  fs/btrfs/file.c | 17 +
>  1 file changed, 17 insertions(+)
> 
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index 41ab907..8eac540 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -1600,6 +1600,7 @@ static noinline ssize_t __btrfs_buffered_write(struct 
> file *file,
>   int ret = 0;
>   bool only_release_metadata = false;
>   bool force_page_uptodate = false;
> + bool relock = false;
>  
>   nrptrs = min(DIV_ROUND_UP(iov_iter_count(i), PAGE_SIZE),
>   PAGE_SIZE / (sizeof(struct page *)));
> @@ -1609,6 +1610,18 @@ static noinline ssize_t __btrfs_buffered_write(struct 
> file *file,
>   if (!pages)
>   return -ENOMEM;
>  
> + inode_dio_begin(inode);

This would need a comment as to why we're using inode_dio_begin() for
buffered I/O.

> + /*
> +  * If the write is beyond the EOF, we need update
> +  * the isize, but it is protected by i_mutex. So we can
> +  * not unlock the i_mutex at this case.
> +  */
> + if (pos + iov_iter_count(i) <= i_size_read(inode)) {
> + inode_unlock(inode);
> + relock = true;
> + }
> +
>   while (iov_iter_count(i) > 0) {
>   size_t offset = pos & (PAGE_SIZE - 1);
>   size_t sector_offset;
> @@ -1808,6 +1821,10 @@ static noinline ssize_t __btrfs_buffered_write(struct 
> file *file,
>   }
>   }
>  
> + inode_dio_end(inode);
> + if (relock)
> + inode_lock(inode);
> +
>   extent_changeset_free(data_reserved);
>   return num_written ? num_written : ret;
>  }
> -- 
> 1.9.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] Btrfs: fix error handling in btrfs_truncate()

2018-05-22 Thread David Sterba
On Tue, May 22, 2018 at 09:47:58AM -0700, Omar Sandoval wrote:
> From: Omar Sandoval 
> 
> Jun Wu at Facebook reported that an internal service was seeing a return
> value of 1 from ftruncate() on Btrfs in some cases.

Do you have a reproducer? To estimate how likely is to hit the problem
in practice.

> This is coming from
> the NEED_TRUNCATE_BLOCK return value from btrfs_truncate_inode_items().
> 
> btrfs_truncate() uses two variables for error handling, ret and err.
> When btrfs_truncate_inode_items() returns non-zero, we set err to the
> return value. However, NEED_TRUNCATE_BLOCK is not an error. Make sure we
> only set err if ret is an error (i.e., negative).
> 
> Fixes: ddfae63cc8e0 ("btrfs: move btrfs_truncate_block out of trans handle")
> Reported-by: Jun Wu 
> Cc: sta...@vger.kernel.org
> Signed-off-by: Omar Sandoval 
> ---
> This version makes the minimal fix which should be good for v4.17 and
> stable. I'll submit a cleanup separately.
> 
>  fs/btrfs/inode.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index d241285a0d2a..f276da70f659 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -9117,7 +9117,8 @@ static int btrfs_truncate(struct inode *inode, bool 
> skip_writeback)
>BTRFS_EXTENT_DATA_KEY);
>   trans->block_rsv = _info->trans_block_rsv;
>   if (ret != -ENOSPC && ret != -EAGAIN) {
> - err = ret;
> + if (ret < 0)
> + err = ret;
>   break;
>   }
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] btrfs: Skip some btrfs_cross_ref_exist() check in nocow path

2018-05-22 Thread David Sterba
On Thu, May 17, 2018 at 02:58:29PM +0800, Ethan Lien wrote:
> In nocow path, we check if the extent is snapshotted in
> btrfs_cross_ref_exist(). We can do the similar check earlier and avoid
> unnecessary search into extent tree.
> 
> A fio test on a Intel D-1531, 16GB RAM, SSD RAID-5 machine as follows:
> 
> [global]
> group_reporting
> time_based
> thread=1
> ioengine=libaio
> bs=4k
> iodepth=32
> size=64G
> runtime=180
> numjobs=8
> rw=randwrite
> 
> [file1]
> filename=/mnt/nocow/testfile
> 
> IOPS result:   unpatched patched
> 
> 1 fio round: 4667046958
> snapshot
> 2 fio round: 5182654498
> 3 fio round: 5976761289
> 
> After snapshot, the first fio get about 5% performance gain. As we
> continually write to the same file, all writes will resume to nocow mode
> and eventually we have no performance gain.
> 
> Signed-off-by: Ethan Lien 
> ---
> 
> V2:
>  Add comment and performance test.

Thanks, I maybe edit the comments further as do not feel like I
understand why the shortcut can be safely taken just from reading it,
but the code looks ok otherwise. I'll add the patch to for-next.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs: clean up error handling in btrfs_truncate()

2018-05-22 Thread Omar Sandoval
From: Omar Sandoval 

btrfs_truncate() uses two variables for error handling, ret and err (if
this sounds familiar, it's because btrfs_truncate_inode_items() did
something similar). This is error prone, as was made evident by "Btrfs:
fix error handling in btrfs_truncate()". We only have err because we
don't want to mask an error if we call btrfs_update_inode() and
btrfs_end_transaction(), so let's make that its own scoped return
variable and use ret everywhere else.

Reviewed-by: Nikolay Borisov 
Signed-off-by: Omar Sandoval 
---
This is the same as my v1 "Btrfs: fix error handling in
btrfs_truncate()", but rebased on top of kdave/for-next + v2 of "Btrfs:
fix error handling in btrfs_truncate()".

 fs/btrfs/inode.c | 33 ++---
 1 file changed, 14 insertions(+), 19 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 220b609d711d..25ca1c14ecb4 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -9019,8 +9019,7 @@ static int btrfs_truncate(struct inode *inode, bool 
skip_writeback)
struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
struct btrfs_root *root = BTRFS_I(inode)->root;
struct btrfs_block_rsv *rsv;
-   int ret = 0;
-   int err = 0;
+   int ret;
struct btrfs_trans_handle *trans;
u64 mask = fs_info->sectorsize - 1;
u64 min_size = btrfs_calc_trunc_metadata_size(fs_info, 1);
@@ -9072,7 +9071,7 @@ static int btrfs_truncate(struct inode *inode, bool 
skip_writeback)
 */
trans = btrfs_start_transaction(root, 2);
if (IS_ERR(trans)) {
-   err = PTR_ERR(trans);
+   ret = PTR_ERR(trans);
goto out;
}
 
@@ -9096,24 +9095,19 @@ static int btrfs_truncate(struct inode *inode, bool 
skip_writeback)
 inode->i_size,
 BTRFS_EXTENT_DATA_KEY);
trans->block_rsv = _info->trans_block_rsv;
-   if (ret != -ENOSPC && ret != -EAGAIN) {
-   if (ret < 0)
-   err = ret;
+   if (ret != -ENOSPC && ret != -EAGAIN)
break;
-   }
 
ret = btrfs_update_inode(trans, root, inode);
-   if (ret) {
-   err = ret;
+   if (ret)
break;
-   }
 
btrfs_end_transaction(trans);
btrfs_btree_balance_dirty(fs_info);
 
trans = btrfs_start_transaction(root, 2);
if (IS_ERR(trans)) {
-   ret = err = PTR_ERR(trans);
+   ret = PTR_ERR(trans);
trans = NULL;
break;
}
@@ -9147,21 +9141,22 @@ static int btrfs_truncate(struct inode *inode, bool 
skip_writeback)
}
 
if (trans) {
+   int ret2;
+
trans->block_rsv = _info->trans_block_rsv;
-   ret = btrfs_update_inode(trans, root, inode);
-   if (ret && !err)
-   err = ret;
+   ret2 = btrfs_update_inode(trans, root, inode);
+   if (ret2 && !ret)
+   ret = ret2;
 
-   ret = btrfs_end_transaction(trans);
+   ret2 = btrfs_end_transaction(trans);
+   if (ret2 && !ret)
+   ret = ret2;
btrfs_btree_balance_dirty(fs_info);
}
 out:
btrfs_free_block_rsv(fs_info, rsv);
 
-   if (ret && !err)
-   err = ret;
-
-   return err;
+   return ret;
 }
 
 /*
-- 
2.17.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2] Btrfs: fix error handling in btrfs_truncate()

2018-05-22 Thread Omar Sandoval
From: Omar Sandoval 

Jun Wu at Facebook reported that an internal service was seeing a return
value of 1 from ftruncate() on Btrfs in some cases. This is coming from
the NEED_TRUNCATE_BLOCK return value from btrfs_truncate_inode_items().

btrfs_truncate() uses two variables for error handling, ret and err.
When btrfs_truncate_inode_items() returns non-zero, we set err to the
return value. However, NEED_TRUNCATE_BLOCK is not an error. Make sure we
only set err if ret is an error (i.e., negative).

Fixes: ddfae63cc8e0 ("btrfs: move btrfs_truncate_block out of trans handle")
Reported-by: Jun Wu 
Cc: sta...@vger.kernel.org
Signed-off-by: Omar Sandoval 
---
This version makes the minimal fix which should be good for v4.17 and
stable. I'll submit a cleanup separately.

 fs/btrfs/inode.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index d241285a0d2a..f276da70f659 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -9117,7 +9117,8 @@ static int btrfs_truncate(struct inode *inode, bool 
skip_writeback)
 BTRFS_EXTENT_DATA_KEY);
trans->block_rsv = _info->trans_block_rsv;
if (ret != -ENOSPC && ret != -EAGAIN) {
-   err = ret;
+   if (ret < 0)
+   err = ret;
break;
}
 
-- 
2.17.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: balance dirty metadata pages in btrfs_finish_ordered_io

2018-05-22 Thread David Sterba
On Fri, Apr 27, 2018 at 03:05:24PM +0800, Ethan Lien wrote:
> We should balance dirty metadata pages at the end of
> btrfs_finish_ordered_io, since a small, unmergeable random write can
> potentially produce dirty metadata which is multiple times larger than
> the data itself. For example, a small, unmergeable 4KiB write may
> produce:
> 
> 16KiB dirty leaf (and possibly 16KiB dirty node) in subvolume tree
> 16KiB dirty leaf (and possibly 16KiB dirty node) in checksum tree
> 16KiB dirty leaf (and possibly 16KiB dirty node) in extent tree
> 
> Although we do call balance dirty pages in write side, but in the
> buffered write path, most metadata are dirtied only after we reach the
> dirty background limit (which by far onlys counts dirty data pages) and
> wakeup the flusher thread. If there are many small, unmergeable random
> writes spread in a large btree, we'll find a burst of dirty pages
> exceeds the dirty_bytes limit after we wakeup the flusher thread - which
> is not what we expect. In our machine, it caused out-of-memory problem
> since a page cannot be dropped if it is marked dirty.
> 
> Someone may worry about we may sleep in btrfs_btree_balance_dirty(), but
> since we do btrfs_finish_ordered_io in a separate worker, it will not
> stop the flusher consuming dirty pages. Also, we use different worker for
> metadata writeback endio, sleep in btrfs_finish_ordered_io help us
> throttle the size of dirty metadata pages.

The described scenario sounds interesting. Do you have some scripted
steps to reproduce it?

btrfs_btree_balance_dirty detects congestion and can skip the balancing
eventually, so the case when it actually does something and waits is the
point where things can go bad. From the last paragraph, it is clear that
you have considered that, that's good.

Have you also considered calling __btrfs_btree_balance_dirty with
flush_delayed=0 ? This would avoid the waiting and I'm not sure if it's
really required here to get out of the situation.

Anyway, I'll add the patch to for-next for more testing.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] btrfs: propagate failures of __exclude_logged_extent to upper caller

2018-05-22 Thread David Sterba
On Tue, May 22, 2018 at 05:46:51PM +0800, Gu Jinxiang wrote:
> Function btrfs_exclude_logged_extents may call __exclude_logged_extent
> which may fail.
> Propagate the failures of __exclude_logged_extent to upper caller.
> 
> Signed-off-by: Gu Jinxiang 

Reviewed-by: David Sterba 

The whole group of extent exclusion functions needs to be audited for
proper error handling.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] btrfs: handle failures of set_extent_bits in add_excluded_extent

2018-05-22 Thread David Sterba
On Tue, May 22, 2018 at 05:46:50PM +0800, Gu Jinxiang wrote:
> set_extent_bits may return 0/-EEXIST, so return the result in
> add_excluded_extent.

This is misleading, set_extent_bits can return anything that gets
propagated from the callees, which is 0 and -EEXIST for now but will be
also -ENOMEM eventually. And some callers expect values >= 0 (eg. in
btrfs_get_io_failure_record) , but I haven't examined if this can really
happen.

> Signed-off-by: Gu Jinxiang 
> ---
>  fs/btrfs/extent-tree.c | 11 ---
>  1 file changed, 8 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 75cfb80d2551..2e85e99b5e6f 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -215,11 +215,16 @@ static int add_excluded_extent(struct btrfs_fs_info 
> *fs_info,
>  u64 start, u64 num_bytes)
>  {
>   u64 end = start + num_bytes - 1;
> - set_extent_bits(_info->freed_extents[0],
> + int ret = 0;
> +
> + ret = set_extent_bits(_info->freed_extents[0],
>   start, end, EXTENT_UPTODATE);
> - set_extent_bits(_info->freed_extents[1],
> + if (ret)
> + goto out;
> + ret = set_extent_bits(_info->freed_extents[1],
>   start, end, EXTENT_UPTODATE);
> - return 0;
> +out:

The function is short and fairly linear so you don't need to add the
label, 'return ret' would be ok.

> + return ret;
>  }
>  
>  static void free_excluded_extents(struct btrfs_fs_info *fs_info,
> -- 
> 1.9.1
> 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] Cleanup alloc_reserved_tree_block signature

2018-05-22 Thread David Sterba
On Mon, May 21, 2018 at 12:27:19PM +0300, Nikolay Borisov wrote:
> At the moment alloc_reserved_tree_block takes 8 friggin arguments. The irony 
> is that all of those are really derived from 3 structures. This series ends 
> up 
> reducing the arguments to 3. As a result some code, private to 
> alloc_reserved_tree_block
> is moved from run_delayed_tree_ref. Patch 1 removes the fs_info argument, 
> Patch 2 replaces all arguments derived from btrfs_delayed_ref_node with the 
> nodes itself, Patch 3 replaces the last 2 arguments with 
> btrfs_delayed_extent_op
> and finally, Patch 4 simplifies the code a bit by removing one extra check.
> 
> All in all this doesn't provide any functional changes but makes the code 
> easier to follow. Also it has survived 2 xfstest runs. 
> 
> Nikolay Borisov (4):
>   btrfs: Remove fs_info argument from alloc_reserved_tree_block
>   btrfs: Simplify alloc_reserved_tree_block interface
>   btrfs: Pass btrfs_delayed_extent_op to alloc_reserved_tree_block
>   btrfs: Streamline shared ref check in alloc_reserved_tree_block

Passed fstests and added to misc-next, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] btrfs: Streamline shared ref check in alloc_reserved_tree_block

2018-05-22 Thread David Sterba
On Mon, May 21, 2018 at 12:27:23PM +0300, Nikolay Borisov wrote:
> Instead of setting "parent" to ref->parent only when dealing with
> a shared ref and subsequently performing another check to see
> if (parent > 0), check the "node->type" directly and act accordingly.
> This makes the code more streamline. No functional changes.
> 
> Signed-off-by: Nikolay Borisov 

Reviewed-by: David Sterba 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/4] btrfs: Pass btrfs_delayed_extent_op to alloc_reserved_tree_block

2018-05-22 Thread David Sterba
On Mon, May 21, 2018 at 12:27:22PM +0300, Nikolay Borisov wrote:
> Instead of taking only specific member of this structure, which results
> in 2 extra arguments, just take the delayed_extent_op struct and
> reference the arguments inside the functions. No functional changes.
> 
> Signed-off-by: Nikolay Borisov 

Reviewed-by: David Sterba 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] btrfs: Simplify alloc_reserved_tree_block interface

2018-05-22 Thread David Sterba
On Mon, May 21, 2018 at 12:27:21PM +0300, Nikolay Borisov wrote:
> This function currently takes 7 parameters, most of which are
> proxies for values from btrfs_delayed_ref_node struct which is not
> passed. This patch simplifies the interface of the function by
> simply passing said delayed ref node struct to the function. This
> enables us to:
> 
> 1. Move locals variables and init code related to them from
> run_delayed_tree_ref which should only be used inside
> alloc_reserved_tree_block, such as skinny_metadata and the btrfs_key,
> representing the extent being inserted. This removes the need for the
> "ins" argument. Instead, it's replaced by a local var with a more
> verbose name - extent_key.
> 
> 2. Now that we have a reference to the node in alloc_reserved_tree_block
> the delayed_tree_ref struct can be referenced inside the function and
> this enable removing the "ref->level", "parent" and "ref_root"
> arguments.
> 
> Signed-off-by: Nikolay Borisov 

Reviewed-by: David Sterba 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/4] btrfs: Remove fs_info argument from alloc_reserved_tree_block

2018-05-22 Thread David Sterba
On Mon, May 21, 2018 at 12:27:20PM +0300, Nikolay Borisov wrote:
> This function already takes a transaction handle which contains a
> reference to the fs_info. So use this and remove the extra argument.
> 
> Signed-off-by: Nikolay Borisov 
Reviewed-by: David Sterba 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 3/4] btrfs: lzo: Add header length check to avoid slab out of bounds access

2018-05-22 Thread David Sterba
On Mon, May 21, 2018 at 01:19:26PM +0800, Qu Wenruo wrote:
> James Harvey reported that some corrupted compressed extent data can
> lead to various kernel memory corruption.
> 
> Such corrupted extent data belongs to inode with NODATASUM flags, thus
> data csum won't help us detecting such bug.
> 
> If lucky enough, kasan could catch it like:
> ==
> BUG: KASAN: slab-out-of-bounds in lzo_decompress_bio+0x384/0x7a0 [btrfs]
> Write of size 4096 at addr 8800606cb0f8 by task kworker/u16:0/2338
> 
> CPU: 3 PID: 2338 Comm: kworker/u16:0 Tainted: G   O  
> 4.17.0-rc5-custom+ #50
> Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
> Workqueue: btrfs-endio btrfs_endio_helper [btrfs]
> Call Trace:
>  dump_stack+0xc2/0x16b
>  print_address_description+0x6a/0x270
>  kasan_report+0x260/0x380
>  memcpy+0x34/0x50
>  lzo_decompress_bio+0x384/0x7a0 [btrfs]
>  end_compressed_bio_read+0x99f/0x10b0 [btrfs]
>  bio_endio+0x32e/0x640
>  normal_work_helper+0x15a/0xea0 [btrfs]
>  process_one_work+0x7e3/0x1470
>  worker_thread+0x1b0/0x1170
>  kthread+0x2db/0x390
>  ret_from_fork+0x22/0x40
> ...
> ==
> 
> The offending compressed data has the following info:
> 
> Header:   length 32768(Looks completely valid)
> Segment 0 Header: length 3472882419   (Obvious out of bounds)
> 
> Then when handling segment 0, since it's over the current page, we need
> the compressed data to workspace, then such large size would trigger
> out-of-bounds memory access, screwing up the whole kernel.
> 
> Fix it by adding extra checks on header and segment headers to ensure we
> won't access out-of-bounds, and even checks the decompressed data won't
> be out-of-bounds.

Good, feel free to add more if you find them. The BUG_ON in
lzo_decompress can be replaced by return -EUCLEAN and the total size can
be compared with the segment size if they match too.

> Reported-by: James Harvey 
> Signed-off-by: Qu Wenruo 
> ---
>  fs/btrfs/lzo.c | 35 ++-
>  1 file changed, 34 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/lzo.c b/fs/btrfs/lzo.c
> index d0c6789ff78f..4c75dcba3f04 100644
> --- a/fs/btrfs/lzo.c
> +++ b/fs/btrfs/lzo.c
> @@ -281,6 +281,7 @@ static int lzo_decompress_bio(struct list_head *ws, 
> struct compressed_bio *cb)
>   unsigned long working_bytes;
>   size_t in_len;
>   size_t out_len;
> + size_t max_segment_len = lzo1x_worst_compress(PAGE_SIZE);

The value is a compile-time constant, it does not need to be stored in a
variable.

>   unsigned long in_offset;
>   unsigned long in_page_bytes_left;
>   unsigned long tot_in;
> @@ -294,6 +295,18 @@ static int lzo_decompress_bio(struct list_head *ws, 
> struct compressed_bio *cb)
>  
>   data_in = kmap(pages_in[0]);
>   tot_len = read_compress_length(data_in);
> + /*
> +  * Compressed data header check.
> +  *
> +  * The real compressed size can't exceed extent length, and all pages
> +  * should be used (a full pending page is not possible).
> +  * If this happens it means the compressed extent is corrupted.
> +  */
> + if (tot_len > min_t(size_t, BTRFS_MAX_COMPRESSED, srclen) ||
> + tot_len < srclen - PAGE_SIZE) {

All such conditions can be put into unlikely() as this is an error
handling shortcut.

> + ret = -EUCLEAN;
> + goto done;
> + }
>  
>   tot_in = LZO_LEN;
>   in_offset = LZO_LEN;
> @@ -308,6 +321,17 @@ static int lzo_decompress_bio(struct list_head *ws, 
> struct compressed_bio *cb)
>   in_offset += LZO_LEN;
>   tot_in += LZO_LEN;
>  
> + /*
> +  * Segment header check.
> +  *
> +  * The segment length must not exceed max lzo compression
> +  * size, nor the total compressed size
> +  */
> + if (in_len > max_segment_len || tot_in + in_len > tot_len) {
> + ret = -EUCLEAN;
> + goto done;
> + }
> +
>   tot_in += in_len;
>   working_bytes = in_len;
>   may_late_unmap = need_unmap = false;
> @@ -358,7 +382,7 @@ static int lzo_decompress_bio(struct list_head *ws, 
> struct compressed_bio *cb)
>   }
>   }
>  
> - out_len = lzo1x_worst_compress(PAGE_SIZE);
> + out_len = max_segment_len;
>   ret = lzo1x_decompress_safe(buf, in_len, workspace->buf,
>   _len);
>   if (need_unmap)
> @@ -368,6 +392,15 @@ static int lzo_decompress_bio(struct list_head *ws, 
> struct compressed_bio *cb)
>   ret = -EIO;
>   break;
>   }
> + /*
> +  

Re: [PATCH 3/4] btrfs: lzo: Add header length check to avoid slab out of bounds access

2018-05-22 Thread David Sterba
On Thu, May 17, 2018 at 11:14:46AM +0300, Nikolay Borisov wrote:
> srclen  comes from the async_extent struct, which in turns is
> initialized in compress_file_range with the value of "total_compressed",
> and the value there is actually initialized by
> btrfs_compress_pages->lzo_compress_pages (that code makes me wanna sing
> "You spin me right round, baby Right round like a record, baby").

I have a dusted patch that reworks the whole loop to something more
understandable, I can post it with some other pending compression
updates.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 2/4] btrfs: lzo: Add comment about the how btrfs records its lzo compressed data

2018-05-22 Thread David Sterba
On Mon, May 21, 2018 at 01:19:25PM +0800, Qu Wenruo wrote:
> Although it's not that complex, but such comment could still save
> several minutes for newer reader/reviewer.
> 
> Signed-off-by: Qu Wenruo 
> ---
>  fs/btrfs/lzo.c | 23 +++
>  1 file changed, 23 insertions(+)
> 
> diff --git a/fs/btrfs/lzo.c b/fs/btrfs/lzo.c
> index 0667ea07f766..d0c6789ff78f 100644
> --- a/fs/btrfs/lzo.c
> +++ b/fs/btrfs/lzo.c
> @@ -17,6 +17,29 @@
>  
>  #define LZO_LEN  4
>  
> +/*
> + * Btrfs LZO compression format
> + *
> + * Regular LZO compressed data extent consists of:
> + * 1.  Header
> + * Fixed size. LZO_LEN (4) bytes long, LE32.
> + * Records the total size (*includes* the header) of real compressed 
> data.
> + *
> + * 2.  Segment(s)
> + * Variable size. Includes one segment header, and then data payload.
> + * One regular LZO compressed extent can have one or more segments.
> + *
> + * 2.1 Segment header
> + * Fixed size. LZO_LEN (4) bytes long, LE32.
> + * Records the total size of the segment (*excludes* the header).
> + *
> + * 2.2 Data Payload
> + * Variable size. Size up limit should be 
> lzo1x_worst_compress(PAGE_SIZE).
> + *
> + * While for inlined LZO compressed data extent, it doesn't have Header, just
> + * *ONE* Segment.

It does have header that has the same value as the 1st segment header so
it's skipped.

There's also one catch in the format that if there's a less then 4 bytes
left to the page size, the new segment starts on the new page.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: fix error handling in btrfs_truncate()

2018-05-22 Thread David Sterba
On Tue, May 22, 2018 at 03:11:04PM +0300, Nikolay Borisov wrote:
> 
> 
> On 22.05.2018 14:53, David Sterba wrote:
> > On Fri, May 18, 2018 at 02:43:02PM -0700, Omar Sandoval wrote:
> >> From: Omar Sandoval 
> >>
> >> Jun Wu at Facebook reported that an internal service was seeing a return
> >> value of 1 from ftruncate() on Btrfs when compression is enabled. This
> >> is coming from the NEED_TRUNCATE_BLOCK return value from
> >> btrfs_truncate_inode_items().
> >>
> >> btrfs_truncate() uses two variables for error handling, ret and err (if
> >> this sounds familiar, it's because btrfs_truncate_inode_items() does
> >> something similar).
> > 
> > This err/ret pattern is widely used in btrfs code and IIRC mostly
> > consistently, where err is the global return and ret is local.
> 
> And as a matter of fact it's a bad pattern, precisely because of the
> type of errors this and one of the patches in the orphan items fix.
> Everytime I hear "global state" I cringe, since it just opens avenues
> for bugs.

If we do the single point of return in the function, then the single
global (function-scope) ret value makes sense.

Arguably one can do the same sort of mistakes with the ret/ret2 pattern,
we still need to correctly propagate the local to the function-scope, so
switching is not strictly neccessary and not an all-cure.

> >> When btrfs_truncate_inode_items() returns non-zero,
> >> we set err to the return value, but we don't reset it to zero in the
> >> successful NEED_TRUNCATE_BLOCK case. We only have err because we don't
> >> want to mask an error if we call btrfs_update_inode() and
> >> btrfs_end_transaction(), so let's make that its own scoped return
> >> variable and use ret everywhere else.
> > 
> > That's an alternative pattern when 'ret' is the global return and ret2
> > are locals. Either way, it would be good to unify that.
> 
> FWIW I prefer the style put forth in this patch, since it's easier to
> reason about ret2 because it's defined in a narrower context (the single
> if (trans)) which helps spot problems more easily.

Yeah, consistency POV, keeping the meaning of 'ret' same everywhere
would be better, and we can add any number of ret2, ret3 if we have more
nested contexts.

I did a quick and dirty grep where 'err' is used as the function-scope
and it's in low-tens, so I think it's doable.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 2/3] btrfs: balance: add args info during start and resume

2018-05-22 Thread David Sterba
On Mon, May 21, 2018 at 02:37:44PM +0800, Anand Jain wrote:
> Balance arg info is an important information to be reviewed for the
> system audit. So this patch adds them to the kernel log.
> 
> Example:
> 
> -> btrfs bal start -dprofiles='raid1|single',convert=raid5 
> -mprofiles='raid1|single',convert=raid5 /btrfs
>  kernel: BTRFS info (device sdb): balance: start 
> 

Re: [PATCH v2 0/6] btrfs_search_slot cleanups

2018-05-22 Thread Nikolay Borisov


On 22.05.2018 15:02, David Sterba wrote:
> On Tue, May 22, 2018 at 07:05:14PM +0800, Su Yue wrote:
>> Hi Liu and David,
>>  During my local xfstests on kdave/for-next, btrfs/139 failed and
>> btrfs BUG_ON due to qgroup rescan.
>>  The bisect result is commit 560215eb3f32("Merge branch
>> 'ext/liubo/search-cleanups-wip' into for-next-next-v4.18-20180521")
>> which seems merged this patchset.
>>  The dmesg file is attached.
>>
>> On 05/18/2018 11:00 AM, Liu Bo wrote:
>>> Here are a collection of patches I did for btrfs_search_slot().
>>>
>>> v2: more explicit commit log for each patch.
>>>
>>> Liu Bo (6):
>>>   Btrfs: remove superfluous free_extent_buffer
>>>   Btrfs: use more straightforward extent_buffer_uptodate
>>>   Btrfs: move get root of btrfs_search_slot to a helper
>>>   Btrfs: remove unused check of skip_locking
>>>   Btrfs: grab write lock directly if write_lock_level is the max level
>>>   Btrfs: remove always true check in unlock_up
>>>
>>>  fs/btrfs/ctree.c | 121 
>>> +--
>>>  1 file changed, 73 insertions(+), 48 deletions(-)
>>>
>>
>>
> 
>> [   46.129166] BTRFS info (device vdb): disk space caching is enabled
>> [   46.130545] BTRFS info (device vdb): has skinny extents
>> [   46.171386] mount (2798) used greatest stack depth: 12920 bytes left
>> [   46.508170] BTRFS: device fsid 83a117c7-a9ea-4bf5-b42f-7092078610d5 devid 
>> 1 transid 5 /dev/vdc
>> [   46.562428] BTRFS info (device vdc): disk space caching is enabled
>> [   46.563690] BTRFS info (device vdc): has skinny extents
>> [   46.564563] BTRFS info (device vdc): flagging fs with big metadata feature
>> [   46.587441] BTRFS info (device vdc): checking UUID tree
>> [   46.766765] BTRFS info (device vdb): disk space caching is enabled
>> [   46.768197] BTRFS info (device vdb): has skinny extents
>> [   46.875534] run fstests btrfs/139 at 2018-05-22 18:40:36
>> [   47.559411] BTRFS: device fsid 065f3825-057e-451f-8722-0d94d4a3f36f devid 
>> 1 transid 5 /dev/vdc
>> [   47.612001] BTRFS info (device vdc): disk space caching is enabled
>> [   47.613254] BTRFS info (device vdc): has skinny extents
>> [   47.614147] BTRFS info (device vdc): flagging fs with big metadata feature
>> [   47.632377] BTRFS info (device vdc): checking UUID tree
>> [   47.681656] btrfs (3176) used greatest stack depth: 12632 bytes left
>> [   47.691156] [ cut here ]
>> [   47.692084] kernel BUG at fs/btrfs/locking.c:286!
> 
> I saw the crash too but did not investigate the root cause. So I'll
> remove the branch from for-next until it's fixed. Thanks for the report.

I think the problem stems from Qu's patch, which sets search_commit_root
=1 but doesn't set skip_locking, as a result we don't lock the tree when
we obtain a reference to the root node, yet later when traversing the
tree due to skip_locking not being set we try to lock it, and this
causes btrfs_assert_tree_locked to triggers. Can you test whether the
following diff solves the issues:


diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index bc19a7d11c98..23fadb640c59 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -2702,6 +2702,7 @@ static void btrfs_qgroup_rescan_worker(struct
btrfs_work *work)
 * should be recorded by qgroup
 */
path->search_commit_root = 1;
+   path->skip_locking = 1;

err = 0;
while (!err && !btrfs_fs_closing(fs_info)) {


If it does, this only means we need to make skip_locking = 1 being
conditional on search_commit_root being set and this situation should be
handled in btrfs_search_slot.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 1/3] btrfs: add helper function describe_block_group()

2018-05-22 Thread David Sterba
On Mon, May 21, 2018 at 02:37:43PM +0800, Anand Jain wrote:
> Improve on describe_relocation() add a common helper function to describe
> the block groups.
> 
> Signed-off-by: Anand Jain 
> ---
> v3: Born.
> 
>  fs/btrfs/relocation.c | 30 +++---
>  fs/btrfs/volumes.c| 44 
>  fs/btrfs/volumes.h|  1 +
>  3 files changed, 48 insertions(+), 27 deletions(-)
> 
> diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
> index 879b76fa881a..b9b32b0e4ba4 100644
> --- a/fs/btrfs/relocation.c
> +++ b/fs/btrfs/relocation.c
> @@ -4320,37 +4320,13 @@ static struct reloc_control *alloc_reloc_control(void)
>  static void describe_relocation(struct btrfs_fs_info *fs_info,
>   struct btrfs_block_group_cache *block_group)
>  {
> - char buf[128];  /* prefixed by a '|' that'll be dropped */
> - u64 flags = block_group->flags;
> + char buf[128];
>  
> - /* Shouldn't happen */
> - if (!flags) {
> - strcpy(buf, "|NONE");
> - } else {
> - char *bp = buf;
> -
> -#define DESCRIBE_FLAG(f, d) \
> - if (flags & BTRFS_BLOCK_GROUP_##f) { \
> - bp += snprintf(bp, buf - bp + sizeof(buf), "|%s", d); \
> - flags &= ~BTRFS_BLOCK_GROUP_##f; \
> - }
> - DESCRIBE_FLAG(DATA, "data");
> - DESCRIBE_FLAG(SYSTEM,   "system");
> - DESCRIBE_FLAG(METADATA, "metadata");
> - DESCRIBE_FLAG(RAID0,"raid0");
> - DESCRIBE_FLAG(RAID1,"raid1");
> - DESCRIBE_FLAG(DUP,  "dup");
> - DESCRIBE_FLAG(RAID10,   "raid10");
> - DESCRIBE_FLAG(RAID5,"raid5");
> - DESCRIBE_FLAG(RAID6,"raid6");
> - if (flags)
> - snprintf(bp, buf - bp + sizeof(buf), "|0x%llx", flags);
> -#undef DESCRIBE_FLAG
> - }
> + describe_block_groups(block_group->flags, buf, sizeof(buf));
>  
>   btrfs_info(fs_info,
>  "relocating block group %llu flags %s",
> -block_group->key.objectid, buf + 1);
> +block_group->key.objectid, buf);
>  }
>  
>  /*
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 43bd65395106..02b1d14f510d 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -126,6 +126,50 @@ const char *get_raid_name(enum btrfs_raid_types type)
>   return btrfs_raid_array[type].raid_name;
>  }
>  
> +u32 describe_block_groups(u64 bg_flags, char *describe, u32 sz)
> +{
> + int i;
> + u32 ret;
> + char *dp = describe;
> + u64 flags = bg_flags;
> +
> + if (!flags)
> + return snprintf(dp, sz, "%s", "NONE");
> +
> +#define DESCRIBE_FLAG(f, d) \
> + do { \
> + if (flags & BTRFS_BLOCK_GROUP_##f) { \
> + dp += snprintf(dp, describe - dp + sz, "%s|", d); \
> + flags &= ~BTRFS_BLOCK_GROUP_##f; \
> + } \
> + } while (0)
> + DESCRIBE_FLAG(DATA, "data");
> + DESCRIBE_FLAG(SYSTEM,   "system");
> + DESCRIBE_FLAG(METADATA, "metadata");
> +#undef DESCRIBE_FLAG

Adding the convenience macro for 3 uses and then opencoding it 2 times
does not make much sense to me. Why don't you change that to use the 1st
argument as the exact flag to mask out and the 2nd argument as the
string to print?

> +
> + if (flags & BTRFS_AVAIL_ALLOC_BIT_SINGLE) {
> + dp += snprintf(dp, describe - dp + sz,
> +"%s|", "single");
> + flags &= ~BTRFS_AVAIL_ALLOC_BIT_SINGLE;
> + }
> +
> + for (i = 0; i < BTRFS_NR_RAID_TYPES; i++) {

And this would work inside the loop too.

> + if (flags & btrfs_raid_array[i].bg_flag) {
> + dp += snprintf(dp, describe - dp + sz,
> +"%s|", btrfs_raid_array[i].raid_name);
> + flags &= ~btrfs_raid_array[i].bg_flag;
> + }
> + }
> +
> + if (flags)
> + dp += snprintf(dp, describe - dp + sz, "0x%llx|", flags);
> +
> + ret = dp - describe - 1;
> + describe[ret] = '\0'; /* remove last | */
> + return ret;
> +}
> +
>  static int init_first_rw_device(struct btrfs_trans_handle *trans,
>   struct btrfs_fs_info *fs_info);
>  static int btrfs_relocate_sys_chunks(struct btrfs_fs_info *fs_info);
> diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
> index 5139ec8daf4c..52ac258daa17 100644
> --- a/fs/btrfs/volumes.h
> +++ b/fs/btrfs/volumes.h
> @@ -433,6 +433,7 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, 
> const char *path);
>  int btrfs_balance(struct btrfs_fs_info *fs_info,
> struct btrfs_balance_control *bctl,
> struct btrfs_ioctl_balance_args *bargs);
> +u32 describe_block_groups(u64 flags, char *describe, u32 sz);

Please prefix an exported 

Re: [PATCH v2 0/6] btrfs_search_slot cleanups

2018-05-22 Thread David Sterba
On Tue, May 22, 2018 at 07:05:14PM +0800, Su Yue wrote:
> Hi Liu and David,
>   During my local xfstests on kdave/for-next, btrfs/139 failed and
> btrfs BUG_ON due to qgroup rescan.
>   The bisect result is commit 560215eb3f32("Merge branch
> 'ext/liubo/search-cleanups-wip' into for-next-next-v4.18-20180521")
> which seems merged this patchset.
>   The dmesg file is attached.
>
> On 05/18/2018 11:00 AM, Liu Bo wrote:
> > Here are a collection of patches I did for btrfs_search_slot().
> > 
> > v2: more explicit commit log for each patch.
> > 
> > Liu Bo (6):
> >   Btrfs: remove superfluous free_extent_buffer
> >   Btrfs: use more straightforward extent_buffer_uptodate
> >   Btrfs: move get root of btrfs_search_slot to a helper
> >   Btrfs: remove unused check of skip_locking
> >   Btrfs: grab write lock directly if write_lock_level is the max level
> >   Btrfs: remove always true check in unlock_up
> > 
> >  fs/btrfs/ctree.c | 121 
> > +--
> >  1 file changed, 73 insertions(+), 48 deletions(-)
> > 
> 
> 

> [   46.129166] BTRFS info (device vdb): disk space caching is enabled
> [   46.130545] BTRFS info (device vdb): has skinny extents
> [   46.171386] mount (2798) used greatest stack depth: 12920 bytes left
> [   46.508170] BTRFS: device fsid 83a117c7-a9ea-4bf5-b42f-7092078610d5 devid 
> 1 transid 5 /dev/vdc
> [   46.562428] BTRFS info (device vdc): disk space caching is enabled
> [   46.563690] BTRFS info (device vdc): has skinny extents
> [   46.564563] BTRFS info (device vdc): flagging fs with big metadata feature
> [   46.587441] BTRFS info (device vdc): checking UUID tree
> [   46.766765] BTRFS info (device vdb): disk space caching is enabled
> [   46.768197] BTRFS info (device vdb): has skinny extents
> [   46.875534] run fstests btrfs/139 at 2018-05-22 18:40:36
> [   47.559411] BTRFS: device fsid 065f3825-057e-451f-8722-0d94d4a3f36f devid 
> 1 transid 5 /dev/vdc
> [   47.612001] BTRFS info (device vdc): disk space caching is enabled
> [   47.613254] BTRFS info (device vdc): has skinny extents
> [   47.614147] BTRFS info (device vdc): flagging fs with big metadata feature
> [   47.632377] BTRFS info (device vdc): checking UUID tree
> [   47.681656] btrfs (3176) used greatest stack depth: 12632 bytes left
> [   47.691156] [ cut here ]
> [   47.692084] kernel BUG at fs/btrfs/locking.c:286!

I saw the crash too but did not investigate the root cause. So I'll
remove the branch from for-next until it's fixed. Thanks for the report.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] btrfs: Use btrfs_mark_bg_unused() to replace open code

2018-05-22 Thread David Sterba
On Tue, May 22, 2018 at 04:43:47PM +0800, Qu Wenruo wrote:
> Introduce a small helper, btrfs_mark_bg_unused(), to accquire needed
> locks and add a block group to unused_bgs list.

The helper is nice but hides that there's a reference taken on the 'bg'.
This would be good to add at least to the function comment or somehow
squeeze it into the function name itself, like
btrfs_get_and_mark_bg_unused.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: fix error handling in btrfs_truncate()

2018-05-22 Thread Nikolay Borisov


On 22.05.2018 14:53, David Sterba wrote:
> On Fri, May 18, 2018 at 02:43:02PM -0700, Omar Sandoval wrote:
>> From: Omar Sandoval 
>>
>> Jun Wu at Facebook reported that an internal service was seeing a return
>> value of 1 from ftruncate() on Btrfs when compression is enabled. This
>> is coming from the NEED_TRUNCATE_BLOCK return value from
>> btrfs_truncate_inode_items().
>>
>> btrfs_truncate() uses two variables for error handling, ret and err (if
>> this sounds familiar, it's because btrfs_truncate_inode_items() does
>> something similar).
> 
> This err/ret pattern is widely used in btrfs code and IIRC mostly
> consistently, where err is the global return and ret is local.

And as a matter of fact it's a bad pattern, precisely because of the
type of errors this and one of the patches in the orphan items fix.
Everytime I hear "global state" I cringe, since it just opens avenues
for bugs.

> 
>> When btrfs_truncate_inode_items() returns non-zero,
>> we set err to the return value, but we don't reset it to zero in the
>> successful NEED_TRUNCATE_BLOCK case. We only have err because we don't
>> want to mask an error if we call btrfs_update_inode() and
>> btrfs_end_transaction(), so let's make that its own scoped return
>> variable and use ret everywhere else.
> 
> That's an alternative pattern when 'ret' is the global return and ret2
> are locals. Either way, it would be good to unify that.

FWIW I prefer the style put forth in this patch, since it's easier to
reason about ret2 because it's defined in a narrower context (the single
if (trans)) which helps spot problems more easily.
> 
>> Fixes: ddfae63cc8e0 ("btrfs: move btrfs_truncate_block out of trans handle")
>> Reported-by: Jun Wu 
>> Signed-off-by: Omar Sandoval 
>> ---
>> This is based on Linus' master rather than my orphan ENOSPC fixes
>> because I think we want to get this in for v4.17 and stable, and rebase
>> my fixes on top of this.
> 
> For a 4.17-rc the patch would need to be smaller. You describe the
> actual bug as a missing reset of the err, so if it's just that, we can
> consider that for 4.17, but the patch as it is changes the error value
> propagatin in not so trivial way.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: fix error handling in btrfs_truncate()

2018-05-22 Thread David Sterba
On Fri, May 18, 2018 at 02:43:02PM -0700, Omar Sandoval wrote:
> From: Omar Sandoval 
> 
> Jun Wu at Facebook reported that an internal service was seeing a return
> value of 1 from ftruncate() on Btrfs when compression is enabled. This
> is coming from the NEED_TRUNCATE_BLOCK return value from
> btrfs_truncate_inode_items().
> 
> btrfs_truncate() uses two variables for error handling, ret and err (if
> this sounds familiar, it's because btrfs_truncate_inode_items() does
> something similar).

This err/ret pattern is widely used in btrfs code and IIRC mostly
consistently, where err is the global return and ret is local.

> When btrfs_truncate_inode_items() returns non-zero,
> we set err to the return value, but we don't reset it to zero in the
> successful NEED_TRUNCATE_BLOCK case. We only have err because we don't
> want to mask an error if we call btrfs_update_inode() and
> btrfs_end_transaction(), so let's make that its own scoped return
> variable and use ret everywhere else.

That's an alternative pattern when 'ret' is the global return and ret2
are locals. Either way, it would be good to unify that.

> Fixes: ddfae63cc8e0 ("btrfs: move btrfs_truncate_block out of trans handle")
> Reported-by: Jun Wu 
> Signed-off-by: Omar Sandoval 
> ---
> This is based on Linus' master rather than my orphan ENOSPC fixes
> because I think we want to get this in for v4.17 and stable, and rebase
> my fixes on top of this.

For a 4.17-rc the patch would need to be smaller. You describe the
actual bug as a missing reset of the err, so if it's just that, we can
consider that for 4.17, but the patch as it is changes the error value
propagatin in not so trivial way.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [External] Re: [RFC PATCH v2 10/12] mm/zsmalloc: update usage of address zone modifiers

2018-05-22 Thread Huaisheng HS1 Ye
From: owner-linux...@kvack.org On Behalf Of Matthew Wilcox
> 
> On Mon, May 21, 2018 at 11:20:31PM +0800, Huaisheng Ye wrote:
> > @@ -343,7 +343,7 @@ static void destroy_cache(struct zs_pool *pool)
> >  static unsigned long cache_alloc_handle(struct zs_pool *pool, gfp_t gfp)
> >  {
> > return (unsigned long)kmem_cache_alloc(pool->handle_cachep,
> > -   gfp & ~(__GFP_HIGHMEM|__GFP_MOVABLE));
> > +   gfp & ~__GFP_ZONE_MOVABLE);
> >  }
> 
> This should be & ~GFP_ZONEMASK
> 
> Actually, we should probably have a function to clear those bits rather
> than have every driver manipulating the gfp mask like this.  Maybe
> 
> #define gfp_normal(gfp)   ((gfp) & ~GFP_ZONEMASK)

Good idea!

> 
>   return (unsigned long)kmem_cache_alloc(pool->handle_cachep,
> - gfp & ~(__GFP_HIGHMEM|__GFP_MOVABLE));
> + gfp_normal(gfp));


Sincerely,
Huaisheng Ye
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v2 10/12] mm/zsmalloc: update usage of address zone modifiers

2018-05-22 Thread Matthew Wilcox
On Mon, May 21, 2018 at 11:20:31PM +0800, Huaisheng Ye wrote:
> @@ -343,7 +343,7 @@ static void destroy_cache(struct zs_pool *pool)
>  static unsigned long cache_alloc_handle(struct zs_pool *pool, gfp_t gfp)
>  {
>   return (unsigned long)kmem_cache_alloc(pool->handle_cachep,
> - gfp & ~(__GFP_HIGHMEM|__GFP_MOVABLE));
> + gfp & ~__GFP_ZONE_MOVABLE);
>  }

This should be & ~GFP_ZONEMASK

Actually, we should probably have a function to clear those bits rather
than have every driver manipulating the gfp mask like this.  Maybe

#define gfp_normal(gfp) ((gfp) & ~GFP_ZONEMASK)

return (unsigned long)kmem_cache_alloc(pool->handle_cachep,
-   gfp & ~(__GFP_HIGHMEM|__GFP_MOVABLE));
+   gfp_normal(gfp));

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: Fix to use original error code of btrfs_read_fs_root_no_name()

2018-05-22 Thread David Sterba
On Mon, May 21, 2018 at 01:57:27PM +0900, Misono Tomohiro wrote:
> btrfs_read_fs_root_no_name() may return ERR_PTR(-ENOENT) or
> ERR_PTR(-ENOMEM) and therefore search_ioctl() and
> btrfs_search_path_in_tree() should use PTR_ERR() instead of -ENOENT,
> which all other callers of btrfs_read_fs_root_no_name() does.
> 
> Signed-off-by: Tomohiro Misono 
> ---
>  fs/btrfs/ioctl.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> index 8f7c1b9a2db7..0f90e68ca512 100644
> --- a/fs/btrfs/ioctl.c
> +++ b/fs/btrfs/ioctl.c
> @@ -2155,7 +2155,7 @@ static noinline int search_ioctl(struct inode *inode,
>   root = btrfs_read_fs_root_no_name(info, );
>   if (IS_ERR(root)) {
>   btrfs_free_path(path);
> - return -ENOENT;
> + return PTR_ERR(root);
>   }
>   }
>  
> @@ -2290,7 +2290,7 @@ static noinline int btrfs_search_path_in_tree(struct 
> btrfs_fs_info *info,
>   root = btrfs_read_fs_root_no_name(info, );
>   if (IS_ERR(root)) {
>   btrfs_err(info, "could not find root %llu", tree_id);

We should probably drop this message too, as there are now 2 errors and
"not found" after ENOMEM does not make sense.

Otherwise ok

Reviewed-by: David Sterba 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: remove unnecessary if judge

2018-05-22 Thread David Sterba
On Mon, May 21, 2018 at 05:32:14PM +0800, Gu Jinxiang wrote:
> Since add_excluded_extent always returns 0,
> no need to judge ret.

The whole calltree starting in this function needs to be examined, if
there's a function that can retrun value (like set_extent_bits) or if
there's a BUG_ON that needs to be turned to proper error handling and
pushed upwards eventually.

When dealing with deep call sequences, it's fine and actually
recommended to do one step at a time. See eg. 57599c7e7722daf5 .

Do all necessary error handling in the function (undo changes, free
resources, etc) and return a value. The caller that does not handle the
return value at all will get the BUG_ON and will be converted the same
way in next patch.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 0/6] btrfs_search_slot cleanups

2018-05-22 Thread Su Yue
Hi Liu and David,
During my local xfstests on kdave/for-next, btrfs/139 failed and
btrfs BUG_ON due to qgroup rescan.
The bisect result is commit 560215eb3f32("Merge branch
'ext/liubo/search-cleanups-wip' into for-next-next-v4.18-20180521")
which seems merged this patchset.
The dmesg file is attached.

Thanks,
Su



On 05/18/2018 11:00 AM, Liu Bo wrote:
> Here are a collection of patches I did for btrfs_search_slot().
> 
> v2: more explicit commit log for each patch.
> 
> Liu Bo (6):
>   Btrfs: remove superfluous free_extent_buffer
>   Btrfs: use more straightforward extent_buffer_uptodate
>   Btrfs: move get root of btrfs_search_slot to a helper
>   Btrfs: remove unused check of skip_locking
>   Btrfs: grab write lock directly if write_lock_level is the max level
>   Btrfs: remove always true check in unlock_up
> 
>  fs/btrfs/ctree.c | 121 
> +--
>  1 file changed, 73 insertions(+), 48 deletions(-)
> 


[   46.129166] BTRFS info (device vdb): disk space caching is enabled
[   46.130545] BTRFS info (device vdb): has skinny extents
[   46.171386] mount (2798) used greatest stack depth: 12920 bytes left
[   46.508170] BTRFS: device fsid 83a117c7-a9ea-4bf5-b42f-7092078610d5 devid 1 
transid 5 /dev/vdc
[   46.562428] BTRFS info (device vdc): disk space caching is enabled
[   46.563690] BTRFS info (device vdc): has skinny extents
[   46.564563] BTRFS info (device vdc): flagging fs with big metadata feature
[   46.587441] BTRFS info (device vdc): checking UUID tree
[   46.766765] BTRFS info (device vdb): disk space caching is enabled
[   46.768197] BTRFS info (device vdb): has skinny extents
[   46.875534] run fstests btrfs/139 at 2018-05-22 18:40:36
[   47.559411] BTRFS: device fsid 065f3825-057e-451f-8722-0d94d4a3f36f devid 1 
transid 5 /dev/vdc
[   47.612001] BTRFS info (device vdc): disk space caching is enabled
[   47.613254] BTRFS info (device vdc): has skinny extents
[   47.614147] BTRFS info (device vdc): flagging fs with big metadata feature
[   47.632377] BTRFS info (device vdc): checking UUID tree
[   47.681656] btrfs (3176) used greatest stack depth: 12632 bytes left
[   47.691156] [ cut here ]
[   47.692084] kernel BUG at fs/btrfs/locking.c:286!
[   47.692929] invalid opcode:  [#1] SMP KASAN PTI
[   47.694128] Modules linked in: btrfs xor zstd_decompress zstd_compress 
xxhash raid6_pq efivarfs xfs
[   47.695593] CPU: 3 PID: 1060 Comm: kworker/u8:5 Not tainted 4.17.0-rc6+ #20
[   47.696708] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 
02/06/2015
[   47.697985] Workqueue: btrfs-qgroup-rescan btrfs_qgroup_rescan_helper [btrfs]
[   47.699167] RIP: 0010:btrfs_assert_tree_read_locked+0xe4/0x100 [btrfs]
[   47.700243] RSP: 0018:88014d50ed68 EFLAGS: 00010246
[   47.701094] RAX: dc00 RBX: 110029aa1dad RCX: a0845ac9
[   47.702254] RDX: 110029aa1db1 RSI: 0004 RDI: 88014fe29d18
[   47.703413] RBP: 110029aa1db1 R08: ed0029fc53a4 R09: ed0029fc53a3
[   47.704669] R10: ed0029fc53a3 R11: 88014fe29d1b R12: 
[   47.705905] R13: dc00 R14: 880152dd10c0 R15: dc00
[   47.708169] FS:  () GS:88015b20() 
knlGS:
[   47.709522] CS:  0010 DS:  ES:  CR0: 80050033
[   47.714771] CR2: 5630aef46298 CR3: 04216006 CR4: 003606e0
[   47.717890] DR0:  DR1:  DR2: 
[   47.720258] DR3:  DR6: fffe0ff0 DR7: 0400
[   47.722128] Call Trace:
[   47.723306]  ? btrfs_ioctl+0x7d20/0x7d20 [btrfs]
[   47.725323]  btrfs_set_lock_blocking_rw+0x2d1/0x420 [btrfs]
[   47.726989]  ? btrfs_assert_tree_locked+0x100/0x100 [btrfs]
[   47.728657]  ? __mutex_lock+0xdd8/0x1750
[   47.729962]  ? find_held_lock+0x3a/0x1c0
[   47.731316]  ? btrfs_qgroup_rescan_worker+0x381/0x11d0 [btrfs]
[   47.732884]  ? mutex_trylock+0x240/0x240
[   47.734257]  ? wait_for_completion+0x6a0/0x6a0
[   47.735643]  btrfs_set_path_blocking+0x79/0xe0 [btrfs]
[   47.737306]  btrfs_clear_path_blocking+0x60/0x160 [btrfs]
[   47.738823]  btrfs_search_slot+0x81e/0x23c0 [btrfs]
[   47.740298]  ? btrfs_record_root_in_trans+0xf2/0x140 [btrfs]
[   47.741835]  ? split_leaf+0x13c0/0x13c0 [btrfs]
[   47.743272]  ? btrfs_commit_transaction+0x2a00/0x2a00 [btrfs]
[   47.744821]  ? init_object+0x4e/0x80
[   47.746122]  ? btrfs_qgroup_rescan_worker+0xac/0x11d0 [btrfs]
[   47.747727]  ? __lock_is_held+0xb4/0x140
[   47.749061]  ? btrfs_qgroup_rescan_worker+0xac/0x11d0 [btrfs]
[   47.750654]  btrfs_search_slot_for_read+0x105/0x2e0 [btrfs]
[   47.752218]  btrfs_qgroup_rescan_worker+0x3bb/0x11d0 [btrfs]
[   47.754506]  ? sched_clock_cpu+0x18/0x180
[   47.755819]  ? add_qgroup_rb+0x4a0/0x4a0 [btrfs]
[   47.757335]  ? lock_acquire+0x470/0x470
[   47.758594]  ? do_raw_spin_unlock+0xa6/0x2f0
[   47.759870]  ? 

Re: [PATCH] Btrfs: allow empty subvol= again

2018-05-22 Thread David Sterba
> > This is silly but IMO it's a regression. If you agree, this should
> > probably go to 4.17 + stable. I'll submit an xfstests test shortly.
> 
> Update, we fixed the userspace bug now so I'd be fine if we dropped this
> patch. I guess there's still a possibility that there are other users
> that hit this, so I'll leave it up to you, Dave, whether it's important
> enough to care.

The documentation does not say anything about the empty value of the
path, it's subvol=path, so I'm not sure if empty string is normally
considered a valid path.

I'll add the patch and update documentation, as it's a small issue, but
not really an urgent fix for 4.17. It also has a simple fix, to use / or
"." .
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] btrfs: propagate failures of __exclude_logged_extent to upper caller

2018-05-22 Thread Nikolay Borisov


On 22.05.2018 12:46, Gu Jinxiang wrote:
> Function btrfs_exclude_logged_extents may call __exclude_logged_extent
> which may fail.
> Propagate the failures of __exclude_logged_extent to upper caller.
> 
> Signed-off-by: Gu Jinxiang 

Reviewed-by: Nikolay Borisov 

> ---
>  fs/btrfs/extent-tree.c | 9 +++--
>  1 file changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 2e85e99b5e6f..28fd71579141 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -6468,6 +6468,7 @@ int btrfs_exclude_logged_extents(struct btrfs_fs_info 
> *fs_info,
>   struct btrfs_key key;
>   int found_type;
>   int i;
> + int ret = 0;
>  
>   if (!btrfs_fs_incompat(fs_info, MIXED_GROUPS))
>   return 0;
> @@ -6484,10 +6485,14 @@ int btrfs_exclude_logged_extents(struct btrfs_fs_info 
> *fs_info,
>   continue;
>   key.objectid = btrfs_file_extent_disk_bytenr(eb, item);
>   key.offset = btrfs_file_extent_disk_num_bytes(eb, item);
> - __exclude_logged_extent(fs_info, key.objectid, key.offset);
> + ret = __exclude_logged_extent(fs_info, key.objectid,
> + key.offset);
> + if (ret)
> + goto out;
>   }
>  
> - return 0;
> +out:
> + return ret;
>  }
>  
>  static void
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] btrfs: handle failures of set_extent_bits in add_excluded_extent

2018-05-22 Thread Nikolay Borisov


On 22.05.2018 12:46, Gu Jinxiang wrote:
> set_extent_bits may return 0/-EEXIST, so return the result in
> add_excluded_extent.
> 
> Signed-off-by: Gu Jinxiang 

Reviewed-by: Nikolay Borisov 

> ---
>  fs/btrfs/extent-tree.c | 11 ---
>  1 file changed, 8 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 75cfb80d2551..2e85e99b5e6f 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -215,11 +215,16 @@ static int add_excluded_extent(struct btrfs_fs_info 
> *fs_info,
>  u64 start, u64 num_bytes)
>  {
>   u64 end = start + num_bytes - 1;
> - set_extent_bits(_info->freed_extents[0],
> + int ret = 0;
> +
> + ret = set_extent_bits(_info->freed_extents[0],
>   start, end, EXTENT_UPTODATE);
> - set_extent_bits(_info->freed_extents[1],
> + if (ret)
> + goto out;
> + ret = set_extent_bits(_info->freed_extents[1],
>   start, end, EXTENT_UPTODATE);
> - return 0;
> +out:
> + return ret;
>  }
>  
>  static void free_excluded_extents(struct btrfs_fs_info *fs_info,
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] btrfs: drop uuid_mutex in btrfs_free_extra_devids()

2018-05-22 Thread David Sterba
On Tue, May 22, 2018 at 05:29:29PM +0800, Anand Jain wrote:
> btrfs_free_extra_devids() frees the orphan fsid::devid with in
> btrfs_fs_devices::devices, so we dont need uuid_mutex, instead hold the
> btrfs_fs_devices::device_list_mutex.

This still lacks the reason why, the text merely repeats what the code
change does.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v2 00/12] get rid of GFP_ZONE_TABLE/BAD

2018-05-22 Thread Huaisheng HS1 Ye
From: owner-linux...@kvack.org On Behalf Of Christoph Hellwig
> This seems to be missing patch 1 and generally be in somewhat odd format.
> Can you try to resend it with git-send-email and against current Linus'
> tree?
> 
Sure, I will rebase them to current mainline ASAP.

> Also I'd suggest you do cleanups like adding and using __GFP_ZONE_MASK
> at the beginning of the series before doing any real changes.

Ok, thanks for your suggestion.

Sincerely,
Huaisheng Ye
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [External] Re: [RFC PATCH v2 02/12] arch/x86/kernel/amd_gart_64: update usage of address zone modifiers

2018-05-22 Thread Huaisheng HS1 Ye
From: owner-linux...@kvack.org On Behalf Of Christoph Hellwig
> 
> This code doesn't exist in current mainline.  What kernel version
> is your patch against?
> 
> On Mon, May 21, 2018 at 11:20:23PM +0800, Huaisheng Ye wrote:
> > From: Huaisheng Ye 
> >
> > Use __GFP_ZONE_MASK to replace (__GFP_DMA | __GFP_HIGHMEM | __GFP_DMA32).
> >
> > ___GFP_DMA, ___GFP_HIGHMEM and ___GFP_DMA32 have been deleted from GFP
> > bitmasks, the bottom three bits of GFP mask is reserved for storing
> > encoded zone number.
> > __GFP_DMA, __GFP_HIGHMEM and __GFP_DMA32 should not be operated by OR.
> 
> If they have already been deleted the identifier should not exist
> anymore, so either your patch has issues, or at least the description.

Dear Christoph,

The kernel version of my patches against is Linux 4.16, the most of
modifications come from include/Linux/gfp.h. I think they should be
pushed to Linux-mm, so I follow the requirement of maintainers to make
patches based on mmotm/master.

I just checked the current mainline, yes,
(__GFP_DMA | __GFP_HIGHMEM | __GFP_DMA32) has been deleted, I can
rebase my patches to mainline, and resend them to mail list.

Sincerely,
Huaisheng Ye
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 01/12] block: convert bounce, q->bio_split to bioset_init()/mempool_init()

2018-05-22 Thread Christoph Hellwig
Looks good,

Reviewed-by: Christoph Hellwig 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 08/12] target: convert to bioset_init()/mempool_init()

2018-05-22 Thread Christoph Hellwig
Looks good,

Reviewed-by: Christoph Hellwig 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 09/12] fs: convert block_dev.c to bioset_init()

2018-05-22 Thread Christoph Hellwig
Looks good,

Reviewed-by: Christoph Hellwig 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 11/12] xfs: convert to bioset_init()/mempool_init()

2018-05-22 Thread Christoph Hellwig
This creates a minor conflict with my iomap series, but that should
be easy to merge if needed.

Otherwise looks good:

Reviewed-by: Christoph Hellwig 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 12/12] block: Drop bioset_create()

2018-05-22 Thread Christoph Hellwig
Looks good,

Reviewed-by: Christoph Hellwig 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 04/12] lightnvm: convert to bioset_init()/mempool_init()

2018-05-22 Thread Javier Gonzalez
> On 21 May 2018, at 00.25, Kent Overstreet  wrote:
> 
> Signed-off-by: Kent Overstreet 
> ---
> drivers/lightnvm/pblk-core.c | 30 ++---
> drivers/lightnvm/pblk-init.c | 72 
> drivers/lightnvm/pblk-read.c |  4 +-
> drivers/lightnvm/pblk-recovery.c |  2 +-
> drivers/lightnvm/pblk-write.c|  8 ++--
> drivers/lightnvm/pblk.h  | 14 +++
> 6 files changed, 65 insertions(+), 65 deletions(-)
> 
> diff --git a/drivers/lightnvm/pblk-core.c b/drivers/lightnvm/pblk-core.c
> index 94d5d97c9d..934341b104 100644
> --- a/drivers/lightnvm/pblk-core.c
> +++ b/drivers/lightnvm/pblk-core.c
> @@ -40,7 +40,7 @@ static void pblk_line_mark_bb(struct work_struct *work)
>   }
> 
>   kfree(ppa);
> - mempool_free(line_ws, pblk->gen_ws_pool);
> + mempool_free(line_ws, >gen_ws_pool);
> }
> 
> static void pblk_mark_bb(struct pblk *pblk, struct pblk_line *line,
> @@ -102,7 +102,7 @@ static void pblk_end_io_erase(struct nvm_rq *rqd)
>   struct pblk *pblk = rqd->private;
> 
>   __pblk_end_io_erase(pblk, rqd);
> - mempool_free(rqd, pblk->e_rq_pool);
> + mempool_free(rqd, >e_rq_pool);
> }
> 
> /*
> @@ -237,15 +237,15 @@ struct nvm_rq *pblk_alloc_rqd(struct pblk *pblk, int 
> type)
>   switch (type) {
>   case PBLK_WRITE:
>   case PBLK_WRITE_INT:
> - pool = pblk->w_rq_pool;
> + pool = >w_rq_pool;
>   rq_size = pblk_w_rq_size;
>   break;
>   case PBLK_READ:
> - pool = pblk->r_rq_pool;
> + pool = >r_rq_pool;
>   rq_size = pblk_g_rq_size;
>   break;
>   default:
> - pool = pblk->e_rq_pool;
> + pool = >e_rq_pool;
>   rq_size = pblk_g_rq_size;
>   }
> 
> @@ -265,13 +265,13 @@ void pblk_free_rqd(struct pblk *pblk, struct nvm_rq 
> *rqd, int type)
>   case PBLK_WRITE:
>   kfree(((struct pblk_c_ctx *)nvm_rq_to_pdu(rqd))->lun_bitmap);
>   case PBLK_WRITE_INT:
> - pool = pblk->w_rq_pool;
> + pool = >w_rq_pool;
>   break;
>   case PBLK_READ:
> - pool = pblk->r_rq_pool;
> + pool = >r_rq_pool;
>   break;
>   case PBLK_ERASE:
> - pool = pblk->e_rq_pool;
> + pool = >e_rq_pool;
>   break;
>   default:
>   pr_err("pblk: trying to free unknown rqd type\n");
> @@ -292,7 +292,7 @@ void pblk_bio_free_pages(struct pblk *pblk, struct bio 
> *bio, int off,
> 
>   for (i = off; i < nr_pages + off; i++) {
>   bv = bio->bi_io_vec[i];
> - mempool_free(bv.bv_page, pblk->page_bio_pool);
> + mempool_free(bv.bv_page, >page_bio_pool);
>   }
> }
> 
> @@ -304,12 +304,12 @@ int pblk_bio_add_pages(struct pblk *pblk, struct bio 
> *bio, gfp_t flags,
>   int i, ret;
> 
>   for (i = 0; i < nr_pages; i++) {
> - page = mempool_alloc(pblk->page_bio_pool, flags);
> + page = mempool_alloc(>page_bio_pool, flags);
> 
>   ret = bio_add_pc_page(q, bio, page, PBLK_EXPOSED_PAGE_SIZE, 0);
>   if (ret != PBLK_EXPOSED_PAGE_SIZE) {
>   pr_err("pblk: could not add page to bio\n");
> - mempool_free(page, pblk->page_bio_pool);
> + mempool_free(page, >page_bio_pool);
>   goto err;
>   }
>   }
> @@ -1593,7 +1593,7 @@ static void pblk_line_put_ws(struct work_struct *work)
>   struct pblk_line *line = line_put_ws->line;
> 
>   __pblk_line_put(pblk, line);
> - mempool_free(line_put_ws, pblk->gen_ws_pool);
> + mempool_free(line_put_ws, >gen_ws_pool);
> }
> 
> void pblk_line_put(struct kref *ref)
> @@ -1610,7 +1610,7 @@ void pblk_line_put_wq(struct kref *ref)
>   struct pblk *pblk = line->pblk;
>   struct pblk_line_ws *line_put_ws;
> 
> - line_put_ws = mempool_alloc(pblk->gen_ws_pool, GFP_ATOMIC);
> + line_put_ws = mempool_alloc(>gen_ws_pool, GFP_ATOMIC);
>   if (!line_put_ws)
>   return;
> 
> @@ -1752,7 +1752,7 @@ void pblk_line_close_ws(struct work_struct *work)
>   struct pblk_line *line = line_ws->line;
> 
>   pblk_line_close(pblk, line);
> - mempool_free(line_ws, pblk->gen_ws_pool);
> + mempool_free(line_ws, >gen_ws_pool);
> }
> 
> void pblk_gen_run_ws(struct pblk *pblk, struct pblk_line *line, void *priv,
> @@ -1761,7 +1761,7 @@ void pblk_gen_run_ws(struct pblk *pblk, struct 
> pblk_line *line, void *priv,
> {
>   struct pblk_line_ws *line_ws;
> 
> - line_ws = mempool_alloc(pblk->gen_ws_pool, gfp_mask);
> + line_ws = mempool_alloc(>gen_ws_pool, gfp_mask);
> 
>   line_ws->pblk = pblk;
>   line_ws->line = line;
> diff --git a/drivers/lightnvm/pblk-init.c b/drivers/lightnvm/pblk-init.c
> index 91a5bc2556..9a984abd3d 100644
> --- a/drivers/lightnvm/pblk-init.c
> +++ 

Re: [RFC PATCH 1/8] btrfs: use iocb for __btrfs_buffered_write

2018-05-22 Thread David Sterba
On Tue, May 22, 2018 at 03:40:18PM +0900, Misono Tomohiro wrote:
> > @@ -1815,7 +1816,6 @@ static ssize_t __btrfs_direct_write(struct kiocb 
> > *iocb, struct iov_iter *from)
> >  {
> > struct file *file = iocb->ki_filp;
> > struct inode *inode = file_inode(file);
> > -   loff_t pos = iocb->ki_pos;
> > ssize_t written;
> > ssize_t written_buffered;
> > loff_t endbyte;
> > @@ -1826,8 +1826,8 @@ static ssize_t __btrfs_direct_write(struct kiocb 
> > *iocb, struct iov_iter *from)
> > if (written < 0 || !iov_iter_count(from))
> > return written;
> >  
> > -   pos += written;
> > -   written_buffered = __btrfs_buffered_write(file, from, pos);
> 
> > +   iocb->ki_pos += written;
> 
> Hi,
> 
> I found btrfs/026 fails on current misc-next branch and
> git bisect points this commit.

Thanks, that's appreciated.

> I noticed generic_file_direct_write() already updates iocb->ki_pos, and 
> therefore
> above "iocb->ki_pos += written" is not needed.
> 
> > +   written_buffered = __btrfs_buffered_write(iocb, from);
> > if (written_buffered < 0) {
> > err = written_buffered;
> > goto out;
> > @@ -1836,16 +1836,16 @@ static ssize_t __btrfs_direct_write(struct kiocb 
> > *iocb, struct iov_iter *from)
> >  * Ensure all data is persisted. We want the next direct IO read to be
> >  * able to read what was just written.
> >  */
> > -   endbyte = pos + written_buffered - 1;
> > -   err = btrfs_fdatawrite_range(inode, pos, endbyte);
> > +   endbyte = iocb->ki_pos + written_buffered - 1;
> > +   err = btrfs_fdatawrite_range(inode, iocb->ki_pos, endbyte);
> > if (err)
> > goto out;
> > -   err = filemap_fdatawait_range(inode->i_mapping, pos, endbyte);
> > +   err = filemap_fdatawait_range(inode->i_mapping, iocb->ki_pos, endbyte);
> > if (err)
> > goto out;
> > +   iocb->ki_pos += written_buffered;
> > written += written_buffered;
> > -   iocb->ki_pos = pos + written_buffered;
> > -   invalidate_mapping_pages(file->f_mapping, pos >> PAGE_SHIFT,
> > +   invalidate_mapping_pages(file->f_mapping, iocb->ki_pos >> PAGE_SHIFT,
> >  endbyte >> PAGE_SHIFT);
> 
> Also, this invalidate_mapping_pages() should be done before updating 
> iocb->ki_pos
> to invalidate buffered written area.

This sounds like the patch needs more review, I treated it as more like
a cleanup so I'll drop it from misc-next and revisit once the whole
iomap patchset will be sent again.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] btrfs: propagate failures of __exclude_logged_extent to upper caller

2018-05-22 Thread Gu Jinxiang
Function btrfs_exclude_logged_extents may call __exclude_logged_extent
which may fail.
Propagate the failures of __exclude_logged_extent to upper caller.

Signed-off-by: Gu Jinxiang 
---
 fs/btrfs/extent-tree.c | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 2e85e99b5e6f..28fd71579141 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -6468,6 +6468,7 @@ int btrfs_exclude_logged_extents(struct btrfs_fs_info 
*fs_info,
struct btrfs_key key;
int found_type;
int i;
+   int ret = 0;
 
if (!btrfs_fs_incompat(fs_info, MIXED_GROUPS))
return 0;
@@ -6484,10 +6485,14 @@ int btrfs_exclude_logged_extents(struct btrfs_fs_info 
*fs_info,
continue;
key.objectid = btrfs_file_extent_disk_bytenr(eb, item);
key.offset = btrfs_file_extent_disk_num_bytes(eb, item);
-   __exclude_logged_extent(fs_info, key.objectid, key.offset);
+   ret = __exclude_logged_extent(fs_info, key.objectid,
+   key.offset);
+   if (ret)
+   goto out;
}
 
-   return 0;
+out:
+   return ret;
 }
 
 static void
-- 
1.9.1



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] btrfs: handle failures of set_extent_bits in add_excluded_extent

2018-05-22 Thread Gu Jinxiang
set_extent_bits may return 0/-EEXIST, so return the result in
add_excluded_extent.

Signed-off-by: Gu Jinxiang 
---
 fs/btrfs/extent-tree.c | 11 ---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 75cfb80d2551..2e85e99b5e6f 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -215,11 +215,16 @@ static int add_excluded_extent(struct btrfs_fs_info 
*fs_info,
   u64 start, u64 num_bytes)
 {
u64 end = start + num_bytes - 1;
-   set_extent_bits(_info->freed_extents[0],
+   int ret = 0;
+
+   ret = set_extent_bits(_info->freed_extents[0],
start, end, EXTENT_UPTODATE);
-   set_extent_bits(_info->freed_extents[1],
+   if (ret)
+   goto out;
+   ret = set_extent_bits(_info->freed_extents[1],
start, end, EXTENT_UPTODATE);
-   return 0;
+out:
+   return ret;
 }
 
 static void free_excluded_extents(struct btrfs_fs_info *fs_info,
-- 
1.9.1



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v2 00/12] get rid of GFP_ZONE_TABLE/BAD

2018-05-22 Thread Christoph Hellwig
This seems to be missing patch 1 and generally be in somewhat odd format.
Can you try to resend it with git-send-email and against current Linus'
tree?

Also I'd suggest you do cleanups like adding and using __GFP_ZONE_MASK
at the beginning of the series before doing any real changes.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v2 02/12] arch/x86/kernel/amd_gart_64: update usage of address zone modifiers

2018-05-22 Thread Christoph Hellwig
This code doesn't exist in current mainline.  What kernel version
is your patch against?

On Mon, May 21, 2018 at 11:20:23PM +0800, Huaisheng Ye wrote:
> From: Huaisheng Ye 
> 
> Use __GFP_ZONE_MASK to replace (__GFP_DMA | __GFP_HIGHMEM | __GFP_DMA32).
> 
> ___GFP_DMA, ___GFP_HIGHMEM and ___GFP_DMA32 have been deleted from GFP
> bitmasks, the bottom three bits of GFP mask is reserved for storing
> encoded zone number.
> __GFP_DMA, __GFP_HIGHMEM and __GFP_DMA32 should not be operated by OR.

If they have already been deleted the identifier should not exist
anymore, so either your patch has issues, or at least the description.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2] btrfs: drop uuid_mutex in close_fs_devices()

2018-05-22 Thread Anand Jain
close_fs_devices() closes devices of a given fsid, and it is limited
to all the devices of a fsid, so we don't have to hold the global
uuid_mutex, instead we need the device_list_mutex as the device state is
being changed.

Signed-off-by: Anand Jain 
---
v1->v2: Add comment why we don't need uuid_mutex when touching seed.

 fs/btrfs/volumes.c | 14 +-
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index b1e5ac419ca8..f4afd2e78e92 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -1025,7 +1025,6 @@ static void btrfs_prepare_close_one_device(struct 
btrfs_device *device)
device->uuid);
BUG_ON(IS_ERR(new_device)); /* -ENOMEM */
 
-   /* Safe because we are under uuid_mutex */
if (device->name) {
name = rcu_string_strdup(device->name->str, GFP_NOFS);
BUG_ON(!name); /* -ENOMEM */
@@ -1043,10 +1042,12 @@ static int close_fs_devices(struct btrfs_fs_devices 
*fs_devices)
 
INIT_LIST_HEAD(_put);
 
-   if (--fs_devices->opened > 0)
+   mutex_lock(_devices->device_list_mutex);
+   if (--fs_devices->opened > 0) {
+   mutex_unlock(_devices->device_list_mutex);
return 0;
+   }
 
-   mutex_lock(_devices->device_list_mutex);
list_for_each_entry_safe(device, tmp, _devices->devices, dev_list) {
btrfs_prepare_close_one_device(device);
list_add(>dev_list, _put);
@@ -1080,14 +1081,17 @@ int btrfs_close_devices(struct btrfs_fs_devices 
*fs_devices)
struct btrfs_fs_devices *seed_devices = NULL;
int ret;
 
-   mutex_lock(_mutex);
ret = close_fs_devices(fs_devices);
if (!fs_devices->opened) {
seed_devices = fs_devices->seed;
fs_devices->seed = NULL;
}
-   mutex_unlock(_mutex);
 
+   /*
+* As fs_devices::seed is a cloned (and then opned) local copy,
+* of the actual seed fs_uuids::fs_devices. So we don't need the
+* uuid_mutex.
+*/
while (seed_devices) {
fs_devices = seed_devices;
seed_devices = fs_devices->seed;
-- 
2.7.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2] btrfs: drop uuid_mutex in btrfs_free_extra_devids()

2018-05-22 Thread Anand Jain
btrfs_free_extra_devids() frees the orphan fsid::devid with in
btrfs_fs_devices::devices, so we dont need uuid_mutex, instead hold the
btrfs_fs_devices::device_list_mutex.

Signed-off-by: Anand Jain 
---
v1->v2: replace uuid_mutex with device_list_mutex instead of delete.
change log updated.

 fs/btrfs/volumes.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 2b101dc49d91..b1e5ac419ca8 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -925,7 +925,7 @@ void btrfs_free_extra_devids(struct btrfs_fs_devices 
*fs_devices, int step)
struct btrfs_device *device, *next;
struct btrfs_device *latest_dev = NULL;
 
-   mutex_lock(_mutex);
+   mutex_lock(_devices->device_list_mutex);
 again:
/* This is the initialized path, it is safe to release the devices. */
list_for_each_entry_safe(device, next, _devices->devices, dev_list) {
@@ -979,8 +979,7 @@ void btrfs_free_extra_devids(struct btrfs_fs_devices 
*fs_devices, int step)
}
 
fs_devices->latest_bdev = latest_dev->bdev;
-
-   mutex_unlock(_mutex);
+   mutex_unlock(_devices->device_list_mutex);
 }
 
 static void free_device_rcu(struct rcu_head *head)
-- 
2.7.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/15] Review uuid_mutex usage

2018-05-22 Thread Anand Jain



On 05/16/2018 12:33 AM, David Sterba wrote:

On Thu, Apr 12, 2018 at 10:29:23AM +0800, Anand Jain wrote:

uuid_mutex lock is not a per-fs lock but a global lock. The main aim of
this patch-set is to critically review the usage of this lock, and delete
the unnecessary once. By doing this we improve the concurrency of
device operations across multiple btrfs filesystems is in the system.

patch 1: Was sent before, I am including it here, as its about uuid_mutex.

patch 2-9: Are cleanup and or preparatory patches.

patch 10-14: Drops the uuid_mutex and makes sure there is enough lock,
as discussed in the patch change log.

patch 15: A generic cleanup patch around functions in the same context.

These patches are on top of
   https://github.com/kdave/btrfs-devel.git remove-volume-mutex
And it will be a good idea to go along with the kill-volume-mutex patches.

This is tested with xfstests and there are no _new_ regression. And I am
trying to understand the old regressions, and notice that they are
inconsistent.

Anand Jain (15):
   btrfs: optimize move uuid_mutex closer to the critical section
   btrfs: rename struct btrfs_fs_devices::list
   btrfs: cleanup __btrfs_open_devices() drop head pointer
   btrfs: rename __btrfs_close_devices to close_fs_devices
   btrfs: rename __btrfs_open_devices to open_fs_devices
   btrfs: cleanup find_device() drop list_head pointer
   btrfs: cleanup btrfs_rm_device() promote fs_devices pointer
   btrfs: cleanup btrfs_rm_device() use cur_devices
   btrfs: uuid_mutex in read_chunk_tree, add a comment
   btrfs: drop uuid_mutex in btrfs_free_extra_devids()
   btrfs: drop uuid_mutex in btrfs_open_devices()
   btrfs: drop uuid_mutex in close_fs_devices()
   btrfs: drop uuid_mutex in btrfs_dev_replace_finishing()
   btrfs: drop uuid_mutex in btrfs_destroy_dev_replace_tgtdev()
   btrfs: cleanup btrfs_destroy_dev_replace_tgtdev() localize
 btrfs_fs_devices


Patches 10 and 12 haven't been merged, the rest is now in misc-next.
Testing hasn't revealed any problems related to the uuid/device locks
but as said before we don't have stress tests.


 Our test cases are sequential. We need same device related test cases
 in a concurrent manner.

 Just for experiment, I ran two instances of xfstests concurrently (on a
 separate set of test and scratch devices), they ran fine with these
 patches, not a perfect solution though. We need to think of better
 ways.

 The main challenge with testing concurrency / racing is how to
 synchronize racing threads.

Thanks, Anand



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 12/15] btrfs: drop uuid_mutex in close_fs_devices()

2018-05-22 Thread Anand Jain



On 05/16/2018 12:30 AM, David Sterba wrote:

On Thu, Apr 12, 2018 at 10:29:35AM +0800, Anand Jain wrote:

close_fs_devices() closes devices of a given fsid, and it is limited
to all the devices of a fsid, so we don't have to hold the global
uuid_mutex, instead we need the device_list_mutex as the device state is
being changed.

Signed-off-by: Anand Jain 
---
  fs/btrfs/volumes.c | 9 -
  1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index dfebf8f29916..4c29214e0c18 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -995,7 +995,6 @@ static void btrfs_prepare_close_one_device(struct 
btrfs_device *device)
device->uuid);
BUG_ON(IS_ERR(new_device)); /* -ENOMEM */
  
-	/* Safe because we are under uuid_mutex */

if (device->name) {
name = rcu_string_strdup(device->name->str, GFP_NOFS);
BUG_ON(!name); /* -ENOMEM */
@@ -1013,10 +1012,12 @@ static int close_fs_devices(struct btrfs_fs_devices 
*fs_devices)
  
  	INIT_LIST_HEAD(_put);
  
-	if (--fs_devices->opened > 0)

+   mutex_lock(_devices->device_list_mutex);
+   if (--fs_devices->opened > 0) {
+   mutex_unlock(_devices->device_list_mutex);
return 0;
+   }
  
-	mutex_lock(_devices->device_list_mutex);

list_for_each_entry_safe(device, tmp, _devices->devices, dev_list) {
btrfs_prepare_close_one_device(device);
list_add(>dev_list, _put);
@@ -1050,13 +1051,11 @@ int btrfs_close_devices(struct btrfs_fs_devices 
*fs_devices)
struct btrfs_fs_devices *seed_devices = NULL;
int ret;
  
-	mutex_lock(_mutex);

ret = close_fs_devices(fs_devices);
if (!fs_devices->opened) {
seed_devices = fs_devices->seed;
fs_devices->seed = NULL;


This still touches ->seed, and also reads ->opened, some locking is
required here or it has to be clearly explained why it's not.


 The ->seed here is %fs_devices local so we don't need the
 global uuid_mutex. We cloned it in btrfs_prepare_sprout().

 Will add comments and send v2.

Thanks, Anand


}
-   mutex_unlock(_mutex);
  
  	while (seed_devices) {

fs_devices = seed_devices;
--
2.7.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 10/15] btrfs: drop uuid_mutex in btrfs_free_extra_devids()

2018-05-22 Thread Anand Jain



On 05/16/2018 12:26 AM, David Sterba wrote:

On Thu, Apr 12, 2018 at 10:29:33AM +0800, Anand Jain wrote:

btrfs_free_extra_devids() frees the orphan fsid::devid but its search is
limited to btrfs_fs_devices::devices, so we dont need uuid_mutex.


 From that it's not clear why there's no locking at all now:


@@ -897,7 +897,6 @@ void btrfs_free_extra_devids(struct btrfs_fs_devices 
*fs_devices, int step)
struct btrfs_device *device, *next;
struct btrfs_device *latest_dev = NULL;
  
-	mutex_lock(_mutex);


ie. why is this not replaced by the device_list_lock. Please resend this
patch and explain in the changelog.


 My apologies for the delay.

 btrfs_free_extra_devids() needs device_list_mutex, will send a v2 for
 this.

 Thanks.


  again:
/* This is the initialized path, it is safe to release the devices. */
list_for_each_entry_safe(device, next, _devices->devices, dev_list) {
@@ -951,8 +950,6 @@ void btrfs_free_extra_devids(struct btrfs_fs_devices 
*fs_devices, int step)
}
  
  	fs_devices->latest_bdev = latest_dev->bdev;

-
-   mutex_unlock(_mutex);
  }
  
  static void free_device_rcu(struct rcu_head *head)

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] btrfs: Use btrfs_mark_bg_unused() to replace open code

2018-05-22 Thread Nikolay Borisov


On 22.05.2018 11:43, Qu Wenruo wrote:
> Introduce a small helper, btrfs_mark_bg_unused(), to accquire needed
> locks and add a block group to unused_bgs list.
> 
> No functional modification, and only 3 callers are involved.>
> Signed-off-by: Qu Wenruo 

Reviewed-by: Nikolay Borisov 

> ---
> This patch should provide the basis for later block group auto-removal
> to get more info (mostly transid) to determine should one block group
> being removed in current trans.
> 
> changelog:
> v2:
>   Add ASSERT() for btrfs_read_block_groups(), as in that call site a
>   block group should not be added to other list (new_bgs or
>   deleted_bgs).
>   Rename the function to btrfs_mark_bg_unused().
>   Both suggested by Nikolay.
> ---
>  fs/btrfs/ctree.h   |  1 +
>  fs/btrfs/extent-tree.c | 36 +---
>  fs/btrfs/scrub.c   |  9 +
>  3 files changed, 19 insertions(+), 27 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index bbb358143ded..4a34bc443fe4 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -2827,6 +2827,7 @@ void check_system_chunk(struct btrfs_trans_handle 
> *trans,
>   struct btrfs_fs_info *fs_info, const u64 type);
>  u64 add_new_free_space(struct btrfs_block_group_cache *block_group,
>  u64 start, u64 end);
> +void btrfs_mark_bg_unused(struct btrfs_block_group_cache *bg);
>  
>  /* ctree.c */
>  int btrfs_bin_search(struct extent_buffer *eb, const struct btrfs_key *key,
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index ccf2690f7ca1..33f046103073 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -6312,16 +6312,8 @@ static int update_block_group(struct 
> btrfs_trans_handle *trans,
>* dirty list to avoid races between cleaner kthread and space
>* cache writeout.
>*/
> - if (!alloc && old_val == 0) {
> - spin_lock(>unused_bgs_lock);
> - if (list_empty(>bg_list)) {
> - btrfs_get_block_group(cache);
> - trace_btrfs_add_unused_block_group(cache);
> - list_add_tail(>bg_list,
> -   >unused_bgs);
> - }
> - spin_unlock(>unused_bgs_lock);
> - }
> + if (!alloc && old_val == 0)
> + btrfs_mark_bg_unused(cache);
>  
>   btrfs_put_block_group(cache);
>   total -= num_bytes;
> @@ -10144,15 +10136,8 @@ int btrfs_read_block_groups(struct btrfs_fs_info 
> *info)
>   if (btrfs_chunk_readonly(info, cache->key.objectid)) {
>   inc_block_group_ro(cache, 1);
>   } else if (btrfs_block_group_used(>item) == 0) {
> - spin_lock(>unused_bgs_lock);
> - /* Should always be true but just in case. */
> - if (list_empty(>bg_list)) {
> - btrfs_get_block_group(cache);
> - trace_btrfs_add_unused_block_group(cache);
> - list_add_tail(>bg_list,
> -   >unused_bgs);
> - }
> - spin_unlock(>unused_bgs_lock);
> + ASSERT(list_empty(>bg_list));
> + btrfs_mark_bg_unused(cache);
>   }
>   }
>  
> @@ -11071,3 +11056,16 @@ void btrfs_wait_for_snapshot_creation(struct 
> btrfs_root *root)
>  !atomic_read(>will_be_snapshotted));
>   }
>  }
> +
> +void btrfs_mark_bg_unused(struct btrfs_block_group_cache *bg)
> +{
> + struct btrfs_fs_info *fs_info = bg->fs_info;
> +
> + spin_lock(_info->unused_bgs_lock);
> + if (list_empty(>bg_list)) {
> + btrfs_get_block_group(bg);
> + trace_btrfs_add_unused_block_group(bg);
> + list_add_tail(>bg_list, _info->unused_bgs);
> + }
> + spin_unlock(_info->unused_bgs_lock);
> +}
> diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
> index a59005862010..40086b47a65f 100644
> --- a/fs/btrfs/scrub.c
> +++ b/fs/btrfs/scrub.c
> @@ -3981,14 +3981,7 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
>   if (!cache->removed && !cache->ro && cache->reserved == 0 &&
>   btrfs_block_group_used(>item) == 0) {
>   spin_unlock(>lock);
> - spin_lock(_info->unused_bgs_lock);
> - if (list_empty(>bg_list)) {
> - btrfs_get_block_group(cache);
> - trace_btrfs_add_unused_block_group(cache);
> - list_add_tail(>bg_list,
> -   _info->unused_bgs);
> - }
> - 

[PATCH v2] btrfs: Use btrfs_mark_bg_unused() to replace open code

2018-05-22 Thread Qu Wenruo
Introduce a small helper, btrfs_mark_bg_unused(), to accquire needed
locks and add a block group to unused_bgs list.

No functional modification, and only 3 callers are involved.

Signed-off-by: Qu Wenruo 
---
This patch should provide the basis for later block group auto-removal
to get more info (mostly transid) to determine should one block group
being removed in current trans.

changelog:
v2:
  Add ASSERT() for btrfs_read_block_groups(), as in that call site a
  block group should not be added to other list (new_bgs or
  deleted_bgs).
  Rename the function to btrfs_mark_bg_unused().
  Both suggested by Nikolay.
---
 fs/btrfs/ctree.h   |  1 +
 fs/btrfs/extent-tree.c | 36 +---
 fs/btrfs/scrub.c   |  9 +
 3 files changed, 19 insertions(+), 27 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index bbb358143ded..4a34bc443fe4 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -2827,6 +2827,7 @@ void check_system_chunk(struct btrfs_trans_handle *trans,
struct btrfs_fs_info *fs_info, const u64 type);
 u64 add_new_free_space(struct btrfs_block_group_cache *block_group,
   u64 start, u64 end);
+void btrfs_mark_bg_unused(struct btrfs_block_group_cache *bg);
 
 /* ctree.c */
 int btrfs_bin_search(struct extent_buffer *eb, const struct btrfs_key *key,
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index ccf2690f7ca1..33f046103073 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -6312,16 +6312,8 @@ static int update_block_group(struct btrfs_trans_handle 
*trans,
 * dirty list to avoid races between cleaner kthread and space
 * cache writeout.
 */
-   if (!alloc && old_val == 0) {
-   spin_lock(>unused_bgs_lock);
-   if (list_empty(>bg_list)) {
-   btrfs_get_block_group(cache);
-   trace_btrfs_add_unused_block_group(cache);
-   list_add_tail(>bg_list,
- >unused_bgs);
-   }
-   spin_unlock(>unused_bgs_lock);
-   }
+   if (!alloc && old_val == 0)
+   btrfs_mark_bg_unused(cache);
 
btrfs_put_block_group(cache);
total -= num_bytes;
@@ -10144,15 +10136,8 @@ int btrfs_read_block_groups(struct btrfs_fs_info *info)
if (btrfs_chunk_readonly(info, cache->key.objectid)) {
inc_block_group_ro(cache, 1);
} else if (btrfs_block_group_used(>item) == 0) {
-   spin_lock(>unused_bgs_lock);
-   /* Should always be true but just in case. */
-   if (list_empty(>bg_list)) {
-   btrfs_get_block_group(cache);
-   trace_btrfs_add_unused_block_group(cache);
-   list_add_tail(>bg_list,
- >unused_bgs);
-   }
-   spin_unlock(>unused_bgs_lock);
+   ASSERT(list_empty(>bg_list));
+   btrfs_mark_bg_unused(cache);
}
}
 
@@ -11071,3 +11056,16 @@ void btrfs_wait_for_snapshot_creation(struct 
btrfs_root *root)
   !atomic_read(>will_be_snapshotted));
}
 }
+
+void btrfs_mark_bg_unused(struct btrfs_block_group_cache *bg)
+{
+   struct btrfs_fs_info *fs_info = bg->fs_info;
+
+   spin_lock(_info->unused_bgs_lock);
+   if (list_empty(>bg_list)) {
+   btrfs_get_block_group(bg);
+   trace_btrfs_add_unused_block_group(bg);
+   list_add_tail(>bg_list, _info->unused_bgs);
+   }
+   spin_unlock(_info->unused_bgs_lock);
+}
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index a59005862010..40086b47a65f 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -3981,14 +3981,7 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
if (!cache->removed && !cache->ro && cache->reserved == 0 &&
btrfs_block_group_used(>item) == 0) {
spin_unlock(>lock);
-   spin_lock(_info->unused_bgs_lock);
-   if (list_empty(>bg_list)) {
-   btrfs_get_block_group(cache);
-   trace_btrfs_add_unused_block_group(cache);
-   list_add_tail(>bg_list,
- _info->unused_bgs);
-   }
-   spin_unlock(_info->unused_bgs_lock);
+   btrfs_mark_bg_unused(cache);
} else {
spin_unlock(>lock);
}
-- 
2.17.0

--
To unsubscribe from this list: send the 

Re: [PATCH] btrfs: handle failures of set_extent_bits in add_excluded_extent

2018-05-22 Thread Nikolay Borisov


On 22.05.2018 06:51, Gu Jinxiang wrote:
> set_extent_bits may return 0/-EEXIST, so return the result in
> add_excluded_extent. And handle the failures in upper callers.
> 
> Caller of add_excluded_extent and failure process currently:
> exclude_super_stripes
>   <- btrfs_make_block_group  //handles the failure
>   <- btrfs_read_block_groups //handles the failure

> __exclude_logged_extent
>   <- btrfs_exclude_logged_extents
>   <- btrfs_alloc_logged_file_extent //propagate failure to upper
>   caller
> Add logic of propagate return value of __exclude_logged_extent to
> btrfs_exclude_logged_extents.
> 
> Signed-off-by: Gu Jinxiang 
> ---
>  fs/btrfs/extent-tree.c | 20 +++-
>  1 file changed, 15 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 75cfb80d2551..28fd71579141 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -215,11 +215,16 @@ static int add_excluded_extent(struct btrfs_fs_info 
> *fs_info,
>  u64 start, u64 num_bytes)
>  {
>   u64 end = start + num_bytes - 1;
> - set_extent_bits(_info->freed_extents[0],
> + int ret = 0;
> +
> + ret = set_extent_bits(_info->freed_extents[0],
>   start, end, EXTENT_UPTODATE);
> - set_extent_bits(_info->freed_extents[1],
> + if (ret)
> + goto out;
> + ret = set_extent_bits(_info->freed_extents[1],
>   start, end, EXTENT_UPTODATE);
> - return 0;
> +out:
> + return ret;
>  }
>  
>  static void free_excluded_extents(struct btrfs_fs_info *fs_info,
> @@ -6463,6 +6468,7 @@ int btrfs_exclude_logged_extents(struct btrfs_fs_info 
> *fs_info,
>   struct btrfs_key key;
>   int found_type;
>   int i;
> + int ret = 0;
>  
>   if (!btrfs_fs_incompat(fs_info, MIXED_GROUPS))
>   return 0;
> @@ -6479,10 +6485,14 @@ int btrfs_exclude_logged_extents(struct btrfs_fs_info 
> *fs_info,
>   continue;
>   key.objectid = btrfs_file_extent_disk_bytenr(eb, item);
>   key.offset = btrfs_file_extent_disk_num_bytes(eb, item);
> - __exclude_logged_extent(fs_info, key.objectid, key.offset);
> + ret = __exclude_logged_extent(fs_info, key.objectid,
> + key.offset);
> + if (ret)
> + goto out;
>   }
>  
> - return 0;
> +out:
> + return ret;
>  }

The 2nd and 3rd hunks need to go in a separate patch.
__exclude_logged_extent can return an error value for any number of
reasons (btrfs_lookup_block_group failing, btrfs_remove_free_space
failing or add_excluded_extents also failing). SO what you are actually
fixing here is __exclude_logged_extent return value not being handled in
btrfs_exclude_logged_extent which is a separate issue from
add_excluded_extent always returning 0. So put this in a separate patch.

>  
>  static void
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: Use btrfs_add_unused_bgs() to replace open code

2018-05-22 Thread Qu Wenruo


On 2018年05月22日 15:49, Nikolay Borisov wrote:
> 
> 
> On 22.05.2018 10:45, Qu Wenruo wrote:
>>
>>
>> On 2018年05月22日 15:37, Nikolay Borisov wrote:
>>>
>>>
>>> On 22.05.2018 10:29, Qu Wenruo wrote:
 Introduce a small helper, btrfs_add_unused_bgs(), to accquire needed
>>>
>>> This function name sounds a bit awkard, mainly because you use the
>>> plural form. How about btrfs_mark_bg_unused() ? The name seems more
>>> unambiguous.
>>
>> Sounds much better.
>>
>>>
 locks and add a block group to unused_bgs list.

 No functional modification, and only 3 callers are involved.

 Signed-off-by: Qu Wenruo 
 ---
 This patch should provide the basis for later block group auto-removal
 to get more info (mostly transid) to determine should one block group
 being removed in current trans.
 ---
  fs/btrfs/ctree.h   |  1 +
  fs/btrfs/extent-tree.c | 35 ---
  fs/btrfs/scrub.c   |  9 +
  3 files changed, 18 insertions(+), 27 deletions(-)

 diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
 index bbb358143ded..701a52034ec6 100644
 --- a/fs/btrfs/ctree.h
 +++ b/fs/btrfs/ctree.h
 @@ -2827,6 +2827,7 @@ void check_system_chunk(struct btrfs_trans_handle 
 *trans,
struct btrfs_fs_info *fs_info, const u64 type);
  u64 add_new_free_space(struct btrfs_block_group_cache *block_group,
   u64 start, u64 end);
 +void btrfs_add_unused_bgs(struct btrfs_block_group_cache *bg);
  
  /* ctree.c */
  int btrfs_bin_search(struct extent_buffer *eb, const struct btrfs_key 
 *key,
 diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
 index ccf2690f7ca1..484c9d11e5b6 100644
 --- a/fs/btrfs/extent-tree.c
 +++ b/fs/btrfs/extent-tree.c
 @@ -6312,16 +6312,8 @@ static int update_block_group(struct 
 btrfs_trans_handle *trans,
 * dirty list to avoid races between cleaner kthread and space
 * cache writeout.
 */
 -  if (!alloc && old_val == 0) {
 -  spin_lock(>unused_bgs_lock);
 -  if (list_empty(>bg_list)) {
 -  btrfs_get_block_group(cache);
 -  trace_btrfs_add_unused_block_group(cache);
 -  list_add_tail(>bg_list,
 ->unused_bgs);
 -  }
 -  spin_unlock(>unused_bgs_lock);
 -  }
 +  if (!alloc && old_val == 0)
 +  btrfs_add_unused_bgs(cache);
  
btrfs_put_block_group(cache);
total -= num_bytes;
 @@ -10144,15 +10136,7 @@ int btrfs_read_block_groups(struct btrfs_fs_info 
 *info)
if (btrfs_chunk_readonly(info, cache->key.objectid)) {
inc_block_group_ro(cache, 1);
} else if (btrfs_block_group_used(>item) == 0) {
 -  spin_lock(>unused_bgs_lock);
 -  /* Should always be true but just in case. */
 -  if (list_empty(>bg_list)) {
 -  btrfs_get_block_group(cache);
 -  trace_btrfs_add_unused_block_group(cache);
 -  list_add_tail(>bg_list,
 ->unused_bgs);
 -  }
 -  spin_unlock(>unused_bgs_lock);
 +  btrfs_add_unused_bgs(cache);
}
}
  
 @@ -11071,3 +11055,16 @@ void btrfs_wait_for_snapshot_creation(struct 
 btrfs_root *root)
   !atomic_read(>will_be_snapshotted));
}
  }
 +
 +void btrfs_add_unused_bgs(struct btrfs_block_group_cache *bg)
 +{
 +  struct btrfs_fs_info *fs_info = bg->fs_info;
 +
 +  spin_lock(_info->unused_bgs_lock);
 +  if (list_empty(>bg_list)) {
>>>
>>> Given the comment in btrfs_read_block_groups:
>>>
>>> /* Should always be true but just in case. */
>>>
>>> How about you make it ASSERT(list_empty(>bg_list));
>>>
>>> /* code to add the bg */
>>>
>>> So right now either :
>>>
>>> a) The comment is bogus and it is indeed required to check if this bg
>>> has already been marked unused.
>>>
>>> or
>>>
>>> b) The comment is correct and it's in fact a bug to try and mark a bg as
>>> unused twice.
>>
>> Not exactly.
>>
>> 1) bg_list is kind of abused.
>>Not only fs_info->unused_bgs, but also transaction->deleted_bgs, and
>>even transaction->new_bgs could use bg_cache->bg_list.
>>So it's not only used to detect unused bgs.
>>And it's possible some bg get moved to deleted_bgs list.
> 
> I haven't looked at the code but if this is indeed the case then doesn't
> it make sense to try and fix this abuse, otherwise don't we risk
> processing a bg in the 

Re: [PATCH] btrfs: Use btrfs_add_unused_bgs() to replace open code

2018-05-22 Thread Nikolay Borisov


On 22.05.2018 10:45, Qu Wenruo wrote:
> 
> 
> On 2018年05月22日 15:37, Nikolay Borisov wrote:
>>
>>
>> On 22.05.2018 10:29, Qu Wenruo wrote:
>>> Introduce a small helper, btrfs_add_unused_bgs(), to accquire needed
>>
>> This function name sounds a bit awkard, mainly because you use the
>> plural form. How about btrfs_mark_bg_unused() ? The name seems more
>> unambiguous.
> 
> Sounds much better.
> 
>>
>>> locks and add a block group to unused_bgs list.
>>>
>>> No functional modification, and only 3 callers are involved.
>>>
>>> Signed-off-by: Qu Wenruo 
>>> ---
>>> This patch should provide the basis for later block group auto-removal
>>> to get more info (mostly transid) to determine should one block group
>>> being removed in current trans.
>>> ---
>>>  fs/btrfs/ctree.h   |  1 +
>>>  fs/btrfs/extent-tree.c | 35 ---
>>>  fs/btrfs/scrub.c   |  9 +
>>>  3 files changed, 18 insertions(+), 27 deletions(-)
>>>
>>> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
>>> index bbb358143ded..701a52034ec6 100644
>>> --- a/fs/btrfs/ctree.h
>>> +++ b/fs/btrfs/ctree.h
>>> @@ -2827,6 +2827,7 @@ void check_system_chunk(struct btrfs_trans_handle 
>>> *trans,
>>> struct btrfs_fs_info *fs_info, const u64 type);
>>>  u64 add_new_free_space(struct btrfs_block_group_cache *block_group,
>>>u64 start, u64 end);
>>> +void btrfs_add_unused_bgs(struct btrfs_block_group_cache *bg);
>>>  
>>>  /* ctree.c */
>>>  int btrfs_bin_search(struct extent_buffer *eb, const struct btrfs_key *key,
>>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>>> index ccf2690f7ca1..484c9d11e5b6 100644
>>> --- a/fs/btrfs/extent-tree.c
>>> +++ b/fs/btrfs/extent-tree.c
>>> @@ -6312,16 +6312,8 @@ static int update_block_group(struct 
>>> btrfs_trans_handle *trans,
>>>  * dirty list to avoid races between cleaner kthread and space
>>>  * cache writeout.
>>>  */
>>> -   if (!alloc && old_val == 0) {
>>> -   spin_lock(>unused_bgs_lock);
>>> -   if (list_empty(>bg_list)) {
>>> -   btrfs_get_block_group(cache);
>>> -   trace_btrfs_add_unused_block_group(cache);
>>> -   list_add_tail(>bg_list,
>>> - >unused_bgs);
>>> -   }
>>> -   spin_unlock(>unused_bgs_lock);
>>> -   }
>>> +   if (!alloc && old_val == 0)
>>> +   btrfs_add_unused_bgs(cache);
>>>  
>>> btrfs_put_block_group(cache);
>>> total -= num_bytes;
>>> @@ -10144,15 +10136,7 @@ int btrfs_read_block_groups(struct btrfs_fs_info 
>>> *info)
>>> if (btrfs_chunk_readonly(info, cache->key.objectid)) {
>>> inc_block_group_ro(cache, 1);
>>> } else if (btrfs_block_group_used(>item) == 0) {
>>> -   spin_lock(>unused_bgs_lock);
>>> -   /* Should always be true but just in case. */
>>> -   if (list_empty(>bg_list)) {
>>> -   btrfs_get_block_group(cache);
>>> -   trace_btrfs_add_unused_block_group(cache);
>>> -   list_add_tail(>bg_list,
>>> - >unused_bgs);
>>> -   }
>>> -   spin_unlock(>unused_bgs_lock);
>>> +   btrfs_add_unused_bgs(cache);
>>> }
>>> }
>>>  
>>> @@ -11071,3 +11055,16 @@ void btrfs_wait_for_snapshot_creation(struct 
>>> btrfs_root *root)
>>>!atomic_read(>will_be_snapshotted));
>>> }
>>>  }
>>> +
>>> +void btrfs_add_unused_bgs(struct btrfs_block_group_cache *bg)
>>> +{
>>> +   struct btrfs_fs_info *fs_info = bg->fs_info;
>>> +
>>> +   spin_lock(_info->unused_bgs_lock);
>>> +   if (list_empty(>bg_list)) {
>>
>> Given the comment in btrfs_read_block_groups:
>>
>> /* Should always be true but just in case. */
>>
>> How about you make it ASSERT(list_empty(>bg_list));
>>
>> /* code to add the bg */
>>
>> So right now either :
>>
>> a) The comment is bogus and it is indeed required to check if this bg
>> has already been marked unused.
>>
>> or
>>
>> b) The comment is correct and it's in fact a bug to try and mark a bg as
>> unused twice.
> 
> Not exactly.
> 
> 1) bg_list is kind of abused.
>Not only fs_info->unused_bgs, but also transaction->deleted_bgs, and
>even transaction->new_bgs could use bg_cache->bg_list.
>So it's not only used to detect unused bgs.
>And it's possible some bg get moved to deleted_bgs list.

I haven't looked at the code but if this is indeed the case then doesn't
it make sense to try and fix this abuse, otherwise don't we risk
processing a bg in the wrong context? In other words, shouldn't bgs have
1 list member for every list they could be part of?I guess a single list
member would have made 

Re: [PATCH] btrfs: Use btrfs_add_unused_bgs() to replace open code

2018-05-22 Thread Qu Wenruo


On 2018年05月22日 15:37, Nikolay Borisov wrote:
> 
> 
> On 22.05.2018 10:29, Qu Wenruo wrote:
>> Introduce a small helper, btrfs_add_unused_bgs(), to accquire needed
> 
> This function name sounds a bit awkard, mainly because you use the
> plural form. How about btrfs_mark_bg_unused() ? The name seems more
> unambiguous.

Sounds much better.

> 
>> locks and add a block group to unused_bgs list.
>>
>> No functional modification, and only 3 callers are involved.
>>
>> Signed-off-by: Qu Wenruo 
>> ---
>> This patch should provide the basis for later block group auto-removal
>> to get more info (mostly transid) to determine should one block group
>> being removed in current trans.
>> ---
>>  fs/btrfs/ctree.h   |  1 +
>>  fs/btrfs/extent-tree.c | 35 ---
>>  fs/btrfs/scrub.c   |  9 +
>>  3 files changed, 18 insertions(+), 27 deletions(-)
>>
>> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
>> index bbb358143ded..701a52034ec6 100644
>> --- a/fs/btrfs/ctree.h
>> +++ b/fs/btrfs/ctree.h
>> @@ -2827,6 +2827,7 @@ void check_system_chunk(struct btrfs_trans_handle 
>> *trans,
>>  struct btrfs_fs_info *fs_info, const u64 type);
>>  u64 add_new_free_space(struct btrfs_block_group_cache *block_group,
>> u64 start, u64 end);
>> +void btrfs_add_unused_bgs(struct btrfs_block_group_cache *bg);
>>  
>>  /* ctree.c */
>>  int btrfs_bin_search(struct extent_buffer *eb, const struct btrfs_key *key,
>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>> index ccf2690f7ca1..484c9d11e5b6 100644
>> --- a/fs/btrfs/extent-tree.c
>> +++ b/fs/btrfs/extent-tree.c
>> @@ -6312,16 +6312,8 @@ static int update_block_group(struct 
>> btrfs_trans_handle *trans,
>>   * dirty list to avoid races between cleaner kthread and space
>>   * cache writeout.
>>   */
>> -if (!alloc && old_val == 0) {
>> -spin_lock(>unused_bgs_lock);
>> -if (list_empty(>bg_list)) {
>> -btrfs_get_block_group(cache);
>> -trace_btrfs_add_unused_block_group(cache);
>> -list_add_tail(>bg_list,
>> -  >unused_bgs);
>> -}
>> -spin_unlock(>unused_bgs_lock);
>> -}
>> +if (!alloc && old_val == 0)
>> +btrfs_add_unused_bgs(cache);
>>  
>>  btrfs_put_block_group(cache);
>>  total -= num_bytes;
>> @@ -10144,15 +10136,7 @@ int btrfs_read_block_groups(struct btrfs_fs_info 
>> *info)
>>  if (btrfs_chunk_readonly(info, cache->key.objectid)) {
>>  inc_block_group_ro(cache, 1);
>>  } else if (btrfs_block_group_used(>item) == 0) {
>> -spin_lock(>unused_bgs_lock);
>> -/* Should always be true but just in case. */
>> -if (list_empty(>bg_list)) {
>> -btrfs_get_block_group(cache);
>> -trace_btrfs_add_unused_block_group(cache);
>> -list_add_tail(>bg_list,
>> -  >unused_bgs);
>> -}
>> -spin_unlock(>unused_bgs_lock);
>> +btrfs_add_unused_bgs(cache);
>>  }
>>  }
>>  
>> @@ -11071,3 +11055,16 @@ void btrfs_wait_for_snapshot_creation(struct 
>> btrfs_root *root)
>> !atomic_read(>will_be_snapshotted));
>>  }
>>  }
>> +
>> +void btrfs_add_unused_bgs(struct btrfs_block_group_cache *bg)
>> +{
>> +struct btrfs_fs_info *fs_info = bg->fs_info;
>> +
>> +spin_lock(_info->unused_bgs_lock);
>> +if (list_empty(>bg_list)) {
> 
> Given the comment in btrfs_read_block_groups:
> 
> /* Should always be true but just in case. */
> 
> How about you make it ASSERT(list_empty(>bg_list));
> 
> /* code to add the bg */
> 
> So right now either :
> 
> a) The comment is bogus and it is indeed required to check if this bg
> has already been marked unused.
> 
> or
> 
> b) The comment is correct and it's in fact a bug to try and mark a bg as
> unused twice.

Not exactly.

1) bg_list is kind of abused.
   Not only fs_info->unused_bgs, but also transaction->deleted_bgs, and
   even transaction->new_bgs could use bg_cache->bg_list.
   So it's not only used to detect unused bgs.
   And it's possible some bg get moved to deleted_bgs list.

2) That is comment only works for caller in btrfs_read_block_groups().
   As at that timing, there is no race at all since we're still mounting
   the fs.
   But may not work for other callers.

Thus I just kept the code while removed the comment, since in the
extracted function, it may no longer be the case.
(And my focus is later auto-removal generation check, so I just left
code as is)

Thanks,
Qu

> 
>> +

Re: [PATCH] btrfs: Use btrfs_add_unused_bgs() to replace open code

2018-05-22 Thread Nikolay Borisov


On 22.05.2018 10:29, Qu Wenruo wrote:
> Introduce a small helper, btrfs_add_unused_bgs(), to accquire needed

This function name sounds a bit awkard, mainly because you use the
plural form. How about btrfs_mark_bg_unused() ? The name seems more
unambiguous.

> locks and add a block group to unused_bgs list.
> 
> No functional modification, and only 3 callers are involved.
> 
> Signed-off-by: Qu Wenruo 
> ---
> This patch should provide the basis for later block group auto-removal
> to get more info (mostly transid) to determine should one block group
> being removed in current trans.
> ---
>  fs/btrfs/ctree.h   |  1 +
>  fs/btrfs/extent-tree.c | 35 ---
>  fs/btrfs/scrub.c   |  9 +
>  3 files changed, 18 insertions(+), 27 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index bbb358143ded..701a52034ec6 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -2827,6 +2827,7 @@ void check_system_chunk(struct btrfs_trans_handle 
> *trans,
>   struct btrfs_fs_info *fs_info, const u64 type);
>  u64 add_new_free_space(struct btrfs_block_group_cache *block_group,
>  u64 start, u64 end);
> +void btrfs_add_unused_bgs(struct btrfs_block_group_cache *bg);
>  
>  /* ctree.c */
>  int btrfs_bin_search(struct extent_buffer *eb, const struct btrfs_key *key,
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index ccf2690f7ca1..484c9d11e5b6 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -6312,16 +6312,8 @@ static int update_block_group(struct 
> btrfs_trans_handle *trans,
>* dirty list to avoid races between cleaner kthread and space
>* cache writeout.
>*/
> - if (!alloc && old_val == 0) {
> - spin_lock(>unused_bgs_lock);
> - if (list_empty(>bg_list)) {
> - btrfs_get_block_group(cache);
> - trace_btrfs_add_unused_block_group(cache);
> - list_add_tail(>bg_list,
> -   >unused_bgs);
> - }
> - spin_unlock(>unused_bgs_lock);
> - }
> + if (!alloc && old_val == 0)
> + btrfs_add_unused_bgs(cache);
>  
>   btrfs_put_block_group(cache);
>   total -= num_bytes;
> @@ -10144,15 +10136,7 @@ int btrfs_read_block_groups(struct btrfs_fs_info 
> *info)
>   if (btrfs_chunk_readonly(info, cache->key.objectid)) {
>   inc_block_group_ro(cache, 1);
>   } else if (btrfs_block_group_used(>item) == 0) {
> - spin_lock(>unused_bgs_lock);
> - /* Should always be true but just in case. */
> - if (list_empty(>bg_list)) {
> - btrfs_get_block_group(cache);
> - trace_btrfs_add_unused_block_group(cache);
> - list_add_tail(>bg_list,
> -   >unused_bgs);
> - }
> - spin_unlock(>unused_bgs_lock);
> + btrfs_add_unused_bgs(cache);
>   }
>   }
>  
> @@ -11071,3 +11055,16 @@ void btrfs_wait_for_snapshot_creation(struct 
> btrfs_root *root)
>  !atomic_read(>will_be_snapshotted));
>   }
>  }
> +
> +void btrfs_add_unused_bgs(struct btrfs_block_group_cache *bg)
> +{
> + struct btrfs_fs_info *fs_info = bg->fs_info;
> +
> + spin_lock(_info->unused_bgs_lock);
> + if (list_empty(>bg_list)) {

Given the comment in btrfs_read_block_groups:

/* Should always be true but just in case. */

How about you make it ASSERT(list_empty(>bg_list));

/* code to add the bg */

So right now either :

a) The comment is bogus and it is indeed required to check if this bg
has already been marked unused.

or

b) The comment is correct and it's in fact a bug to try and mark a bg as
unused twice.

> + btrfs_get_block_group(bg);
> + trace_btrfs_add_unused_block_group(bg);
> + list_add_tail(>bg_list, _info->unused_bgs);
> + }
> + spin_unlock(_info->unused_bgs_lock);
> +}
> diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
> index a59005862010..1044ab2fc71c 100644
> --- a/fs/btrfs/scrub.c
> +++ b/fs/btrfs/scrub.c
> @@ -3981,14 +3981,7 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
>   if (!cache->removed && !cache->ro && cache->reserved == 0 &&
>   btrfs_block_group_used(>item) == 0) {
>   spin_unlock(>lock);
> - spin_lock(_info->unused_bgs_lock);
> - if (list_empty(>bg_list)) {
> - btrfs_get_block_group(cache);
> - trace_btrfs_add_unused_block_group(cache);
> -   

[PATCH] btrfs: Use btrfs_add_unused_bgs() to replace open code

2018-05-22 Thread Qu Wenruo
Introduce a small helper, btrfs_add_unused_bgs(), to accquire needed
locks and add a block group to unused_bgs list.

No functional modification, and only 3 callers are involved.

Signed-off-by: Qu Wenruo 
---
This patch should provide the basis for later block group auto-removal
to get more info (mostly transid) to determine should one block group
being removed in current trans.
---
 fs/btrfs/ctree.h   |  1 +
 fs/btrfs/extent-tree.c | 35 ---
 fs/btrfs/scrub.c   |  9 +
 3 files changed, 18 insertions(+), 27 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index bbb358143ded..701a52034ec6 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -2827,6 +2827,7 @@ void check_system_chunk(struct btrfs_trans_handle *trans,
struct btrfs_fs_info *fs_info, const u64 type);
 u64 add_new_free_space(struct btrfs_block_group_cache *block_group,
   u64 start, u64 end);
+void btrfs_add_unused_bgs(struct btrfs_block_group_cache *bg);
 
 /* ctree.c */
 int btrfs_bin_search(struct extent_buffer *eb, const struct btrfs_key *key,
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index ccf2690f7ca1..484c9d11e5b6 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -6312,16 +6312,8 @@ static int update_block_group(struct btrfs_trans_handle 
*trans,
 * dirty list to avoid races between cleaner kthread and space
 * cache writeout.
 */
-   if (!alloc && old_val == 0) {
-   spin_lock(>unused_bgs_lock);
-   if (list_empty(>bg_list)) {
-   btrfs_get_block_group(cache);
-   trace_btrfs_add_unused_block_group(cache);
-   list_add_tail(>bg_list,
- >unused_bgs);
-   }
-   spin_unlock(>unused_bgs_lock);
-   }
+   if (!alloc && old_val == 0)
+   btrfs_add_unused_bgs(cache);
 
btrfs_put_block_group(cache);
total -= num_bytes;
@@ -10144,15 +10136,7 @@ int btrfs_read_block_groups(struct btrfs_fs_info *info)
if (btrfs_chunk_readonly(info, cache->key.objectid)) {
inc_block_group_ro(cache, 1);
} else if (btrfs_block_group_used(>item) == 0) {
-   spin_lock(>unused_bgs_lock);
-   /* Should always be true but just in case. */
-   if (list_empty(>bg_list)) {
-   btrfs_get_block_group(cache);
-   trace_btrfs_add_unused_block_group(cache);
-   list_add_tail(>bg_list,
- >unused_bgs);
-   }
-   spin_unlock(>unused_bgs_lock);
+   btrfs_add_unused_bgs(cache);
}
}
 
@@ -11071,3 +11055,16 @@ void btrfs_wait_for_snapshot_creation(struct 
btrfs_root *root)
   !atomic_read(>will_be_snapshotted));
}
 }
+
+void btrfs_add_unused_bgs(struct btrfs_block_group_cache *bg)
+{
+   struct btrfs_fs_info *fs_info = bg->fs_info;
+
+   spin_lock(_info->unused_bgs_lock);
+   if (list_empty(>bg_list)) {
+   btrfs_get_block_group(bg);
+   trace_btrfs_add_unused_block_group(bg);
+   list_add_tail(>bg_list, _info->unused_bgs);
+   }
+   spin_unlock(_info->unused_bgs_lock);
+}
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index a59005862010..1044ab2fc71c 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -3981,14 +3981,7 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
if (!cache->removed && !cache->ro && cache->reserved == 0 &&
btrfs_block_group_used(>item) == 0) {
spin_unlock(>lock);
-   spin_lock(_info->unused_bgs_lock);
-   if (list_empty(>bg_list)) {
-   btrfs_get_block_group(cache);
-   trace_btrfs_add_unused_block_group(cache);
-   list_add_tail(>bg_list,
- _info->unused_bgs);
-   }
-   spin_unlock(_info->unused_bgs_lock);
+   btrfs_add_unused_bgs(cache);
} else {
spin_unlock(>lock);
}
-- 
2.17.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 5/5] generic: test invalid swap file activation

2018-05-22 Thread Eryu Guan
On Fri, May 18, 2018 at 07:37:07AM -0700, Darrick J. Wong wrote:
> On Wed, May 16, 2018 at 01:38:49PM -0700, Omar Sandoval wrote:
> > From: Omar Sandoval 
> > 
> > Swap files cannot have holes, and they must at least two pages.
> > swapon(8) and mkswap(8) have stricter restrictions, so add versions of
> > those commands without any restrictions.
> > 
> > Signed-off-by: Omar Sandoval 
> > ---
> >  .gitignore|  2 ++
> >  src/Makefile  |  2 +-
> >  src/mkswap.c  | 83 +++
> >  src/swapon.c  | 24 +
> >  tests/generic/490 | 77 +++
> >  tests/generic/490.out |  5 +++
> >  tests/generic/group   |  1 +
> >  7 files changed, 193 insertions(+), 1 deletion(-)
> >  create mode 100644 src/mkswap.c
> >  create mode 100644 src/swapon.c
> >  create mode 100755 tests/generic/490
> >  create mode 100644 tests/generic/490.out
> > 
> > diff --git a/.gitignore b/.gitignore
> > index 53029e24..efc73a7c 100644
> > --- a/.gitignore
> > +++ b/.gitignore
> > @@ -92,6 +92,7 @@
> >  /src/lstat64
> >  /src/makeextents
> >  /src/metaperf
> > +/src/mkswap
> >  /src/mmapcat
> >  /src/multi_open_unlink
> >  /src/nametest
> > @@ -111,6 +112,7 @@
> >  /src/seek_sanity_test
> >  /src/stale_handle
> >  /src/stat_test
> > +/src/swapon
> >  /src/t_access_root
> >  /src/t_dir_offset
> >  /src/t_dir_offset2
> > diff --git a/src/Makefile b/src/Makefile
> > index c42d3bb1..01fe99ef 100644
> > --- a/src/Makefile
> > +++ b/src/Makefile
> > @@ -26,7 +26,7 @@ LINUX_TARGETS = xfsctl bstat t_mtab getdevicesize 
> > preallo_rw_pattern_reader \
> > renameat2 t_getcwd e4compact test-nextquota punch-alternating \
> > attr-list-by-handle-cursor-test listxattr dio-interleaved t_dir_type \
> > dio-invalidate-cache stat_test t_encrypted_d_revalidate \
> > -   attr_replace_test
> > +   attr_replace_test swapon mkswap
> >  
> >  SUBDIRS = log-writes perf
> >  
> > diff --git a/src/mkswap.c b/src/mkswap.c
> > new file mode 100644
> > index ..d0bce2bd
> > --- /dev/null
> > +++ b/src/mkswap.c
> > @@ -0,0 +1,83 @@
> > +/* mkswap(8) without any sanity checks */
> > +
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +
> > +struct swap_header {
> > +   charbootbits[1024];
> > +   uint32_tversion;
> > +   uint32_tlast_page;
> > +   uint32_tnr_badpages;
> > +   unsigned char   sws_uuid[16];
> > +   unsigned char   sws_volume[16];
> > +   uint32_tpadding[117];
> > +   uint32_tbadpages[1];
> > +};
> > +
> > +int main(int argc, char **argv)
> > +{
> > +   struct swap_header *hdr;
> > +   FILE *file;
> > +   struct stat st;
> > +   long page_size;
> > +   int ret;
> > +
> > +   if (argc != 2) {
> > +   fprintf(stderr, "usage: %s PATH\n", argv[0]);
> > +   return EXIT_FAILURE;
> > +   }
> > +
> > +   page_size = sysconf(_SC_PAGESIZE);
> > +   if (page_size == -1) {
> > +   perror("sysconf");
> > +   return EXIT_FAILURE;
> > +   }
> > +
> > +   hdr = calloc(1, page_size);
> > +   if (!hdr) {
> > +   perror("calloc");
> > +   return EXIT_FAILURE;
> > +   }
> > +
> > +   file = fopen(argv[1], "r+");
> > +   if (!file) {
> > +   perror("fopen");
> > +   free(hdr);
> > +   return EXIT_FAILURE;
> > +   }
> > +
> > +   ret = fstat(fileno(file), );
> > +   if (ret) {
> > +   perror("fstat");
> > +   free(hdr);
> > +   fclose(file);
> > +   return EXIT_FAILURE;
> > +   }
> > +
> > +   hdr->version = 1;
> > +   hdr->last_page = st.st_size / page_size - 1;
> > +   memset(>sws_uuid, 0x99, sizeof(hdr->sws_uuid));
> > +   memcpy((char *)hdr + page_size - 10, "SWAPSPACE2", 10);
> > +
> > +   if (fwrite(hdr, page_size, 1, file) != 1) {
> > +   perror("fwrite");
> > +   free(hdr);
> > +   fclose(file);
> > +   return EXIT_FAILURE;
> > +   }
> > +
> > +   if (fclose(file) == EOF) {
> > +   perror("fwrite");
> > +   free(hdr);
> > +   return EXIT_FAILURE;
> > +   }
> > +
> > +   free(hdr);
> > +
> > +   return EXIT_SUCCESS;
> > +}
> > diff --git a/src/swapon.c b/src/swapon.c
> > new file mode 100644
> > index ..0cb7108a
> > --- /dev/null
> > +++ b/src/swapon.c
> > @@ -0,0 +1,24 @@
> > +/* swapon(8) without any sanity checks; simply calls swapon(2) directly. */
> > +
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +
> > +int main(int argc, char **argv)
> > +{
> > +   int ret;
> > +
> > +   if (argc != 2) {
> > +   fprintf(stderr, "usage: %s PATH\n", argv[0]);
> > +   return EXIT_FAILURE;
> > +   }
> > +
> > +   ret = swapon(argv[1], 0);
> > +   if (ret) {
> > +   perror("swapon");
> > +   return EXIT_FAILURE;
> > +   }
> > +
> > +   return EXIT_SUCCESS;
> > +}
> > diff --git a/tests/generic/490 

Re: [PATCH v2 3/5] generic: add test for dedupe on an active swapfile

2018-05-22 Thread Eryu Guan
On Wed, May 16, 2018 at 01:38:47PM -0700, Omar Sandoval wrote:
> From: Omar Sandoval 
> 
> Similar to generic/356 that makes sure we can't reflink an active
  ^^^ dedupe
I'll fix it on commit.

Thanks,
Eryu
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 2/5] generic: enable swapfile tests on Btrfs

2018-05-22 Thread Eryu Guan
On Wed, May 16, 2018 at 01:38:46PM -0700, Omar Sandoval wrote:
> From: Omar Sandoval 
> 
> Commit 8c96cfbfe530 ("generic/35[67]: disable swapfile tests on Btrfs")
> disabled the swapfile tests on Btrfs because it did not support
> swapfiles at the time. Now that we're adding support, we want these

So currently btrfs has no swapfile support yet, and tests still _notrun
with current v4.17-rc5 kernel? Just want to make sure that's expected.

> tests to run, but they don't. _require_scratch_swapfile always fails for
> Btrfs because swapfiles on Btrfs must be set to nocow. After fixing
> that, generic/356 and generic/357 fail for the same reason. After fixing
> _that_, both tests still fail because we don't allow reflinking a
> non-checksummed extent (which nocow implies) to a checksummed extent.
> Add a helper for formatting a swap file which does the chattr, and
> chattr the second file, which gets these tests running on kernels
> supporting Btrfs swapfiles.
> 
> Signed-off-by: Omar Sandoval 
> ---
>  common/rc | 15 ---
>  tests/generic/356 |  7 ---
>  tests/generic/357 |  6 +++---
>  3 files changed, 19 insertions(+), 9 deletions(-)
> 
> diff --git a/common/rc b/common/rc
> index ffe53236..814b8b5c 100644
> --- a/common/rc
> +++ b/common/rc
> @@ -,6 +,17 @@ _require_odirect()
>   rm -f $testfile 2>&1 > /dev/null
>  }
>  
> +_format_swapfile() {
> + local fname="$1"
> + local sz="$2"
> +
> + touch "$fname"

Better to remove $fname first, just in case it's an existing file with
some blocks allocated, as chattr(1) says

"Note: For btrfs, the 'C' flag should be set on new or empty files.  If
it is set on a file which already has data blocks, it is undefined when
the blocks assigned to the file will be fully stable"

> + chmod 0600 "$fname"
> + $CHATTR_PROG +C "$fname" > /dev/null 2>&1

It'd be good to have some comments on this chattr +C.

> + _pwrite_byte 0x61 0 "$sz" "$fname" >> $seqres.full
> + mkswap -U 27376b42-ff65-42ca-919f-6c9b62292a5c "$fname" >> $seqres.full

Is the label really needed?

Thanks,
Eryu

> +}
> +
>  # Check that the filesystem supports swapfiles
>  _require_scratch_swapfile()
>  {
> @@ -2231,10 +2242,8 @@ _require_scratch_swapfile()
>   _scratch_mount
>  
>   # Minimum size for mkswap is 10 pages
> - local size=$(($(get_page_size) * 10))
> + _format_swapfile "$SCRATCH_MNT/swap" $(($(get_page_size) * 10))
>  
> - _pwrite_byte 0x61 0 "$size" "$SCRATCH_MNT/swap" >/dev/null 2>&1
> - mkswap "$SCRATCH_MNT/swap" >/dev/null 2>&1
>   if ! swapon "$SCRATCH_MNT/swap" >/dev/null 2>&1; then
>   _scratch_unmount
>   _notrun "swapfiles are not supported"
> diff --git a/tests/generic/356 b/tests/generic/356
> index 51eeb652..b4a38f84 100755
> --- a/tests/generic/356
> +++ b/tests/generic/356
> @@ -59,11 +59,12 @@ blocks=160
>  blksz=65536
>  
>  echo "Initialize file"
> -echo >> $seqres.full
> -_pwrite_byte 0x61 0 $((blocks * blksz)) $testdir/file1 >> $seqres.full
> -mkswap -U 27376b42-ff65-42ca-919f-6c9b62292a5c $testdir/file1 >> $seqres.full
> +_format_swapfile "$testdir/file1" $((blocks * blksz))
>  swapon $testdir/file1
>  
> +touch "$testdir/file2"
> +$CHATTR_PROG +C "$testdir/file2" >/dev/null 2>&1
> +
>  echo "Try to reflink"
>  _cp_reflink $testdir/file1 $testdir/file2 2>&1 | _filter_scratch
>  
> diff --git a/tests/generic/357 b/tests/generic/357
> index 0dd0c10f..9a83a283 100755
> --- a/tests/generic/357
> +++ b/tests/generic/357
> @@ -59,9 +59,9 @@ blocks=160
>  blksz=65536
>  
>  echo "Initialize file"
> -echo >> $seqres.full
> -_pwrite_byte 0x61 0 $((blocks * blksz)) $testdir/file1 >> $seqres.full
> -mkswap -U 27376b42-ff65-42ca-919f-6c9b62292a5c $testdir/file1 >> $seqres.full
> +_format_swapfile "$testdir/file1" $((blocks * blksz))
> +touch "$testdir/file2"
> +$CHATTR_PROG +C "$testdir/file2" >/dev/null 2>&1
>  _cp_reflink $testdir/file1 $testdir/file2 2>&1 | _filter_scratch
>  
>  echo "Try to swapon"
> -- 
> 2.17.0
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/13] convert block layer to bioset_init()/mempool_init()

2018-05-22 Thread Christoph Hellwig
On Mon, May 21, 2018 at 07:38:55PM -0400, Kent Overstreet wrote:
> On Mon, May 21, 2018 at 02:24:32PM -0400, Mike Snitzer wrote:
> > Every single data structure change in this series should be reviewed for
> > unforeseen alignment consequences.  Jens seemed to say that is
> > worthwhile.  Not sure if he'll do it or we divide it up.  If we divide
> > it up a temp topic branch should be published for others to inspect.
> > 
> > Could be alignment hasn't been a historic concern for a bunch of the
> > data structures changed in this series.. if so then all we can do is fix
> > up any obvious potential for false sharing.
> 
> Honestly, I almost never worry about alignment... the very few times I do 
> care,
> I use __cacheline_aligned_in_smp.

And that generally is the right stratgey.  If Mike really doesn't want
a change we can just open code the kmalloc for the bio set there, the
important point is that we should not keep the old API around for no
good reason.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 1/8] btrfs: use iocb for __btrfs_buffered_write

2018-05-22 Thread Misono Tomohiro
On 2017/11/18 2:44, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues 
> 
> Preparatory patch. It reduces the arguments to __btrfs_buffered_write
> to follow buffered_write() style.
> 
> Signed-off-by: Goldwyn Rodrigues 
> 
> ---
>  fs/btrfs/file.c | 24 
>  1 file changed, 12 insertions(+), 12 deletions(-)
> 
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index aafcc785f840..9bceb0e61361 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -1572,10 +1572,11 @@ static noinline int check_can_nocow(struct 
> btrfs_inode *inode, loff_t pos,
>   return ret;
>  }
>  
> -static noinline ssize_t __btrfs_buffered_write(struct file *file,
> -struct iov_iter *i,
> -loff_t pos)
> +static noinline ssize_t __btrfs_buffered_write(struct kiocb *iocb,
> +struct iov_iter *i)
>  {
> + struct file *file = iocb->ki_filp;
> + loff_t pos = iocb->ki_pos;
>   struct inode *inode = file_inode(file);
>   struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
>   struct btrfs_root *root = BTRFS_I(inode)->root;
> @@ -1815,7 +1816,6 @@ static ssize_t __btrfs_direct_write(struct kiocb *iocb, 
> struct iov_iter *from)
>  {
>   struct file *file = iocb->ki_filp;
>   struct inode *inode = file_inode(file);
> - loff_t pos = iocb->ki_pos;
>   ssize_t written;
>   ssize_t written_buffered;
>   loff_t endbyte;
> @@ -1826,8 +1826,8 @@ static ssize_t __btrfs_direct_write(struct kiocb *iocb, 
> struct iov_iter *from)
>   if (written < 0 || !iov_iter_count(from))
>   return written;
>  
> - pos += written;
> - written_buffered = __btrfs_buffered_write(file, from, pos);

> + iocb->ki_pos += written;

Hi,

I found btrfs/026 fails on current misc-next branch and
git bisect points this commit.

I noticed generic_file_direct_write() already updates iocb->ki_pos, and 
therefore
above "iocb->ki_pos += written" is not needed.

> + written_buffered = __btrfs_buffered_write(iocb, from);
>   if (written_buffered < 0) {
>   err = written_buffered;
>   goto out;
> @@ -1836,16 +1836,16 @@ static ssize_t __btrfs_direct_write(struct kiocb 
> *iocb, struct iov_iter *from)
>* Ensure all data is persisted. We want the next direct IO read to be
>* able to read what was just written.
>*/
> - endbyte = pos + written_buffered - 1;
> - err = btrfs_fdatawrite_range(inode, pos, endbyte);
> + endbyte = iocb->ki_pos + written_buffered - 1;
> + err = btrfs_fdatawrite_range(inode, iocb->ki_pos, endbyte);
>   if (err)
>   goto out;
> - err = filemap_fdatawait_range(inode->i_mapping, pos, endbyte);
> + err = filemap_fdatawait_range(inode->i_mapping, iocb->ki_pos, endbyte);
>   if (err)
>   goto out;
> + iocb->ki_pos += written_buffered;
>   written += written_buffered;
> - iocb->ki_pos = pos + written_buffered;
> - invalidate_mapping_pages(file->f_mapping, pos >> PAGE_SHIFT,
> + invalidate_mapping_pages(file->f_mapping, iocb->ki_pos >> PAGE_SHIFT,
>endbyte >> PAGE_SHIFT);

Also, this invalidate_mapping_pages() should be done before updating 
iocb->ki_pos
to invalidate buffered written area.

Thanks,
Tomohiro Misono

>  out:
>   return written ? written : err;
> @@ -1964,7 +1964,7 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
>   if (iocb->ki_flags & IOCB_DIRECT) {
>   num_written = __btrfs_direct_write(iocb, from);
>   } else {
> - num_written = __btrfs_buffered_write(file, from, pos);
> + num_written = __btrfs_buffered_write(iocb, from);
>   if (num_written > 0)
>   iocb->ki_pos = pos + num_written;
>   if (clean_page)
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html