[PATCH] fstests: introduce btrfs/volume group

2018-05-28 Thread Anand Jain
The btrfs/volume group represent a set of btrfs test-cases, which
shall intend to verify the relevant btrfs volume operations.

Under this new group all the existing btrfs/replace group would
come under, and also the device operations test cases which does
not have any group as of now. This group is helpful to verify
the btrfs volume related changes.

Run as
  ./check -g btrfs/volume

Signed-off-by: Anand Jain 
---
 tests/btrfs/group | 40 
 1 file changed, 20 insertions(+), 20 deletions(-)

diff --git a/tests/btrfs/group b/tests/btrfs/group
index d3639ba8dbfe..03de3d11cfa1 100644
--- a/tests/btrfs/group
+++ b/tests/btrfs/group
@@ -5,15 +5,15 @@
 #
 001 auto quick subvol snapshot
 002 auto snapshot
-003 auto replace
+003 auto replace volume
 004 auto rw metadata
 005 auto defrag
-006 auto quick
+006 auto quick volume
 007 auto quick rw metadata send
 008 auto quick send
 009 auto quick subvol
 010 auto
-011 auto replace
+011 auto replace volume
 012 auto convert
 013 auto quick balance
 014 auto balance
@@ -22,14 +22,14 @@
 017 auto quick qgroup
 018 auto quick subvol
 019 auto quick send
-020 auto quick replace
+020 auto quick replace volume
 021 auto quick balance defrag
 022 auto qgroup limit
 023 auto
 024 auto quick compress
 025 auto quick send clone
 026 auto quick compress prealloc
-027 auto replace
+027 auto replace volume
 028 auto qgroup balance
 029 auto quick clone
 030 auto quick send
@@ -66,14 +66,14 @@
 061 auto balance scrub
 062 auto balance defrag compress
 063 auto balance remount compress
-064 auto balance replace
-065 auto subvol replace
+064 auto balance replace volume
+065 auto subvol replace volume
 066 auto subvol scrub
 067 auto subvol defrag compress
 068 auto subvol remount compress
-069 auto replace scrub
-070 auto replace defrag compress
-071 auto replace remount compress
+069 auto replace scrub volume
+070 auto replace defrag compress volume
+071 auto replace remount compress volume
 072 auto scrub defrag compress
 073 auto scrub remount compress
 074 auto defrag remount compress
@@ -102,8 +102,8 @@
 097 auto quick send clone
 098 auto quick metadata clone
 099 auto quick qgroup limit
-100 auto replace
-101 auto replace
+100 auto replace volume
+101 auto replace volume
 102 auto quick metadata enospc
 103 auto quick clone compress
 104 auto qgroup
@@ -126,8 +126,8 @@
 121 auto quick snapshot qgroup
 122 auto quick snapshot qgroup
 123 auto quick qgroup
-124 auto replace
-125 auto replace
+124 auto replace volume
+125 auto replace volume
 126 auto quick qgroup limit
 127 auto quick send
 128 auto quick send
@@ -153,17 +153,17 @@
 148 auto quick rw
 149 auto quick send compress
 150 auto quick dangerous
-151 auto quick
+151 auto quick volume
 152 auto quick metadata qgroup send
 153 auto quick qgroup limit
-154 auto quick
+154 auto quick volume
 155 auto quick send
 156 auto quick trim
 157 auto quick raid
 158 auto quick raid scrub
 159 auto quick
 160 auto quick
-161 auto quick
-162 auto quick
-163 auto quick
-164 auto quick
+161 auto quick volume
+162 auto quick volume
+163 auto quick volume
+164 auto quick volume
-- 
2.15.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2nd try v4] btrfs: drop uuid_mutex in btrfs_free_extra_devids()

2018-05-28 Thread Anand Jain




On 05/29/2018 06:57 AM, Anand Jain wrote:



On 05/28/2018 11:40 PM, David Sterba wrote:

On Mon, May 28, 2018 at 10:43:29PM +0800, Anand Jain wrote:

btrfs_free_extra_devids() is called only in the mount context which
traverses through the fs_devices::devices and frees the orphan devices
devices in the given %fs_devices if any. As the search for the orphan
device is limited to fs_devices::devices so we don't need the global
uuid_mutex.

There can't be any mount-point based ioctl threads in this context as
the mount thread is not yet returned. But there can be the btrfs-control
based scan ioctls thread which calls device_list_add().

Here in the mount thread the fs_devices::opened is incremented way 
before
btrfs_free_extra_devids() is called and in the scan context the 
fs_devices

which are already opened neither be freed or alloc-able at
device_list_add().

But lets say you change the device-path and call the scan again, then 
scan
would update the new device path and this operation could race 
against the

btrfs_free_extra_devids() thread, which might be in the process of
free-ing the same device. So synchronize it by using the
device_list_mutex.

This scenario is a very corner case, and practically the scan and mount
are anyway serialized by the usage so unless the race is instrumented 
its

very difficult to achieve.

Signed-off-by: Anand Jain 
---
(I didn't see this email in the mailing list, so trying again).
v3->v4: As we traverse through the seed device, fs_device gets 
updated with

the child seed fs_devices, so make sure we use the same fs_devices
pointer for the mutex_unlock as used for the mutex_lock.


Well, now that I see the change, shouldn't we always hold the
device_list_mutex of the fs_devices that's being processed? Ie. each
time it's switched, the previous is unlocked and new one locked.


  No David. That's because we organize seed device under its parent
  fs_devices ((fs_devices::seed)::seed)..so on, and they are a local
  cloned copy of the original seed fs_devices. So parent's
  fs_devices::device_list_mutex lock will suffice.


 On the 2nd thought, though theoretically you are right. But practically
 there isn't any use case which can benefit by using the intricate
 locking as you suggested above.

 I am following the following method:-
 By holding the parent fs_devices (which is also the mounted fs_devices
 lock) it would imply to lock its dependent cloned seed fs_devices, as
 to reach the cloned seed device, first we need to traverse from the
 parent fs_devices.

Thanks, Anand



Thanks, Anand


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/11] Summer argument cleanup

2018-05-28 Thread Lu Fengqi
On Mon, May 28, 2018 at 09:36:39AM +0300, Nikolay Borisov wrote:
>Hello, 
>
>Here is yet another series removing a bunch of extraneous argument. The series 
>constitutes no functional changes. Some of the function actually have a kernel 
>counterpart (btrfs_lookup_extent_info/clean_tree_block). However the former's 
>userspace signature differs and the patch in this series makes it identical to 
>the kernel one. 

This series looks good to me.

Reviewed-by: Lu Fengqi 

-- 
Thanks,
Lu

>
>Nikolay Borisov (11):
>  btrfs-progs: check: Remove root argument from delete_extent_records
>  btrfs-progs: check: Remove root parameter from
>btrfs_fix_block_accounting
>  btrfs-progs: check: Remove root parameter from del_pending_extents
>  btrfs-progs: check: Remove root argument from finish_current_insert
>  btrfs-progs: check: Make update_pinned_extents take btrfs_fs_info
>  btrfs-progs: Remove unused argument from clean_tree_block
>  btrfs-progs: check: Remove unused root argument from
>btrfs_extent_post_op
>  btrfs-progs: Change btrfs_root to btrfs_fs_info argument in
>btrfs_lookup_extent_info
>  btrfs-progs: Remove root argument from btrfs_set_block_flags
>  btrfs-progs: Remove root argument from write_one_cache_group
>  btrfs-progs: Remove fs_info argument from write_ctree_super
>
> check/main.c|  31 +++---
> check/mode-lowmem.c |   6 +--
> ctree.c |  17 
> ctree.h |  13 +++---
> disk-io.c   |   9 ++--
> disk-io.h   |   6 +--
> extent-tree.c   | 121 
> free-space-tree.c   |   2 +-
> transaction.c   |   2 +-
> 9 files changed, 96 insertions(+), 111 deletions(-)
>
>-- 
>2.7.4
>
>--
>To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>the body of a message to majord...@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] btrfs: Use btrfs_mark_bg_unused() to replace open code

2018-05-28 Thread Nikolay Borisov



On 28.05.2018 12:20, Qu Wenruo wrote:
> Introduce a small helper, btrfs_mark_bg_unused(), to accquire needed
> locks and add a block group to unused_bgs list.
> 
> No functional modification, and only 3 callers are involved.
> 
> Signed-off-by: Qu Wenruo 

Reviewed-by: Nikolay Borisov 

> ---
>  fs/btrfs/ctree.h   |  1 +
>  fs/btrfs/extent-tree.c | 36 +---
>  fs/btrfs/scrub.c   |  9 +
>  3 files changed, 19 insertions(+), 27 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 2771cc56a622..b19b673485fd 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -2829,6 +2829,7 @@ void check_system_chunk(struct btrfs_trans_handle 
> *trans,
>   struct btrfs_fs_info *fs_info, const u64 type);
>  u64 add_new_free_space(struct btrfs_block_group_cache *block_group,
>  struct btrfs_fs_info *info, u64 start, u64 end);
> +void btrfs_mark_bg_unused(struct btrfs_block_group_cache *bg);
>  
>  /* ctree.c */
>  int btrfs_bin_search(struct extent_buffer *eb, const struct btrfs_key *key,
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 8241de81089a..f0d7e19feca7 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -6350,16 +6350,8 @@ static int update_block_group(struct 
> btrfs_trans_handle *trans,
>* dirty list to avoid races between cleaner kthread and space
>* cache writeout.
>*/
> - if (!alloc && old_val == 0) {
> - spin_lock(>unused_bgs_lock);
> - if (list_empty(>bg_list)) {
> - btrfs_get_block_group(cache);
> - trace_btrfs_add_unused_block_group(cache);
> - list_add_tail(>bg_list,
> -   >unused_bgs);
> - }
> - spin_unlock(>unused_bgs_lock);
> - }
> + if (!alloc && old_val == 0)
> + btrfs_mark_bg_unused(cache);
>  
>   btrfs_put_block_group(cache);
>   total -= num_bytes;
> @@ -10201,15 +10193,8 @@ int btrfs_read_block_groups(struct btrfs_fs_info 
> *info)
>   if (btrfs_chunk_readonly(info, cache->key.objectid)) {
>   inc_block_group_ro(cache, 1);
>   } else if (btrfs_block_group_used(>item) == 0) {
> - spin_lock(>unused_bgs_lock);
> - /* Should always be true but just in case. */
> - if (list_empty(>bg_list)) {
> - btrfs_get_block_group(cache);
> - trace_btrfs_add_unused_block_group(cache);
> - list_add_tail(>bg_list,
> -   >unused_bgs);
> - }
> - spin_unlock(>unused_bgs_lock);
> + ASSERT(list_empty(>bg_list));
> + btrfs_mark_bg_unused(cache);
>   }
>   }
>  
> @@ -11133,3 +8,16 @@ void btrfs_wait_for_snapshot_creation(struct 
> btrfs_root *root)
>  !atomic_read(>will_be_snapshotted));
>   }
>  }
> +
> +void btrfs_mark_bg_unused(struct btrfs_block_group_cache *bg)
> +{
> + struct btrfs_fs_info *fs_info = bg->fs_info;
> +
> + spin_lock(_info->unused_bgs_lock);
> + if (list_empty(>bg_list)) {
> + btrfs_get_block_group(bg);
> + trace_btrfs_add_unused_block_group(bg);
> + list_add_tail(>bg_list, _info->unused_bgs);
> + }
> + spin_unlock(_info->unused_bgs_lock);
> +}
> diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
> index a59005862010..40086b47a65f 100644
> --- a/fs/btrfs/scrub.c
> +++ b/fs/btrfs/scrub.c
> @@ -3981,14 +3981,7 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
>   if (!cache->removed && !cache->ro && cache->reserved == 0 &&
>   btrfs_block_group_used(>item) == 0) {
>   spin_unlock(>lock);
> - spin_lock(_info->unused_bgs_lock);
> - if (list_empty(>bg_list)) {
> - btrfs_get_block_group(cache);
> - trace_btrfs_add_unused_block_group(cache);
> - list_add_tail(>bg_list,
> -   _info->unused_bgs);
> - }
> - spin_unlock(_info->unused_bgs_lock);
> + btrfs_mark_bg_unused(cache);
>   } else {
>   spin_unlock(>lock);
>   }
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2nd try v4] btrfs: drop uuid_mutex in btrfs_free_extra_devids()

2018-05-28 Thread Anand Jain




On 05/28/2018 11:40 PM, David Sterba wrote:

On Mon, May 28, 2018 at 10:43:29PM +0800, Anand Jain wrote:

btrfs_free_extra_devids() is called only in the mount context which
traverses through the fs_devices::devices and frees the orphan devices
devices in the given %fs_devices if any. As the search for the orphan
device is limited to fs_devices::devices so we don't need the global
uuid_mutex.

There can't be any mount-point based ioctl threads in this context as
the mount thread is not yet returned. But there can be the btrfs-control
based scan ioctls thread which calls device_list_add().

Here in the mount thread the fs_devices::opened is incremented way before
btrfs_free_extra_devids() is called and in the scan context the fs_devices
which are already opened neither be freed or alloc-able at
device_list_add().

But lets say you change the device-path and call the scan again, then scan
would update the new device path and this operation could race against the
btrfs_free_extra_devids() thread, which might be in the process of
free-ing the same device. So synchronize it by using the
device_list_mutex.

This scenario is a very corner case, and practically the scan and mount
are anyway serialized by the usage so unless the race is instrumented its
very difficult to achieve.

Signed-off-by: Anand Jain 
---
(I didn't see this email in the mailing list, so trying again).
v3->v4: As we traverse through the seed device, fs_device gets updated with
the child seed fs_devices, so make sure we use the same fs_devices
pointer for the mutex_unlock as used for the mutex_lock.


Well, now that I see the change, shouldn't we always hold the
device_list_mutex of the fs_devices that's being processed? Ie. each
time it's switched, the previous is unlocked and new one locked.


 No David. That's because we organize seed device under its parent
 fs_devices ((fs_devices::seed)::seed)..so on, and they are a local
 cloned copy of the original seed fs_devices. So parent's
 fs_devices::device_list_mutex lock will suffice.

Thanks, Anand


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: csum failed root raveled during balance

2018-05-28 Thread ein
On 05/23/2018 01:03 PM, Austin S. Hemmelgarn wrote:
> On 2018-05-23 06:09, ein wrote:
>> On 05/23/2018 11:09 AM, Duncan wrote:
>>> ein posted on Wed, 23 May 2018 10:03:52 +0200 as excerpted:
>>>
> IMHO the best course of action would be to disable checksumming for
> you
> vm files.

 Do you mean '-o nodatasum' mount flag? Is it possible to disable
 checksumming for singe file by setting some magical chattr? Google
 thinks it's not possible to disable csums for a single file.
>>>
>>> You can use nocow (-C), but of course that has other restrictions (like
>>> setting it on the files when they're zero-length, easiest done for
>>> existing data by setting it on the containing dir and copying files (no
>>> reflink) in) as well as the nocow effects, and nocow becomes cow1
>>> after a
>>> snapshot (which locks the existing copy in place so changes written to a
>>> block /must/ be written elsewhere, thus the cow1, aka cow the first time
>>> written after the snapshot but retain the nocow for repeated writes
>>> between snapshots).
>>>
>>> But if you're disabling checksumming anyway, nocow's likely the way
>>> to go.
>>
>> Disabling checksumming only may be a way to go - we live without it
>> every day. But nocow @ VM files defeats whole purpose of using BTRFS for
>> me, even with huge performance penalty - backup reasons - I mean few
>> snapshots (20-30), send & receive.
>>
> Setting NOCOW on a file doesn't prevent it from being snapshotted, it
> just prevents COW operations from happening under most normal
> circumstances.  In essence, it prevents COW from happening except for
> writing right after the snapshot.  More specifically, the first write to
> a given block in a file set for NOCOW after taking a snapshot will
> trigger a _single_ COW operation for _only_ that block (unless you have
> autodefrag enabled too), after which that block will revert to not doing
> COW operations on write.  This way, you still get consistent and working
> snapshots, but you also don't take the performance hits from COW except
> right after taking a snapshot.

Yeah, just after I've post it, I've found some Duncan post from 2015,
explaining it, thank you anyway.

Is there anything we can do better in random/write VM workload to speed
the BTRFS up and why?

My settings:


  
  
  
  [...]


/dev/mapper/raid10-images on /var/lib/libvirt type btrfs
(rw,noatime,nodiratime,compress=lzo:3,ssd,space_cache,autodefrag,subvolid=5,subvol=/)

md1 : active raid10 sdc1[2] sdb2[1] sdd1[3] sda2[0]
  468596736 blocks super 1.2 512K chunks 2 near-copies [4/4] []
  bitmap: 0/4 pages [0KB], 65536KB chunk

CPU: E3-1246 with: VT-x, VT-d, HT, EPT, TSX-NI, AES-NI on debian's
kernel 4.15.0-0.bpo.2-amd64

As far as I understand compress and autodefrag are impacting negatively
for performance (latency), especially autodefrag. I think also that
nodatacow shall also speed things up and it's a must when using qemu and
BTRFS. Is it better to use virtio or virt-scsi with TRIM support?

-- 
PGP Public Key (RSA/4096b):
ID: 0xF2C6EA10
SHA-1: 51DA 40EE 832A 0572 5AD8 B3C0 7AFF 69E1 F2C6 EA10
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] btrfs: kill btrfs_write_inode

2018-05-28 Thread David Sterba
On Tue, May 22, 2018 at 01:47:22PM -0400, Josef Bacik wrote:
> From: Josef Bacik 
> 
> We don't actually need this.  It used to be in place for O_SYNC writes,
> but we've used the normal fsync() path for that for years now.  The
> other case we hit this is through sync(), which will commit the
> transaction anyway.  All this does is make us commit the transaction a
> bunch for no reason, and it could deadlock with delayed iput's.

In what way does it deadlock with delayed iput?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: csum failed root raveled during balance

2018-05-28 Thread ein
On 05/27/2018 11:41 AM, Nikolay Borisov wrote:
> 
> 
> On 27.05.2018 08:50, Andrei Borzenkov wrote:
>> 23.05.2018 09:32, Nikolay Borisov пишет:
>>>
>>>
>>> On 22.05.2018 23:05, ein wrote:
 Hello devs,

 I tested BTRFS in production for about a month:

 21:08:17 up 34 days,  2:21,  3 users,  load average: 0.06, 0.02, 0.00

 Without power blackout, hardware failure, SSD's SMART is flawless etc.
 The tests ended with:

 root@node0:~# dmesg | grep BTRFS | grep warn
 185:980:[2927472.393557] BTRFS warning (device dm-0): csum failed root
 -9 ino 312 off 608284672 csum 0x7d03a376 expected csum 0x3163a9b7 mirror 1
 186:981:[2927472.394158] BTRFS warning (device dm-0): csum failed root
 -9 ino 312 off 608284672 csum 0x7da1b152 expected csum 0x3163a9b7 mirror 1
 191:986:[2928224.169814] BTRFS warning (device dm-0): csum failed root
 -9 ino 314 off 608284672 csum 0x7d03a376 expected csum 0x3163a9b7 mirror 1
 192:987:[2928224.171433] BTRFS warning (device dm-0): csum failed root
 -9 ino 314 off 608284672 csum 0x7da1b152 expected csum 0x3163a9b7 mirror 1
 206:1001:[2928298.039516] BTRFS warning (device dm-0): csum failed root
 -9 ino 319 off 608284672 csum 0x7d03a376 expected csum 0x3163a9b7 mirror 1
 207:1002:[2928298.043103] BTRFS warning (device dm-0): csum failed root
 -9 ino 319 off 608284672 csum 0x7d03a376 expected csum 0x3163a9b7 mirror 1
 208:1004:[2932213.513424] BTRFS warning (device dm-0): csum failed root
 5 ino 219962 off 4564959232 csum 0xc616afb4 expected csum 0x5425e489
 mirror 1
 209:1005:[2932235.666368] BTRFS warning (device dm-0): csum failed root
 5 ino 219962 off 16989835264 csum 0xd63ed5da expected csum 0x7429caa1
 mirror 1
 210:1072:[2936767.229277] BTRFS warning (device dm-0): csum failed root
 5 ino 219915 off 82318458880 csum 0x83614341 expected csum 0x0b8706f8
 mirror 1
 211:1073:[2936767.276229] BTRFS warning (device dm-0): csum failed root
 5 ino 219915 off 82318458880 csum 0x83614341 expected csum 0x0b8706f8
 mirror 1

 Above has been revealed during below command and quite high IO usage by
 few VMs (Linux on top Ext4 with firebird database, lots of random
 read/writes, two others with Windows 2016 and Windows Update in the
 background):
>>>
>>> I believe you are hitting the issue described here:
>>>
>>> https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg25656.html
>>>
>>> Essentially the way qemu operates on vm images atop btrfs is prone to
>>> producing such errors. As a matter of fact, other filesystems also
>>> suffer from this(i.e pages modified while being written, however due to
>>> lack of CRC on the data they don't detect it). Can you confirm that
>>> those inodes (312/314/319/219962/219915) belong to vm images files?
>>>
>>> IMHO the best course of action would be to disable checksumming for you
>>> vm files.
>>>
>>>
>>> For some background I suggest you read the following LWN articles:
>>>
>>> https://lwn.net/Articles/486311/
>>> https://lwn.net/Articles/442355/
>>>
>>
>> Hmm ... according to these articles, "pages under writeback are marked
>> as not being writable; any process attempting to write to such a page
>> will block until the writeback completes". And it says this feature is
>> available since 3.0 and btrfs has it. So how comes it still happens?
>> Were stable patches removed since then?
> 
> If you are using buffered writes, then yes you won't have the problem.
> However qemu by default bypasses host's page cache and instead uses DIO:
> 
> https://btrfs.wiki.kernel.org/index.php/Gotchas#Direct_IO_and_CRCs

I can confirm that writing data to the filesystem on guest side is not
buffered at host with config:


  
  
  
  [...]


Because buff/cache memory usage stays unchanged at host during high
sequential writing and there's no kworker/flush process committing the
data. How qemu can avoid dirty page buffering? There's nothing else
than:ppoll, read, io_sumbit and write in strace:

read(52, "\1\0\0\0\0\0\0\0", 512)   = 8
io_submit(0x7f35367f7000, 2, [{pwritev, fildes=19,
iovec=[{iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
iov_len=368640}, {iov_base="\0\0\0\0\0\0
\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
iov_len=368640},
{iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
iov_len=679
\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=368640},
{iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
iov_len=679936}, {iov_bas
\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=368640},
{iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
iov_len=679936}, {iov_base="\0\0\0\0\0\
1048576},
{iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
iov_len=1048576},

Re: [PATCH v3 1/2] btrfs: qgroup: Search commit root for rescan to avoid missing extent

2018-05-28 Thread David Sterba
On Wed, May 23, 2018 at 10:32:13AM +0300, Nikolay Borisov wrote:
> 
> 
> On 23.05.2018 10:29, Qu Wenruo wrote:
> > When doing qgroup rescan using the following script (modified from
> > btrfs/017 test case), we can sometimes hit qgroup corruption.
> > 
> > --
> > umount $dev &> /dev/null
> > umount $mnt &> /dev/null
> > 
> > mkfs.btrfs -f -n 64k $dev
> > mount $dev $mnt
> > 
> > extent_size=8192
> > 
> > xfs_io -f -d -c "pwrite 0 $extent_size" $mnt/foo > /dev/null
> > btrfs subvolume snapshot $mnt $mnt/snap
> > 
> > xfs_io -f -c "reflink $mnt/foo" $mnt/foo-reflink > /dev/null
> > xfs_io -f -c "reflink $mnt/foo" $mnt/snap/foo-reflink > /dev/null
> > xfs_io -f -c "reflink $mnt/foo" $mnt/snap/foo-reflink2 > /dev/unll
> > btrfs quota enable $mnt
> > 
> >  # -W is the new option to only wait rescan while not starting new one
> > btrfs quota rescan -W $mnt
> > btrfs qgroup show -prce $mnt
> > umount $mnt
> > 
> >  # Need to patch btrfs-progs to report qgroup mismatch as error
> > btrfs check $dev || _fail
> > --
> > 
> > For fast machine, we can hit some corruption which missed accounting
> > tree blocks:
> > --
> > qgroupid rfer excl max_rfer max_excl parent  child
> >      --  -
> > 0/5   8.00KiB0.00B none none --- ---
> > 0/257 8.00KiB0.00B none none --- ---
> > --
> > 
> > This is due to the fact that we're always searching commit root for
> > btrfs_find_all_roots() at qgroup_rescan_leaf(), but the leaf we get is
> > from current transaction, not commit root.
> > 
> > And if our tree blocks get modified in current transaction, we won't
> > find any owner in commit root, thus causing the corruption.
> > 
> > Fix it by searching commit root for extent tree for
> > qgroup_rescan_leaf().
> > 
> > Reported-by: Nikolay Borisov 
> > Signed-off-by: Qu Wenruo 
> 
> Reviewed-by: Nikolay Borisov 

V3 replaced in misc-next, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 0/6] btrfs_search_slot cleanups

2018-05-28 Thread David Sterba
On Mon, May 28, 2018 at 04:40:28PM +0200, David Sterba wrote:
> On Fri, May 18, 2018 at 11:00:18AM +0800, Liu Bo wrote:
> > Here are a collection of patches I did for btrfs_search_slot().
> > 
> > v2: more explicit commit log for each patch.
> > 
> > Liu Bo (6):
> >   Btrfs: remove superfluous free_extent_buffer
> >   Btrfs: use more straightforward extent_buffer_uptodate
> >   Btrfs: move get root of btrfs_search_slot to a helper
> >   Btrfs: remove unused check of skip_locking
> >   Btrfs: grab write lock directly if write_lock_level is the max level
> >   Btrfs: remove always true check in unlock_up
> 
> 1,2,4,5,6 merged. Please update patch 3 and add the ASSERT. Thanks.

Sorry, it's the 4/6 that's not merged and I'm awaiting an updated
versino. Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4] btrfs: drop uuid_mutex in btrfs_free_extra_devids()

2018-05-28 Thread Anand Jain
btrfs_free_extra_devids() is called only in the mount context which
traverses through the fs_devices::devices and frees the orphan devices
devices in the given %fs_devices if any. As the search for the orphan
device is limited to fs_devices::devices so we don't need the global
uuid_mutex.

There can't be any mount-point based ioctl threads in this context as
the mount thread is not yet returned. But there can be the btrfs-control
based scan ioctls thread which calls device_list_add().

Here in the mount thread the fs_devices::opened is incremented way before
btrfs_free_extra_devids() is called and in the scan context the fs_devices
which are already opened neither be freed or alloc-able at
device_list_add().

But lets say you change the device-path and call the scan again, then scan
would update the new device path and this operation could race against the
btrfs_free_extra_devids() thread, which might be in the process of
free-ing the same device. So synchronize it by using the
device_list_mutex.

This scenario is a very corner case, and practically the scan and mount
are anyway serialized by the usage so unless the race is instrumented its
very difficult to achieve.

Signed-off-by: Anand Jain 
---
v3->v4: As we traverse through the seed device, fs_device gets updated with
the child seed fs_devices, so make sure we use the top most
fs_devices pointer.
v2->v3: Update change log.
(Currently device_list_add() is very lean on its device_list_mutex 
usage,
a cleanup and fix is wip. Given the practicality of the above race
condition this patch is good to merge).
v1->v2: replace uuid_mutex with device_list_mutex instead of delete.
change log updated.
 fs/btrfs/volumes.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index b6757b53c297..f03719221fca 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -924,8 +924,9 @@ void btrfs_free_extra_devids(struct btrfs_fs_devices 
*fs_devices, int step)
 {
struct btrfs_device *device, *next;
struct btrfs_device *latest_dev = NULL;
+   struct btrfs_fs_devices *parent_fs_devices = fs_devices;
 
-   mutex_lock(_mutex);
+   mutex_lock(_fs_devices->device_list_mutex);
 again:
/* This is the initialized path, it is safe to release the devices. */
list_for_each_entry_safe(device, next, _devices->devices, dev_list) {
@@ -979,8 +980,7 @@ void btrfs_free_extra_devids(struct btrfs_fs_devices 
*fs_devices, int step)
}
 
fs_devices->latest_bdev = latest_dev->bdev;
-
-   mutex_unlock(_mutex);
+   mutex_unlock(_fs_devices->device_list_mutex);
 }
 
 static void free_device_rcu(struct rcu_head *head)
-- 
2.7.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 00/11] Summer argument cleanup

2018-05-28 Thread Nikolay Borisov
Hello, 

Here is yet another series removing a bunch of extraneous argument. The series 
constitutes no functional changes. Some of the function actually have a kernel 
counterpart (btrfs_lookup_extent_info/clean_tree_block). However the former's 
userspace signature differs and the patch in this series makes it identical to 
the kernel one. 

Nikolay Borisov (11):
  btrfs-progs: check: Remove root argument from delete_extent_records
  btrfs-progs: check: Remove root parameter from
btrfs_fix_block_accounting
  btrfs-progs: check: Remove root parameter from del_pending_extents
  btrfs-progs: check: Remove root argument from finish_current_insert
  btrfs-progs: check: Make update_pinned_extents take btrfs_fs_info
  btrfs-progs: Remove unused argument from clean_tree_block
  btrfs-progs: check: Remove unused root argument from
btrfs_extent_post_op
  btrfs-progs: Change btrfs_root to btrfs_fs_info argument in
btrfs_lookup_extent_info
  btrfs-progs: Remove root argument from btrfs_set_block_flags
  btrfs-progs: Remove root argument from write_one_cache_group
  btrfs-progs: Remove fs_info argument from write_ctree_super

 check/main.c|  31 +++---
 check/mode-lowmem.c |   6 +--
 ctree.c |  17 
 ctree.h |  13 +++---
 disk-io.c   |   9 ++--
 disk-io.h   |   6 +--
 extent-tree.c   | 121 
 free-space-tree.c   |   2 +-
 transaction.c   |   2 +-
 9 files changed, 96 insertions(+), 111 deletions(-)

-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 11/11] btrfs-progs: Remove fs_info argument from write_ctree_super

2018-05-28 Thread Nikolay Borisov
This function already takes a transaction handle which has a reference
to the fs_info, so use that to obtain it.

Signed-off-by: Nikolay Borisov 
---
 disk-io.c | 6 +++---
 disk-io.h | 3 +--
 transaction.c | 2 +-
 3 files changed, 5 insertions(+), 6 deletions(-)

diff --git a/disk-io.c b/disk-io.c
index 456b354cb727..4a609a892be7 100644
--- a/disk-io.c
+++ b/disk-io.c
@@ -1606,10 +1606,10 @@ int write_all_supers(struct btrfs_fs_info *fs_info)
return 0;
 }
 
-int write_ctree_super(struct btrfs_trans_handle *trans,
- struct btrfs_fs_info *fs_info)
+int write_ctree_super(struct btrfs_trans_handle *trans)
 {
int ret;
+   struct btrfs_fs_info *fs_info = trans->fs_info;
struct btrfs_root *tree_root = fs_info->tree_root;
struct btrfs_root *chunk_root = fs_info->chunk_root;
 
@@ -1657,7 +1657,7 @@ int close_ctree_fs_info(struct btrfs_fs_info *fs_info)
BUG_ON(ret);
ret = __commit_transaction(trans, root);
BUG_ON(ret);
-   write_ctree_super(trans, fs_info);
+   write_ctree_super(trans);
kfree(trans);
}
 
diff --git a/disk-io.h b/disk-io.h
index 36fb68cdb86a..fefdb0a75d07 100644
--- a/disk-io.h
+++ b/disk-io.h
@@ -164,8 +164,7 @@ static inline int close_ctree(struct btrfs_root *root)
 }
 
 int write_all_supers(struct btrfs_fs_info *fs_info);
-int write_ctree_super(struct btrfs_trans_handle *trans,
- struct btrfs_fs_info *fs_info);
+int write_ctree_super(struct btrfs_trans_handle *trans);
 int btrfs_read_dev_super(int fd, struct btrfs_super_block *sb, u64 sb_bytenr,
unsigned sbflags);
 int btrfs_map_bh_to_logical(struct btrfs_root *root, struct extent_buffer *bh,
diff --git a/transaction.c b/transaction.c
index ad70572838c1..9619265ef6e8 100644
--- a/transaction.c
+++ b/transaction.c
@@ -168,7 +168,7 @@ int btrfs_commit_transaction(struct btrfs_trans_handle 
*trans,
BUG_ON(ret);
ret = __commit_transaction(trans, root);
BUG_ON(ret);
-   write_ctree_super(trans, fs_info);
+   write_ctree_super(trans);
btrfs_finish_extent_commit(trans, fs_info->extent_root,
   _info->pinned_extents);
kfree(trans);
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] btrfs: Refactor running of delayed inode items during transaction commit

2018-05-28 Thread David Sterba
On Mon, May 28, 2018 at 03:26:58PM +0300, Nikolay Borisov wrote:
>  dmesg looks like:
>  [6.649213] WARNING: CPU: 0 PID: 2838 at fs/btrfs/transaction.c:303 
>  record_root_in_trans+0x38/0xd0

Found in the logs. I reported it to the patch that added the assertion
but I did not suspect your patches.

>  [6.662909]  create_pending_snapshot+0x1ab/0xd00

> So the answer to your question is "yes", in which case indeed this patch
> will have to be reverted.

Both patches removed from misc-next.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 05/11] btrfs-progs: check: Make update_pinned_extents take btrfs_fs_info

2018-05-28 Thread Nikolay Borisov
This function needs btrfs_fs_info and not a root. So make it directly
take btrfs_fs_info,

Signed-off-by: Nikolay Borisov 
---
 extent-tree.c | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/extent-tree.c b/extent-tree.c
index 89fed5b73b1f..8e7f888b5ce5 100644
--- a/extent-tree.c
+++ b/extent-tree.c
@@ -1979,12 +1979,11 @@ static int update_block_group(struct btrfs_root *root,
return 0;
 }
 
-static int update_pinned_extents(struct btrfs_root *root,
+static int update_pinned_extents(struct btrfs_fs_info *fs_info,
u64 bytenr, u64 num, int pin)
 {
u64 len;
struct btrfs_block_group_cache *cache;
-   struct btrfs_fs_info *fs_info = root->fs_info;
 
if (pin) {
set_extent_dirty(_info->pinned_extents,
@@ -2033,7 +2032,8 @@ int btrfs_finish_extent_commit(struct btrfs_trans_handle 
*trans,
EXTENT_DIRTY);
if (ret)
break;
-   update_pinned_extents(root, start, end + 1 - start, 0);
+   update_pinned_extents(trans->fs_info, start, end + 1 - start,
+ 0);
clear_extent_dirty(unpin, start, end);
set_extent_dirty(free_space_cache, start, end);
}
@@ -2136,7 +2136,7 @@ static int pin_down_bytes(struct btrfs_trans_handle 
*trans,
}
free_extent_buffer(buf);
 pinit:
-   update_pinned_extents(root, bytenr, num_bytes, 1);
+   update_pinned_extents(trans->fs_info, bytenr, num_bytes, 1);
 
BUG_ON(err < 0);
return 0;
@@ -2145,13 +2145,13 @@ static int pin_down_bytes(struct btrfs_trans_handle 
*trans,
 void btrfs_pin_extent(struct btrfs_fs_info *fs_info,
   u64 bytenr, u64 num_bytes)
 {
-   update_pinned_extents(fs_info->extent_root, bytenr, num_bytes, 1);
+   update_pinned_extents(fs_info, bytenr, num_bytes, 1);
 }
 
 void btrfs_unpin_extent(struct btrfs_fs_info *fs_info,
u64 bytenr, u64 num_bytes)
 {
-   update_pinned_extents(fs_info->extent_root, bytenr, num_bytes, 0);
+   update_pinned_extents(fs_info, bytenr, num_bytes, 0);
 }
 
 /*
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [External] Re: [RFC PATCH v2 00/12] get rid of GFP_ZONE_TABLE/BAD

2018-05-28 Thread Michal Hocko
On Fri 25-05-18 09:43:09, Huaisheng HS1 Ye wrote:
> From: Michal Hocko [mailto:mho...@kernel.org]
> Sent: Thursday, May 24, 2018 8:19 PM> 
> > > Let me try to reply your questions.
> > > Exactly, GFP_ZONE_TABLE is too complicated. I think there are two 
> > > advantages
> > > from the series of patches.
> > >
> > > 1. XOR operation is simple and efficient, GFP_ZONE_TABLE/BAD need to do 
> > > twice
> > > shift operations, the first is for getting a zone_type and the second is 
> > > for
> > > checking the to be returned type is a correct or not. But with these 
> > > patch XOR
> > > operation just needs to use once. Because the bottom 3 bits of GFP 
> > > bitmask have
> > > been used to represent the encoded zone number, we can say there is no 
> > > bad zone
> > > number if all callers could use it without buggy way. Of course, the 
> > > returned
> > > zone type in gfp_zone needs to be no more than ZONE_MOVABLE.
> > 
> > But you are losing the ability to check for wrong usage. And it seems
> > that the sad reality is that the existing code do screw up.
> 
> In my opinion, originally there shouldn't be such many wrong
> combinations of these bottom 3 bits. For any user, whether or
> driver and fs, they should make a decision that which zone is they
> preferred. Matthew's idea is great, because with it the user must
> offer an unambiguous flag to gfp zone bits.

Well, I would argue that those shouldn't really care about any zones at
all. All they should carea bout is whether they really need a low mem
zone (aka directly accessible to the kernel), highmem or they are the
allocation is generally movable. Mixing zones into the picture just
makes the whole thing more complicated and error prone.
[...]
> > That being said. I am not saying that I am in love with GFP_ZONE_TABLE.
> > It always makes my head explode when I look there but it seems to work
> > with the current code and it is optimized for it. If you want to change
> > this then you should make sure you describe reasons _why_ this is an
> > improvement. And I would argue that "we can have more zones" is a
> > relevant one.
> 
> Yes, GFP_ZONE_TABLE is too complicated. The patches have 4 advantages as 
> below.
> 
> * The address zone modifiers have new operation method, that is, user should 
> decide which zone is preferred at first, then give the encoded zone number to 
> bottom 3 bits in GFP mask. That is much direct and clear than before.
> 
> * No bad zone combination, because user should choose just one address zone 
> modifier always.
> * Better performance and efficiency, current gfp_zone has to take shifting 
> operation twice for GFP_ZONE_TABLE and GFP_ZONE_BAD. With these patches, 
> gfp_zone() just needs one XOR.
> * Up to 8 zones can be used. At least it isn't a disadvantage, right?

This should be a part of the changelog. Please note that you should
provide some number if you claim performance benefits. The complexity
will always be subjective.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 01/11] btrfs-progs: check: Remove root argument from delete_extent_records

2018-05-28 Thread Nikolay Borisov
This function is always passed the extent_root as "root" parameter. In
turn it uses the root parameter to mostly access fs_info and performs
only a single call to btrfs_update_block_group where it passses the
passed root. This is all redundant since fs_info can be referenced
from the transaction handle and in turn extent_root can be referenced
from the fs_info. So do that to simplify the function's signature.

Signed-off-by: Nikolay Borisov 
---
 check/main.c | 17 -
 1 file changed, 8 insertions(+), 9 deletions(-)

diff --git a/check/main.c b/check/main.c
index 68da994f7ae0..4cf243d9b379 100644
--- a/check/main.c
+++ b/check/main.c
@@ -6295,10 +6295,10 @@ static int free_extent_hook(struct btrfs_trans_handle 
*trans,
 }
 
 static int delete_extent_records(struct btrfs_trans_handle *trans,
-struct btrfs_root *root,
 struct btrfs_path *path,
 u64 bytenr)
 {
+   struct btrfs_fs_info *fs_info = trans->fs_info;
struct btrfs_key key;
struct btrfs_key found_key;
struct extent_buffer *leaf;
@@ -6311,8 +6311,8 @@ static int delete_extent_records(struct 
btrfs_trans_handle *trans,
key.offset = (u64)-1;
 
while (1) {
-   ret = btrfs_search_slot(trans, root->fs_info->extent_root,
-   , path, 0, 1);
+   ret = btrfs_search_slot(trans, fs_info->extent_root, ,
+   path, 0, 1);
if (ret < 0)
break;
 
@@ -6354,7 +6354,7 @@ static int delete_extent_records(struct 
btrfs_trans_handle *trans,
"repair deleting extent record: key [%llu,%u,%llu]\n",
found_key.objectid, found_key.type, found_key.offset);
 
-   ret = btrfs_del_item(trans, root->fs_info->extent_root, path);
+   ret = btrfs_del_item(trans, fs_info->extent_root, path);
if (ret)
break;
btrfs_release_path(path);
@@ -6362,10 +6362,10 @@ static int delete_extent_records(struct 
btrfs_trans_handle *trans,
if (found_key.type == BTRFS_EXTENT_ITEM_KEY ||
found_key.type == BTRFS_METADATA_ITEM_KEY) {
u64 bytes = (found_key.type == BTRFS_EXTENT_ITEM_KEY) ?
-   found_key.offset : root->fs_info->nodesize;
+   found_key.offset : fs_info->nodesize;
 
-   ret = btrfs_update_block_group(root, bytenr,
-  bytes, 0, 0);
+   ret = btrfs_update_block_group(fs_info->extent_root,
+  bytenr, bytes, 0, 0);
if (ret)
break;
}
@@ -7293,8 +7293,7 @@ static int fixup_extent_refs(struct btrfs_fs_info *info,
}
 
/* step two, delete all the existing records */
-   ret = delete_extent_records(trans, info->extent_root, ,
-   rec->start);
+   ret = delete_extent_records(trans, , rec->start);
 
if (ret < 0)
goto out;
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 07/11] btrfs-progs: check: Remove unused root argument from btrfs_extent_post_op

2018-05-28 Thread Nikolay Borisov
This is no longer used by the callees of that function so remove it.

Signed-off-by: Nikolay Borisov 
---
 check/main.c  | 2 +-
 ctree.h   | 3 +--
 extent-tree.c | 5 ++---
 3 files changed, 4 insertions(+), 6 deletions(-)

diff --git a/check/main.c b/check/main.c
index 71cc16735443..9a7e6b6dbde3 100644
--- a/check/main.c
+++ b/check/main.c
@@ -8626,7 +8626,7 @@ static int reinit_extent_tree(struct btrfs_trans_handle 
*trans,
fprintf(stderr, "Error adding block group\n");
return ret;
}
-   btrfs_extent_post_op(trans, fs_info->extent_root);
+   btrfs_extent_post_op(trans);
}
 
ret = reset_balance(trans, fs_info);
diff --git a/ctree.h b/ctree.h
index c833ad6998b9..138cd22c6c6e 100644
--- a/ctree.h
+++ b/ctree.h
@@ -2504,8 +2504,7 @@ int btrfs_fix_block_accounting(struct btrfs_trans_handle 
*trans);
 void btrfs_pin_extent(struct btrfs_fs_info *fs_info, u64 bytenr, u64 
num_bytes);
 void btrfs_unpin_extent(struct btrfs_fs_info *fs_info,
u64 bytenr, u64 num_bytes);
-int btrfs_extent_post_op(struct btrfs_trans_handle *trans,
-struct btrfs_root *root);
+int btrfs_extent_post_op(struct btrfs_trans_handle *trans);
 struct btrfs_block_group_cache *btrfs_lookup_block_group(struct
 btrfs_fs_info *info,
 u64 bytenr);
diff --git a/extent-tree.c b/extent-tree.c
index e06fe56b5ca8..d90eb8139d8b 100644
--- a/extent-tree.c
+++ b/extent-tree.c
@@ -1426,8 +1426,7 @@ int btrfs_inc_extent_ref(struct btrfs_trans_handle *trans,
return err;
 }
 
-int btrfs_extent_post_op(struct btrfs_trans_handle *trans,
-struct btrfs_root *root)
+int btrfs_extent_post_op(struct btrfs_trans_handle *trans)
 {
finish_current_insert(trans);
del_pending_extents(trans);
@@ -4012,7 +4011,7 @@ static int __btrfs_record_file_extent(struct 
btrfs_trans_handle *trans,
} else if (ret != -EEXIST) {
goto fail;
}
-   btrfs_extent_post_op(trans, extent_root);
+   btrfs_extent_post_op(trans);
extent_bytenr = disk_bytenr;
extent_num_bytes = num_bytes;
extent_offset = 0;
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 09/11] btrfs-progs: Remove root argument from btrfs_set_block_flags

2018-05-28 Thread Nikolay Borisov
It's used only to get a reference to fs_info, which can be obtained from
the transaction handle.

Signed-off-by: Nikolay Borisov 
---
 ctree.c   |  2 +-
 ctree.h   |  5 ++---
 extent-tree.c | 22 ++
 3 files changed, 13 insertions(+), 16 deletions(-)

diff --git a/ctree.c b/ctree.c
index c89fd11fc48b..2c51580fec65 100644
--- a/ctree.c
+++ b/ctree.c
@@ -236,7 +236,7 @@ static noinline int update_ref_for_cow(struct 
btrfs_trans_handle *trans,
BUG_ON(ret);
}
if (new_flags != 0) {
-   ret = btrfs_set_block_flags(trans, root, buf->start,
+   ret = btrfs_set_block_flags(trans, buf->start,
btrfs_header_level(buf),
new_flags);
BUG_ON(ret);
diff --git a/ctree.h b/ctree.h
index 037e020401d2..c1033c86d484 100644
--- a/ctree.h
+++ b/ctree.h
@@ -2519,9 +2519,8 @@ struct extent_buffer *btrfs_alloc_free_block(struct 
btrfs_trans_handle *trans,
 int btrfs_lookup_extent_info(struct btrfs_trans_handle *trans,
 struct btrfs_fs_info *fs_info, u64 bytenr,
 u64 offset, int metadata, u64 *refs, u64 *flags);
-int btrfs_set_block_flags(struct btrfs_trans_handle *trans,
- struct btrfs_root *root,
- u64 bytenr, int level, u64 flags);
+int btrfs_set_block_flags(struct btrfs_trans_handle *trans, u64 bytenr,
+ int level, u64 flags);
 int btrfs_inc_ref(struct btrfs_trans_handle *trans, struct btrfs_root *root,
  struct extent_buffer *buf, int record_parent);
 int btrfs_dec_ref(struct btrfs_trans_handle *trans, struct btrfs_root *root,
diff --git a/extent-tree.c b/extent-tree.c
index 2fd4e7a0d9eb..24fb96492c13 100644
--- a/extent-tree.c
+++ b/extent-tree.c
@@ -1529,18 +1529,17 @@ int btrfs_lookup_extent_info(struct btrfs_trans_handle 
*trans,
return ret;
 }
 
-int btrfs_set_block_flags(struct btrfs_trans_handle *trans,
- struct btrfs_root *root,
- u64 bytenr, int level, u64 flags)
+int btrfs_set_block_flags(struct btrfs_trans_handle *trans, u64 bytenr,
+ int level, u64 flags)
 {
+   struct btrfs_fs_info *fs_info = trans->fs_info;
struct btrfs_path *path;
int ret;
struct btrfs_key key;
struct extent_buffer *l;
struct btrfs_extent_item *item;
u32 item_size;
-   int skinny_metadata =
-   btrfs_fs_incompat(root->fs_info, SKINNY_METADATA);
+   int skinny_metadata = btrfs_fs_incompat(fs_info, SKINNY_METADATA);
 
path = btrfs_alloc_path();
if (!path)
@@ -1552,13 +1551,12 @@ int btrfs_set_block_flags(struct btrfs_trans_handle 
*trans,
key.offset = level;
key.type = BTRFS_METADATA_ITEM_KEY;
} else {
-   key.offset = root->fs_info->nodesize;
+   key.offset = fs_info->nodesize;
key.type = BTRFS_EXTENT_ITEM_KEY;
}
 
 again:
-   ret = btrfs_search_slot(trans, root->fs_info->extent_root, , path,
-   0, 0);
+   ret = btrfs_search_slot(trans, fs_info->extent_root, , path, 0, 0);
if (ret < 0)
goto out;
 
@@ -1569,13 +1567,13 @@ int btrfs_set_block_flags(struct btrfs_trans_handle 
*trans,
btrfs_item_key_to_cpu(path->nodes[0], ,
  path->slots[0]);
if (key.objectid == bytenr &&
-   key.offset == root->fs_info->nodesize &&
+   key.offset == fs_info->nodesize &&
key.type == BTRFS_EXTENT_ITEM_KEY)
ret = 0;
}
if (ret) {
btrfs_release_path(path);
-   key.offset = root->fs_info->nodesize;
+   key.offset = fs_info->nodesize;
key.type = BTRFS_EXTENT_ITEM_KEY;
goto again;
}
@@ -1591,8 +1589,8 @@ int btrfs_set_block_flags(struct btrfs_trans_handle 
*trans,
item_size = btrfs_item_size_nr(l, path->slots[0]);
 #ifdef BTRFS_COMPAT_EXTENT_TREE_V0
if (item_size < sizeof(*item)) {
-   ret = convert_extent_item_v0(trans, root->fs_info->extent_root,
-path, (u64)-1, 0);
+   ret = convert_extent_item_v0(trans, fs_info->extent_root, path,
+(u64)-1, 0);
if (ret < 0)
goto out;
 
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  

[PATCH 03/11] btrfs-progs: check: Remove root parameter from del_pending_extents

2018-05-28 Thread Nikolay Borisov
This function always operates on the extent root which can be
referenced from trans->fs_info. Do that to simplify function's
signature.

Signed-off-by: Nikolay Borisov 
---
 extent-tree.c | 26 +-
 1 file changed, 13 insertions(+), 13 deletions(-)

diff --git a/extent-tree.c b/extent-tree.c
index 3132ccc8d44f..e950ba6de3cc 100644
--- a/extent-tree.c
+++ b/extent-tree.c
@@ -56,8 +56,7 @@ static int __free_extent(struct btrfs_trans_handle *trans,
 u64 owner_offset, int refs_to_drop);
 static int finish_current_insert(struct btrfs_trans_handle *trans, struct
 btrfs_root *extent_root);
-static int del_pending_extents(struct btrfs_trans_handle *trans, struct
-  btrfs_root *extent_root);
+static int del_pending_extents(struct btrfs_trans_handle *trans);
 static struct btrfs_block_group_cache *
 btrfs_find_block_group(struct btrfs_root *root, struct btrfs_block_group_cache
   *hint, u64 search_start, int data, int owner);
@@ -1423,7 +1422,7 @@ int btrfs_inc_extent_ref(struct btrfs_trans_handle *trans,
 out:
btrfs_free_path(path);
finish_current_insert(trans, root->fs_info->extent_root);
-   del_pending_extents(trans, root->fs_info->extent_root);
+   del_pending_extents(trans);
BUG_ON(err);
return err;
 }
@@ -1432,7 +1431,7 @@ int btrfs_extent_post_op(struct btrfs_trans_handle *trans,
 struct btrfs_root *root)
 {
finish_current_insert(trans, root->fs_info->extent_root);
-   del_pending_extents(trans, root->fs_info->extent_root);
+   del_pending_extents(trans);
return 0;
 }
 
@@ -1612,7 +1611,7 @@ int btrfs_set_block_flags(struct btrfs_trans_handle 
*trans,
 out:
btrfs_free_path(path);
finish_current_insert(trans, root->fs_info->extent_root);
-   del_pending_extents(trans, root->fs_info->extent_root);
+   del_pending_extents(trans);
return ret;
 }
 
@@ -1728,7 +1727,7 @@ static int write_one_cache_group(struct 
btrfs_trans_handle *trans,
btrfs_release_path(path);
 fail:
finish_current_insert(trans, extent_root);
-   pending_ret = del_pending_extents(trans, extent_root);
+   pending_ret = del_pending_extents(trans);
if (ret)
return ret;
if (pending_ret)
@@ -2397,8 +2396,7 @@ static int __free_extent(struct btrfs_trans_handle *trans,
  * find all the blocks marked as pending in the radix tree and remove
  * them from the extent map
  */
-static int del_pending_extents(struct btrfs_trans_handle *trans, struct
-  btrfs_root *extent_root)
+static int del_pending_extents(struct btrfs_trans_handle *trans)
 {
int ret;
int err = 0;
@@ -2408,6 +2406,8 @@ static int del_pending_extents(struct btrfs_trans_handle 
*trans, struct
struct extent_io_tree *pending_del;
struct extent_io_tree *extent_ins;
struct pending_extent_op *extent_op;
+   struct btrfs_fs_info *fs_info = trans->fs_info;
+   struct btrfs_root *extent_root = fs_info->extent_root;
 
extent_ins = _root->fs_info->extent_ins;
pending_del = _root->fs_info->pending_del;
@@ -2497,7 +2497,7 @@ int btrfs_free_extent(struct btrfs_trans_handle *trans,
}
ret = __free_extent(trans, root, bytenr, num_bytes, parent,
root_objectid, owner, offset, 1);
-   pending_ret = del_pending_extents(trans, root->fs_info->extent_root);
+   pending_ret = del_pending_extents(trans);
return ret ? ret : pending_ret;
 }
 
@@ -2790,7 +2790,7 @@ static int alloc_tree_block(struct btrfs_trans_handle 
*trans,
generation, flags,
key, level, ins);
finish_current_insert(trans, root->fs_info->extent_root);
-   del_pending_extents(trans, root->fs_info->extent_root);
+   del_pending_extents(trans);
}
return ret;
 }
@@ -3318,7 +3318,7 @@ int btrfs_make_block_group(struct btrfs_trans_handle 
*trans,
 
ret = finish_current_insert(trans, extent_root);
BUG_ON(ret);
-   ret = del_pending_extents(trans, extent_root);
+   ret = del_pending_extents(trans);
BUG_ON(ret);
 
return 0;
@@ -3417,7 +3417,7 @@ int btrfs_make_block_groups(struct btrfs_trans_handle 
*trans,
BUG_ON(ret);
 
finish_current_insert(trans, extent_root);
-   ret = del_pending_extents(trans, extent_root);
+   ret = del_pending_extents(trans);
BUG_ON(ret);
 
cur_start = cache->key.objectid + cache->key.offset;
@@ -3805,7 +3805,7 @@ int btrfs_fix_block_accounting(struct btrfs_trans_handle 
*trans)
ret = finish_current_insert(trans, root);
if (ret)
return 

Re: [PATCH] btrfs: drop unused space_info parameter from create_space_info

2018-05-28 Thread David Sterba
On Mon, May 28, 2018 at 02:30:27PM +0800, Lu Fengqi wrote:
> Since commit dc2d3005d27d ("btrfs: remove dead create_space_info
> calls"), there is only one caller btrfs_init_space_info. However, it
> doesn't need create_space_info to return space_info at all.
> 
> Signed-off-by: Lu Fengqi 

Reviewed-by: David Sterba 

Added to 4.18 queue, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v4 3/6] Btrfs: push EXCL_OP set into btrfs_rm_device()

2018-05-28 Thread David Sterba
On Thu, May 24, 2018 at 02:41:27PM -0700, Omar Sandoval wrote:
> From: Omar Sandoval 
> 
> btrfs_ioctl_rm_dev() and btrfs_ioctl_rm_dev_v2() both manipulate this
> bit. Let's move it into the common btrfs_rm_device(), which also makes
> the following change to deal with swap files easier.

I'd rather keep it in both places as it's clear where the EXCL_OP
context starts in the ioctl handlers, not in some helper function.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] btrfs: Refactor running of delayed inode items during transaction commit

2018-05-28 Thread Nikolay Borisov


On 28.05.2018 14:51, Qu Wenruo wrote:
> 
> 
> On 2018年05月28日 16:51, Nikolay Borisov wrote:
>>
>>
>> On 28.05.2018 11:35, Nikolay Borisov wrote:
>>>
>>>
>>> On 28.05.2018 11:21, Misono Tomohiro wrote:
 Hello,

 I found current misc-next sometimes fails btrfs/152 when the number
 of cpu is >= 4 in my vm and git bisect points this commit.
 (btrfs/152 performs parallel send/receive.)

 The failure is caused by _check_dmesg and sometimes also leads to 
 inconsistent fs.

 dmesg looks like:
 [6.649213] WARNING: CPU: 0 PID: 2838 at fs/btrfs/transaction.c:303 
 record_root_in_trans+0x38/0xd0
 [6.650807] CPU: 0 PID: 2838 Comm: btrfs Tainted: GW 
 4.17.0-rc6+ #113
 [6.651998] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
 1.10.2-2.fc27 04/01/2014
 [6.653325] RIP: 0010:record_root_in_trans+0x38/0xd0
 [6.654090] RSP: 0018:90b781e23a68 EFLAGS: 00010206
 [6.654882] RAX: 8f346a48b720 RBX: 8f3479ce86c0 RCX: 
 
 [6.655965] RDX:  RSI: 8f347c8ec000 RDI: 
 8f346a03a958
 [6.657059] RBP: 90b781e23b80 R08: 00025870 R09: 
 a8c112bb
 [6.658162] R10: 90b781e23ba0 R11: 0002 R12: 
 8f346a03a958
 [6.659290] R13: 8f347c8d9000 R14: 8f347ba2a0b8 R15: 
 
 [6.660374] FS:  7f3b62ebc8c0() GS:8f347fc0() 
 knlGS:
 [6.661349] CS:  0010 DS:  ES:  CR0: 80050033
 [6.661928] CR2: 7f0a641995e0 CR3: 6af5c000 CR4: 
 06f0
 [6.662651] Call Trace:
 [6.662909]  create_pending_snapshot+0x1ab/0xd00
 [6.663391]  ? btrfs_delayed_inode_release_metadata+0xe0/0xf0
 [6.663972]  ? __btrfs_update_delayed_inode+0x1aa/0x210
 [6.664526]  ? __btrfs_release_delayed_node.part.18+0x8f/0x1d0
 [6.665116]  ? create_pending_snapshots+0x81/0xa0
 [6.665597]  create_pending_snapshots+0x81/0xa0
 [6.666068]  btrfs_commit_transaction+0x279/0x860
 [6.666553]  ? start_transaction+0x98/0x3c0
 [6.666989]  btrfs_mksubvol+0x589/0x5a0
 [6.667390]  ? btrfs_opendir+0x39/0x70
 [6.667778]  btrfs_ioctl_snap_create_transid+0x16a/0x1a0
 [6.668327]  btrfs_ioctl_snap_create_v2+0x121/0x180
 [6.668827]  btrfs_ioctl+0x56d/0x25a0
 [6.669205]  ? do_filp_open+0xaa/0x110
 [6.669591]  ? do_vfs_ioctl+0x9f/0x610
 [6.669980]  ? btrfs_ioctl_get_supported_features+0x20/0x20
 [6.670550]  do_vfs_ioctl+0x9f/0x610
 [6.670922]  ksys_ioctl+0x6b/0x80
 [6.671261]  __x64_sys_ioctl+0x11/0x20
 [6.671650]  do_syscall_64+0x49/0x100
 [6.672040]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
 [6.672545] RIP: 0033:0x7f3b61cbddd7
 [6.672899] RSP: 002b:7ffcdbe7b318 EFLAGS: 0246 ORIG_RAX: 
 0010
 [6.673666] RAX: ffda RBX: 7ffcdbe90610 RCX: 
 7f3b61cbddd7
 [6.674430] RDX: 7ffcdbe7b360 RSI: 50009417 RDI: 
 0003
 [6.675154] RBP: 5634403ec020 R08:  R09: 
 1010
 [6.675878] R10:  R11: 0246 R12: 
 5634403ed670
 [6.676604] R13: 7ffcdbe7c470 R14: 0053 R15: 
 7ffcdbe7b360
 [6.677330] Code: f0 01 00 00 a8 02 74 27 48 8b 07 48 39 86 40 03 00 00 
 73 1b 49 39 75 28 0f 84 91 00 00 00 85 d2 75 17 48 8b 06 48 39 46 08 74 0e 
 <0f> 0b eb 0a 85 d2 74 65 49 3b 75 28 74 76 41 89 d4 48 89 f3 48
 [6.679318] ---[ end trace b2378f91e69026c3 ]---

 transaction.c:303:
   WARN_ON(!force && root->commit_root != root->node);

 And 152.out.full looks like:
   *** fsck.btrfs output ***
   root item for root 261, current bytenr 53870592, current gen 149, current
   level 0, new bytenr 54067200, new gen 149, new level 0
   Found 1 roots with an outdated root item.
   Please run a filesystem check with the option --repair to fix them.

 =
 (My observation)

 Call chain is:
   btrfs_commit_transaction()
create_pending_snapshots()
  create_pending_snapshot()
run_delayed_item() <- removed
qgroup_account_snapshot()
run_delayed_item() <- added

 This commit removes run_delayed_item() in create_pending_snapshot() and
 instead calls it after create_pending_snapshots().

 However, as qgroup_account_snapshot() updates commit_root, if there are
 more than one pending snapshots under the same subvolume, it may fail
 to update root item correctly.

 So, I think we cannot remove run_delayed_item() here.
>>>
>>> So btrfs_run_delayed_items is run once before going into
>>> create_pending_snapshots. This means 

[PATCH 3/3] btrfs: Delayed empty block group auto removal to next transaction

2018-05-28 Thread Qu Wenruo
Under certain KVM load and LTP tests, we are possible to hit the
following calltrace if quota is enabled:
--
BTRFS critical (device vda2): unable to find logical 8820195328 length 4096
BTRFS critical (device vda2): unable to find logical 8820195328 length 4096
[ cut here ]
WARNING: CPU: 0 PID: 49 at ../block/blk-core.c:172 blk_status_to_errno+0x1a/0x30
CPU: 0 PID: 49 Comm: kworker/u2:1 Not tainted 4.12.14-15-default #1 SLE15 
(unreleased)
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
1.0.0-prebuilt.qemu-project.org 04/01/2014
Workqueue: btrfs-endio-write btrfs_endio_write_helper [btrfs]
task: 9f827b340bc0 task.stack: b4f8c0304000
RIP: 0010:blk_status_to_errno+0x1a/0x30
Call Trace:
 submit_extent_page+0x191/0x270 [btrfs]
 ? btrfs_create_repair_bio+0x130/0x130 [btrfs]
 __do_readpage+0x2d2/0x810 [btrfs]
 ? btrfs_create_repair_bio+0x130/0x130 [btrfs]
 ? run_one_async_done+0xc0/0xc0 [btrfs]
 __extent_read_full_page+0xe7/0x100 [btrfs]
 ? run_one_async_done+0xc0/0xc0 [btrfs]
 read_extent_buffer_pages+0x1ab/0x2d0 [btrfs]
 ? run_one_async_done+0xc0/0xc0 [btrfs]
 btree_read_extent_buffer_pages+0x94/0xf0 [btrfs]
 read_tree_block+0x31/0x60 [btrfs]
 read_block_for_search.isra.35+0xf0/0x2e0 [btrfs]
 btrfs_search_slot+0x46b/0xa00 [btrfs]
 ? kmem_cache_alloc+0x1a8/0x510
 ? btrfs_get_token_32+0x5b/0x120 [btrfs]
 find_parent_nodes+0x11d/0xeb0 [btrfs]
 ? leaf_space_used+0xb8/0xd0 [btrfs]
 ? btrfs_leaf_free_space+0x49/0x90 [btrfs]
 ? btrfs_find_all_roots_safe+0x93/0x100 [btrfs]
 btrfs_find_all_roots_safe+0x93/0x100 [btrfs]
 btrfs_find_all_roots+0x45/0x60 [btrfs]
 btrfs_qgroup_trace_extent_post+0x20/0x40 [btrfs]
 btrfs_add_delayed_data_ref+0x1a3/0x1d0 [btrfs]
 btrfs_alloc_reserved_file_extent+0x38/0x40 [btrfs]
 insert_reserved_file_extent.constprop.71+0x289/0x2e0 [btrfs]
 btrfs_finish_ordered_io+0x2f4/0x7f0 [btrfs]
 ? pick_next_task_fair+0x2cd/0x530
 ? __switch_to+0x92/0x4b0
 btrfs_worker_helper+0x81/0x300 [btrfs]
 process_one_work+0x1da/0x3f0
 worker_thread+0x2b/0x3f0
 ? process_one_work+0x3f0/0x3f0
 kthread+0x11a/0x130
 ? kthread_create_on_node+0x40/0x40
 ret_from_fork+0x35/0x40
Code: 00 00 5b c3 0f 1f 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 40 80 
ff 0c 40 0f b6 c7 77 0b 48 c1 e0 04 8b 80 00 bf c8 bd c3 <0f> 0b b8 fb ff ff ff 
c3 0f 1f 40 00 66 2e 0f 1f 84 00 00 00 00
---[ end trace f079fb809e7a862b ]---
BTRFS critical (device vda2): unable to find logical 8820195328 length 16384
BTRFS: error (device vda2) in btrfs_finish_ordered_io:3023: errno=-5 IO failure
BTRFS info (device vda2): forced readonly
BTRFS error (device vda2): pending csums is 2887680
--

Although we haven't hit the same bug in real world, it shows that since
qgroup is heavily relying on commit root, if block group auto removal
happens and removed the empty block group, qgroup could failed to find
its old referencers.

This patch will add a new member for btrfs_transaction,
pending_unused_bgs.
Now unused bgs will go trans->transaction->pending_unused_bgs, then
fs_info->unused_bgs, and finally get removed by
btrfs_deleted_unused_bgs().

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/ctree.h   |  3 ++-
 fs/btrfs/extent-tree.c | 34 ++
 fs/btrfs/scrub.c   |  2 +-
 fs/btrfs/transaction.c |  7 +++
 fs/btrfs/transaction.h | 10 ++
 5 files changed, 50 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index b19b673485fd..19d7532fa4cf 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -2829,7 +2829,8 @@ void check_system_chunk(struct btrfs_trans_handle *trans,
struct btrfs_fs_info *fs_info, const u64 type);
 u64 add_new_free_space(struct btrfs_block_group_cache *block_group,
   struct btrfs_fs_info *info, u64 start, u64 end);
-void btrfs_mark_bg_unused(struct btrfs_block_group_cache *bg);
+void btrfs_mark_bg_unused(struct btrfs_block_group_cache *bg,
+ struct btrfs_trans_handle *trans);
 
 /* ctree.c */
 int btrfs_bin_search(struct extent_buffer *eb, const struct btrfs_key *key,
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index f0d7e19feca7..80e17bd99f8e 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -6351,7 +6351,7 @@ static int update_block_group(struct btrfs_trans_handle 
*trans,
 * cache writeout.
 */
if (!alloc && old_val == 0)
-   btrfs_mark_bg_unused(cache);
+   btrfs_mark_bg_unused(cache, trans);
 
btrfs_put_block_group(cache);
total -= num_bytes;
@@ -10194,7 +10194,7 @@ int btrfs_read_block_groups(struct btrfs_fs_info *info)
inc_block_group_ro(cache, 1);
} else if (btrfs_block_group_used(>item) == 0) {
ASSERT(list_empty(>bg_list));
-   btrfs_mark_bg_unused(cache);
+   

Re: [RFC PATCH v4 6/6] Btrfs: support swap files

2018-05-28 Thread David Sterba
On Thu, May 24, 2018 at 02:41:30PM -0700, Omar Sandoval wrote:
> From: Omar Sandoval 
> 
> Implement the swap file a_ops on Btrfs. Activation needs to make sure
> that the file can be used as a swap file, which currently means it must
> be fully allocated as nocow with no compression on one device. It also
> sets up the swap extents directly with add_swap_extent(), so export it.
> 
> Signed-off-by: Omar Sandoval 
> ---
>  fs/btrfs/inode.c | 220 +++
>  mm/swapfile.c|   1 +
>  2 files changed, 221 insertions(+)
> 
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 9228cb866115..6cca8529e307 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -10526,6 +10526,224 @@ void btrfs_set_range_writeback(void *private_data, 
> u64 start, u64 end)
>   }
>  }
>  
> +struct btrfs_swap_info {
> + u64 start;
> + u64 block_start;
> + u64 block_len;
> + u64 lowest_ppage;
> + u64 highest_ppage;
> + unsigned long nr_pages;
> + int nr_extents;
> +};
> +
> +static int btrfs_add_swap_extent(struct swap_info_struct *sis,
> +  struct btrfs_swap_info *bsi)
> +{
> + unsigned long nr_pages;
> + u64 first_ppage, first_ppage_reported, next_ppage;
> + int ret;
> +
> + first_ppage = ALIGN(bsi->block_start, PAGE_SIZE) >> PAGE_SHIFT;
> + next_ppage = ALIGN_DOWN(bsi->block_start + bsi->block_len,
> + PAGE_SIZE) >> PAGE_SHIFT;
> +
> + if (first_ppage >= next_ppage)
> + return 0;
> + nr_pages = next_ppage - first_ppage;
> +
> + first_ppage_reported = first_ppage;
> + if (bsi->start == 0)
> + first_ppage_reported++;
> + if (bsi->lowest_ppage > first_ppage_reported)
> + bsi->lowest_ppage = first_ppage_reported;
> + if (bsi->highest_ppage < (next_ppage - 1))
> + bsi->highest_ppage = next_ppage - 1;
> +
> + ret = add_swap_extent(sis, bsi->nr_pages, nr_pages, first_ppage);
> + if (ret < 0)
> + return ret;
> + bsi->nr_extents += ret;
> + bsi->nr_pages += nr_pages;
> + return 0;
> +}
> +
> +static int btrfs_swap_activate(struct swap_info_struct *sis, struct file 
> *file,
> +sector_t *span)
> +{
> + struct inode *inode = file_inode(file);
> + struct btrfs_fs_info *fs_info = BTRFS_I(inode)->root->fs_info;
> + struct extent_io_tree *io_tree = _I(inode)->io_tree;
> + struct extent_state *cached_state = NULL;
> + struct extent_map *em = NULL;
> + struct btrfs_device *device = NULL;
> + struct btrfs_swap_info bsi = {
> + .lowest_ppage = (sector_t)-1ULL,
> + };
> + int ret = 0;
> + u64 isize = inode->i_size;
> + u64 start;
> +
> + /*
> +  * The inode is locked, so these flags won't change after we check them.
> +  */
> + if (BTRFS_I(inode)->flags & BTRFS_INODE_COMPRESS) {
> + btrfs_err(fs_info, "swapfile is compressed");

I'm not sure this is the right way to phrase the error message. I'd
expect something like "swapfile cannot be compressed", but I also think
I've seen the positive and negative elsewhere so there's no universal
style to follow.

> + return -EINVAL;
> + }
> + if (!(BTRFS_I(inode)->flags & BTRFS_INODE_NODATACOW)) {
> + btrfs_err(fs_info, "swapfile is copy-on-write");
> + return -EINVAL;
> + }
> +
> + /*
> +  * Balance or device remove/replace/resize can move stuff around from
> +  * under us. The EXCL_OP flag makes sure they aren't running/won't run
> +  * concurrently while we are mapping the swap extents, and the fs_info
> +  * nr_swapfiles counter prevents them from running while the swap file
> +  * is active and moving the extents. Note that this also prevents a
> +  * concurrent device add which isn't actually necessary, but it's not
> +  * really worth the trouble to allow it.
> +  */
> + if (test_and_set_bit(BTRFS_FS_EXCL_OP, _info->flags))
> + return -EBUSY;
> + atomic_inc(_info->nr_swapfiles);
> + /*
> +  * Snapshots can create extents which require COW even if NODATACOW is
> +  * set. We use this counter to prevent snapshots. We must increment it
> +  * before walking the extents because we don't want a concurrent
> +  * snapshot to run after we've already checked the extents.
> +  */
> + atomic_inc(_I(inode)->root->nr_swapfiles);
> +
> + lock_extent_bits(io_tree, 0, isize - 1, _state);
> + start = 0;
> + while (start < isize) {
> + u64 end, logical_block_start, physical_block_start;
> + u64 len = isize - start;
> +
> + em = btrfs_get_extent(BTRFS_I(inode), NULL, 0, start, len, 0);
> + if (IS_ERR(em)) {
> + ret = PTR_ERR(em);
> + goto out;
> + }
> + end = 

[PATCH 06/11] btrfs-progs: Remove unused argument from clean_tree_block

2018-05-28 Thread Nikolay Borisov
This function actually uses only the extent_buffer arg but takes 3
arguments. Furthermore, it's current interface doesn't even mirror
the kernel counterpart. Just remove the extra arguments.

Signed-off-by: Nikolay Borisov 
---
 ctree.c   | 12 ++--
 disk-io.c |  3 +--
 disk-io.h |  3 +--
 extent-tree.c |  2 +-
 free-space-tree.c |  2 +-
 5 files changed, 10 insertions(+), 12 deletions(-)

diff --git a/ctree.c b/ctree.c
index 2c3ba70b000c..e79d7aa32319 100644
--- a/ctree.c
+++ b/ctree.c
@@ -251,7 +251,7 @@ static noinline int update_ref_for_cow(struct 
btrfs_trans_handle *trans,
ret = btrfs_dec_ref(trans, root, buf, 1);
BUG_ON(ret);
}
-   clean_tree_block(trans, root, buf);
+   clean_tree_block(buf);
}
return 0;
 }
@@ -717,7 +717,7 @@ static int balance_level(struct btrfs_trans_handle *trans,
root->node = child;
add_root_to_dirty_list(root);
path->nodes[level] = NULL;
-   clean_tree_block(trans, root, mid);
+   clean_tree_block(mid);
/* once for the path */
free_extent_buffer(mid);
 
@@ -770,7 +770,7 @@ static int balance_level(struct btrfs_trans_handle *trans,
u64 bytenr = right->start;
u32 blocksize = right->len;
 
-   clean_tree_block(trans, root, right);
+   clean_tree_block(right);
free_extent_buffer(right);
right = NULL;
wret = btrfs_del_ptr(root, path, level + 1, pslot + 1);
@@ -816,7 +816,7 @@ static int balance_level(struct btrfs_trans_handle *trans,
/* we've managed to empty the middle node, drop it */
u64 bytenr = mid->start;
u32 blocksize = mid->len;
-   clean_tree_block(trans, root, mid);
+   clean_tree_block(mid);
free_extent_buffer(mid);
mid = NULL;
wret = btrfs_del_ptr(root, path, level + 1, pslot);
@@ -2739,7 +2739,7 @@ int btrfs_del_items(struct btrfs_trans_handle *trans, 
struct btrfs_root *root,
if (leaf == root->node) {
btrfs_set_header_level(leaf, 0);
} else {
-   clean_tree_block(trans, root, leaf);
+   clean_tree_block(leaf);
wret = btrfs_del_leaf(trans, root, path, leaf);
BUG_ON(ret);
if (wret)
@@ -2775,7 +2775,7 @@ int btrfs_del_items(struct btrfs_trans_handle *trans, 
struct btrfs_root *root,
}
 
if (btrfs_header_nritems(leaf) == 0) {
-   clean_tree_block(trans, root, leaf);
+   clean_tree_block(leaf);
path->slots[1] = slot;
ret = btrfs_del_leaf(trans, root, path, leaf);
BUG_ON(ret);
diff --git a/disk-io.c b/disk-io.c
index 72d44531a1b5..456b354cb727 100644
--- a/disk-io.c
+++ b/disk-io.c
@@ -1684,8 +1684,7 @@ int close_ctree_fs_info(struct btrfs_fs_info *fs_info)
return err;
 }
 
-int clean_tree_block(struct btrfs_trans_handle *trans, struct btrfs_root *root,
-struct extent_buffer *eb)
+int clean_tree_block(struct extent_buffer *eb)
 {
return clear_extent_buffer_dirty(eb);
 }
diff --git a/disk-io.h b/disk-io.h
index c4496155771f..36fb68cdb86a 100644
--- a/disk-io.h
+++ b/disk-io.h
@@ -131,8 +131,7 @@ struct extent_buffer* btrfs_find_create_tree_block(
 
 void btrfs_setup_root(struct btrfs_root *root, struct btrfs_fs_info *fs_info,
  u64 objectid);
-int clean_tree_block(struct btrfs_trans_handle *trans,
-struct btrfs_root *root, struct extent_buffer *buf);
+int clean_tree_block(struct extent_buffer *buf);
 
 void btrfs_free_fs_info(struct btrfs_fs_info *fs_info);
 struct btrfs_fs_info *btrfs_new_fs_info(int writable, u64 sb_bytenr);
diff --git a/extent-tree.c b/extent-tree.c
index 8e7f888b5ce5..e06fe56b5ca8 100644
--- a/extent-tree.c
+++ b/extent-tree.c
@@ -2129,7 +2129,7 @@ static int pin_down_bytes(struct btrfs_trans_handle 
*trans,
if (header_owner != BTRFS_TREE_LOG_OBJECTID &&
header_transid == trans->transid &&
!btrfs_header_flag(buf, BTRFS_HEADER_FLAG_WRITTEN)) {
-   clean_tree_block(NULL, root, buf);
+   clean_tree_block(buf);
free_extent_buffer(buf);
return 1;
}
diff --git a/free-space-tree.c b/free-space-tree.c
index 69a4eca8a74f..139a031e8483 100644
--- a/free-space-tree.c
+++ b/free-space-tree.c
@@ -135,7 +135,7 @@ int 

[PATCH 3/3] btrfs: Delayed empty block group auto removal to next transaction

2018-05-28 Thread Qu Wenruo
Under certain KVM load and LTP tests, we are possible to hit the
following calltrace if quota is enabled:
--
BTRFS critical (device vda2): unable to find logical 8820195328 length 4096
BTRFS critical (device vda2): unable to find logical 8820195328 length 4096
[ cut here ]
WARNING: CPU: 0 PID: 49 at ../block/blk-core.c:172 blk_status_to_errno+0x1a/0x30
CPU: 0 PID: 49 Comm: kworker/u2:1 Not tainted 4.12.14-15-default #1 SLE15 
(unreleased)
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
1.0.0-prebuilt.qemu-project.org 04/01/2014
Workqueue: btrfs-endio-write btrfs_endio_write_helper [btrfs]
task: 9f827b340bc0 task.stack: b4f8c0304000
RIP: 0010:blk_status_to_errno+0x1a/0x30
Call Trace:
 submit_extent_page+0x191/0x270 [btrfs]
 ? btrfs_create_repair_bio+0x130/0x130 [btrfs]
 __do_readpage+0x2d2/0x810 [btrfs]
 ? btrfs_create_repair_bio+0x130/0x130 [btrfs]
 ? run_one_async_done+0xc0/0xc0 [btrfs]
 __extent_read_full_page+0xe7/0x100 [btrfs]
 ? run_one_async_done+0xc0/0xc0 [btrfs]
 read_extent_buffer_pages+0x1ab/0x2d0 [btrfs]
 ? run_one_async_done+0xc0/0xc0 [btrfs]
 btree_read_extent_buffer_pages+0x94/0xf0 [btrfs]
 read_tree_block+0x31/0x60 [btrfs]
 read_block_for_search.isra.35+0xf0/0x2e0 [btrfs]
 btrfs_search_slot+0x46b/0xa00 [btrfs]
 ? kmem_cache_alloc+0x1a8/0x510
 ? btrfs_get_token_32+0x5b/0x120 [btrfs]
 find_parent_nodes+0x11d/0xeb0 [btrfs]
 ? leaf_space_used+0xb8/0xd0 [btrfs]
 ? btrfs_leaf_free_space+0x49/0x90 [btrfs]
 ? btrfs_find_all_roots_safe+0x93/0x100 [btrfs]
 btrfs_find_all_roots_safe+0x93/0x100 [btrfs]
 btrfs_find_all_roots+0x45/0x60 [btrfs]
 btrfs_qgroup_trace_extent_post+0x20/0x40 [btrfs]
 btrfs_add_delayed_data_ref+0x1a3/0x1d0 [btrfs]
 btrfs_alloc_reserved_file_extent+0x38/0x40 [btrfs]
 insert_reserved_file_extent.constprop.71+0x289/0x2e0 [btrfs]
 btrfs_finish_ordered_io+0x2f4/0x7f0 [btrfs]
 ? pick_next_task_fair+0x2cd/0x530
 ? __switch_to+0x92/0x4b0
 btrfs_worker_helper+0x81/0x300 [btrfs]
 process_one_work+0x1da/0x3f0
 worker_thread+0x2b/0x3f0
 ? process_one_work+0x3f0/0x3f0
 kthread+0x11a/0x130
 ? kthread_create_on_node+0x40/0x40
 ret_from_fork+0x35/0x40
Code: 00 00 5b c3 0f 1f 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 40 80 
ff 0c 40 0f b6 c7 77 0b 48 c1 e0 04 8b 80 00 bf c8 bd c3 <0f> 0b b8 fb ff ff ff 
c3 0f 1f 40 00 66 2e 0f 1f 84 00 00 00 00
---[ end trace f079fb809e7a862b ]---
BTRFS critical (device vda2): unable to find logical 8820195328 length 16384
BTRFS: error (device vda2) in btrfs_finish_ordered_io:3023: errno=-5 IO failure
BTRFS info (device vda2): forced readonly
BTRFS error (device vda2): pending csums is 2887680
--

Although we haven't hit the same bug in real world, it shows that since
qgroup is heavily relying on commit root, if block group auto removal
happens and removed the empty block group, qgroup could failed to find
its old referencers.

This patch will add a new member for btrfs_transaction,
pending_unused_bgs.
Now unused bgs will go trans->transaction->pending_unused_bgs, then
fs_info->unused_bgs, and finally get removed by
btrfs_deleted_unused_bgs().

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/ctree.h   |  3 ++-
 fs/btrfs/extent-tree.c | 34 ++
 fs/btrfs/scrub.c   |  2 +-
 fs/btrfs/transaction.c |  7 +++
 fs/btrfs/transaction.h | 10 ++
 5 files changed, 50 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index b19b673485fd..19d7532fa4cf 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -2829,7 +2829,8 @@ void check_system_chunk(struct btrfs_trans_handle *trans,
struct btrfs_fs_info *fs_info, const u64 type);
 u64 add_new_free_space(struct btrfs_block_group_cache *block_group,
   struct btrfs_fs_info *info, u64 start, u64 end);
-void btrfs_mark_bg_unused(struct btrfs_block_group_cache *bg);
+void btrfs_mark_bg_unused(struct btrfs_block_group_cache *bg,
+ struct btrfs_trans_handle *trans);
 
 /* ctree.c */
 int btrfs_bin_search(struct extent_buffer *eb, const struct btrfs_key *key,
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index f0d7e19feca7..80e17bd99f8e 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -6351,7 +6351,7 @@ static int update_block_group(struct btrfs_trans_handle 
*trans,
 * cache writeout.
 */
if (!alloc && old_val == 0)
-   btrfs_mark_bg_unused(cache);
+   btrfs_mark_bg_unused(cache, trans);
 
btrfs_put_block_group(cache);
total -= num_bytes;
@@ -10194,7 +10194,7 @@ int btrfs_read_block_groups(struct btrfs_fs_info *info)
inc_block_group_ro(cache, 1);
} else if (btrfs_block_group_used(>item) == 0) {
ASSERT(list_empty(>bg_list));
-   btrfs_mark_bg_unused(cache);
+   

Re: [RFC PATCH v2 00/12] get rid of GFP_ZONE_TABLE/BAD

2018-05-28 Thread Michal Hocko
On Fri 25-05-18 05:00:44, Matthew Wilcox wrote:
> On Thu, May 24, 2018 at 05:29:43PM +0200, Michal Hocko wrote:
> > > ie if we had more,
> > > could we solve our pain by making them more generic?
> > 
> > Well, if you have more you will consume more bits in the struct pages,
> > right?
> 
> Not necessarily ... the zone number is stored in the struct page
> currently, so either two or three bits are used right now.  In my
> proposal, one can infer the zone of a page from its PFN, except for
> ZONE_MOVABLE.  So we could trim down to just one bit per struct page
> for 32-bit machines while using 3 bits on 64-bit machines, where there
> is plenty of space.

Just be warned that page_zone is called from many hot paths. I am not
sure adding something more complex there is going to fly.

> > > it more-or-less sucks that the devices with 28-bit DMA limits are forced
> > > to allocate from the low 16MB when they're perfectly capable of using the
> > > low 256MB.
> > 
> > Do we actually care all that much about those? If yes then we should
> > probably follow the ZONE_DMA (x86) path and use a CMA region for them.
> > I mean most devices should be good with very limited addressability or
> > below 4G, no?
> 
> Sure.  One other thing I meant to mention was the media devices
> (TV capture cards and so on) which want a vmalloc_32() allocation.
> On 32-bit machines right now, we allocate from LOWMEM, when we really
> should be allocating from the 1GB-4GB region.  32-bit machines generally
> don't have a ZONE_DMA32 today.

Well, _I_ think that vmalloc on 32b is just lost case...

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/3] btrfs: Use btrfs_mark_bg_unused() to replace open code

2018-05-28 Thread Qu Wenruo
Introduce a small helper, btrfs_mark_bg_unused(), to accquire needed
locks and add a block group to unused_bgs list.

No functional modification, and only 3 callers are involved.

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/ctree.h   |  1 +
 fs/btrfs/extent-tree.c | 36 +---
 fs/btrfs/scrub.c   |  9 +
 3 files changed, 19 insertions(+), 27 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 2771cc56a622..b19b673485fd 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -2829,6 +2829,7 @@ void check_system_chunk(struct btrfs_trans_handle *trans,
struct btrfs_fs_info *fs_info, const u64 type);
 u64 add_new_free_space(struct btrfs_block_group_cache *block_group,
   struct btrfs_fs_info *info, u64 start, u64 end);
+void btrfs_mark_bg_unused(struct btrfs_block_group_cache *bg);
 
 /* ctree.c */
 int btrfs_bin_search(struct extent_buffer *eb, const struct btrfs_key *key,
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 8241de81089a..f0d7e19feca7 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -6350,16 +6350,8 @@ static int update_block_group(struct btrfs_trans_handle 
*trans,
 * dirty list to avoid races between cleaner kthread and space
 * cache writeout.
 */
-   if (!alloc && old_val == 0) {
-   spin_lock(>unused_bgs_lock);
-   if (list_empty(>bg_list)) {
-   btrfs_get_block_group(cache);
-   trace_btrfs_add_unused_block_group(cache);
-   list_add_tail(>bg_list,
- >unused_bgs);
-   }
-   spin_unlock(>unused_bgs_lock);
-   }
+   if (!alloc && old_val == 0)
+   btrfs_mark_bg_unused(cache);
 
btrfs_put_block_group(cache);
total -= num_bytes;
@@ -10201,15 +10193,8 @@ int btrfs_read_block_groups(struct btrfs_fs_info *info)
if (btrfs_chunk_readonly(info, cache->key.objectid)) {
inc_block_group_ro(cache, 1);
} else if (btrfs_block_group_used(>item) == 0) {
-   spin_lock(>unused_bgs_lock);
-   /* Should always be true but just in case. */
-   if (list_empty(>bg_list)) {
-   btrfs_get_block_group(cache);
-   trace_btrfs_add_unused_block_group(cache);
-   list_add_tail(>bg_list,
- >unused_bgs);
-   }
-   spin_unlock(>unused_bgs_lock);
+   ASSERT(list_empty(>bg_list));
+   btrfs_mark_bg_unused(cache);
}
}
 
@@ -11133,3 +8,16 @@ void btrfs_wait_for_snapshot_creation(struct 
btrfs_root *root)
   !atomic_read(>will_be_snapshotted));
}
 }
+
+void btrfs_mark_bg_unused(struct btrfs_block_group_cache *bg)
+{
+   struct btrfs_fs_info *fs_info = bg->fs_info;
+
+   spin_lock(_info->unused_bgs_lock);
+   if (list_empty(>bg_list)) {
+   btrfs_get_block_group(bg);
+   trace_btrfs_add_unused_block_group(bg);
+   list_add_tail(>bg_list, _info->unused_bgs);
+   }
+   spin_unlock(_info->unused_bgs_lock);
+}
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index a59005862010..40086b47a65f 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -3981,14 +3981,7 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
if (!cache->removed && !cache->ro && cache->reserved == 0 &&
btrfs_block_group_used(>item) == 0) {
spin_unlock(>lock);
-   spin_lock(_info->unused_bgs_lock);
-   if (list_empty(>bg_list)) {
-   btrfs_get_block_group(cache);
-   trace_btrfs_add_unused_block_group(cache);
-   list_add_tail(>bg_list,
- _info->unused_bgs);
-   }
-   spin_unlock(_info->unused_bgs_lock);
+   btrfs_mark_bg_unused(cache);
} else {
spin_unlock(>lock);
}
-- 
2.17.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 10/11] btrfs-progs: Remove root argument from write_one_cache_group

2018-05-28 Thread Nikolay Borisov
It's not needed since we can acquire a reference to the fs_info from
the transaction handle already passed.

Signed-off-by: Nikolay Borisov 
---
 extent-tree.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/extent-tree.c b/extent-tree.c
index 24fb96492c13..1694c479cc2b 100644
--- a/extent-tree.c
+++ b/extent-tree.c
@@ -1699,13 +1699,12 @@ int btrfs_dec_ref(struct btrfs_trans_handle *trans, 
struct btrfs_root *root,
 }
 
 static int write_one_cache_group(struct btrfs_trans_handle *trans,
-struct btrfs_root *root,
 struct btrfs_path *path,
 struct btrfs_block_group_cache *cache)
 {
int ret;
int pending_ret;
-   struct btrfs_root *extent_root = root->fs_info->extent_root;
+   struct btrfs_root *extent_root = trans->fs_info->extent_root;
unsigned long bi;
struct extent_buffer *leaf;
 
@@ -1765,7 +1764,7 @@ int btrfs_write_dirty_block_groups(struct 
btrfs_trans_handle *trans,
  BLOCK_GROUP_DIRTY);
 
cache = (struct btrfs_block_group_cache *)(unsigned long)ptr;
-   ret = write_one_cache_group(trans, root, path, cache);
+   ret = write_one_cache_group(trans, path, cache);
}
btrfs_free_path(path);
return 0;
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] btrfs: Refactor running of delayed inode items during transaction commit

2018-05-28 Thread Nikolay Borisov


On 28.05.2018 11:21, Misono Tomohiro wrote:
> Hello,
> 
> I found current misc-next sometimes fails btrfs/152 when the number
> of cpu is >= 4 in my vm and git bisect points this commit.
> (btrfs/152 performs parallel send/receive.)
> 
> The failure is caused by _check_dmesg and sometimes also leads to 
> inconsistent fs.
> 
> dmesg looks like:
> [6.649213] WARNING: CPU: 0 PID: 2838 at fs/btrfs/transaction.c:303 
> record_root_in_trans+0x38/0xd0
> [6.650807] CPU: 0 PID: 2838 Comm: btrfs Tainted: GW 
> 4.17.0-rc6+ #113
> [6.651998] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> 1.10.2-2.fc27 04/01/2014
> [6.653325] RIP: 0010:record_root_in_trans+0x38/0xd0
> [6.654090] RSP: 0018:90b781e23a68 EFLAGS: 00010206
> [6.654882] RAX: 8f346a48b720 RBX: 8f3479ce86c0 RCX: 
> 
> [6.655965] RDX:  RSI: 8f347c8ec000 RDI: 
> 8f346a03a958
> [6.657059] RBP: 90b781e23b80 R08: 00025870 R09: 
> a8c112bb
> [6.658162] R10: 90b781e23ba0 R11: 0002 R12: 
> 8f346a03a958
> [6.659290] R13: 8f347c8d9000 R14: 8f347ba2a0b8 R15: 
> 
> [6.660374] FS:  7f3b62ebc8c0() GS:8f347fc0() 
> knlGS:
> [6.661349] CS:  0010 DS:  ES:  CR0: 80050033
> [6.661928] CR2: 7f0a641995e0 CR3: 6af5c000 CR4: 
> 06f0
> [6.662651] Call Trace:
> [6.662909]  create_pending_snapshot+0x1ab/0xd00
> [6.663391]  ? btrfs_delayed_inode_release_metadata+0xe0/0xf0
> [6.663972]  ? __btrfs_update_delayed_inode+0x1aa/0x210
> [6.664526]  ? __btrfs_release_delayed_node.part.18+0x8f/0x1d0
> [6.665116]  ? create_pending_snapshots+0x81/0xa0
> [6.665597]  create_pending_snapshots+0x81/0xa0
> [6.666068]  btrfs_commit_transaction+0x279/0x860
> [6.666553]  ? start_transaction+0x98/0x3c0
> [6.666989]  btrfs_mksubvol+0x589/0x5a0
> [6.667390]  ? btrfs_opendir+0x39/0x70
> [6.667778]  btrfs_ioctl_snap_create_transid+0x16a/0x1a0
> [6.668327]  btrfs_ioctl_snap_create_v2+0x121/0x180
> [6.668827]  btrfs_ioctl+0x56d/0x25a0
> [6.669205]  ? do_filp_open+0xaa/0x110
> [6.669591]  ? do_vfs_ioctl+0x9f/0x610
> [6.669980]  ? btrfs_ioctl_get_supported_features+0x20/0x20
> [6.670550]  do_vfs_ioctl+0x9f/0x610
> [6.670922]  ksys_ioctl+0x6b/0x80
> [6.671261]  __x64_sys_ioctl+0x11/0x20
> [6.671650]  do_syscall_64+0x49/0x100
> [6.672040]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [6.672545] RIP: 0033:0x7f3b61cbddd7
> [6.672899] RSP: 002b:7ffcdbe7b318 EFLAGS: 0246 ORIG_RAX: 
> 0010
> [6.673666] RAX: ffda RBX: 7ffcdbe90610 RCX: 
> 7f3b61cbddd7
> [6.674430] RDX: 7ffcdbe7b360 RSI: 50009417 RDI: 
> 0003
> [6.675154] RBP: 5634403ec020 R08:  R09: 
> 1010
> [6.675878] R10:  R11: 0246 R12: 
> 5634403ed670
> [6.676604] R13: 7ffcdbe7c470 R14: 0053 R15: 
> 7ffcdbe7b360
> [6.677330] Code: f0 01 00 00 a8 02 74 27 48 8b 07 48 39 86 40 03 00 00 73 
> 1b 49 39 75 28 0f 84 91 00 00 00 85 d2 75 17 48 8b 06 48 39 46 08 74 0e <0f> 
> 0b eb 0a 85 d2 74 65 49 3b 75 28 74 76 41 89 d4 48 89 f3 48
> [6.679318] ---[ end trace b2378f91e69026c3 ]---
> 
> transaction.c:303:
>   WARN_ON(!force && root->commit_root != root->node);
> 
> And 152.out.full looks like:
>   *** fsck.btrfs output ***
>   root item for root 261, current bytenr 53870592, current gen 149, current
>   level 0, new bytenr 54067200, new gen 149, new level 0
>   Found 1 roots with an outdated root item.
>   Please run a filesystem check with the option --repair to fix them.
> 
> =
> (My observation)
> 
> Call chain is:
>   btrfs_commit_transaction()
>create_pending_snapshots()
>  create_pending_snapshot()
>run_delayed_item() <- removed
>qgroup_account_snapshot()
>run_delayed_item() <- added
> 
> This commit removes run_delayed_item() in create_pending_snapshot() and
> instead calls it after create_pending_snapshots().
> 
> However, as qgroup_account_snapshot() updates commit_root, if there are
> more than one pending snapshots under the same subvolume, it may fail
> to update root item correctly.
> 
> So, I think we cannot remove run_delayed_item() here.

So btrfs_run_delayed_items is run once before going into
create_pending_snapshots. This means that when we go into
create_pending_snapshots the filesystem should be consistent from the
POV of delayed items (as per patch's changelog).

I somehow fail to reconcile how the removal of run delayed items (which
really pertain to inode state) have anything to do with an outdated
root?  IMHO you are hitting some race which my patch just made more
visible due to removing code in the critical path.

> 
> Thanks,
> Tomohiro Misono
> 
> On 2018/02/13 23:16, 

[PATCH 0/3] btrfs: Delay block group auto removal to avoid interfering qgroups

2018-05-28 Thread Qu Wenruo
The patchset can be fetched from github:
https://github.com/adam900710/linux/tree/delayed_bg_removal

It's based on v4.17-rc5 branch.

This bug is reported from SUSE openQA, although it's pretty hard to hit
in real world (even real world VM), it's believed block group auto
removal (anyway, there isn't much thing can remove a chunk mapping in
btrfs) could interfere with qgroup's search on commit root.

Full details can be found in the 3rd patch.

The patchset uses 2 submitted cleanup/refactor patches as basis, and the
3rd patch will ensure unused block group will only be deleted after
current transaction is done.

Qu Wenruo (3):
  btrfs: trace: Add trace points for unused block groups
  btrfs: Use btrfs_mark_bg_unused() to replace open code
  btrfs: Delayed empty block group auto removal to next transaction

 fs/btrfs/ctree.h |  2 ++
 fs/btrfs/extent-tree.c   | 62 ++--
 fs/btrfs/scrub.c |  8 +
 fs/btrfs/transaction.c   |  7 
 fs/btrfs/transaction.h   | 10 ++
 include/trace/events/btrfs.h | 42 
 6 files changed, 107 insertions(+), 24 deletions(-)

-- 
2.17.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/3] btrfs: trace: Add trace points for unused block groups

2018-05-28 Thread Qu Wenruo
This patch will add the following trace events:
1) btrfs_remove_block_group
   For btrfs_remove_block_group() function.
   Triggered when a block group is really removed.

2) btrfs_add_unused_block_group
   Triggered which block group is added to unused_bgs list.

3) btrfs_skip_unused_block_group
   Triggered which unused block group is not deleted.

These trace events is pretty handy to debug case related to block group
auto remove.

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/extent-tree.c   |  4 
 fs/btrfs/scrub.c |  1 +
 include/trace/events/btrfs.h | 42 
 3 files changed, 47 insertions(+)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 51b5e2da708c..8241de81089a 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -6354,6 +6354,7 @@ static int update_block_group(struct btrfs_trans_handle 
*trans,
spin_lock(>unused_bgs_lock);
if (list_empty(>bg_list)) {
btrfs_get_block_group(cache);
+   trace_btrfs_add_unused_block_group(cache);
list_add_tail(>bg_list,
  >unused_bgs);
}
@@ -10204,6 +10205,7 @@ int btrfs_read_block_groups(struct btrfs_fs_info *info)
/* Should always be true but just in case. */
if (list_empty(>bg_list)) {
btrfs_get_block_group(cache);
+   trace_btrfs_add_unused_block_group(cache);
list_add_tail(>bg_list,
  >unused_bgs);
}
@@ -10391,6 +10393,7 @@ int btrfs_remove_block_group(struct btrfs_trans_handle 
*trans,
BUG_ON(!block_group);
BUG_ON(!block_group->ro);
 
+   trace_btrfs_remove_block_group(block_group);
/*
 * Free the reserved super bytes from this block group before
 * remove it.
@@ -10755,6 +10758,7 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info 
*fs_info)
 * the ro check in case balance is currently acting on
 * this block group.
 */
+   trace_btrfs_skip_unused_block_group(block_group);
spin_unlock(_group->lock);
up_write(_info->groups_sem);
goto next;
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 52b39a0924e9..a59005862010 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -3984,6 +3984,7 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
spin_lock(_info->unused_bgs_lock);
if (list_empty(>bg_list)) {
btrfs_get_block_group(cache);
+   trace_btrfs_add_unused_block_group(cache);
list_add_tail(>bg_list,
  _info->unused_bgs);
}
diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
index 965c650a5273..c226bc9a61db 100644
--- a/include/trace/events/btrfs.h
+++ b/include/trace/events/btrfs.h
@@ -1810,6 +1810,48 @@ TRACE_EVENT(btrfs_inode_mod_outstanding_extents,
show_root_type(__entry->root_objectid),
(unsigned long long)__entry->ino, __entry->mod)
 );
+
+DECLARE_EVENT_CLASS(btrfs__block_group,
+   TP_PROTO(const struct btrfs_block_group_cache *bg_cache),
+
+   TP_ARGS(bg_cache),
+
+   TP_STRUCT__entry_btrfs(
+   __field(u64,bytenr  )
+   __field(u64,flags   )
+   __field(u64,len )
+   __field(u64,used)
+   ),
+
+   TP_fast_assign_btrfs(bg_cache->fs_info,
+   __entry->bytenr = bg_cache->key.objectid,
+   __entry->len= bg_cache->key.offset,
+   __entry->flags  = bg_cache->flags;
+   __entry->used   = btrfs_block_group_used(_cache->item);
+   ),
+
+   TP_printk_btrfs("bg bytenr=%llu len=%llu used=%llu flags=%llu(%s)",
+   __entry->bytenr, __entry->len, __entry->used, __entry->flags,
+   __print_flags(__entry->flags, "|", BTRFS_GROUP_FLAGS))
+);
+
+DEFINE_EVENT(btrfs__block_group, btrfs_remove_block_group,
+   TP_PROTO(const struct btrfs_block_group_cache *bg_cache),
+
+   TP_ARGS(bg_cache)
+);
+
+DEFINE_EVENT(btrfs__block_group, btrfs_add_unused_block_group,
+   TP_PROTO(const struct btrfs_block_group_cache *bg_cache),
+
+   TP_ARGS(bg_cache)
+);
+
+DEFINE_EVENT(btrfs__block_group, btrfs_skip_unused_block_group,
+   TP_PROTO(const struct btrfs_block_group_cache *bg_cache),
+
+   TP_ARGS(bg_cache)
+);
 #endif /* _TRACE_BTRFS_H 

[PATCH 02/11] btrfs-progs: check: Remove root parameter from btrfs_fix_block_accounting

2018-05-28 Thread Nikolay Borisov
It's always set to extent_root and the function already takes a
transaction handle where fs_info could be referenced and in turn
the extent_tree.

Signed-off-by: Nikolay Borisov 
---
 check/main.c| 2 +-
 check/mode-lowmem.c | 2 +-
 ctree.h | 3 +--
 extent-tree.c   | 8 +++-
 4 files changed, 6 insertions(+), 9 deletions(-)

diff --git a/check/main.c b/check/main.c
index 4cf243d9b379..71cc16735443 100644
--- a/check/main.c
+++ b/check/main.c
@@ -7671,7 +7671,7 @@ static int check_extent_refs(struct btrfs_root *root,
goto repair_abort;
}
 
-   ret = btrfs_fix_block_accounting(trans, root);
+   ret = btrfs_fix_block_accounting(trans);
if (ret)
goto repair_abort;
ret = btrfs_commit_transaction(trans, root);
diff --git a/check/mode-lowmem.c b/check/mode-lowmem.c
index 9890180d1d3c..410c5f635d4e 100644
--- a/check/mode-lowmem.c
+++ b/check/mode-lowmem.c
@@ -556,7 +556,7 @@ static int repair_block_accounting(struct btrfs_fs_info 
*fs_info)
return ret;
}
 
-   ret = btrfs_fix_block_accounting(trans, root);
+   ret = btrfs_fix_block_accounting(trans);
btrfs_commit_transaction(trans, root);
return ret;
 }
diff --git a/ctree.h b/ctree.h
index 697895a9e098..c833ad6998b9 100644
--- a/ctree.h
+++ b/ctree.h
@@ -2500,8 +2500,7 @@ int btrfs_reserve_extent(struct btrfs_trans_handle *trans,
 u64 num_bytes, u64 empty_size,
 u64 hint_byte, u64 search_end,
 struct btrfs_key *ins, bool is_data);
-int btrfs_fix_block_accounting(struct btrfs_trans_handle *trans,
-struct btrfs_root *root);
+int btrfs_fix_block_accounting(struct btrfs_trans_handle *trans);
 void btrfs_pin_extent(struct btrfs_fs_info *fs_info, u64 bytenr, u64 
num_bytes);
 void btrfs_unpin_extent(struct btrfs_fs_info *fs_info,
u64 bytenr, u64 num_bytes);
diff --git a/extent-tree.c b/extent-tree.c
index 1b8a4f8cb1c3..3132ccc8d44f 100644
--- a/extent-tree.c
+++ b/extent-tree.c
@@ -3788,8 +3788,7 @@ int btrfs_free_block_group(struct btrfs_trans_handle 
*trans,
  * Fixup block accounting. The initial block accounting created by
  * make_block_groups isn't accuracy in this case.
  */
-int btrfs_fix_block_accounting(struct btrfs_trans_handle *trans,
-  struct btrfs_root *root)
+int btrfs_fix_block_accounting(struct btrfs_trans_handle *trans)
 {
int ret = 0;
int slot;
@@ -3799,9 +3798,8 @@ int btrfs_fix_block_accounting(struct btrfs_trans_handle 
*trans,
struct btrfs_key key;
struct extent_buffer *leaf;
struct btrfs_block_group_cache *cache;
-   struct btrfs_fs_info *fs_info = root->fs_info;
-
-   root = root->fs_info->extent_root;
+   struct btrfs_fs_info *fs_info = trans->fs_info;
+   struct btrfs_root *root = fs_info->extent_root;
 
while(extent_root_pending_ops(fs_info)) {
ret = finish_current_insert(trans, root);
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 04/11] btrfs-progs: check: Remove root argument from finish_current_insert

2018-05-28 Thread Nikolay Borisov
Just reference it directly from trans->fs_info.

Signed-off-by: Nikolay Borisov 
---
 extent-tree.c | 27 +--
 1 file changed, 13 insertions(+), 14 deletions(-)

diff --git a/extent-tree.c b/extent-tree.c
index e950ba6de3cc..89fed5b73b1f 100644
--- a/extent-tree.c
+++ b/extent-tree.c
@@ -54,8 +54,7 @@ static int __free_extent(struct btrfs_trans_handle *trans,
 u64 bytenr, u64 num_bytes, u64 parent,
 u64 root_objectid, u64 owner_objectid,
 u64 owner_offset, int refs_to_drop);
-static int finish_current_insert(struct btrfs_trans_handle *trans, struct
-btrfs_root *extent_root);
+static int finish_current_insert(struct btrfs_trans_handle *trans);
 static int del_pending_extents(struct btrfs_trans_handle *trans);
 static struct btrfs_block_group_cache *
 btrfs_find_block_group(struct btrfs_root *root, struct btrfs_block_group_cache
@@ -1421,7 +1420,7 @@ int btrfs_inc_extent_ref(struct btrfs_trans_handle *trans,
err = ret;
 out:
btrfs_free_path(path);
-   finish_current_insert(trans, root->fs_info->extent_root);
+   finish_current_insert(trans);
del_pending_extents(trans);
BUG_ON(err);
return err;
@@ -1430,7 +1429,7 @@ int btrfs_inc_extent_ref(struct btrfs_trans_handle *trans,
 int btrfs_extent_post_op(struct btrfs_trans_handle *trans,
 struct btrfs_root *root)
 {
-   finish_current_insert(trans, root->fs_info->extent_root);
+   finish_current_insert(trans);
del_pending_extents(trans);
return 0;
 }
@@ -1610,7 +1609,7 @@ int btrfs_set_block_flags(struct btrfs_trans_handle 
*trans,
btrfs_set_extent_flags(l, item, flags);
 out:
btrfs_free_path(path);
-   finish_current_insert(trans, root->fs_info->extent_root);
+   finish_current_insert(trans);
del_pending_extents(trans);
return ret;
 }
@@ -1726,7 +1725,7 @@ static int write_one_cache_group(struct 
btrfs_trans_handle *trans,
btrfs_mark_buffer_dirty(leaf);
btrfs_release_path(path);
 fail:
-   finish_current_insert(trans, extent_root);
+   finish_current_insert(trans);
pending_ret = del_pending_extents(trans);
if (ret)
return ret;
@@ -2056,13 +2055,13 @@ static int extent_root_pending_ops(struct btrfs_fs_info 
*info)
return ret == 0;
 
 }
-static int finish_current_insert(struct btrfs_trans_handle *trans,
-struct btrfs_root *extent_root)
+static int finish_current_insert(struct btrfs_trans_handle *trans)
 {
u64 start;
u64 end;
u64 priv;
-   struct btrfs_fs_info *info = extent_root->fs_info;
+   struct btrfs_fs_info *info = trans->fs_info;
+   struct btrfs_root *extent_root = info->extent_root;
struct pending_extent_op *extent_op;
struct btrfs_key key;
int ret;
@@ -2388,7 +2387,7 @@ static int __free_extent(struct btrfs_trans_handle *trans,
}
 fail:
btrfs_free_path(path);
-   finish_current_insert(trans, extent_root);
+   finish_current_insert(trans);
return ret;
 }
 
@@ -2789,7 +2788,7 @@ static int alloc_tree_block(struct btrfs_trans_handle 
*trans,
ret = alloc_reserved_tree_block(trans, root, root_objectid,
generation, flags,
key, level, ins);
-   finish_current_insert(trans, root->fs_info->extent_root);
+   finish_current_insert(trans);
del_pending_extents(trans);
}
return ret;
@@ -3316,7 +3315,7 @@ int btrfs_make_block_group(struct btrfs_trans_handle 
*trans,
sizeof(cache->item));
BUG_ON(ret);
 
-   ret = finish_current_insert(trans, extent_root);
+   ret = finish_current_insert(trans);
BUG_ON(ret);
ret = del_pending_extents(trans);
BUG_ON(ret);
@@ -3416,7 +3415,7 @@ int btrfs_make_block_groups(struct btrfs_trans_handle 
*trans,
sizeof(cache->item));
BUG_ON(ret);
 
-   finish_current_insert(trans, extent_root);
+   finish_current_insert(trans);
ret = del_pending_extents(trans);
BUG_ON(ret);
 
@@ -3802,7 +3801,7 @@ int btrfs_fix_block_accounting(struct btrfs_trans_handle 
*trans)
struct btrfs_root *root = fs_info->extent_root;
 
while(extent_root_pending_ops(fs_info)) {
-   ret = finish_current_insert(trans, root);
+   ret = finish_current_insert(trans);
if (ret)
return ret;
ret = del_pending_extents(trans);
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More 

[PATCH 2/3] btrfs: Use btrfs_mark_bg_unused() to replace open code

2018-05-28 Thread Qu Wenruo
Introduce a small helper, btrfs_mark_bg_unused(), to accquire needed
locks and add a block group to unused_bgs list.

No functional modification, and only 3 callers are involved.

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/ctree.h   |  1 +
 fs/btrfs/extent-tree.c | 36 +---
 fs/btrfs/scrub.c   |  9 +
 3 files changed, 19 insertions(+), 27 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 2771cc56a622..b19b673485fd 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -2829,6 +2829,7 @@ void check_system_chunk(struct btrfs_trans_handle *trans,
struct btrfs_fs_info *fs_info, const u64 type);
 u64 add_new_free_space(struct btrfs_block_group_cache *block_group,
   struct btrfs_fs_info *info, u64 start, u64 end);
+void btrfs_mark_bg_unused(struct btrfs_block_group_cache *bg);
 
 /* ctree.c */
 int btrfs_bin_search(struct extent_buffer *eb, const struct btrfs_key *key,
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 8241de81089a..f0d7e19feca7 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -6350,16 +6350,8 @@ static int update_block_group(struct btrfs_trans_handle 
*trans,
 * dirty list to avoid races between cleaner kthread and space
 * cache writeout.
 */
-   if (!alloc && old_val == 0) {
-   spin_lock(>unused_bgs_lock);
-   if (list_empty(>bg_list)) {
-   btrfs_get_block_group(cache);
-   trace_btrfs_add_unused_block_group(cache);
-   list_add_tail(>bg_list,
- >unused_bgs);
-   }
-   spin_unlock(>unused_bgs_lock);
-   }
+   if (!alloc && old_val == 0)
+   btrfs_mark_bg_unused(cache);
 
btrfs_put_block_group(cache);
total -= num_bytes;
@@ -10201,15 +10193,8 @@ int btrfs_read_block_groups(struct btrfs_fs_info *info)
if (btrfs_chunk_readonly(info, cache->key.objectid)) {
inc_block_group_ro(cache, 1);
} else if (btrfs_block_group_used(>item) == 0) {
-   spin_lock(>unused_bgs_lock);
-   /* Should always be true but just in case. */
-   if (list_empty(>bg_list)) {
-   btrfs_get_block_group(cache);
-   trace_btrfs_add_unused_block_group(cache);
-   list_add_tail(>bg_list,
- >unused_bgs);
-   }
-   spin_unlock(>unused_bgs_lock);
+   ASSERT(list_empty(>bg_list));
+   btrfs_mark_bg_unused(cache);
}
}
 
@@ -11133,3 +8,16 @@ void btrfs_wait_for_snapshot_creation(struct 
btrfs_root *root)
   !atomic_read(>will_be_snapshotted));
}
 }
+
+void btrfs_mark_bg_unused(struct btrfs_block_group_cache *bg)
+{
+   struct btrfs_fs_info *fs_info = bg->fs_info;
+
+   spin_lock(_info->unused_bgs_lock);
+   if (list_empty(>bg_list)) {
+   btrfs_get_block_group(bg);
+   trace_btrfs_add_unused_block_group(bg);
+   list_add_tail(>bg_list, _info->unused_bgs);
+   }
+   spin_unlock(_info->unused_bgs_lock);
+}
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index a59005862010..40086b47a65f 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -3981,14 +3981,7 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
if (!cache->removed && !cache->ro && cache->reserved == 0 &&
btrfs_block_group_used(>item) == 0) {
spin_unlock(>lock);
-   spin_lock(_info->unused_bgs_lock);
-   if (list_empty(>bg_list)) {
-   btrfs_get_block_group(cache);
-   trace_btrfs_add_unused_block_group(cache);
-   list_add_tail(>bg_list,
- _info->unused_bgs);
-   }
-   spin_unlock(_info->unused_bgs_lock);
+   btrfs_mark_bg_unused(cache);
} else {
spin_unlock(>lock);
}
-- 
2.17.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/3] btrfs: Delay block group auto removal to avoid interfering qgroups

2018-05-28 Thread Qu Wenruo
The patchset can be fetched from github:
https://github.com/adam900710/linux/tree/delayed_bg_removal

It's based on v4.17-rc5 branch.

This bug is reported from SUSE openQA, although it's pretty hard to hit
in real world (even real world VM), it's believed block group auto
removal (anyway, there isn't much thing can remove a chunk mapping in
btrfs) could interfere with qgroup's search on commit root.

Full details can be found in the 3rd patch.

The patchset uses 2 submitted cleanup/refactor patches as basis, and the
3rd patch will ensure unused block group will only be deleted after
current transaction is done.

Qu Wenruo (3):
  btrfs: trace: Add trace points for unused block groups
  btrfs: Use btrfs_mark_bg_unused() to replace open code
  btrfs: Delayed empty block group auto removal to next transaction

 fs/btrfs/ctree.h |  2 ++
 fs/btrfs/extent-tree.c   | 62 ++--
 fs/btrfs/scrub.c |  8 +
 fs/btrfs/transaction.c   |  7 
 fs/btrfs/transaction.h   | 10 ++
 include/trace/events/btrfs.h | 42 
 6 files changed, 107 insertions(+), 24 deletions(-)

-- 
2.17.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v2 0/6] btrfs send stream version 2

2018-05-28 Thread David Sterba
On Sun, May 27, 2018 at 10:14:29PM -0400, Howard McLauchlan wrote:
> This is v2 of send stream version 2. The goal is to provide proper
> versioning/compatibility as new features are implemented. v1 can be found here
> [1].

We need to decide the overall approach to the versioning updates. The
wiki page has several problems with v1 that I don't see addressed in
this patchset. So the question is:

1) fix everything we know about now in send protocol v2

2) incremental fixes (eg. this patchset) and version bumps as the missing
   features/bugs get fixed

I'd vote for 1 because 2 is likely to cause usability problems with
kernel and clients with different version support. But eg. rsync has
protocol version 30 and maybe more so handful of versions does not need
to be an issue in the end. We should be ready with version updates for
btrfs but at this moment I haven't thought about all the usability
issues for 2).
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 08/11] btrfs-progs: Change btrfs_root to btrfs_fs_info argument in btrfs_lookup_extent_info

2018-05-28 Thread Nikolay Borisov
That function really wants an fs_info and not a root. Accidentally,
this also makes the kernel/user space signatures to be coherent.

Signed-off-by: Nikolay Borisov 
---
 check/main.c| 10 +-
 check/mode-lowmem.c |  4 ++--
 ctree.c |  3 ++-
 ctree.h |  2 +-
 extent-tree.c   | 14 ++
 5 files changed, 16 insertions(+), 17 deletions(-)

diff --git a/check/main.c b/check/main.c
index 9a7e6b6dbde3..9e8a83f86ab0 100644
--- a/check/main.c
+++ b/check/main.c
@@ -1622,9 +1622,9 @@ static int walk_down_tree(struct btrfs_root *root, struct 
btrfs_path *path,
refs = nrefs->refs[*level];
ret = 0;
} else {
-   ret = btrfs_lookup_extent_info(NULL, root,
-  path->nodes[*level]->start,
-  *level, 1, , NULL);
+   ret = btrfs_lookup_extent_info(NULL, fs_info,
+  path->nodes[*level]->start,
+  *level, 1, , NULL);
if (ret < 0) {
err = ret;
goto out;
@@ -1664,7 +1664,7 @@ static int walk_down_tree(struct btrfs_root *root, struct 
btrfs_path *path,
if (bytenr == nrefs->bytenr[*level - 1]) {
refs = nrefs->refs[*level - 1];
} else {
-   ret = btrfs_lookup_extent_info(NULL, root, bytenr,
+   ret = btrfs_lookup_extent_info(NULL, fs_info, bytenr,
*level - 1, 1, , NULL);
if (ret < 0) {
refs = 0;
@@ -5928,7 +5928,7 @@ static int run_next_block(struct btrfs_root *root,
 
flags = 0;
if (!init_extent_tree) {
-   ret = btrfs_lookup_extent_info(NULL, root, bytenr,
+   ret = btrfs_lookup_extent_info(NULL, fs_info, bytenr,
   btrfs_header_level(buf), 1, NULL,
   );
if (ret < 0) {
diff --git a/check/mode-lowmem.c b/check/mode-lowmem.c
index 410c5f635d4e..989306f03c2a 100644
--- a/check/mode-lowmem.c
+++ b/check/mode-lowmem.c
@@ -186,8 +186,8 @@ static int update_nodes_refs(struct btrfs_root *root, u64 
bytenr,
 
if (bytenr != (u64)-1) {
/* the return value of this function seems a mistake */
-   ret = btrfs_lookup_extent_info(NULL, root, bytenr,
-  level, 1, , );
+   ret = btrfs_lookup_extent_info(NULL, root->fs_info, bytenr,
+  level, 1, , );
/* temporary fix */
if (ret < 0 && !check_all)
return ret;
diff --git a/ctree.c b/ctree.c
index e79d7aa32319..c89fd11fc48b 100644
--- a/ctree.c
+++ b/ctree.c
@@ -192,7 +192,8 @@ static noinline int update_ref_for_cow(struct 
btrfs_trans_handle *trans,
 */
 
if (btrfs_block_can_be_shared(root, buf)) {
-   ret = btrfs_lookup_extent_info(trans, root, buf->start,
+   ret = btrfs_lookup_extent_info(trans, trans->fs_info,
+  buf->start,
   btrfs_header_level(buf), 1,
   , );
BUG_ON(ret);
diff --git a/ctree.h b/ctree.h
index 138cd22c6c6e..037e020401d2 100644
--- a/ctree.h
+++ b/ctree.h
@@ -2517,7 +2517,7 @@ struct extent_buffer *btrfs_alloc_free_block(struct 
btrfs_trans_handle *trans,
struct btrfs_disk_key *key, int level,
u64 hint, u64 empty_size);
 int btrfs_lookup_extent_info(struct btrfs_trans_handle *trans,
-struct btrfs_root *root, u64 bytenr,
+struct btrfs_fs_info *fs_info, u64 bytenr,
 u64 offset, int metadata, u64 *refs, u64 *flags);
 int btrfs_set_block_flags(struct btrfs_trans_handle *trans,
  struct btrfs_root *root,
diff --git a/extent-tree.c b/extent-tree.c
index d90eb8139d8b..2fd4e7a0d9eb 100644
--- a/extent-tree.c
+++ b/extent-tree.c
@@ -1434,7 +1434,7 @@ int btrfs_extent_post_op(struct btrfs_trans_handle *trans)
 }
 
 int btrfs_lookup_extent_info(struct btrfs_trans_handle *trans,
-struct btrfs_root *root, u64 bytenr,
+struct btrfs_fs_info *fs_info, u64 bytenr,
 u64 offset, int metadata, u64 *refs, u64 *flags)
 {
struct btrfs_path *path;
@@ -1446,9 +1446,8 @@ int btrfs_lookup_extent_info(struct btrfs_trans_handle 
*trans,
u64 num_refs;
u64 extent_flags;
 
-   if (metadata &&
-   !btrfs_fs_incompat(root->fs_info, SKINNY_METADATA)) {
-   offset = 

Re: [PATCH v2 3/4] btrfs: lzo: Add header length check to avoid slab out of bounds access

2018-05-28 Thread David Sterba
On Fri, May 25, 2018 at 09:31:30AM +0800, Qu Wenruo wrote:
> 
> 
> On 2018年05月25日 00:43, David Sterba wrote:
> > On Wed, May 23, 2018 at 07:38:28AM +0800, Qu Wenruo wrote:
>  --- a/fs/btrfs/lzo.c
>  +++ b/fs/btrfs/lzo.c
>  @@ -281,6 +281,7 @@ static int lzo_decompress_bio(struct list_head *ws, 
>  struct compressed_bio *cb)
>   unsigned long working_bytes;
>   size_t in_len;
>   size_t out_len;
>  +size_t max_segment_len = lzo1x_worst_compress(PAGE_SIZE);
> >>>
> >>> The value is a compile-time constant, it does not need to be stored in a
> >>> variable.
> >>
> >> Just to save some long lines, as it's used several times.
> > 
> > My concern is about the eventual stack consumption by the variable. I
> > haven't measured that as the stack-bloat-script is now somehow broken so
> > I'm basing this on my previous evaluations. The long term plan is to
> > remove unnecessary variables or at least help the compiler to optimize
> > them out, every few bytes help and we hope for the cumulative effect.
> 
> I think for such constant value, compiler should be clear enough to
> avoid the usage of stack.
> 
> And in fact it indeed avoids the usage of stack memory, even without
> const prefix.
> All operation related to @max_segment_len are using immediate number 4419.
> (Result is from gcc 8.1)
> --
> lzo_decompress_bio:
> 1:call__fentry__
>   pushq   %r15#
>   movq%rdi, %r15  # ws, ws
>   movq%rsi, %rdi  # cb, cb
>   pushq   %r14#
>   pushq   %r13#
>   pushq   %r12#
>   pushq   %rbp#
>   pushq   %rbx#
>   addq$-128, %rsp #,
> # fs/btrfs//lzo.c:305:u64 disk_start = cb->start;
>   movq24(%rdi), %rcx  # cb_63(D)->start, disk_start
> # fs/btrfs//lzo.c:283: {
>   movq%rsi, 32(%rsp)  # cb, %sfp
>   movq%gs:40, %rax# MEM[( long unsigned int 
> *)40B], tmp197
>   movq%rax, 120(%rsp) # tmp197, D.32453
>   xorl%eax, %eax  # tmp197
> # fs/btrfs//lzo.c:288:size_t srclen = cb->compressed_len;
>   movq40(%rsi), %rax  # cb_63(D)->compressed_len, srclen
> # fs/btrfs//lzo.c:304:struct page **pages_in = cb->compressed_pages;
>   movq8(%rsi), %rsi   # cb_63(D)->compressed_pages, pages_in
> # fs/btrfs//lzo.c:305:u64 disk_start = cb->start;
>   movq%rcx, 40(%rsp)  # disk_start, %sfp
> # ./include/linux/mm.h:1098:  return page_to_virt(page);
>   movabsq $-131941395333120, %rcx #, tmp154
> # fs/btrfs//lzo.c:289:unsigned long total_pages_in =
> DIV_ROUND_UP(srclen, PAGE_SIZE);
>   leaq4095(%rax), %rdx#, tmp147
> # ./include/linux/mm.h:1098:  return page_to_virt(page);
>   movq(%rsi), %rbp# *pages_in_66, tmp148
> # fs/btrfs//lzo.c:304:struct page **pages_in = cb->compressed_pages;
>   movq%rsi, 88(%rsp)  # pages_in, %sfp
> # fs/btrfs//lzo.c:317:if (tot_len > min_t(size_t,
> BTRFS_MAX_COMPRESSED, srclen) ||
>   movl$131072, %esi   #, tmp156
> # fs/btrfs//lzo.c:289:unsigned long total_pages_in =
> DIV_ROUND_UP(srclen, PAGE_SIZE);
>   shrq$12, %rdx   #, tmp147
>   movq%rdx, 80(%rsp)  # tmp147, %sfp
> # fs/btrfs//lzo.c:306:struct bio *orig_bio = cb->orig_bio;
>   movq72(%rdi), %rdx  # cb_63(D)->orig_bio, orig_bio
>   movq%rdx, 24(%rsp)  # orig_bio, %sfp
> # ./include/linux/mm.h:1098:  return page_to_virt(page);
>   movabsq $24189255811072, %rdx   #, tmp149
>   addq%rdx, %rbp  # tmp149, tmp148
>   sarq$6, %rbp#, tmp152
>   salq$12, %rbp   #, tmp153
>   addq%rcx, %rbp  # tmp154, data_in
> # fs/btrfs//lzo.c:317:if (tot_len > min_t(size_t,
> BTRFS_MAX_COMPRESSED, srclen) ||
>   cmpq$131072, %rax   #, srclen
> --
> 
> My question is, if this is a little overkilled for us human to do all
> the work which should be done by compiler.

It depends, if we know the value is a constant and want to give it a
symbolic name, then it's ok to add the variable. That's IMO not an
overkill but a good coding practice. If there's no const but there are
no writes to the variable, the compiler can assume it's a constant and
treat it like that. But this requires additional logic.

> > The replacement by local variable sort of does not need to be in this
> > patch as it's not used for the header length check itself.
> > 
>  @@ -294,6 +295,18 @@ static int lzo_decompress_bio(struct list_head *ws, 
>  struct compressed_bio *cb)
>   
>   data_in = kmap(pages_in[0]);
>   tot_len = read_compress_length(data_in);
>  +/*
>  + * Compressed data header check.
>  + *
>  + * The real compressed size can't exceed extent length, and all 
>  pages
>  + * should be used (a full pending page is not possible).
>  + * If this 

[PATCH 1/3] btrfs: trace: Add trace points for unused block groups

2018-05-28 Thread Qu Wenruo
This patch will add the following trace events:
1) btrfs_remove_block_group
   For btrfs_remove_block_group() function.
   Triggered when a block group is really removed.

2) btrfs_add_unused_block_group
   Triggered which block group is added to unused_bgs list.

3) btrfs_skip_unused_block_group
   Triggered which unused block group is not deleted.

These trace events is pretty handy to debug case related to block group
auto remove.

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/extent-tree.c   |  4 
 fs/btrfs/scrub.c |  1 +
 include/trace/events/btrfs.h | 42 
 3 files changed, 47 insertions(+)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 51b5e2da708c..8241de81089a 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -6354,6 +6354,7 @@ static int update_block_group(struct btrfs_trans_handle 
*trans,
spin_lock(>unused_bgs_lock);
if (list_empty(>bg_list)) {
btrfs_get_block_group(cache);
+   trace_btrfs_add_unused_block_group(cache);
list_add_tail(>bg_list,
  >unused_bgs);
}
@@ -10204,6 +10205,7 @@ int btrfs_read_block_groups(struct btrfs_fs_info *info)
/* Should always be true but just in case. */
if (list_empty(>bg_list)) {
btrfs_get_block_group(cache);
+   trace_btrfs_add_unused_block_group(cache);
list_add_tail(>bg_list,
  >unused_bgs);
}
@@ -10391,6 +10393,7 @@ int btrfs_remove_block_group(struct btrfs_trans_handle 
*trans,
BUG_ON(!block_group);
BUG_ON(!block_group->ro);
 
+   trace_btrfs_remove_block_group(block_group);
/*
 * Free the reserved super bytes from this block group before
 * remove it.
@@ -10755,6 +10758,7 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info 
*fs_info)
 * the ro check in case balance is currently acting on
 * this block group.
 */
+   trace_btrfs_skip_unused_block_group(block_group);
spin_unlock(_group->lock);
up_write(_info->groups_sem);
goto next;
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 52b39a0924e9..a59005862010 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -3984,6 +3984,7 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
spin_lock(_info->unused_bgs_lock);
if (list_empty(>bg_list)) {
btrfs_get_block_group(cache);
+   trace_btrfs_add_unused_block_group(cache);
list_add_tail(>bg_list,
  _info->unused_bgs);
}
diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
index 965c650a5273..c226bc9a61db 100644
--- a/include/trace/events/btrfs.h
+++ b/include/trace/events/btrfs.h
@@ -1810,6 +1810,48 @@ TRACE_EVENT(btrfs_inode_mod_outstanding_extents,
show_root_type(__entry->root_objectid),
(unsigned long long)__entry->ino, __entry->mod)
 );
+
+DECLARE_EVENT_CLASS(btrfs__block_group,
+   TP_PROTO(const struct btrfs_block_group_cache *bg_cache),
+
+   TP_ARGS(bg_cache),
+
+   TP_STRUCT__entry_btrfs(
+   __field(u64,bytenr  )
+   __field(u64,flags   )
+   __field(u64,len )
+   __field(u64,used)
+   ),
+
+   TP_fast_assign_btrfs(bg_cache->fs_info,
+   __entry->bytenr = bg_cache->key.objectid,
+   __entry->len= bg_cache->key.offset,
+   __entry->flags  = bg_cache->flags;
+   __entry->used   = btrfs_block_group_used(_cache->item);
+   ),
+
+   TP_printk_btrfs("bg bytenr=%llu len=%llu used=%llu flags=%llu(%s)",
+   __entry->bytenr, __entry->len, __entry->used, __entry->flags,
+   __print_flags(__entry->flags, "|", BTRFS_GROUP_FLAGS))
+);
+
+DEFINE_EVENT(btrfs__block_group, btrfs_remove_block_group,
+   TP_PROTO(const struct btrfs_block_group_cache *bg_cache),
+
+   TP_ARGS(bg_cache)
+);
+
+DEFINE_EVENT(btrfs__block_group, btrfs_add_unused_block_group,
+   TP_PROTO(const struct btrfs_block_group_cache *bg_cache),
+
+   TP_ARGS(bg_cache)
+);
+
+DEFINE_EVENT(btrfs__block_group, btrfs_skip_unused_block_group,
+   TP_PROTO(const struct btrfs_block_group_cache *bg_cache),
+
+   TP_ARGS(bg_cache)
+);
 #endif /* _TRACE_BTRFS_H 

Re: csum failed root raveled during balance

2018-05-28 Thread Nikolay Borisov


On 27.05.2018 08:50, Andrei Borzenkov wrote:
> 23.05.2018 09:32, Nikolay Borisov пишет:
>>
>>
>> On 22.05.2018 23:05, ein wrote:
>>> Hello devs,
>>>
>>> I tested BTRFS in production for about a month:
>>>
>>> 21:08:17 up 34 days,  2:21,  3 users,  load average: 0.06, 0.02, 0.00
>>>
>>> Without power blackout, hardware failure, SSD's SMART is flawless etc.
>>> The tests ended with:
>>>
>>> root@node0:~# dmesg | grep BTRFS | grep warn
>>> 185:980:[2927472.393557] BTRFS warning (device dm-0): csum failed root
>>> -9 ino 312 off 608284672 csum 0x7d03a376 expected csum 0x3163a9b7 mirror 1
>>> 186:981:[2927472.394158] BTRFS warning (device dm-0): csum failed root
>>> -9 ino 312 off 608284672 csum 0x7da1b152 expected csum 0x3163a9b7 mirror 1
>>> 191:986:[2928224.169814] BTRFS warning (device dm-0): csum failed root
>>> -9 ino 314 off 608284672 csum 0x7d03a376 expected csum 0x3163a9b7 mirror 1
>>> 192:987:[2928224.171433] BTRFS warning (device dm-0): csum failed root
>>> -9 ino 314 off 608284672 csum 0x7da1b152 expected csum 0x3163a9b7 mirror 1
>>> 206:1001:[2928298.039516] BTRFS warning (device dm-0): csum failed root
>>> -9 ino 319 off 608284672 csum 0x7d03a376 expected csum 0x3163a9b7 mirror 1
>>> 207:1002:[2928298.043103] BTRFS warning (device dm-0): csum failed root
>>> -9 ino 319 off 608284672 csum 0x7d03a376 expected csum 0x3163a9b7 mirror 1
>>> 208:1004:[2932213.513424] BTRFS warning (device dm-0): csum failed root
>>> 5 ino 219962 off 4564959232 csum 0xc616afb4 expected csum 0x5425e489
>>> mirror 1
>>> 209:1005:[2932235.666368] BTRFS warning (device dm-0): csum failed root
>>> 5 ino 219962 off 16989835264 csum 0xd63ed5da expected csum 0x7429caa1
>>> mirror 1
>>> 210:1072:[2936767.229277] BTRFS warning (device dm-0): csum failed root
>>> 5 ino 219915 off 82318458880 csum 0x83614341 expected csum 0x0b8706f8
>>> mirror 1
>>> 211:1073:[2936767.276229] BTRFS warning (device dm-0): csum failed root
>>> 5 ino 219915 off 82318458880 csum 0x83614341 expected csum 0x0b8706f8
>>> mirror 1
>>>
>>> Above has been revealed during below command and quite high IO usage by
>>> few VMs (Linux on top Ext4 with firebird database, lots of random
>>> read/writes, two others with Windows 2016 and Windows Update in the
>>> background):
>>
>> I believe you are hitting the issue described here:
>>
>> https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg25656.html
>>
>> Essentially the way qemu operates on vm images atop btrfs is prone to
>> producing such errors. As a matter of fact, other filesystems also
>> suffer from this(i.e pages modified while being written, however due to
>> lack of CRC on the data they don't detect it). Can you confirm that
>> those inodes (312/314/319/219962/219915) belong to vm images files?
>>
>> IMHO the best course of action would be to disable checksumming for you
>> vm files.
>>
>>
>> For some background I suggest you read the following LWN articles:
>>
>> https://lwn.net/Articles/486311/
>> https://lwn.net/Articles/442355/
>>
> 
> Hmm ... according to these articles, "pages under writeback are marked
> as not being writable; any process attempting to write to such a page
> will block until the writeback completes". And it says this feature is
> available since 3.0 and btrfs has it. So how comes it still happens?
> Were stable patches removed since then?

If you are using buffered writes, then yes you won't have the problem.
However qemu by default bypasses host's page cache and instead uses DIO:

https://btrfs.wiki.kernel.org/index.php/Gotchas#Direct_IO_and_CRCs

> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2nd try v4] btrfs: drop uuid_mutex in btrfs_free_extra_devids()

2018-05-28 Thread David Sterba
On Mon, May 28, 2018 at 10:43:29PM +0800, Anand Jain wrote:
> btrfs_free_extra_devids() is called only in the mount context which
> traverses through the fs_devices::devices and frees the orphan devices
> devices in the given %fs_devices if any. As the search for the orphan
> device is limited to fs_devices::devices so we don't need the global
> uuid_mutex.
> 
> There can't be any mount-point based ioctl threads in this context as
> the mount thread is not yet returned. But there can be the btrfs-control
> based scan ioctls thread which calls device_list_add().
> 
> Here in the mount thread the fs_devices::opened is incremented way before
> btrfs_free_extra_devids() is called and in the scan context the fs_devices
> which are already opened neither be freed or alloc-able at
> device_list_add().
> 
> But lets say you change the device-path and call the scan again, then scan
> would update the new device path and this operation could race against the
> btrfs_free_extra_devids() thread, which might be in the process of
> free-ing the same device. So synchronize it by using the
> device_list_mutex.
> 
> This scenario is a very corner case, and practically the scan and mount
> are anyway serialized by the usage so unless the race is instrumented its
> very difficult to achieve.
> 
> Signed-off-by: Anand Jain 
> ---
> (I didn't see this email in the mailing list, so trying again).
> v3->v4: As we traverse through the seed device, fs_device gets updated with
>   the child seed fs_devices, so make sure we use the same fs_devices
>   pointer for the mutex_unlock as used for the mutex_lock.

Well, now that I see the change, shouldn't we always hold the
device_list_mutex of the fs_devices that's being processed? Ie. each
time it's switched, the previous is unlocked and new one locked.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: drop unused space_info parameter from create_space_info

2018-05-28 Thread Nikolay Borisov


On 28.05.2018 09:30, Lu Fengqi wrote:
> Since commit dc2d3005d27d ("btrfs: remove dead create_space_info
> calls"), there is only one caller btrfs_init_space_info. However, it
> doesn't need create_space_info to return space_info at all.
> 
> Signed-off-by: Lu Fengqi 

Nice,

Reviewed-by: Nikolay Borisov 
> ---
>  fs/btrfs/extent-tree.c | 13 +
>  1 file changed, 5 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 2c65a09535b6..31627de751d1 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -4007,8 +4007,7 @@ static const char *alloc_name(u64 flags)
>   };
>  }
>  
> -static int create_space_info(struct btrfs_fs_info *info, u64 flags,
> -  struct btrfs_space_info **new)
> +static int create_space_info(struct btrfs_fs_info *info, u64 flags)
>  {
>  
>   struct btrfs_space_info *space_info;
> @@ -4046,7 +4045,6 @@ static int create_space_info(struct btrfs_fs_info 
> *info, u64 flags,
>   return ret;
>   }
>  
> - *new = space_info;
>   list_add_rcu(_info->list, >space_info);
>   if (flags & BTRFS_BLOCK_GROUP_DATA)
>   info->data_sinfo = space_info;
> @@ -10810,7 +10808,6 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info 
> *fs_info)
>  
>  int btrfs_init_space_info(struct btrfs_fs_info *fs_info)
>  {
> - struct btrfs_space_info *space_info;
>   struct btrfs_super_block *disk_super;
>   u64 features;
>   u64 flags;
> @@ -10826,21 +10823,21 @@ int btrfs_init_space_info(struct btrfs_fs_info 
> *fs_info)
>   mixed = 1;
>  
>   flags = BTRFS_BLOCK_GROUP_SYSTEM;
> - ret = create_space_info(fs_info, flags, _info);
> + ret = create_space_info(fs_info, flags);
>   if (ret)
>   goto out;
>  
>   if (mixed) {
>   flags = BTRFS_BLOCK_GROUP_METADATA | BTRFS_BLOCK_GROUP_DATA;
> - ret = create_space_info(fs_info, flags, _info);
> + ret = create_space_info(fs_info, flags);
>   } else {
>   flags = BTRFS_BLOCK_GROUP_METADATA;
> - ret = create_space_info(fs_info, flags, _info);
> + ret = create_space_info(fs_info, flags);
>   if (ret)
>   goto out;
>  
>   flags = BTRFS_BLOCK_GROUP_DATA;
> - ret = create_space_info(fs_info, flags, _info);
> + ret = create_space_info(fs_info, flags);
>   }
>  out:
>   return ret;
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] btrfs: Refactor running of delayed inode items during transaction commit

2018-05-28 Thread Nikolay Borisov


On 28.05.2018 11:35, Nikolay Borisov wrote:
> 
> 
> On 28.05.2018 11:21, Misono Tomohiro wrote:
>> Hello,
>>
>> I found current misc-next sometimes fails btrfs/152 when the number
>> of cpu is >= 4 in my vm and git bisect points this commit.
>> (btrfs/152 performs parallel send/receive.)
>>
>> The failure is caused by _check_dmesg and sometimes also leads to 
>> inconsistent fs.
>>
>> dmesg looks like:
>> [6.649213] WARNING: CPU: 0 PID: 2838 at fs/btrfs/transaction.c:303 
>> record_root_in_trans+0x38/0xd0
>> [6.650807] CPU: 0 PID: 2838 Comm: btrfs Tainted: GW 
>> 4.17.0-rc6+ #113
>> [6.651998] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
>> 1.10.2-2.fc27 04/01/2014
>> [6.653325] RIP: 0010:record_root_in_trans+0x38/0xd0
>> [6.654090] RSP: 0018:90b781e23a68 EFLAGS: 00010206
>> [6.654882] RAX: 8f346a48b720 RBX: 8f3479ce86c0 RCX: 
>> 
>> [6.655965] RDX:  RSI: 8f347c8ec000 RDI: 
>> 8f346a03a958
>> [6.657059] RBP: 90b781e23b80 R08: 00025870 R09: 
>> a8c112bb
>> [6.658162] R10: 90b781e23ba0 R11: 0002 R12: 
>> 8f346a03a958
>> [6.659290] R13: 8f347c8d9000 R14: 8f347ba2a0b8 R15: 
>> 
>> [6.660374] FS:  7f3b62ebc8c0() GS:8f347fc0() 
>> knlGS:
>> [6.661349] CS:  0010 DS:  ES:  CR0: 80050033
>> [6.661928] CR2: 7f0a641995e0 CR3: 6af5c000 CR4: 
>> 06f0
>> [6.662651] Call Trace:
>> [6.662909]  create_pending_snapshot+0x1ab/0xd00
>> [6.663391]  ? btrfs_delayed_inode_release_metadata+0xe0/0xf0
>> [6.663972]  ? __btrfs_update_delayed_inode+0x1aa/0x210
>> [6.664526]  ? __btrfs_release_delayed_node.part.18+0x8f/0x1d0
>> [6.665116]  ? create_pending_snapshots+0x81/0xa0
>> [6.665597]  create_pending_snapshots+0x81/0xa0
>> [6.666068]  btrfs_commit_transaction+0x279/0x860
>> [6.666553]  ? start_transaction+0x98/0x3c0
>> [6.666989]  btrfs_mksubvol+0x589/0x5a0
>> [6.667390]  ? btrfs_opendir+0x39/0x70
>> [6.667778]  btrfs_ioctl_snap_create_transid+0x16a/0x1a0
>> [6.668327]  btrfs_ioctl_snap_create_v2+0x121/0x180
>> [6.668827]  btrfs_ioctl+0x56d/0x25a0
>> [6.669205]  ? do_filp_open+0xaa/0x110
>> [6.669591]  ? do_vfs_ioctl+0x9f/0x610
>> [6.669980]  ? btrfs_ioctl_get_supported_features+0x20/0x20
>> [6.670550]  do_vfs_ioctl+0x9f/0x610
>> [6.670922]  ksys_ioctl+0x6b/0x80
>> [6.671261]  __x64_sys_ioctl+0x11/0x20
>> [6.671650]  do_syscall_64+0x49/0x100
>> [6.672040]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>> [6.672545] RIP: 0033:0x7f3b61cbddd7
>> [6.672899] RSP: 002b:7ffcdbe7b318 EFLAGS: 0246 ORIG_RAX: 
>> 0010
>> [6.673666] RAX: ffda RBX: 7ffcdbe90610 RCX: 
>> 7f3b61cbddd7
>> [6.674430] RDX: 7ffcdbe7b360 RSI: 50009417 RDI: 
>> 0003
>> [6.675154] RBP: 5634403ec020 R08:  R09: 
>> 1010
>> [6.675878] R10:  R11: 0246 R12: 
>> 5634403ed670
>> [6.676604] R13: 7ffcdbe7c470 R14: 0053 R15: 
>> 7ffcdbe7b360
>> [6.677330] Code: f0 01 00 00 a8 02 74 27 48 8b 07 48 39 86 40 03 00 00 
>> 73 1b 49 39 75 28 0f 84 91 00 00 00 85 d2 75 17 48 8b 06 48 39 46 08 74 0e 
>> <0f> 0b eb 0a 85 d2 74 65 49 3b 75 28 74 76 41 89 d4 48 89 f3 48
>> [6.679318] ---[ end trace b2378f91e69026c3 ]---
>>
>> transaction.c:303:
>>   WARN_ON(!force && root->commit_root != root->node);
>>
>> And 152.out.full looks like:
>>   *** fsck.btrfs output ***
>>   root item for root 261, current bytenr 53870592, current gen 149, current
>>   level 0, new bytenr 54067200, new gen 149, new level 0
>>   Found 1 roots with an outdated root item.
>>   Please run a filesystem check with the option --repair to fix them.
>>
>> =
>> (My observation)
>>
>> Call chain is:
>>   btrfs_commit_transaction()
>>create_pending_snapshots()
>>  create_pending_snapshot()
>>run_delayed_item() <- removed
>>qgroup_account_snapshot()
>>run_delayed_item() <- added
>>
>> This commit removes run_delayed_item() in create_pending_snapshot() and
>> instead calls it after create_pending_snapshots().
>>
>> However, as qgroup_account_snapshot() updates commit_root, if there are
>> more than one pending snapshots under the same subvolume, it may fail
>> to update root item correctly.
>>
>> So, I think we cannot remove run_delayed_item() here.
> 
> So btrfs_run_delayed_items is run once before going into
> create_pending_snapshots. This means that when we go into
> create_pending_snapshots the filesystem should be consistent from the
> POV of delayed items (as per patch's changelog).
> 
> I somehow fail to reconcile how the removal of run delayed items (which
> really pertain to inode state) have anything to do with an outdated
> root?  IMHO you are 

Re: [PATCH 0/4] btrfs check --check-data-csum enhancement for

2018-05-28 Thread David Sterba
On Mon, May 28, 2018 at 08:20:35PM +0800, Qu Wenruo wrote:
> Or do I need to rebase the patchset and resend?

The patches are in devel for a long time. If you have any updates,
please send as incremental patches, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 4/6] Btrfs: remove unused check of skip_locking

2018-05-28 Thread Nikolay Borisov


On 28.05.2018 17:21, David Sterba wrote:
> On Fri, May 18, 2018 at 01:27:50PM +0800, Qu Wenruo wrote:
>>
>>
>> On 2018年05月18日 11:00, Liu Bo wrote:
>>> The check is superfluous since all of callers who set search_for_commit
>>> also have skip_locking set.
>>>
>>> Signed-off-by: Liu Bo 
>>
>> Reviewed-by: Qu Wenruo 
>>
>> Although more obvious comment about search_commit_root and skip_locking
>> in ctree.h will be much better.
> 
> Not only a comment but also an ASSERT, this is too easy to get wrong.
> That all currenct callers do search_commit_root + skip_locking will not
> catch any future callers. And there was an example reported.

How about my initial suggestion of setting skip)_locking in
btrfs_search_slot if we see commit_root set? Let's try and make the life
of callers as easier as possible.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 5/6] Btrfs: grab write lock directly if write_lock_level is the max level

2018-05-28 Thread David Sterba
On Fri, May 18, 2018 at 11:00:23AM +0800, Liu Bo wrote:
> Typically, when acquiring root node's lock, btrfs tries its best to get
> read lock and trade for write lock if @write_lock_level implies to do so.
> 
> In case of (cow && (p->keep_locks || p->lowest_level)), write_lock_level
> is set to BTRFS_MAX_LEVEL, which means we need to acquire root node's
> write lock directly.
> 
> In this particular case, the dance of acquiring read lock and then trading
> for write lock can be saved.
> 
> Signed-off-by: Liu Bo 

Reviewed-by: David Sterba 

> ---
>  fs/btrfs/ctree.c | 29 -
>  1 file changed, 16 insertions(+), 13 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
> index 8d3b09038f37..e619f7e01794 100644
> --- a/fs/btrfs/ctree.c
> +++ b/fs/btrfs/ctree.c
> @@ -2633,20 +2633,23 @@ static struct extent_buffer 
> *btrfs_search_slot_get_root(struct btrfs_root *root,
>   goto out;
>   }
>  
> - /*
> -  * we don't know the level of the root node until we actually
> -  * have it read locked
> -  */
> - b = btrfs_read_lock_root_node(root);
> - level = btrfs_header_level(b);
> - if (level > write_lock_level)
> - goto out;

I've added a comment why the check below is done.

> + if (write_lock_level < BTRFS_MAX_LEVEL) {
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 4/6] Btrfs: remove unused check of skip_locking

2018-05-28 Thread David Sterba
On Fri, May 18, 2018 at 01:27:50PM +0800, Qu Wenruo wrote:
> 
> 
> On 2018年05月18日 11:00, Liu Bo wrote:
> > The check is superfluous since all of callers who set search_for_commit
> > also have skip_locking set.
> > 
> > Signed-off-by: Liu Bo 
> 
> Reviewed-by: Qu Wenruo 
> 
> Although more obvious comment about search_commit_root and skip_locking
> in ctree.h will be much better.

Not only a comment but also an ASSERT, this is too easy to get wrong.
That all currenct callers do search_commit_root + skip_locking will not
catch any future callers. And there was an example reported.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] btrfs: Refactor running of delayed inode items during transaction commit

2018-05-28 Thread Qu Wenruo


On 2018年05月28日 16:51, Nikolay Borisov wrote:
> 
> 
> On 28.05.2018 11:35, Nikolay Borisov wrote:
>>
>>
>> On 28.05.2018 11:21, Misono Tomohiro wrote:
>>> Hello,
>>>
>>> I found current misc-next sometimes fails btrfs/152 when the number
>>> of cpu is >= 4 in my vm and git bisect points this commit.
>>> (btrfs/152 performs parallel send/receive.)
>>>
>>> The failure is caused by _check_dmesg and sometimes also leads to 
>>> inconsistent fs.
>>>
>>> dmesg looks like:
>>> [6.649213] WARNING: CPU: 0 PID: 2838 at fs/btrfs/transaction.c:303 
>>> record_root_in_trans+0x38/0xd0
>>> [6.650807] CPU: 0 PID: 2838 Comm: btrfs Tainted: GW 
>>> 4.17.0-rc6+ #113
>>> [6.651998] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
>>> 1.10.2-2.fc27 04/01/2014
>>> [6.653325] RIP: 0010:record_root_in_trans+0x38/0xd0
>>> [6.654090] RSP: 0018:90b781e23a68 EFLAGS: 00010206
>>> [6.654882] RAX: 8f346a48b720 RBX: 8f3479ce86c0 RCX: 
>>> 
>>> [6.655965] RDX:  RSI: 8f347c8ec000 RDI: 
>>> 8f346a03a958
>>> [6.657059] RBP: 90b781e23b80 R08: 00025870 R09: 
>>> a8c112bb
>>> [6.658162] R10: 90b781e23ba0 R11: 0002 R12: 
>>> 8f346a03a958
>>> [6.659290] R13: 8f347c8d9000 R14: 8f347ba2a0b8 R15: 
>>> 
>>> [6.660374] FS:  7f3b62ebc8c0() GS:8f347fc0() 
>>> knlGS:
>>> [6.661349] CS:  0010 DS:  ES:  CR0: 80050033
>>> [6.661928] CR2: 7f0a641995e0 CR3: 6af5c000 CR4: 
>>> 06f0
>>> [6.662651] Call Trace:
>>> [6.662909]  create_pending_snapshot+0x1ab/0xd00
>>> [6.663391]  ? btrfs_delayed_inode_release_metadata+0xe0/0xf0
>>> [6.663972]  ? __btrfs_update_delayed_inode+0x1aa/0x210
>>> [6.664526]  ? __btrfs_release_delayed_node.part.18+0x8f/0x1d0
>>> [6.665116]  ? create_pending_snapshots+0x81/0xa0
>>> [6.665597]  create_pending_snapshots+0x81/0xa0
>>> [6.666068]  btrfs_commit_transaction+0x279/0x860
>>> [6.666553]  ? start_transaction+0x98/0x3c0
>>> [6.666989]  btrfs_mksubvol+0x589/0x5a0
>>> [6.667390]  ? btrfs_opendir+0x39/0x70
>>> [6.667778]  btrfs_ioctl_snap_create_transid+0x16a/0x1a0
>>> [6.668327]  btrfs_ioctl_snap_create_v2+0x121/0x180
>>> [6.668827]  btrfs_ioctl+0x56d/0x25a0
>>> [6.669205]  ? do_filp_open+0xaa/0x110
>>> [6.669591]  ? do_vfs_ioctl+0x9f/0x610
>>> [6.669980]  ? btrfs_ioctl_get_supported_features+0x20/0x20
>>> [6.670550]  do_vfs_ioctl+0x9f/0x610
>>> [6.670922]  ksys_ioctl+0x6b/0x80
>>> [6.671261]  __x64_sys_ioctl+0x11/0x20
>>> [6.671650]  do_syscall_64+0x49/0x100
>>> [6.672040]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>> [6.672545] RIP: 0033:0x7f3b61cbddd7
>>> [6.672899] RSP: 002b:7ffcdbe7b318 EFLAGS: 0246 ORIG_RAX: 
>>> 0010
>>> [6.673666] RAX: ffda RBX: 7ffcdbe90610 RCX: 
>>> 7f3b61cbddd7
>>> [6.674430] RDX: 7ffcdbe7b360 RSI: 50009417 RDI: 
>>> 0003
>>> [6.675154] RBP: 5634403ec020 R08:  R09: 
>>> 1010
>>> [6.675878] R10:  R11: 0246 R12: 
>>> 5634403ed670
>>> [6.676604] R13: 7ffcdbe7c470 R14: 0053 R15: 
>>> 7ffcdbe7b360
>>> [6.677330] Code: f0 01 00 00 a8 02 74 27 48 8b 07 48 39 86 40 03 00 00 
>>> 73 1b 49 39 75 28 0f 84 91 00 00 00 85 d2 75 17 48 8b 06 48 39 46 08 74 0e 
>>> <0f> 0b eb 0a 85 d2 74 65 49 3b 75 28 74 76 41 89 d4 48 89 f3 48
>>> [6.679318] ---[ end trace b2378f91e69026c3 ]---
>>>
>>> transaction.c:303:
>>>   WARN_ON(!force && root->commit_root != root->node);
>>>
>>> And 152.out.full looks like:
>>>   *** fsck.btrfs output ***
>>>   root item for root 261, current bytenr 53870592, current gen 149, current
>>>   level 0, new bytenr 54067200, new gen 149, new level 0
>>>   Found 1 roots with an outdated root item.
>>>   Please run a filesystem check with the option --repair to fix them.
>>>
>>> =
>>> (My observation)
>>>
>>> Call chain is:
>>>   btrfs_commit_transaction()
>>>create_pending_snapshots()
>>>  create_pending_snapshot()
>>>run_delayed_item() <- removed
>>>qgroup_account_snapshot()
>>>run_delayed_item() <- added
>>>
>>> This commit removes run_delayed_item() in create_pending_snapshot() and
>>> instead calls it after create_pending_snapshots().
>>>
>>> However, as qgroup_account_snapshot() updates commit_root, if there are
>>> more than one pending snapshots under the same subvolume, it may fail
>>> to update root item correctly.
>>>
>>> So, I think we cannot remove run_delayed_item() here.
>>
>> So btrfs_run_delayed_items is run once before going into
>> create_pending_snapshots. This means that when we go into
>> create_pending_snapshots the filesystem should be consistent from the
>> POV of delayed items (as per patch's changelog).
>>
>> I 

Re: [PATCH v2 0/6] btrfs_search_slot cleanups

2018-05-28 Thread David Sterba
On Fri, May 18, 2018 at 11:00:18AM +0800, Liu Bo wrote:
> Here are a collection of patches I did for btrfs_search_slot().
> 
> v2: more explicit commit log for each patch.
> 
> Liu Bo (6):
>   Btrfs: remove superfluous free_extent_buffer
>   Btrfs: use more straightforward extent_buffer_uptodate
>   Btrfs: move get root of btrfs_search_slot to a helper
>   Btrfs: remove unused check of skip_locking
>   Btrfs: grab write lock directly if write_lock_level is the max level
>   Btrfs: remove always true check in unlock_up

1,2,4,5,6 merged. Please update patch 3 and add the ASSERT. Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2nd try v4] btrfs: drop uuid_mutex in btrfs_free_extra_devids()

2018-05-28 Thread Anand Jain
btrfs_free_extra_devids() is called only in the mount context which
traverses through the fs_devices::devices and frees the orphan devices
devices in the given %fs_devices if any. As the search for the orphan
device is limited to fs_devices::devices so we don't need the global
uuid_mutex.

There can't be any mount-point based ioctl threads in this context as
the mount thread is not yet returned. But there can be the btrfs-control
based scan ioctls thread which calls device_list_add().

Here in the mount thread the fs_devices::opened is incremented way before
btrfs_free_extra_devids() is called and in the scan context the fs_devices
which are already opened neither be freed or alloc-able at
device_list_add().

But lets say you change the device-path and call the scan again, then scan
would update the new device path and this operation could race against the
btrfs_free_extra_devids() thread, which might be in the process of
free-ing the same device. So synchronize it by using the
device_list_mutex.

This scenario is a very corner case, and practically the scan and mount
are anyway serialized by the usage so unless the race is instrumented its
very difficult to achieve.

Signed-off-by: Anand Jain 
---
(I didn't see this email in the mailing list, so trying again).
v3->v4: As we traverse through the seed device, fs_device gets updated with
the child seed fs_devices, so make sure we use the same fs_devices
pointer for the mutex_unlock as used for the mutex_lock.
v2->v3: Update change log.
(Currently device_list_add() is very lean on its device_list_mutex 
usage,
a cleanup and fix is wip. Given the practicality of the above race
condition this patch is good to merge).
v1->v2: replace uuid_mutex with device_list_mutex instead of delete.
change log updated.
 fs/btrfs/volumes.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index b6757b53c297..f03719221fca 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -924,8 +924,9 @@ void btrfs_free_extra_devids(struct btrfs_fs_devices 
*fs_devices, int step)
 {
struct btrfs_device *device, *next;
struct btrfs_device *latest_dev = NULL;
+   struct btrfs_fs_devices *parent_fs_devices = fs_devices;
 
-   mutex_lock(_mutex);
+   mutex_lock(_fs_devices->device_list_mutex);
 again:
/* This is the initialized path, it is safe to release the devices. */
list_for_each_entry_safe(device, next, _devices->devices, dev_list) {
@@ -979,8 +980,7 @@ void btrfs_free_extra_devids(struct btrfs_fs_devices 
*fs_devices, int step)
}
 
fs_devices->latest_bdev = latest_dev->bdev;
-
-   mutex_unlock(_mutex);
+   mutex_unlock(_fs_devices->device_list_mutex);
 }
 
 static void free_device_rcu(struct rcu_head *head)
-- 
2.7.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Questions from aspiring btrfs mini-debugger/mini-developer

2018-05-28 Thread Qu Wenruo


On 2018年05月28日 17:21, james harvey wrote:
> I'm tracking down some more bugs.

Yeah, more bugs for us to fix.

> 
> Useful information for you to track down these bugs isn't in this
> email.  This is more about an aspiring btrfs
> mini-debugger/mini-developer asking for some guidance, to be able to
> get the more useful information.
> 
> I ran across some mirrored files that are nodatacow/nodatasum, with
> differing mirrored extents.  UNLIKE BEFORE, these are uncompressed.
> Mostly /var/cache/samba and /var/lib/mysql files.  This also happened
> during my recent btrfs replace.  Luckily for me, I had the unmodified
> original images, so when I re-did this with a btrfs device add /
> remove, the new ones are fine.

That's the best practice.

> 
> I'm almost positive I have extents with checksums, where its inode is
> marked nodatacow.  This would be a bug, right?  (Confirming before I
> look into this much more.)

Yes, we're aware of this bug, and working on it.

> 
> 
> 
> I've spent a few days familiarizing myself with btrfs (kernel and
> -progs) internals and source.
> 
> I've made some additions to btrfs-progs, that I'll submit once
> finished.  One of them compares mirrored extents, looking for
> differences.

This is a good extension to the original --check-datasum option.

Although I have submitted some patches to enhance --check-datasum to do
better check ("[PATCH 0/4] btrfs check --check-data-csum enhancement
for"), I didn't consider the case of nodatasum case.
It makes sense, and hope the existing patchset could give some clue for you.

>  If I have it check all extents, it brings up every
> problematic file I've found, and the few I mentioned above that I
> wasn't aware of because they were uncompressed.  I'll give more on
> this once I have the details.  I think this must mean scrub doesn't
> verify extents with checksums that are marked nodatacow,

I'm not pretty sure about this.
AFAIK scrub doesn't care about the owner (until it needs to output the
file name), it just iterate through extent tree and compare data with
its data sum.

But I can be totally wrong.

> since it's
> not expecting them to have checksums.
> 
> 
> 
> I have a few questions that would greatly help having answered.
> 
> Am I right that an inode has a single set of btrfs flags (things like
> nodatacow, nodatasum, etc) accessable through btrfs_inode_flags()?

Yes. And you can also check the flags by checking the following pattern
of btrfs-debug-tree (btrfs inspect dump-tree).
--
item 9 key (256 INODE_ITEM 0) itemoff 13702 itemsize 160
generation 10 transid 10 size 262144 nbytes 1310720
block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0
sequence 5 flags
0x1b(NODATASUM|NODATACOW|NOCOMPRESS|PREALLOC)
--

>  I
> want to make sure extents within the same file can't have any varying
> flags, and that a file and its extents across multiple snapshots all
> share the same.
> 
> What about deduplicated extents?  If there's a file whose inode says
> it has checksums, and another file whose inode has nodatasum, and
> there's duplicate blocks, are they deduplicated, or does deduplication
> see this and skip it because of the mismatch?

For deduplicated extents (or called reflink), if an inode has NODATACOW
kernel won't allow reflink from/to it.

And further more, NODATACOW flag can't be set/clear for a non-empty
inode, it should not happen.

> 
> I have files that have some extents compressed, and others without.
> Is this allowed?

Not really defined. I need to double check about how kernel handles
NOCOMPRESS flag.
But it's always good to have define the expected behavior.

>  This might just be on nodatacow files defragmented
> and compressed, so maybe that process left some extents uncompressed.
> Wondering if this is allowed before I dig more to see if it's on files
> that haven't been through that process.
> 
> 
> 
> extent_offset isn't making sense to me.  I have a file whose filefrag 
> includes:
> 
>   28:  896.. 919: 596954..596977: 24: 596978:
> encoded,shared
>   29:  920..1023: 580304..580407:104: 596978: shared
>   30: 1024..1055: 596961..596992: 32: 580408:
> encoded,shared

When compression is involved, old filefrag really doesn't make much sense.

> 
> #29, through btrfs-tree-debug, is:
> 
> item 49 key (71469 EXTENT_DATA 3768320) itemoff 13232 itemsize 53
> generation 218 type 1 (regular)
> extent data disk byte 2373160960 nr 8384512
> extent data offset 3764224 nr 425984 ram 8384512
> extent compression 0 (none)
> 
> Its extents without a data offset (i.e. filefrag #30) look like:
> 
> item 50 key (71469 EXTENT_DATA 4194304) itemoff 13179 itemsize 53
> generation 310 type 1 (regular)
> extent data disk byte 2445152256 nr 49152
> extent data offset 0 nr 

Re: [PATCH 0/4] btrfs check --check-data-csum enhancement for

2018-05-28 Thread Qu Wenruo
Gentle ping.

Or do I need to rebase the patchset and resend?

Thanks,
Qu

On 2018年02月27日 17:12, Qu Wenruo wrote:
> Log-writes tool from Josef exposed btrfs data corruption.
> 
> Although still digging, at least enhance --check-data-csum option so
> that log-replay --check can stop exactly at the point where btrfs data
> get corrupted, instead of manually checking the output.
> 
> Qu Wenruo (4):
>   btrfs-progs: check: Check data csum for all copies
>   btrfs-progs: check: Fix data csum check return value
>   btrfs-progs: check: Continue check even csum error is found
>   btrfs-progs: check: Distingusih csum checking output for
> --check-data-csum
> 
>  check/main.c | 93 
> 
>  1 file changed, 56 insertions(+), 37 deletions(-)
> 



signature.asc
Description: OpenPGP digital signature


[PATCH v2 2/4] fstests: btrfs: nested seed device test

2018-05-28 Thread Anand Jain
Test case to verify that a sprout device can be a seed device

Signed-off-by: Anand Jain 
---
 tests/btrfs/162 | 103 
 tests/btrfs/162.out |   9 +
 tests/btrfs/group   |   1 +
 3 files changed, 113 insertions(+)
 create mode 100755 tests/btrfs/162
 create mode 100644 tests/btrfs/162.out

diff --git a/tests/btrfs/162 b/tests/btrfs/162
new file mode 100755
index ..fe63fb3ad704
--- /dev/null
+++ b/tests/btrfs/162
@@ -0,0 +1,103 @@
+#! /bin/bash
+# FS QA Test 162 nested seed device test
+#
+# Steps:
+# Create a seed device
+# Create a sprout device
+# Make the sprout device a seed device and create a sprout device again
+#
+#---
+# Copyright (c) 2018 Oracle.  All Rights Reserved.
+# Author: Anand Jain 
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License as
+# published by the Free Software Foundation.
+#
+# This program is distributed in the hope that it would be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write the Free Software Foundation,
+# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+#---
+#
+
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo "QA output created by $seq"
+
+here=`pwd`
+tmp=/tmp/$$
+status=1   # failure is the default!
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+_cleanup()
+{
+   cd /
+   rm -f $tmp.*
+}
+
+# get standard environment, filters and checks
+. ./common/rc
+. ./common/filter
+. ./common/filter.btrfs
+
+# remove previous $seqres.full before test
+rm -f $seqres.full
+
+# real QA test starts here
+
+# Modify as appropriate.
+_supported_fs btrfs
+_supported_os Linux
+_require_scratch_dev_pool 3
+
+_scratch_dev_pool_get 3
+
+DEV_SEED=$(echo $SCRATCH_DEV_POOL | awk '{print $1}')
+DEV_SPROUT_SEED=$(echo $SCRATCH_DEV_POOL | awk '{print $2}')
+DEV_SPROUT=$(echo $SCRATCH_DEV_POOL | awk '{print $3}')
+
+create_seed()
+{
+   _mkfs_dev $DEV_SEED
+   run_check _mount $DEV_SEED $SCRATCH_MNT
+   $XFS_IO_PROG -f -d -c "pwrite -S 0xab 0 256K" $SCRATCH_MNT/foobar >\
+   /dev/null
+   echo -- gloden --
+   od -x $SCRATCH_MNT/foobar
+   _run_btrfs_util_prog filesystem show -m $SCRATCH_MNT
+   _scratch_unmount
+   $BTRFS_TUNE_PROG -S 1 $DEV_SEED
+}
+
+create_sprout_seed()
+{
+   run_check _mount $DEV_SEED $SCRATCH_MNT
+   _run_btrfs_util_prog device add -f $DEV_SPROUT_SEED $SCRATCH_MNT
+   _scratch_unmount
+   $BTRFS_TUNE_PROG -S 1 $DEV_SPROUT_SEED
+}
+
+create_next_sprout()
+{
+   run_check _mount -o device=$DEV_SEED $DEV_SPROUT_SEED $SCRATCH_MNT
+   _run_btrfs_util_prog device add -f $DEV_SPROUT $SCRATCH_MNT
+   _scratch_unmount
+   run_check _mount -o device=$DEV_SEED,device=$DEV_SPROUT_SEED \
+   $DEV_SPROUT $SCRATCH_MNT
+   echo -- sprout --
+   od -x $SCRATCH_MNT/foobar
+}
+
+create_seed
+create_sprout_seed
+create_next_sprout
+
+_scratch_dev_pool_put
+
+status=0
+exit
diff --git a/tests/btrfs/162.out b/tests/btrfs/162.out
new file mode 100644
index ..ada7cc5507bd
--- /dev/null
+++ b/tests/btrfs/162.out
@@ -0,0 +1,9 @@
+QA output created by 162
+-- gloden --
+000 abab abab abab abab abab abab abab abab
+*
+100
+-- sprout --
+000 abab abab abab abab abab abab abab abab
+*
+100
diff --git a/tests/btrfs/group b/tests/btrfs/group
index fe83631f0f33..cdc8ad32036e 100644
--- a/tests/btrfs/group
+++ b/tests/btrfs/group
@@ -164,3 +164,4 @@
 159 auto quick
 160 auto quick
 161 auto quick
+162 auto quick
-- 
2.7.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 4/4] fstests: btrfs: seed device delete test

2018-05-28 Thread Anand Jain
Test case to verify that a seed device can be deleted

Signed-off-by: Anand Jain 
---
 tests/btrfs/164 | 114 
 tests/btrfs/164.out |   9 +
 tests/btrfs/group   |   1 +
 3 files changed, 124 insertions(+)
 create mode 100755 tests/btrfs/164
 create mode 100644 tests/btrfs/164.out

diff --git a/tests/btrfs/164 b/tests/btrfs/164
new file mode 100755
index ..31f7f1f72d5e
--- /dev/null
+++ b/tests/btrfs/164
@@ -0,0 +1,114 @@
+#! /bin/bash
+# FS QA Test 164 seed device delete test
+#
+# Test case to verify that a seed device can be deleted
+#
+# Steps:
+# Create a seed device
+# Create a sprout device
+# Remount RW
+# Run device delete on the seed device
+#
+#---
+# Copyright (c) 2018 Oracle.  All Rights Reserved.
+# Author: Anand Jain 
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License as
+# published by the Free Software Foundation.
+#
+# This program is distributed in the hope that it would be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write the Free Software Foundation,
+# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+#---
+#
+
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo "QA output created by $seq"
+
+here=`pwd`
+tmp=/tmp/$$
+status=1   # failure is the default!
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+_cleanup()
+{
+   cd /
+   rm -f $tmp.*
+}
+
+# get standard environment, filters and checks
+. ./common/rc
+. ./common/module
+
+# remove previous $seqres.full before test
+rm -f $seqres.full
+
+# real QA test starts here
+
+# Modify as appropriate.
+_supported_fs btrfs
+_supported_os Linux
+_require_loadable_fs_module "btrfs"
+_require_scratch_dev_pool 2
+_scratch_dev_pool_get 2
+
+DEV_SEED=$(echo $SCRATCH_DEV_POOL | awk '{print $1}')
+DEV_SPROUT=$(echo $SCRATCH_DEV_POOL | awk '{print $2}')
+
+create_seed()
+{
+   _mkfs_dev $DEV_SEED
+   run_check _mount $DEV_SEED $SCRATCH_MNT
+   $XFS_IO_PROG -f -d -c "pwrite -S 0xab 0 256K" $SCRATCH_MNT/foobar >\
+   /dev/null
+   echo -- gloden --
+   od -x $SCRATCH_MNT/foobar
+   _run_btrfs_util_prog filesystem show -m $SCRATCH_MNT
+   _scratch_unmount
+   $BTRFS_TUNE_PROG -S 1 $DEV_SEED
+   run_check _mount $DEV_SEED $SCRATCH_MNT
+}
+
+add_sprout()
+{
+   _run_btrfs_util_prog device add -f $DEV_SPROUT $SCRATCH_MNT
+   run_check mount -o rw,remount $DEV_SEED $SCRATCH_MNT
+   _run_btrfs_util_prog filesystem show -m $SCRATCH_MNT
+}
+
+delete_seed()
+{
+   _run_btrfs_util_prog device delete $DEV_SEED $SCRATCH_MNT
+   _scratch_unmount
+   _reload_fs_module "btrfs"
+   run_check _mount $DEV_SPROUT $SCRATCH_MNT
+   _run_btrfs_util_prog filesystem show -m $SCRATCH_MNT
+   echo -- sprout --
+   od -x $SCRATCH_MNT/foobar
+   _scratch_unmount
+}
+
+seed_is_mountable()
+{
+   run_check _mount $DEV_SEED $SCRATCH_MNT
+   _run_btrfs_util_prog filesystem show -m $SCRATCH_MNT
+   _scratch_unmount
+}
+
+create_seed
+add_sprout
+delete_seed
+
+seed_is_mountable
+
+_scratch_dev_pool_put
+
+status=0
+exit
diff --git a/tests/btrfs/164.out b/tests/btrfs/164.out
new file mode 100644
index ..c073b7a88ae1
--- /dev/null
+++ b/tests/btrfs/164.out
@@ -0,0 +1,9 @@
+QA output created by 164
+-- gloden --
+000 abab abab abab abab abab abab abab abab
+*
+100
+-- sprout --
+000 abab abab abab abab abab abab abab abab
+*
+100
diff --git a/tests/btrfs/group b/tests/btrfs/group
index 2b7b0caf407e..d3639ba8dbfe 100644
--- a/tests/btrfs/group
+++ b/tests/btrfs/group
@@ -166,3 +166,4 @@
 161 auto quick
 162 auto quick
 163 auto quick
+164 auto quick
-- 
2.7.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 3/4] fstests: btrfs: seed device replace test

2018-05-28 Thread Anand Jain
Test case to verify that a seed device can be replaced

Signed-off-by: Anand Jain 
---
 tests/btrfs/163 | 114 
 tests/btrfs/163.out |   9 +
 tests/btrfs/group   |   1 +
 3 files changed, 124 insertions(+)
 create mode 100755 tests/btrfs/163
 create mode 100644 tests/btrfs/163.out

diff --git a/tests/btrfs/163 b/tests/btrfs/163
new file mode 100755
index ..e59f1deae42e
--- /dev/null
+++ b/tests/btrfs/163
@@ -0,0 +1,114 @@
+#! /bin/bash
+# FS QA Test 163 seed device replace test
+#
+# Test case to verify that a seed device can be replaced
+#
+# Steps:
+#  Create a seed device
+#  Create a sprout device
+#  Remount RW
+#  Run device replace on the seed device
+#---
+# Copyright (c) 2018 Oracle.  All Rights Reserved.
+# Author: Anand Jain 
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License as
+# published by the Free Software Foundation.
+#
+# This program is distributed in the hope that it would be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write the Free Software Foundation,
+# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+#---
+#
+
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo "QA output created by $seq"
+
+here=`pwd`
+tmp=/tmp/$$
+status=1   # failure is the default!
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+_cleanup()
+{
+   cd /
+   rm -f $tmp.*
+}
+
+# get standard environment, filters and checks
+. ./common/rc
+. ./common/filter
+. ./common/filter.btrfs
+
+# remove previous $seqres.full before test
+rm -f $seqres.full
+
+# real QA test starts here
+
+# Modify as appropriate.
+_supported_fs btrfs
+_supported_os Linux
+_require_scratch_dev_pool 3
+
+_scratch_dev_pool_get 3
+
+DEV_SEED=$(echo $SCRATCH_DEV_POOL | awk '{print $1}')
+DEV_SPROUT=$(echo $SCRATCH_DEV_POOL | awk '{print $2}')
+DEV_REPLACE_TGT=$(echo $SCRATCH_DEV_POOL | awk '{print $3}')
+
+create_seed()
+{
+   _mkfs_dev $DEV_SEED
+   run_check _mount $DEV_SEED $SCRATCH_MNT
+   $XFS_IO_PROG -f -d -c "pwrite -S 0xab 0 256K" $SCRATCH_MNT/foobar >\
+   /dev/null
+   echo -- gloden --
+   od -x $SCRATCH_MNT/foobar
+   _run_btrfs_util_prog filesystem show -m $SCRATCH_MNT
+   _scratch_unmount
+   $BTRFS_TUNE_PROG -S 1 $DEV_SEED
+   run_check _mount $DEV_SEED $SCRATCH_MNT
+}
+
+add_sprout()
+{
+   _run_btrfs_util_prog device add -f $DEV_SPROUT $SCRATCH_MNT
+   _run_btrfs_util_prog filesystem show -m $SCRATCH_MNT
+}
+
+replace_seed()
+{
+   _run_btrfs_util_prog replace start -fB $DEV_SEED $DEV_REPLACE_TGT 
$SCRATCH_MNT
+   _run_btrfs_util_prog filesystem show -m $SCRATCH_MNT
+   _scratch_unmount
+   run_check _mount $DEV_REPLACE_TGT $SCRATCH_MNT
+   echo -- sprout --
+   od -x $SCRATCH_MNT/foobar
+   _scratch_unmount
+
+}
+
+seed_is_mountable()
+{
+   run_check _mount $DEV_SEED $SCRATCH_MNT
+   _run_btrfs_util_prog filesystem show -m $SCRATCH_MNT
+   _scratch_unmount
+}
+
+create_seed
+add_sprout
+replace_seed
+
+seed_is_mountable
+
+_scratch_dev_pool_put
+
+status=0
+exit
diff --git a/tests/btrfs/163.out b/tests/btrfs/163.out
new file mode 100644
index ..50f46da6df86
--- /dev/null
+++ b/tests/btrfs/163.out
@@ -0,0 +1,9 @@
+QA output created by 163
+-- gloden --
+000 abab abab abab abab abab abab abab abab
+*
+100
+-- sprout --
+000 abab abab abab abab abab abab abab abab
+*
+100
diff --git a/tests/btrfs/group b/tests/btrfs/group
index cdc8ad32036e..2b7b0caf407e 100644
--- a/tests/btrfs/group
+++ b/tests/btrfs/group
@@ -165,3 +165,4 @@
 160 auto quick
 161 auto quick
 162 auto quick
+163 auto quick
-- 
2.7.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 1/4] fstests: btrfs: add seed sprout functionality test

2018-05-28 Thread Anand Jain
Create a seed device and add the sprout device to it.

Signed-off-by: Anand Jain 
---
v1->v2: Use functions to do the respective operations. Add data
verification.

 common/config   |  1 +
 tests/btrfs/161 | 90 +
 tests/btrfs/161.out |  9 ++
 tests/btrfs/group   |  1 +
 4 files changed, 101 insertions(+)
 create mode 100755 tests/btrfs/161
 create mode 100644 tests/btrfs/161.out

diff --git a/common/config b/common/config
index af360cefc804..f0735cb0153d 100644
--- a/common/config
+++ b/common/config
@@ -235,6 +235,7 @@ case "$HOSTOS" in
 export BTRFS_UTIL_PROG="`set_prog_path btrfs`"
 export BTRFS_SHOW_SUPER_PROG="`set_prog_path btrfs-show-super`"
export BTRFS_CONVERT_PROG="`set_prog_path btrfs-convert`"
+   export BTRFS_TUNE_PROG="`set_prog_path btrfstune`"
 export XFS_FSR_PROG="`set_prog_path xfs_fsr`"
 export MKFS_NFS_PROG="false"
 export MKFS_CIFS_PROG="false"
diff --git a/tests/btrfs/161 b/tests/btrfs/161
new file mode 100755
index ..009e95354e85
--- /dev/null
+++ b/tests/btrfs/161
@@ -0,0 +1,90 @@
+#! /bin/bash
+# FS QA Test 161 seed sprout functionality test
+#
+#  Create a seed device, mount it and, add a new device to create a
+#  sprout filesystem.
+#
+#---
+# Copyright (c) 2018 Oracle.  All Rights Reserved.
+# Author: Anand Jain 
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License as
+# published by the Free Software Foundation.
+#
+# This program is distributed in the hope that it would be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write the Free Software Foundation,
+# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+#---
+#
+
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo "QA output created by $seq"
+
+here=`pwd`
+tmp=/tmp/$$
+status=1   # failure is the default!
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+_cleanup()
+{
+   cd /
+   rm -f $tmp.*
+}
+
+# get standard environment, filters and checks
+. ./common/rc
+. ./common/filter
+
+# remove previous $seqres.full before test
+rm -f $seqres.full
+
+# real QA test starts here
+
+# Modify as appropriate.
+_supported_fs btrfs
+_supported_os Linux
+_require_scratch_dev_pool 2
+
+_scratch_dev_pool_get 2
+
+DEV_SEED=$(echo $SCRATCH_DEV_POOL | awk '{print $1}')
+DEV_SPROUT=$(echo $SCRATCH_DEV_POOL | awk '{print $2}')
+
+create_seed()
+{
+   _mkfs_dev $DEV_SEED
+   run_check _mount $DEV_SEED $SCRATCH_MNT
+   $XFS_IO_PROG -f -d -c "pwrite -S 0xab 0 256K" $SCRATCH_MNT/foobar >\
+   /dev/null
+   echo -- golden --
+   od -x $SCRATCH_MNT/foobar
+   _run_btrfs_util_prog filesystem show -m $SCRATCH_MNT
+   _scratch_unmount
+   $BTRFS_TUNE_PROG -S 1 $DEV_SEED
+   run_check _mount $DEV_SEED $SCRATCH_MNT
+}
+
+create_sprout()
+{
+   _run_btrfs_util_prog device add -f $DEV_SPROUT $SCRATCH_MNT
+   _scratch_unmount
+   run_check _mount $DEV_SPROUT $SCRATCH_MNT
+   echo -- sprout --
+   od -x $SCRATCH_MNT/foobar
+   _scratch_unmount
+}
+
+create_seed
+create_sprout
+
+_scratch_dev_pool_put
+
+status=0
+exit
diff --git a/tests/btrfs/161.out b/tests/btrfs/161.out
new file mode 100644
index ..363b8217f45c
--- /dev/null
+++ b/tests/btrfs/161.out
@@ -0,0 +1,9 @@
+QA output created by 161
+-- golden --
+000 abab abab abab abab abab abab abab abab
+*
+100
+-- sprout --
+000 abab abab abab abab abab abab abab abab
+*
+100
diff --git a/tests/btrfs/group b/tests/btrfs/group
index f04ee8d5297c..fe83631f0f33 100644
--- a/tests/btrfs/group
+++ b/tests/btrfs/group
@@ -163,3 +163,4 @@
 158 auto quick raid scrub
 159 auto quick
 160 auto quick
+161 auto quick
-- 
2.7.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 0/4] fstests: btrfs seed test cases

2018-05-28 Thread Anand Jain
1/4 was sent separately which is now integrated into this set, with some
 changes as indicated in the patch.
Though rest of the patches 2/4..4/4 are new, they carry the set version v2.

These test cases verify the seed sprout opertions.

Anand Jain (4):
  fstests: btrfs: add seed sprout functionality test
  fstests: btrfs: nested seed device test
  fstests: btrfs: seed device replace test
  fstests: btrfs: seed device delete test

 common/config   |   1 +
 tests/btrfs/161 |  90 +
 tests/btrfs/161.out |   9 +
 tests/btrfs/162 | 103 +++
 tests/btrfs/162.out |   9 +
 tests/btrfs/163 | 114 
 tests/btrfs/163.out |   9 +
 tests/btrfs/164 | 114 
 tests/btrfs/164.out |   9 +
 tests/btrfs/group   |   4 ++
 10 files changed, 462 insertions(+)
 create mode 100755 tests/btrfs/161
 create mode 100644 tests/btrfs/161.out
 create mode 100755 tests/btrfs/162
 create mode 100644 tests/btrfs/162.out
 create mode 100755 tests/btrfs/163
 create mode 100644 tests/btrfs/163.out
 create mode 100755 tests/btrfs/164
 create mode 100644 tests/btrfs/164.out

-- 
2.7.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Questions from aspiring btrfs mini-debugger/mini-developer

2018-05-28 Thread james harvey
I'm tracking down some more bugs.

Useful information for you to track down these bugs isn't in this
email.  This is more about an aspiring btrfs
mini-debugger/mini-developer asking for some guidance, to be able to
get the more useful information.

I ran across some mirrored files that are nodatacow/nodatasum, with
differing mirrored extents.  UNLIKE BEFORE, these are uncompressed.
Mostly /var/cache/samba and /var/lib/mysql files.  This also happened
during my recent btrfs replace.  Luckily for me, I had the unmodified
original images, so when I re-did this with a btrfs device add /
remove, the new ones are fine.

I'm almost positive I have extents with checksums, where its inode is
marked nodatacow.  This would be a bug, right?  (Confirming before I
look into this much more.)



I've spent a few days familiarizing myself with btrfs (kernel and
-progs) internals and source.

I've made some additions to btrfs-progs, that I'll submit once
finished.  One of them compares mirrored extents, looking for
differences.  If I have it check all extents, it brings up every
problematic file I've found, and the few I mentioned above that I
wasn't aware of because they were uncompressed.  I'll give more on
this once I have the details.  I think this must mean scrub doesn't
verify extents with checksums that are marked nodatacow, since it's
not expecting them to have checksums.



I have a few questions that would greatly help having answered.

Am I right that an inode has a single set of btrfs flags (things like
nodatacow, nodatasum, etc) accessable through btrfs_inode_flags()?  I
want to make sure extents within the same file can't have any varying
flags, and that a file and its extents across multiple snapshots all
share the same.

What about deduplicated extents?  If there's a file whose inode says
it has checksums, and another file whose inode has nodatasum, and
there's duplicate blocks, are they deduplicated, or does deduplication
see this and skip it because of the mismatch?

I have files that have some extents compressed, and others without.
Is this allowed?  This might just be on nodatacow files defragmented
and compressed, so maybe that process left some extents uncompressed.
Wondering if this is allowed before I dig more to see if it's on files
that haven't been through that process.



extent_offset isn't making sense to me.  I have a file whose filefrag includes:

  28:  896.. 919: 596954..596977: 24: 596978:
encoded,shared
  29:  920..1023: 580304..580407:104: 596978: shared
  30: 1024..1055: 596961..596992: 32: 580408:
encoded,shared

#29, through btrfs-tree-debug, is:

item 49 key (71469 EXTENT_DATA 3768320) itemoff 13232 itemsize 53
generation 218 type 1 (regular)
extent data disk byte 2373160960 nr 8384512
extent data offset 3764224 nr 425984 ram 8384512
extent compression 0 (none)

Its extents without a data offset (i.e. filefrag #30) look like:

item 50 key (71469 EXTENT_DATA 4194304) itemoff 13179 itemsize 53
generation 310 type 1 (regular)
extent data disk byte 2445152256 nr 49152
extent data offset 0 nr 131072 ram 131072
extent compression 2 (lzo)

So, item 49 is saying there's 8,384,512 bytes on disk, but for this
file extent, only read starting 3,764,224 into the extent_data, and
only read 425,984 bytes?  This is a snapshotted file.  At first, I was
thinking this might mean most of this extent had changed, but 425,984
bytes in the "middle" were the same, so btrfs was re-using that
portion.  Is that's why data_offset is used?  In this case, there is
the file in its normal location plus 43 older snapshots.  All of the
files are completely identical.  It's always possible there could have
been a deleted snapshot that was different, so maybe that's why I'm
not seeing a difference, and maybe it made sense in this way when it
was done.



extent_offset on prealloc data makes even less sense to me, like:

item 47 key (71469 EXTENT_DATA 42098688) itemoff 13739 itemsize 53
generation 293 type 2 (prealloc)
prealloc data disk byte 2426286080 nr 8388608
prealloc data offset 155648 nr 8232960

Am I right that preallocated means no data has actually been written
there?  Why does it even have a disk byte then, isn't that taking up
disk space?  And, why would it have a data offset of 155648 after that
disk byte location if there's no data there?



In the context of uncompressed extents, what's the difference between
extent num_bytes and extent_ram_bytes?  They're usually the same, but
sometimes different:

item 126 key (275 EXTENT_DATA 12288) itemoff 9867 itemsize 53
generation 98 type 1 (regular)
extent data disk byte 1656295424 nr 8192
extent data offset 0 nr 4096 ram 8192
   

Re: RAID-1 refuses to balance large drive

2018-05-28 Thread Duncan
Brad Templeton posted on Sun, 27 May 2018 11:22:07 -0700 as excerpted:

> BTW, I decided to follow the original double replace strategy suggested 
--
> replace 6TB with 8TB and replace 4TB with 6TB.  That should be sure to
> leave the 2 large drives each with 2TB free once expanded, and thus able
> to fully use all space.
> 
> However, the first one has been going for 9 hours and is "189.7% done" 
> and still going.   Some sort of bug in calculating the completion
> status, obviously.  With luck 200% will be enough?

IIRC there was an over-100% completion status bug fixed, I'd guess about 
18 months to two years ago now, long enough it would have slipped 
regular's minds so nobody would have thought about it even knowing you're 
still on 4.4, that being one of the reasons we don't do as well 
supporting stuff that old.

If it is indeed the same bug, anything even half modern should have it 
fixed

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] btrfs: Refactor running of delayed inode items during transaction commit

2018-05-28 Thread Misono Tomohiro
Hello,

I found current misc-next sometimes fails btrfs/152 when the number
of cpu is >= 4 in my vm and git bisect points this commit.
(btrfs/152 performs parallel send/receive.)

The failure is caused by _check_dmesg and sometimes also leads to inconsistent 
fs.

dmesg looks like:
[6.649213] WARNING: CPU: 0 PID: 2838 at fs/btrfs/transaction.c:303 
record_root_in_trans+0x38/0xd0
[6.650807] CPU: 0 PID: 2838 Comm: btrfs Tainted: GW 
4.17.0-rc6+ #113
[6.651998] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
1.10.2-2.fc27 04/01/2014
[6.653325] RIP: 0010:record_root_in_trans+0x38/0xd0
[6.654090] RSP: 0018:90b781e23a68 EFLAGS: 00010206
[6.654882] RAX: 8f346a48b720 RBX: 8f3479ce86c0 RCX: 
[6.655965] RDX:  RSI: 8f347c8ec000 RDI: 8f346a03a958
[6.657059] RBP: 90b781e23b80 R08: 00025870 R09: a8c112bb
[6.658162] R10: 90b781e23ba0 R11: 0002 R12: 8f346a03a958
[6.659290] R13: 8f347c8d9000 R14: 8f347ba2a0b8 R15: 
[6.660374] FS:  7f3b62ebc8c0() GS:8f347fc0() 
knlGS:
[6.661349] CS:  0010 DS:  ES:  CR0: 80050033
[6.661928] CR2: 7f0a641995e0 CR3: 6af5c000 CR4: 06f0
[6.662651] Call Trace:
[6.662909]  create_pending_snapshot+0x1ab/0xd00
[6.663391]  ? btrfs_delayed_inode_release_metadata+0xe0/0xf0
[6.663972]  ? __btrfs_update_delayed_inode+0x1aa/0x210
[6.664526]  ? __btrfs_release_delayed_node.part.18+0x8f/0x1d0
[6.665116]  ? create_pending_snapshots+0x81/0xa0
[6.665597]  create_pending_snapshots+0x81/0xa0
[6.666068]  btrfs_commit_transaction+0x279/0x860
[6.666553]  ? start_transaction+0x98/0x3c0
[6.666989]  btrfs_mksubvol+0x589/0x5a0
[6.667390]  ? btrfs_opendir+0x39/0x70
[6.667778]  btrfs_ioctl_snap_create_transid+0x16a/0x1a0
[6.668327]  btrfs_ioctl_snap_create_v2+0x121/0x180
[6.668827]  btrfs_ioctl+0x56d/0x25a0
[6.669205]  ? do_filp_open+0xaa/0x110
[6.669591]  ? do_vfs_ioctl+0x9f/0x610
[6.669980]  ? btrfs_ioctl_get_supported_features+0x20/0x20
[6.670550]  do_vfs_ioctl+0x9f/0x610
[6.670922]  ksys_ioctl+0x6b/0x80
[6.671261]  __x64_sys_ioctl+0x11/0x20
[6.671650]  do_syscall_64+0x49/0x100
[6.672040]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[6.672545] RIP: 0033:0x7f3b61cbddd7
[6.672899] RSP: 002b:7ffcdbe7b318 EFLAGS: 0246 ORIG_RAX: 
0010
[6.673666] RAX: ffda RBX: 7ffcdbe90610 RCX: 7f3b61cbddd7
[6.674430] RDX: 7ffcdbe7b360 RSI: 50009417 RDI: 0003
[6.675154] RBP: 5634403ec020 R08:  R09: 1010
[6.675878] R10:  R11: 0246 R12: 5634403ed670
[6.676604] R13: 7ffcdbe7c470 R14: 0053 R15: 7ffcdbe7b360
[6.677330] Code: f0 01 00 00 a8 02 74 27 48 8b 07 48 39 86 40 03 00 00 73 
1b 49 39 75 28 0f 84 91 00 00 00 85 d2 75 17 48 8b 06 48 39 46 08 74 0e <0f> 0b 
eb 0a 85 d2 74 65 49 3b 75 28 74 76 41 89 d4 48 89 f3 48
[6.679318] ---[ end trace b2378f91e69026c3 ]---

transaction.c:303:
  WARN_ON(!force && root->commit_root != root->node);

And 152.out.full looks like:
  *** fsck.btrfs output ***
  root item for root 261, current bytenr 53870592, current gen 149, current
  level 0, new bytenr 54067200, new gen 149, new level 0
  Found 1 roots with an outdated root item.
  Please run a filesystem check with the option --repair to fix them.

=
(My observation)

Call chain is:
  btrfs_commit_transaction()
   create_pending_snapshots()
 create_pending_snapshot()
   run_delayed_item() <- removed
   qgroup_account_snapshot()
   run_delayed_item() <- added

This commit removes run_delayed_item() in create_pending_snapshot() and
instead calls it after create_pending_snapshots().

However, as qgroup_account_snapshot() updates commit_root, if there are
more than one pending snapshots under the same subvolume, it may fail
to update root item correctly.

So, I think we cannot remove run_delayed_item() here.

Thanks,
Tomohiro Misono

On 2018/02/13 23:16, Nikolay Borisov wrote:
> Currently calls to btrfs_run_delayed_inode items have been scattered
> around the transaction commit code with no real design argument when they 
> should be execute.
> 
> We have one call, after transaction writers go to 0. Then we have the delayed
> items running as part of creating a snapshot (once per pedning snapshot).
> Finally, delayed items are run once more _after_ snapshots have been created.
> All in all this amounts to 2 + N (N = number of snapshots slated for 
> creation).
> In reality we need to flush delayed items once before
> create_pending_snapshots is called to ensure snapshosts are consistent with
> inode data and once after snapshots are created so that newly introduced 
> inode 
> items during snapshot creation 

[PATCH] btrfs: drop unused space_info parameter from create_space_info

2018-05-28 Thread Lu Fengqi
Since commit dc2d3005d27d ("btrfs: remove dead create_space_info
calls"), there is only one caller btrfs_init_space_info. However, it
doesn't need create_space_info to return space_info at all.

Signed-off-by: Lu Fengqi 
---
 fs/btrfs/extent-tree.c | 13 +
 1 file changed, 5 insertions(+), 8 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 2c65a09535b6..31627de751d1 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4007,8 +4007,7 @@ static const char *alloc_name(u64 flags)
};
 }
 
-static int create_space_info(struct btrfs_fs_info *info, u64 flags,
-struct btrfs_space_info **new)
+static int create_space_info(struct btrfs_fs_info *info, u64 flags)
 {
 
struct btrfs_space_info *space_info;
@@ -4046,7 +4045,6 @@ static int create_space_info(struct btrfs_fs_info *info, 
u64 flags,
return ret;
}
 
-   *new = space_info;
list_add_rcu(_info->list, >space_info);
if (flags & BTRFS_BLOCK_GROUP_DATA)
info->data_sinfo = space_info;
@@ -10810,7 +10808,6 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info 
*fs_info)
 
 int btrfs_init_space_info(struct btrfs_fs_info *fs_info)
 {
-   struct btrfs_space_info *space_info;
struct btrfs_super_block *disk_super;
u64 features;
u64 flags;
@@ -10826,21 +10823,21 @@ int btrfs_init_space_info(struct btrfs_fs_info 
*fs_info)
mixed = 1;
 
flags = BTRFS_BLOCK_GROUP_SYSTEM;
-   ret = create_space_info(fs_info, flags, _info);
+   ret = create_space_info(fs_info, flags);
if (ret)
goto out;
 
if (mixed) {
flags = BTRFS_BLOCK_GROUP_METADATA | BTRFS_BLOCK_GROUP_DATA;
-   ret = create_space_info(fs_info, flags, _info);
+   ret = create_space_info(fs_info, flags);
} else {
flags = BTRFS_BLOCK_GROUP_METADATA;
-   ret = create_space_info(fs_info, flags, _info);
+   ret = create_space_info(fs_info, flags);
if (ret)
goto out;
 
flags = BTRFS_BLOCK_GROUP_DATA;
-   ret = create_space_info(fs_info, flags, _info);
+   ret = create_space_info(fs_info, flags);
}
 out:
return ret;
-- 
2.17.0



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html