Re: [PATCH] btrfs: add missing memset while reading compressed inline extents

2017-03-21 Thread Qu Wenruo



At 03/09/2017 10:05 AM, Zygo Blaxell wrote:

On Wed, Mar 08, 2017 at 10:27:33AM +, Filipe Manana wrote:

On Wed, Mar 8, 2017 at 3:18 AM, Zygo Blaxell
 wrote:

From: Zygo Blaxell 

This is a story about 4 distinct (and very old) btrfs bugs.

Commit c8b978188c ("Btrfs: Add zlib compression support") added
three data corruption bugs for inline extents (bugs #1-3).

Commit 93c82d5750 ("Btrfs: zero page past end of inline file items")
fixed bug #1:  uncompressed inline extents followed by a hole and more
extents could get non-zero data in the hole as they were read.  The fix
was to add a memset in btrfs_get_extent to zero out the hole.

Commit 166ae5a418 ("btrfs: fix inline compressed read err corruption")
fixed bug #2:  compressed inline extents which contained non-zero bytes
might be replaced with zero bytes in some cases.  This patch removed an
unhelpful memset from uncompress_inline, but the case where memset is
required was missed.

There is also a memset in the decompression code, but this only covers
decompressed data that is shorter than the ram_bytes from the extent
ref record.  This memset doesn't cover the region between the end of the
decompressed data and the end of the page.  It has also moved around a
few times over the years, so there's no single patch to refer to.

This patch fixes bug #3:  compressed inline extents followed by a hole
and more extents could get non-zero data in the hole as they were read
(i.e. bug #3 is the same as bug #1, but s/uncompressed/compressed/).
The fix is the same:  zero out the hole in the compressed case too,
by putting a memset back in uncompress_inline, but this time with
correct parameters.

The last and oldest bug, bug #0, is the cause of the offending inline
extent/hole/extent pattern.  Bug #0 is a subtle and mostly-harmless quirk
of behavior somewhere in the btrfs write code.  In a few special cases,
an inline extent and hole are allowed to persist where they normally
would be combined with later extents in the file.

A fast reproducer for bug #0 is presented below.  A few offending extents
are also created in the wild during large rsync transfers with the -S
flag.  A Linux kernel build (git checkout; make allyesconfig; make -j8)
will produce a handful of offending files as well.  Once an offending
file is created, it can present different content to userspace each
time it is read.

Bug #0 is at least 4 and possibly 8 years old.  I verified every vX.Y
kernel back to v3.5 has this behavior.  There are fossil records of this
bug's effects in commits all the way back to v2.6.32.  I have no reason
to believe bug #0 wasn't present at the beginning of btrfs compression
support in v2.6.29, but I can't easily test kernels that old to be sure.

It is not clear whether bug #0 is worth fixing.  A fix would likely
require injecting extra reads into currently write-only paths, and most
of the exceptional cases caused by bug #0 are already handled now.

Whether we like them or not, bug #0's inline extents followed by holes
are part of the btrfs de-facto disk format now, and we need to be able
to read them without data corruption or an infoleak.  So enough about
bug #0, let's get back to bug #3 (this patch).

An example of on-disk structure leading to data corruption:

item 61 key (606890 INODE_ITEM 0) itemoff 9662 itemsize 160
inode generation 50 transid 50 size 47424 nbytes 49141
block group 0 mode 100644 links 1 uid 0 gid 0
rdev 0 flags 0x0(none)
item 62 key (606890 INODE_REF 603050) itemoff 9642 itemsize 20
inode ref index 3 namelen 10 name: DB_File.so
item 63 key (606890 EXTENT_DATA 0) itemoff 8280 itemsize 1362
inline extent data size 1341 ram 4085 compress(zlib)
item 64 key (606890 EXTENT_DATA 4096) itemoff 8227 itemsize 53
extent data disk byte 5367308288 nr 20480
extent data offset 0 nr 45056 ram 45056
extent compression(zlib)


So this case is actually different from the reproducer below, because
once a file has prealloc extents, future writes will never be
compressed. That is, the extent at offset 4096 can not have compressed
content using steps like the reproducer below. I can only imagine that
after an falloc to create the extent at 4K, we cloned some compressed
extent from some other file into that offset.


The above comes from one of my captures of the bug appearing in the wild,
not the reproducer from Liu Bo.  I'll make that clearer.


In fact, I found that btrfs/137 with compress=lzo seems to trigger such 
"inline-then-regular" case in a very high possibility.


If abort the test just after populating the fs, I found the possibility 
to have inline-then-regular extent layout is over 70%.


Although in btrfs/137 case, it's "inline-then-hole", not completely 
"inline-then-regular".


Hopes such reproducer can help to trace down the 

Re: [PATCH RFC] btrfs: introduce a separate mutex for caching_block_groups list

2017-03-21 Thread Liu Bo
On Sun, Mar 19, 2017 at 07:18:59PM +0200, Alex Lyakas wrote:
> We have a commit_root_sem, which is a read-write semaphore that protects the
> commit roots.
> But it is also used to protect the list of caching block groups.
> 
> As a result, while doing "slow" caching, the following issue is seen:
> 
> Some of the caching threads are scanning the extent tree with
> commit_root_sem
> acquired in shared mode, with stack like:
> [] read_extent_buffer_pages+0x2d2/0x300 [btrfs]
> [] btree_read_extent_buffer_pages.constprop.50+0xb7/0x1e0
> [btrfs]
> [] read_tree_block+0x40/0x70 [btrfs]
> [] read_block_for_search.isra.33+0x12c/0x370 [btrfs]
> [] btrfs_search_slot+0x3c6/0xb10 [btrfs]
> [] caching_thread+0x1b9/0x820 [btrfs]
> [] normal_work_helper+0xc6/0x340 [btrfs]
> [] btrfs_cache_helper+0x12/0x20 [btrfs]
> 
> IO requests that want to allocate space are waiting in cache_block_group()
> to acquire the commit_root_sem in exclusive mode. But they only want to add
> the caching control structure to the list of caching block-groups:
> [] schedule+0x29/0x70
> [] rwsem_down_write_failed+0x145/0x320
> [] call_rwsem_down_write_failed+0x13/0x20
> [] cache_block_group+0x25b/0x450 [btrfs]
> [] find_free_extent+0xd16/0xdb0 [btrfs]
> [] btrfs_reserve_extent+0xaf/0x160 [btrfs]
> 
> Other caching threads want to continue their scanning, and for that they
> are waiting to acquire commit_root_sem in shared mode. But since there are
> IO threads that want the exclusive lock, the caching threads are unable
> to continue the scanning, because (I presume) rw_semaphore guarantees some
> fairness:
> [] schedule+0x29/0x70
> [] rwsem_down_read_failed+0xc5/0x120
> [] call_rwsem_down_read_failed+0x14/0x30
> [] caching_thread+0x1a1/0x820 [btrfs]
> [] normal_work_helper+0xc6/0x340 [btrfs]
> [] btrfs_cache_helper+0x12/0x20 [btrfs]
> [] process_one_work+0x146/0x410
> 
> This causes slowness of the IO, especially when there are many block groups
> that need to be scanned for free space. In some cases it takes minutes
> until a single IO thread is able to allocate free space.
> 
> I don't see a deadlock here, because the caching threads that were able to
> acquire
> the commit_root_sem will call rwsem_is_contended() and should give up the
> semaphore,
> so that IO threads are able to acquire it in exclusive mode.
> 
> However, introducing a separate mutex that protects only the list of caching
> block groups makes things move forward much faster.
>

The problem did exist and the patch looks good to me.

> This patch is based on kernel 3.18.
> Unfortunately, I am not able to submit a patch based on one of the latest
> kernels, because
> here btrfs is part of the larger system, and upgrading the kernel is a
> significant effort.
> Hence marking the patch as RFC.
> Hopefully, this patch still has some value to the community.
> 
> Signed-off-by: Alex Lyakas 
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 42d11e7..74feacb 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -1490,6 +1490,8 @@ struct btrfs_fs_info {
> struct list_head trans_list;
> struct list_head dead_roots;
> struct list_head caching_block_groups;
> +/* protects the above list */
> +struct mutex caching_block_groups_mutex;
> 
> spinlock_t delayed_iput_lock;
> struct list_head delayed_iputs;
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 5177954..130ec58 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -2229,6 +2229,7 @@ int open_ctree(struct super_block *sb,
> INIT_LIST_HEAD(_info->delayed_iputs);
> INIT_LIST_HEAD(_info->delalloc_roots);
> INIT_LIST_HEAD(_info->caching_block_groups);
> +mutex_init(_info->caching_block_groups_mutex);
> spin_lock_init(_info->delalloc_root_lock);
> spin_lock_init(_info->trans_lock);
> spin_lock_init(_info->fs_roots_radix_lock);
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index a067065..906fb08 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -637,10 +637,10 @@ static int cache_block_group(struct
> btrfs_block_group_cache *cache,
> return 0;
> }
> 
> -down_write(_info->commit_root_sem);
> +mutex_lock(_info->caching_block_groups_mutex);
> atomic_inc(_ctl->count);
> list_add_tail(_ctl->list, _info->caching_block_groups);
> -up_write(_info->commit_root_sem);
> +mutex_unlock(_info->caching_block_groups_mutex);
> 
> btrfs_get_block_group(cache);
> 
> @@ -5693,6 +5693,7 @@ void btrfs_prepare_extent_commit(struct
> btrfs_trans_handle *trans,
> 
> down_write(_info->commit_root_sem);
>

Witht the new mutex, it's not necessary to take commit_root_sem here
because a) pinned_extents could not be modified outside of a
transaction, and b) while at btrfs_prepare_extent_commit(), it's
already at the very end of commiting a transaction.

caching_ctl should have at least one reference and
caching_ctl->progress is supposed to be protected by

Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-03-21 Thread Jeff Layton
On Tue, 2017-03-21 at 15:13 -0400, J. Bruce Fields wrote:
> On Tue, Mar 21, 2017 at 02:46:53PM -0400, Jeff Layton wrote:
> > On Tue, 2017-03-21 at 14:30 -0400, J. Bruce Fields wrote:
> > > On Tue, Mar 21, 2017 at 01:23:24PM -0400, Jeff Layton wrote:
> > > > On Tue, 2017-03-21 at 12:30 -0400, J. Bruce Fields wrote:
> > > > > - It's durable; the above comparison still works if there were reboots
> > > > >   between the two i_version checks.
> > > > >   - I don't know how realistic this is--we may need to figure out
> > > > > if there's a weaker guarantee that's still useful.  Do
> > > > > filesystems actually make ctime/mtime/i_version changes
> > > > > atomically with the changes that caused them?  What if a
> > > > > change attribute is exposed to an NFS client but doesn't make
> > > > > it to disk, and then that value is reused after reboot?
> > > > > 
> > > > 
> > > > Yeah, there could be atomicity there. If we bump i_version, we'll mark
> > > > the inode dirty and I think that will end up with the new i_version at
> > > > least being journalled before __mark_inode_dirty returns.
> > > 
> > > So you think the filesystem can provide the atomicity?  In more detail:
> > > 
> > 
> > Sorry, I hit send too quickly. That should have read:
> > 
> > "Yeah, there could be atomicity issues there."
> > 
> > I think providing that level of atomicity may be difficult, though
> > maybe there's some way to make the querying of i_version block until
> > the inode update has been journalled?
> 
> No idea.  Anyway, I'd like to figure out some reasonable requirement
> that we can document.
> 
> > 
> > >   - if I write to a file, a simultaneous reader should see either
> > > (old data, old i_version) or (new data, new i_version), not a
> > > combination of the two.
> > >   - ditto for metadata modifications.
> > >   - the same should be true if there's a crash.
> > > 
> > > (If that's not possible, then I think we could live with a brief window
> > > of (new data, old i_version) as long as it doesn't persist beyond sync?)
> > > 
> > > > That said, I suppose it is possible for us to bump the counter, hand
> > > > that new counter value out to a NFS client and then the box crashes
> > > > before it makes it to the journal.
> > > > 
> > > > Not sure how big a problem that really is.
> > > 
> > > The other case I was wondering about may have been unclear.  Represent
> > > the state of a file by a (data, i_version) pair.  Say:
> > > 
> > >   - file is modified from (F, V) to (F', V+1).
> > >   - client reads and caches (F', V+1).
> > >   - server crashes before writeback, so disk still has (F, V).
> > >   - after restart, someone else modifies file to (F'', V+1).
> > >   - original client revalidates its cache, sees V+1, concludes
> > > file data is still F'.
> > > 
> > > This may not cause a real problem for clients depending only on
> > > traditional NFS close-to-open semantics.
> > > 
> > > 
> > 
> > No, I think that is a legitimate problem.
> > 
> > That said, after F'', the mtime would almost certainly be different
> > from the time after F', and that would likely be enough to prevent
> > confusion in NFS clients.
> 
> Oh, good point.  So, may be worth saying that anyone wanting to make
> sense of these across reboot should compare times as well (maybe that
> should be in nfs rfc's too).  I think that should be ctime not mtime,
> though?
> 

Yes, it might be worth a mention there. IIRC, it does mention that you
shouldn't just look at a single attribute for cache validation
purposes, but the wording is a bit vague. I can't find the section at
the moment though.

The more I think about it though, simply ensuring that we don't publish
 a new change attr until the inode update has hit the journal may be
the best we can do. I'd have to think about how to implement that
though.
-- 
Jeff Layton 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-03-21 Thread Dave Chinner
On Tue, Mar 21, 2017 at 01:23:24PM -0400, Jeff Layton wrote:
> On Tue, 2017-03-21 at 12:30 -0400, J. Bruce Fields wrote:
> > - It's durable; the above comparison still works if there were reboots
> >   between the two i_version checks.
> > - I don't know how realistic this is--we may need to figure out
> >   if there's a weaker guarantee that's still useful.  Do
> >   filesystems actually make ctime/mtime/i_version changes
> >   atomically with the changes that caused them?  What if a
> >   change attribute is exposed to an NFS client but doesn't make
> >   it to disk, and then that value is reused after reboot?
> > 
> 
> Yeah, there could be atomicity there. If we bump i_version, we'll mark
> the inode dirty and I think that will end up with the new i_version at
> least being journalled before __mark_inode_dirty returns.

The change may be journalled, but it isn't guaranteed stable until
fsync is run on the inode.

NFS server operations commit the metadata changed by a modification
through ->commit_metadata or sync_inode_metadata() before the
response is sent back to the client, hence guaranteeing that
i_version changes through the NFS server are stable and durable.

This is not the case for normal operations done through the POSIX
API - the journalling is asynchronous and the only durability
guarantees are provided by fsync()

> That said, I suppose it is possible for us to bump the counter, hand
> that new counter value out to a NFS client and then the box crashes
> before it makes it to the journal.

Yup, this has aways been a problem when you mix posix applications
running on the NFS server modifying the same files as the NFS
clients are accessing and requiring synchronisation.

> Not sure how big a problem that really is.

This coherency problem has always existed on the server side...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-03-21 Thread J. Bruce Fields
On Tue, Mar 21, 2017 at 02:46:53PM -0400, Jeff Layton wrote:
> On Tue, 2017-03-21 at 14:30 -0400, J. Bruce Fields wrote:
> > On Tue, Mar 21, 2017 at 01:23:24PM -0400, Jeff Layton wrote:
> > > On Tue, 2017-03-21 at 12:30 -0400, J. Bruce Fields wrote:
> > > > - It's durable; the above comparison still works if there were reboots
> > > >   between the two i_version checks.
> > > > - I don't know how realistic this is--we may need to figure out
> > > >   if there's a weaker guarantee that's still useful.  Do
> > > >   filesystems actually make ctime/mtime/i_version changes
> > > >   atomically with the changes that caused them?  What if a
> > > >   change attribute is exposed to an NFS client but doesn't make
> > > >   it to disk, and then that value is reused after reboot?
> > > > 
> > > 
> > > Yeah, there could be atomicity there. If we bump i_version, we'll mark
> > > the inode dirty and I think that will end up with the new i_version at
> > > least being journalled before __mark_inode_dirty returns.
> > 
> > So you think the filesystem can provide the atomicity?  In more detail:
> > 
> 
> Sorry, I hit send too quickly. That should have read:
> 
> "Yeah, there could be atomicity issues there."
> 
> I think providing that level of atomicity may be difficult, though
> maybe there's some way to make the querying of i_version block until
> the inode update has been journalled?

No idea.  Anyway, I'd like to figure out some reasonable requirement
that we can document.

> 
> > - if I write to a file, a simultaneous reader should see either
> >   (old data, old i_version) or (new data, new i_version), not a
> >   combination of the two.
> > - ditto for metadata modifications.
> > - the same should be true if there's a crash.
> > 
> > (If that's not possible, then I think we could live with a brief window
> > of (new data, old i_version) as long as it doesn't persist beyond sync?)
> > 
> > > That said, I suppose it is possible for us to bump the counter, hand
> > > that new counter value out to a NFS client and then the box crashes
> > > before it makes it to the journal.
> > > 
> > > Not sure how big a problem that really is.
> > 
> > The other case I was wondering about may have been unclear.  Represent
> > the state of a file by a (data, i_version) pair.  Say:
> > 
> > - file is modified from (F, V) to (F', V+1).
> > - client reads and caches (F', V+1).
> > - server crashes before writeback, so disk still has (F, V).
> > - after restart, someone else modifies file to (F'', V+1).
> > - original client revalidates its cache, sees V+1, concludes
> >   file data is still F'.
> > 
> > This may not cause a real problem for clients depending only on
> > traditional NFS close-to-open semantics.
> > 
> > 
> 
> No, I think that is a legitimate problem.
> 
> That said, after F'', the mtime would almost certainly be different
> from the time after F', and that would likely be enough to prevent
> confusion in NFS clients.

Oh, good point.  So, may be worth saying that anyone wanting to make
sense of these across reboot should compare times as well (maybe that
should be in nfs rfc's too).  I think that should be ctime not mtime,
though?

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-03-21 Thread J. Bruce Fields
On Tue, Mar 21, 2017 at 01:23:24PM -0400, Jeff Layton wrote:
> On Tue, 2017-03-21 at 12:30 -0400, J. Bruce Fields wrote:
> > - It's durable; the above comparison still works if there were reboots
> >   between the two i_version checks.
> > - I don't know how realistic this is--we may need to figure out
> >   if there's a weaker guarantee that's still useful.  Do
> >   filesystems actually make ctime/mtime/i_version changes
> >   atomically with the changes that caused them?  What if a
> >   change attribute is exposed to an NFS client but doesn't make
> >   it to disk, and then that value is reused after reboot?
> > 
> 
> Yeah, there could be atomicity there. If we bump i_version, we'll mark
> the inode dirty and I think that will end up with the new i_version at
> least being journalled before __mark_inode_dirty returns.

So you think the filesystem can provide the atomicity?  In more detail:

- if I write to a file, a simultaneous reader should see either
  (old data, old i_version) or (new data, new i_version), not a
  combination of the two.
- ditto for metadata modifications.
- the same should be true if there's a crash.

(If that's not possible, then I think we could live with a brief window
of (new data, old i_version) as long as it doesn't persist beyond sync?)

> That said, I suppose it is possible for us to bump the counter, hand
> that new counter value out to a NFS client and then the box crashes
> before it makes it to the journal.
> 
> Not sure how big a problem that really is.

The other case I was wondering about may have been unclear.  Represent
the state of a file by a (data, i_version) pair.  Say:

- file is modified from (F, V) to (F', V+1).
- client reads and caches (F', V+1).
- server crashes before writeback, so disk still has (F, V).
- after restart, someone else modifies file to (F'', V+1).
- original client revalidates its cache, sees V+1, concludes
  file data is still F'.

This may not cause a real problem for clients depending only on
traditional NFS close-to-open semantics.

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-03-21 Thread Jeff Layton
On Tue, 2017-03-21 at 14:30 -0400, J. Bruce Fields wrote:
> On Tue, Mar 21, 2017 at 01:23:24PM -0400, Jeff Layton wrote:
> > On Tue, 2017-03-21 at 12:30 -0400, J. Bruce Fields wrote:
> > > - It's durable; the above comparison still works if there were reboots
> > >   between the two i_version checks.
> > >   - I don't know how realistic this is--we may need to figure out
> > > if there's a weaker guarantee that's still useful.  Do
> > > filesystems actually make ctime/mtime/i_version changes
> > > atomically with the changes that caused them?  What if a
> > > change attribute is exposed to an NFS client but doesn't make
> > > it to disk, and then that value is reused after reboot?
> > > 
> > 
> > Yeah, there could be atomicity there. If we bump i_version, we'll mark
> > the inode dirty and I think that will end up with the new i_version at
> > least being journalled before __mark_inode_dirty returns.
> 
> So you think the filesystem can provide the atomicity?  In more detail:
> 

Sorry, I hit send too quickly. That should have read:

"Yeah, there could be atomicity issues there."

I think providing that level of atomicity may be difficult, though
maybe there's some way to make the querying of i_version block until
the inode update has been journalled?

>   - if I write to a file, a simultaneous reader should see either
> (old data, old i_version) or (new data, new i_version), not a
> combination of the two.
>   - ditto for metadata modifications.
>   - the same should be true if there's a crash.
> 
> (If that's not possible, then I think we could live with a brief window
> of (new data, old i_version) as long as it doesn't persist beyond sync?)
> 
> > That said, I suppose it is possible for us to bump the counter, hand
> > that new counter value out to a NFS client and then the box crashes
> > before it makes it to the journal.
> > 
> > Not sure how big a problem that really is.
> 
> The other case I was wondering about may have been unclear.  Represent
> the state of a file by a (data, i_version) pair.  Say:
> 
>   - file is modified from (F, V) to (F', V+1).
>   - client reads and caches (F', V+1).
>   - server crashes before writeback, so disk still has (F, V).
>   - after restart, someone else modifies file to (F'', V+1).
>   - original client revalidates its cache, sees V+1, concludes
> file data is still F'.
> 
> This may not cause a real problem for clients depending only on
> traditional NFS close-to-open semantics.
> 
> 

No, I think that is a legitimate problem.

That said, after F'', the mtime would almost certainly be different
from the time after F', and that would likely be enough to prevent
confusion in NFS clients.

Applications that are just looking at i_version via statx() could get
confused though.

-- 
Jeff Layton 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-03-21 Thread J. Bruce Fields
On Tue, Mar 21, 2017 at 01:37:04PM -0400, J. Bruce Fields wrote:
> On Tue, Mar 21, 2017 at 01:23:24PM -0400, Jeff Layton wrote:
> > On Tue, 2017-03-21 at 12:30 -0400, J. Bruce Fields wrote:
> > >   - NFS doesn't actually require that it increases, but I think it
> > > should.  I assume 64 bits means we don't need a discussion of
> > > wraparound.
> > 
> > I thought NFS spec required that you be able to recognize old change
> > attributes so that they can be discarded. I could be wrong here though.
> > I'd have to go back and look through the spec to be sure.
> 
> https://tools.ietf.org/html/rfc7862#section-10

So, I'm suggesting we implement this one:

NFS4_CHANGE_TYPE_IS_MONOTONIC_INCR:  The change attribute value
  MUST monotonically increase for every atomic change to the file
  attributes, data, or directory contents.

It may be a slight lie--after your patches we wouldn't actually increase
"for every atomic change".  I think that's OK.

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-03-21 Thread J. Bruce Fields
On Tue, Mar 21, 2017 at 01:23:24PM -0400, Jeff Layton wrote:
> On Tue, 2017-03-21 at 12:30 -0400, J. Bruce Fields wrote:
> > - NFS doesn't actually require that it increases, but I think it
> >   should.  I assume 64 bits means we don't need a discussion of
> >   wraparound.
> 
> I thought NFS spec required that you be able to recognize old change
> attributes so that they can be discarded. I could be wrong here though.
> I'd have to go back and look through the spec to be sure.

https://tools.ietf.org/html/rfc7862#section-10

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-03-21 Thread Jeff Layton
On Tue, 2017-03-21 at 12:30 -0400, J. Bruce Fields wrote:
> On Tue, Mar 21, 2017 at 06:45:00AM -0700, Christoph Hellwig wrote:
> > On Mon, Mar 20, 2017 at 05:43:27PM -0400, J. Bruce Fields wrote:
> > > To me, the interesting question is whether this allows us to turn on
> > > i_version updates by default on xfs and ext4.
> > 
> > XFS v5 file systems have it on by default.
> 
> Great, thanks.
> 
> > Although we'll still need to agree on the exact semantics of i_version
> > before it's going to be useful.
> 
> Once it's figured out maybe we should write it up for a manpage that
> could be used if statx starts exposing it to userspace.
> 
> A first attempt:
> 
> - It's a u64.
> 
> - It works for regular files and directories.  (What about symlinks or
>   other special types?)
> 
> - It changes between two checks if and only if there were intervening
>   data or metadata changes.  The change will always be an increase, but
>   the amount of the increase is meaningless.
>   - NFS doesn't actually require that it increases, but I think it
> should.  I assume 64 bits means we don't need a discussion of
> wraparound.

I thought NFS spec required that you be able to recognize old change
attributes so that they can be discarded. I could be wrong here though.
I'd have to go back and look through the spec to be sure.

>   - AFS wants an actual counter: if you get i_version X, then
> write twice, then get i_version X+2, you're allowed to assume
> your writes were the only modifications.  Let's ignore this
> for now.  In the future if someone explains how to count
> operations, then we can extend the interface to tell the
> caller it can get those extra semantics.
> 
> - It's durable; the above comparison still works if there were reboots
>   between the two i_version checks.
>   - I don't know how realistic this is--we may need to figure out
> if there's a weaker guarantee that's still useful.  Do
> filesystems actually make ctime/mtime/i_version changes
> atomically with the changes that caused them?  What if a
> change attribute is exposed to an NFS client but doesn't make
> it to disk, and then that value is reused after reboot?
> 

Yeah, there could be atomicity there. If we bump i_version, we'll mark
the inode dirty and I think that will end up with the new i_version at
least being journalled before __mark_inode_dirty returns.

That said, I suppose it is possible for us to bump the counter, hand
that new counter value out to a NFS client and then the box crashes
before it makes it to the journal.

Not sure how big a problem that really is.

> Am I missing any issues?
> 

No, I think you have it covered, and that's pretty much exactly what I
had in mind as far as semantics go. Thanks for writing it up!

-- 
Jeff Layton 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-03-21 Thread J. Bruce Fields
On Tue, Mar 21, 2017 at 06:45:00AM -0700, Christoph Hellwig wrote:
> On Mon, Mar 20, 2017 at 05:43:27PM -0400, J. Bruce Fields wrote:
> > To me, the interesting question is whether this allows us to turn on
> > i_version updates by default on xfs and ext4.
> 
> XFS v5 file systems have it on by default.

Great, thanks.

> Although we'll still need to agree on the exact semantics of i_version
> before it's going to be useful.

Once it's figured out maybe we should write it up for a manpage that
could be used if statx starts exposing it to userspace.

A first attempt:

- It's a u64.

- It works for regular files and directories.  (What about symlinks or
  other special types?)

- It changes between two checks if and only if there were intervening
  data or metadata changes.  The change will always be an increase, but
  the amount of the increase is meaningless.
- NFS doesn't actually require that it increases, but I think it
  should.  I assume 64 bits means we don't need a discussion of
  wraparound.
- AFS wants an actual counter: if you get i_version X, then
  write twice, then get i_version X+2, you're allowed to assume
  your writes were the only modifications.  Let's ignore this
  for now.  In the future if someone explains how to count
  operations, then we can extend the interface to tell the
  caller it can get those extra semantics.

- It's durable; the above comparison still works if there were reboots
  between the two i_version checks.
- I don't know how realistic this is--we may need to figure out
  if there's a weaker guarantee that's still useful.  Do
  filesystems actually make ctime/mtime/i_version changes
  atomically with the changes that caused them?  What if a
  change attribute is exposed to an NFS client but doesn't make
  it to disk, and then that value is reused after reboot?

Am I missing any issues?

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-03-21 Thread Christoph Hellwig
On Mon, Mar 20, 2017 at 05:43:27PM -0400, J. Bruce Fields wrote:
> To me, the interesting question is whether this allows us to turn on
> i_version updates by default on xfs and ext4.

XFS v5 file systems have it on by default.  Although we'll still need
to agree on the exact semantics of i_version before it's going to be
useful.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 3/3] fstests: btrfs: Test inband dedupe with data balance.

2017-03-21 Thread Eryu Guan
On Thu, Mar 16, 2017 at 05:08:51PM +0800, Qu Wenruo wrote:
> Btrfs balance will reloate date extent, but its hash is removed too late
 ^^^ relocate
> at run_delayed_ref() time, which will cause extent ref increased
> during balance, cause either find_data_references() gives WARN_ON()
> or even run_delayed_refs() fails and cause transaction abort.
> 
> Add such concurrency test for inband dedupe and data balance.
> 
> Signed-off-by: Qu Wenruo 
> ---
>  tests/btrfs/202 | 109 
> 
>  tests/btrfs/202.out |   3 ++
>  tests/btrfs/group   |   1 +
>  3 files changed, 113 insertions(+)
>  create mode 100755 tests/btrfs/202
>  create mode 100644 tests/btrfs/202.out
> 
> diff --git a/tests/btrfs/202 b/tests/btrfs/202
> new file mode 100755
> index 000..60bb924
> --- /dev/null
> +++ b/tests/btrfs/202
> @@ -0,0 +1,109 @@
> +#! /bin/bash
> +# FS QA Test 203
> +#
> +# Btrfs inband dedupe with balance concurrency test
> +#
> +# This can spot inband dedupe error which will increase delayed ref on
> +# an data extent inside RO block group
> +#
> +#---
> +# Copyright (c) 2017 Fujitsu.  All Rights Reserved.
> +#
> +# This program is free software; you can redistribute it and/or
> +# modify it under the terms of the GNU General Public License as
> +# published by the Free Software Foundation.
> +#
> +# This program is distributed in the hope that it would be useful,
> +# but WITHOUT ANY WARRANTY; without even the implied warranty of
> +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> +# GNU General Public License for more details.
> +#
> +# You should have received a copy of the GNU General Public License
> +# along with this program; if not, write the Free Software Foundation,
> +# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
> +#---
> +#
> +
> +seq=`basename $0`
> +seqres=$RESULT_DIR/$seq
> +echo "QA output created by $seq"
> +
> +here=`pwd`
> +tmp=/tmp/$$
> +status=1 # failure is the default!
> +trap "_cleanup; exit \$status" 0 1 2 3 15
> +
> +_cleanup()
> +{
> + cd /
> + kill $populate_pid &> /dev/null
> + kill $balance_pid &> /dev/null
> + wait
> + # Check later comment for reason
> + $BTRFS_UTIL_PROG balance cancel $SCRATCH_MNT &> /dev/null
> + rm -f $tmp.*

I sometimes saw "btrfs balance start" hold $SCRATCH_MNT from umounting
and result in fs inconsistency after test. Some debug codes show that
the "btrfs balance start $SCRATCH_MNT" is still running. If I add what
btrfs/061 does:

while ps aux | grep "balance start" | grep -qv grep; do 

   
sleep 1 

   
done

test finishes fine. But btrfs-balance(8) says balance cancel should
block and wait for the backgroud balance process to finish. Seems like a
"balance cancel" bug?

Otherwise these three patches look fine to me overall, except the wanted
"dedupe feature check", and btrfs/200 is quick enough to fit in 'quick'
group.

I think the "dedupe feature check" should _notrun the tests if current
btrfs has incompatible features set, e.g. compress and/or nodatacow.

Thanks,
Eryu
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] Btrfs in-band de-duplication test cases

2017-03-21 Thread Eryu Guan
On Tue, Mar 21, 2017 at 03:36:54PM +0800, Qu Wenruo wrote:
> 
> 
> At 03/21/2017 03:23 PM, Eryu Guan wrote:
> > On Thu, Mar 16, 2017 at 09:50:24AM +0800, Qu Wenruo wrote:
> > > Btrfs in-band de-duplication test cases for in-memory backend, which 
> > > covers
> > > the bugs exposed during the development.
> > 
> > Sorry, I'm having trouble enabling inband dedupe in tests, I always get
> > ioctl failure, $seqres.full shows:
> > 
> > btrfs-progs v4.10-5-g23d4fd5
> > See http://btrfs.wiki.kernel.org for more information.
> > 
> > Label:  (null)
> > UUID:   3d2cedd0-64ae-4667-ba09-9c12e28ecadd
> > Node size:  16384
> > Sector size:4096
> > Filesystem size:15.00GiB
> > Block group profiles:
> >   Data: single8.00MiB
> >   Metadata: DUP   1.00GiB
> >   System:   DUP   8.00MiB
> > SSD detected:   no
> > Incompat features:  extref, skinny-metadata
> > Number of devices:  1
> > Devices:
> >IDSIZE  PATH
> > 115.00GiB  /dev/mapper/systemvg-testlv2
> > 
> > # /usr/local/bin/btrfs dedupe enable -f -s inmemory -b 64K /scratch
> > ERROR: failed to (enable) inband deduplication: Inappropriate ioctl for 
> > device
> > ERROR: unsupported dedupe limit combination: nr: 0, mem: 0
> 
> What I forgot to mention is that, the inband dedupe feature is an
> experimental feature, so CONFIG_BTRFS_DEBUG is needed to use this feature.

OK, I'll rebuild with BTRFS_DEBUG and try again, thanks for the info!
But IMHO, giving out experimental message in dmesg seems sufficient,
like what XFS reflink feature does.

> 
> Sorry for the inconvenience.
> 
> I'll add also add the CONFIG_BTRFS_DEBUG check for /sys/fs/btrfs/feature/,
> so it won't export inband dedupe to users.

I'd suggest that adding a new require rule to check if inband dedupe is
supported by current kernel, by enabling inband dedupe and see the
result, so that we don't need to update the CONFIG_BTRFS_DEBUG check
when this feature is moved out of BTRFS_DEBUG.

Thanks,
Eryu
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] Btrfs in-band de-duplication test cases

2017-03-21 Thread Qu Wenruo



At 03/21/2017 03:23 PM, Eryu Guan wrote:

On Thu, Mar 16, 2017 at 09:50:24AM +0800, Qu Wenruo wrote:

Btrfs in-band de-duplication test cases for in-memory backend, which covers
the bugs exposed during the development.


Sorry, I'm having trouble enabling inband dedupe in tests, I always get
ioctl failure, $seqres.full shows:

btrfs-progs v4.10-5-g23d4fd5
See http://btrfs.wiki.kernel.org for more information.

Label:  (null)
UUID:   3d2cedd0-64ae-4667-ba09-9c12e28ecadd
Node size:  16384
Sector size:4096
Filesystem size:15.00GiB
Block group profiles:
  Data: single8.00MiB
  Metadata: DUP   1.00GiB
  System:   DUP   8.00MiB
SSD detected:   no
Incompat features:  extref, skinny-metadata
Number of devices:  1
Devices:
   IDSIZE  PATH
115.00GiB  /dev/mapper/systemvg-testlv2

# /usr/local/bin/btrfs dedupe enable -f -s inmemory -b 64K /scratch
ERROR: failed to (enable) inband deduplication: Inappropriate ioctl for device
ERROR: unsupported dedupe limit combination: nr: 0, mem: 0


What I forgot to mention is that, the inband dedupe feature is an 
experimental feature, so CONFIG_BTRFS_DEBUG is needed to use this feature.


Sorry for the inconvenience.

I'll add also add the CONFIG_BTRFS_DEBUG check for 
/sys/fs/btrfs/feature/, so it won't export inband dedupe to users.


Thanks,
Qu

failed: '/usr/local/bin/btrfs dedupe enable -f -s inmemory -b 64K /scratch'

I have no special MKFS_OPTIONS or MOUNT_OPTIONS specified, all defaults.
kernel is built from your test tree. Did I miss anything?

Thanks,
Eryu





--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] Btrfs in-band de-duplication test cases

2017-03-21 Thread Eryu Guan
On Thu, Mar 16, 2017 at 09:50:24AM +0800, Qu Wenruo wrote:
> Btrfs in-band de-duplication test cases for in-memory backend, which covers
> the bugs exposed during the development.

Sorry, I'm having trouble enabling inband dedupe in tests, I always get
ioctl failure, $seqres.full shows:

btrfs-progs v4.10-5-g23d4fd5
See http://btrfs.wiki.kernel.org for more information.

Label:  (null)
UUID:   3d2cedd0-64ae-4667-ba09-9c12e28ecadd
Node size:  16384
Sector size:4096
Filesystem size:15.00GiB
Block group profiles:
  Data: single8.00MiB
  Metadata: DUP   1.00GiB
  System:   DUP   8.00MiB
SSD detected:   no
Incompat features:  extref, skinny-metadata
Number of devices:  1
Devices:
   IDSIZE  PATH
115.00GiB  /dev/mapper/systemvg-testlv2

# /usr/local/bin/btrfs dedupe enable -f -s inmemory -b 64K /scratch
ERROR: failed to (enable) inband deduplication: Inappropriate ioctl for device
ERROR: unsupported dedupe limit combination: nr: 0, mem: 0
failed: '/usr/local/bin/btrfs dedupe enable -f -s inmemory -b 64K /scratch'

I have no special MKFS_OPTIONS or MOUNT_OPTIONS specified, all defaults.
kernel is built from your test tree. Did I miss anything?

Thanks,
Eryu
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 0/3] Btrfs in-band de-duplication test cases

2017-03-21 Thread Qu Wenruo



At 03/21/2017 02:05 PM, Eryu Guan wrote:

On Tue, Mar 21, 2017 at 01:22:33PM +0800, Qu Wenruo wrote:



At 03/21/2017 12:51 PM, Eryu Guan wrote:

Hi Qu,

On Thu, Mar 16, 2017 at 05:08:48PM +0800, Qu Wenruo wrote:

Btrfs in-band de-duplication test cases for in-memory backend, which covers
the bugs exposed during the development.


Do you have a kernel tree that contains the in-band patches so that I
and others could try and run these tests?


Of course.

Kernel tree:
https://github.com/adam900710/linux.git wang_dedupe_latest

Btrfs-progs tree:
https://github.com/adam900710/btrfs-progs.git dedupe_20170306


Thanks! But I guess it's branch 'dedupe_20170316' of btrfs-progs repo,
right? There's no dedupe_20170306 branch.


My fault, it's 20170316.

Thanks,
Qu


Thanks,
Eryu





--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/5] btrfs: raid56: Don't keep rbio for later steal

2017-03-21 Thread Liu Bo
On Tue, Mar 21, 2017 at 10:23:56AM +0800, Qu Wenruo wrote:
> 
> 
> At 03/21/2017 10:08 AM, Liu Bo wrote:
> > On Tue, Mar 21, 2017 at 08:44:18AM +0800, Qu Wenruo wrote:
> > > 
> > > 
> > > At 03/21/2017 04:23 AM, Liu Bo wrote:
> > > > On Mon, Mar 20, 2017 at 02:21:48PM +0800, Qu Wenruo wrote:
> > > > > 
> > > > > 
> > > > > At 03/18/2017 10:03 AM, Liu Bo wrote:
> > > > > > On Fri, Mar 17, 2017 at 01:28:45PM +0800, Qu Wenruo wrote:
> > > > > > > 
> > > > > > > 
> > > > > > > At 03/17/2017 12:44 PM, Liu Bo wrote:
> > > > > > > > On Fri, Feb 03, 2017 at 04:20:22PM +0800, Qu Wenruo wrote:
> > > > > > > > > Before this patch, btrfs raid56 will keep raid56 rbio even 
> > > > > > > > > all its IO is
> > > > > > > > > done.
> > > > > > > > > This may save some time allocating rbio, but it can cause 
> > > > > > > > > deadly
> > > > > > > > > use-after-free bug, for the following case:
> > > > > > > > > 
> > > > > > > > > Original fs: 4 devices RAID5
> > > > > > > > > 
> > > > > > > > >Process A |  Process B
> > > > > > > > > --
> > > > > > > > >  |  Start device replace
> > > > > > > > >  |Now the fs has 5 devices
> > > > > > > > >  |devid 0: replace device
> > > > > > > > >  |devid 1~4: old devices
> > > > > > > > > btrfs_map_bio()  |
> > > > > > > > > |- __btrfs_map_block()   |
> > > > > > > > > |bbio has 5 stripes  |
> > > > > > > > > |including devid 0   |
> > > > > > > > > |- raid56_parity_write() |
> > > > > > > > >  |
> > > > > > > > > raid_write_end_io()  |
> > > > > > > > > |- rbio_orig_end_io()|
> > > > > > > > >|- unlock_stripe()|
> > > > > > > > >Keeps the old rbio for|
> > > > > > > > >later steal, which has|
> > > > > > > > >stripe for devid 0|
> > > > > > > > >  |  Cancel device replace
> > > > > > > > >  |Now the fs has 4 devices
> > > > > > > > >  |devid 0 is freed
> > > > > > > > > Some IO happens  |
> > > > > > > > > raid_write_end_io()  |
> > > > > > > > > |- rbio_orig_end_io()|
> > > > > > > > >|- unlock_stripe()|
> > > > > > > > >   |- steal_rbio()|
> > > > > > > > >Use old rbio, whose   |
> > > > > > > > >bbio has freed devid 0|
> > > > > > > > >stripe|
> > > > > > > > > Any access to rbio->bbio will|
> > > > > > > > > cause general protection or NULL |
> > > > > > > > > pointer dereference  |
> > > > > > > > > 
> > > > > > > > > Such bug can already be triggered by fstests btrfs/069 for 
> > > > > > > > > RAID5/6
> > > > > > > > > profiles.
> > > > > > > > > 
> > > > > > > > > Fix it by not keeping old rbio in unlock_stripe(), so we just 
> > > > > > > > > free the
> > > > > > > > > finished rbio and rbio->bbio, so above problem wont' happen.
> > > > > > > > > 
> > > > > > > > 
> > > > > > > > I don't think this is acceptable, keeping a cache is important 
> > > > > > > > for
> > > > > > > > raid56 write performance, could you please fix it by checking 
> > > > > > > > if the
> > > > > > > > device is missing?
> > > > > > > 
> > > > > > > Not possible, as it's keeping the btrfs_device pointer and never 
> > > > > > > release it,
> > > > > > > the stolen rbio can be kept forever until umount.
> > > > > > > 
> > > > > > 
> > > > > > steal_rbio() only takes pages from rbio->stripe_pages, so the cached
> > > > > > rbio->bbio is not going to the next IO's rbio because the cached one
> > > > > > got freed immediately in steal_rbio(), where could we dereference
> > > > > > rbio->bbio?
> > > > > 
> > > > > Did you mean in unlock_stripe(), we still keep the rbio in cache, 
> > > > > while
> > > > > release its bbio?
> > > > > 
> > > > 
> > > > I thought it was lock_stripe_add(), OK, so unlock_stripe() just caches
> > > > the current rbio and doens't free it.  But the point about
> > > > steal_rbio() still stands, steal_rbio() is supposed to take uptodate
> > > > pages from the cached old rbio to the current processing rbio, but it
> > > > doesn't touch the cached old rbio's bbio, nor uses the cached old rbio
> > > > afterwards, instead it is the current processing rbio that would use
> > > > its bbio for writing into target devices, but it has increased its own
> > > > bio_counter.
> > > > 
> > > > > This sounds quite good but I'm afraid it may cause more problems.
> > > > > 
> > > > > Quite a lot of places are accessing rbio->bbio either for their btrfs
> > > > > logical address or stripes or even stripe->dev.
> > > > > 
> > > > 
> > > > I'm confused, 

Re: [PATCH 4/5] btrfs: raid56: Don't keep rbio for later steal

2017-03-21 Thread Qu Wenruo



At 03/21/2017 01:45 PM, Liu Bo wrote:

On Tue, Mar 21, 2017 at 10:23:56AM +0800, Qu Wenruo wrote:



At 03/21/2017 10:08 AM, Liu Bo wrote:

On Tue, Mar 21, 2017 at 08:44:18AM +0800, Qu Wenruo wrote:



At 03/21/2017 04:23 AM, Liu Bo wrote:

On Mon, Mar 20, 2017 at 02:21:48PM +0800, Qu Wenruo wrote:



At 03/18/2017 10:03 AM, Liu Bo wrote:

On Fri, Mar 17, 2017 at 01:28:45PM +0800, Qu Wenruo wrote:



At 03/17/2017 12:44 PM, Liu Bo wrote:

On Fri, Feb 03, 2017 at 04:20:22PM +0800, Qu Wenruo wrote:

Before this patch, btrfs raid56 will keep raid56 rbio even all its IO is
done.
This may save some time allocating rbio, but it can cause deadly
use-after-free bug, for the following case:

Original fs: 4 devices RAID5

   Process A |  Process B
--
 |  Start device replace
 |Now the fs has 5 devices
 |devid 0: replace device
 |devid 1~4: old devices
btrfs_map_bio()  |
|- __btrfs_map_block()   |
|bbio has 5 stripes  |
|including devid 0   |
|- raid56_parity_write() |
 |
raid_write_end_io()  |
|- rbio_orig_end_io()|
   |- unlock_stripe()|
   Keeps the old rbio for|
   later steal, which has|
   stripe for devid 0|
 |  Cancel device replace
 |Now the fs has 4 devices
 |devid 0 is freed
Some IO happens  |
raid_write_end_io()  |
|- rbio_orig_end_io()|
   |- unlock_stripe()|
  |- steal_rbio()|
   Use old rbio, whose   |
   bbio has freed devid 0|
   stripe|
Any access to rbio->bbio will|
cause general protection or NULL |
pointer dereference  |

Such bug can already be triggered by fstests btrfs/069 for RAID5/6
profiles.

Fix it by not keeping old rbio in unlock_stripe(), so we just free the
finished rbio and rbio->bbio, so above problem wont' happen.



I don't think this is acceptable, keeping a cache is important for
raid56 write performance, could you please fix it by checking if the
device is missing?


Not possible, as it's keeping the btrfs_device pointer and never release it,
the stolen rbio can be kept forever until umount.



steal_rbio() only takes pages from rbio->stripe_pages, so the cached
rbio->bbio is not going to the next IO's rbio because the cached one
got freed immediately in steal_rbio(), where could we dereference
rbio->bbio?


Did you mean in unlock_stripe(), we still keep the rbio in cache, while
release its bbio?



I thought it was lock_stripe_add(), OK, so unlock_stripe() just caches
the current rbio and doens't free it.  But the point about
steal_rbio() still stands, steal_rbio() is supposed to take uptodate
pages from the cached old rbio to the current processing rbio, but it
doesn't touch the cached old rbio's bbio, nor uses the cached old rbio
afterwards, instead it is the current processing rbio that would use
its bbio for writing into target devices, but it has increased its own
bio_counter.


This sounds quite good but I'm afraid it may cause more problems.

Quite a lot of places are accessing rbio->bbio either for their btrfs
logical address or stripes or even stripe->dev.



I'm confused, could you please specify the call trace of general
protection you got in the commit log?


The 4th and 5th patches are designed to fix the same problem.

If one applies 5th patch only and run btrfs/069, it will cause hang when
aborting replace, since btrfs_device of dev 0 is hold in bbio->stripes[] and
never get released.

The 4th patch is used to solve such hang.



OK, I see.

Firstly, the above commit log is misleading people a bit because it
says that steal_rbio() dereferences the device of the cached rbio and
that device has got free'd, but steal_rbio() actually doesn't.  Yes,
the cached rbio holds a reference on the free'd device, but I think
the below 'NULL pointer dereference' comes from writing back pages
into target devices when doing RMW with the current rbio instead of
the old cached one, right?


Yes, steal_bio() is not related to this problem, sorry for the confusion.



Secondly, if it is the current rio that ends up this 'NULL pointer
dereference', it could hold a bio_counter and let the replace thread
canceled by scrub wait for this bio_counter to be zero.  Although
btrfs_dev_replace_finishing() has flushed delalloc IO and committed
transaction, seems like scrub is an exception because it can continued
after committing transaction.


If I understand it correctly, did you mean hold bio_counter when rbio holds
bbio?



We have bio_counter to prevent the race against a 

Re: [PATCH] fstests: add test for btrfs send/receive with sparse files

2017-03-21 Thread Qu Wenruo

Hi Filipe,

At 02/15/2017 04:35 AM, fdman...@kernel.org wrote:

From: Filipe Manana 

Test that both a full and incremental btrfs send operation preserves file
holes.


I found the test case fails with compress=lzo mount option.

In fact, it turns out to be 2 bugs:

[Inline-then-regular layout]
And further more, it seems that it can create inline-then-regular file 
extents layout with compress=lzo.


It does not always create such inline-then-regular, but the possibility 
seems quite high in my test box, near 80%.


Btrfs dump tree for snap2 (258):
item 9 key (257 INODE_REF 256) itemoff 15736 itemsize 13
inode ref index 2 namelen 3 name: foo
item 10 key (257 EXTENT_DATA 0) itemoff 15663 itemsize 73
generation 7 type 0 (inline)
inline extent data size 52 ram_bytes 4096 compression 2 
(lzo)

item 11 key (257 EXTENT_DATA 4096) itemoff 15610 itemsize 53
generation 8 type 1 (regular)
extent data disk byte 0 nr 0
extent data offset 0 nr 1048576 ram 1052672
extent compression 0 (none)
item 12 key (257 EXTENT_DATA 1052672) itemoff 15557 itemsize 53
generation 8 type 1 (regular)
extent data disk byte 12582912 nr 4096
extent data offset 0 nr 4096 ram 4096
extent compression 0 (none)

[Send-stream without hole]
This bug is 100% reproducible, the send stream of snap2 contains no 
hole, just the full file contents.


Thanks,
Qu



This used to fail when the filesystem had the NO_HOLES feature enabled,
that is, when the test is run with MKFS_OPTIONS="-O no-holes".

This is fixed by the following patch for the linux kernel:

  "Btrfs: incremental send, fix unnecessary hole writes for sparse files"

Signed-off-by: Filipe Manana 
---
 tests/btrfs/137 | 141 
 tests/btrfs/137.out |  63 +++
 tests/btrfs/group   |   1 +
 3 files changed, 205 insertions(+)
 create mode 100755 tests/btrfs/137
 create mode 100644 tests/btrfs/137.out

diff --git a/tests/btrfs/137 b/tests/btrfs/137
new file mode 100755
index 000..3ff2c6b
--- /dev/null
+++ b/tests/btrfs/137
@@ -0,0 +1,141 @@
+#! /bin/bash
+# FS QA Test No. btrfs/137
+#
+# Test that both incremental and full send operations preserve file holes.
+#
+#---
+#
+# Copyright (C) 2017 SUSE Linux Products GmbH. All Rights Reserved.
+# Author: Filipe Manana 
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License as
+# published by the Free Software Foundation.
+#
+# This program is distributed in the hope that it would be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write the Free Software Foundation,
+# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+#---
+#
+
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo "QA output created by $seq"
+
+tmp=/tmp/$$
+status=1   # failure is the default!
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+_cleanup()
+{
+   cd /
+   rm -fr $send_files_dir
+   rm -f $tmp.*
+}
+
+# get standard environment, filters and checks
+. ./common/rc
+. ./common/filter
+. ./common/punch
+
+# real QA test starts here
+_supported_fs btrfs
+_supported_os Linux
+_require_test
+_require_scratch
+_require_xfs_io_command "fiemap"
+
+send_files_dir=$TEST_DIR/btrfs-test-$seq
+
+rm -f $seqres.full
+rm -fr $send_files_dir
+mkdir $send_files_dir
+
+_scratch_mkfs >>$seqres.full 2>&1
+_scratch_mount
+
+# Create the first test file.
+$XFS_IO_PROG -f -c "pwrite -S 0xaa 0 4K" $SCRATCH_MNT/foo | _filter_xfs_io
+
+# Create a second test file with a 1Mb hole.
+$XFS_IO_PROG -f \
+ -c "pwrite -S 0xaa 0 4K" \
+ -c "pwrite -S 0xbb 1028K 4K" \
+ $SCRATCH_MNT/bar | _filter_xfs_io
+
+$BTRFS_UTIL_PROG subvolume snapshot -r $SCRATCH_MNT \
+   $SCRATCH_MNT/snap1 >/dev/null
+
+# Now add one new extent to our first test file, increasing its size and 
leaving
+# a 1Mb hole between the first extent and this new extent.
+$XFS_IO_PROG -c "pwrite -S 0xbb 1028K 4K" $SCRATCH_MNT/foo | _filter_xfs_io
+
+# Now overwrite the last extent of our second test file.
+$XFS_IO_PROG -c "pwrite -S 0xcc 1028K 4K" $SCRATCH_MNT/bar | _filter_xfs_io
+
+$BTRFS_UTIL_PROG subvolume snapshot -r $SCRATCH_MNT \
+$SCRATCH_MNT/snap2 >/dev/null
+
+echo
+echo "File digests in the original filesystem:"
+md5sum $SCRATCH_MNT/snap1/foo | _filter_scratch
+md5sum $SCRATCH_MNT/snap1/bar | 

Re: [PATCH v2 0/3] Btrfs in-band de-duplication test cases

2017-03-21 Thread Eryu Guan
On Tue, Mar 21, 2017 at 01:22:33PM +0800, Qu Wenruo wrote:
> 
> 
> At 03/21/2017 12:51 PM, Eryu Guan wrote:
> > Hi Qu,
> > 
> > On Thu, Mar 16, 2017 at 05:08:48PM +0800, Qu Wenruo wrote:
> > > Btrfs in-band de-duplication test cases for in-memory backend, which 
> > > covers
> > > the bugs exposed during the development.
> > 
> > Do you have a kernel tree that contains the in-band patches so that I
> > and others could try and run these tests?
> 
> Of course.
> 
> Kernel tree:
> https://github.com/adam900710/linux.git wang_dedupe_latest
> 
> Btrfs-progs tree:
> https://github.com/adam900710/btrfs-progs.git dedupe_20170306

Thanks! But I guess it's branch 'dedupe_20170316' of btrfs-progs repo,
right? There's no dedupe_20170306 branch.

Thanks,
Eryu
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html