Re: [PATCH] btrfs: add missing memset while reading compressed inline extents
At 03/09/2017 10:05 AM, Zygo Blaxell wrote: On Wed, Mar 08, 2017 at 10:27:33AM +, Filipe Manana wrote: On Wed, Mar 8, 2017 at 3:18 AM, Zygo Blaxellwrote: From: Zygo Blaxell This is a story about 4 distinct (and very old) btrfs bugs. Commit c8b978188c ("Btrfs: Add zlib compression support") added three data corruption bugs for inline extents (bugs #1-3). Commit 93c82d5750 ("Btrfs: zero page past end of inline file items") fixed bug #1: uncompressed inline extents followed by a hole and more extents could get non-zero data in the hole as they were read. The fix was to add a memset in btrfs_get_extent to zero out the hole. Commit 166ae5a418 ("btrfs: fix inline compressed read err corruption") fixed bug #2: compressed inline extents which contained non-zero bytes might be replaced with zero bytes in some cases. This patch removed an unhelpful memset from uncompress_inline, but the case where memset is required was missed. There is also a memset in the decompression code, but this only covers decompressed data that is shorter than the ram_bytes from the extent ref record. This memset doesn't cover the region between the end of the decompressed data and the end of the page. It has also moved around a few times over the years, so there's no single patch to refer to. This patch fixes bug #3: compressed inline extents followed by a hole and more extents could get non-zero data in the hole as they were read (i.e. bug #3 is the same as bug #1, but s/uncompressed/compressed/). The fix is the same: zero out the hole in the compressed case too, by putting a memset back in uncompress_inline, but this time with correct parameters. The last and oldest bug, bug #0, is the cause of the offending inline extent/hole/extent pattern. Bug #0 is a subtle and mostly-harmless quirk of behavior somewhere in the btrfs write code. In a few special cases, an inline extent and hole are allowed to persist where they normally would be combined with later extents in the file. A fast reproducer for bug #0 is presented below. A few offending extents are also created in the wild during large rsync transfers with the -S flag. A Linux kernel build (git checkout; make allyesconfig; make -j8) will produce a handful of offending files as well. Once an offending file is created, it can present different content to userspace each time it is read. Bug #0 is at least 4 and possibly 8 years old. I verified every vX.Y kernel back to v3.5 has this behavior. There are fossil records of this bug's effects in commits all the way back to v2.6.32. I have no reason to believe bug #0 wasn't present at the beginning of btrfs compression support in v2.6.29, but I can't easily test kernels that old to be sure. It is not clear whether bug #0 is worth fixing. A fix would likely require injecting extra reads into currently write-only paths, and most of the exceptional cases caused by bug #0 are already handled now. Whether we like them or not, bug #0's inline extents followed by holes are part of the btrfs de-facto disk format now, and we need to be able to read them without data corruption or an infoleak. So enough about bug #0, let's get back to bug #3 (this patch). An example of on-disk structure leading to data corruption: item 61 key (606890 INODE_ITEM 0) itemoff 9662 itemsize 160 inode generation 50 transid 50 size 47424 nbytes 49141 block group 0 mode 100644 links 1 uid 0 gid 0 rdev 0 flags 0x0(none) item 62 key (606890 INODE_REF 603050) itemoff 9642 itemsize 20 inode ref index 3 namelen 10 name: DB_File.so item 63 key (606890 EXTENT_DATA 0) itemoff 8280 itemsize 1362 inline extent data size 1341 ram 4085 compress(zlib) item 64 key (606890 EXTENT_DATA 4096) itemoff 8227 itemsize 53 extent data disk byte 5367308288 nr 20480 extent data offset 0 nr 45056 ram 45056 extent compression(zlib) So this case is actually different from the reproducer below, because once a file has prealloc extents, future writes will never be compressed. That is, the extent at offset 4096 can not have compressed content using steps like the reproducer below. I can only imagine that after an falloc to create the extent at 4K, we cloned some compressed extent from some other file into that offset. The above comes from one of my captures of the bug appearing in the wild, not the reproducer from Liu Bo. I'll make that clearer. In fact, I found that btrfs/137 with compress=lzo seems to trigger such "inline-then-regular" case in a very high possibility. If abort the test just after populating the fs, I found the possibility to have inline-then-regular extent layout is over 70%. Although in btrfs/137 case, it's "inline-then-hole", not completely "inline-then-regular". Hopes such reproducer can help to trace down the
Re: [PATCH RFC] btrfs: introduce a separate mutex for caching_block_groups list
On Sun, Mar 19, 2017 at 07:18:59PM +0200, Alex Lyakas wrote: > We have a commit_root_sem, which is a read-write semaphore that protects the > commit roots. > But it is also used to protect the list of caching block groups. > > As a result, while doing "slow" caching, the following issue is seen: > > Some of the caching threads are scanning the extent tree with > commit_root_sem > acquired in shared mode, with stack like: > [] read_extent_buffer_pages+0x2d2/0x300 [btrfs] > [] btree_read_extent_buffer_pages.constprop.50+0xb7/0x1e0 > [btrfs] > [] read_tree_block+0x40/0x70 [btrfs] > [] read_block_for_search.isra.33+0x12c/0x370 [btrfs] > [] btrfs_search_slot+0x3c6/0xb10 [btrfs] > [] caching_thread+0x1b9/0x820 [btrfs] > [] normal_work_helper+0xc6/0x340 [btrfs] > [] btrfs_cache_helper+0x12/0x20 [btrfs] > > IO requests that want to allocate space are waiting in cache_block_group() > to acquire the commit_root_sem in exclusive mode. But they only want to add > the caching control structure to the list of caching block-groups: > [] schedule+0x29/0x70 > [] rwsem_down_write_failed+0x145/0x320 > [] call_rwsem_down_write_failed+0x13/0x20 > [] cache_block_group+0x25b/0x450 [btrfs] > [] find_free_extent+0xd16/0xdb0 [btrfs] > [] btrfs_reserve_extent+0xaf/0x160 [btrfs] > > Other caching threads want to continue their scanning, and for that they > are waiting to acquire commit_root_sem in shared mode. But since there are > IO threads that want the exclusive lock, the caching threads are unable > to continue the scanning, because (I presume) rw_semaphore guarantees some > fairness: > [] schedule+0x29/0x70 > [] rwsem_down_read_failed+0xc5/0x120 > [] call_rwsem_down_read_failed+0x14/0x30 > [] caching_thread+0x1a1/0x820 [btrfs] > [] normal_work_helper+0xc6/0x340 [btrfs] > [] btrfs_cache_helper+0x12/0x20 [btrfs] > [] process_one_work+0x146/0x410 > > This causes slowness of the IO, especially when there are many block groups > that need to be scanned for free space. In some cases it takes minutes > until a single IO thread is able to allocate free space. > > I don't see a deadlock here, because the caching threads that were able to > acquire > the commit_root_sem will call rwsem_is_contended() and should give up the > semaphore, > so that IO threads are able to acquire it in exclusive mode. > > However, introducing a separate mutex that protects only the list of caching > block groups makes things move forward much faster. > The problem did exist and the patch looks good to me. > This patch is based on kernel 3.18. > Unfortunately, I am not able to submit a patch based on one of the latest > kernels, because > here btrfs is part of the larger system, and upgrading the kernel is a > significant effort. > Hence marking the patch as RFC. > Hopefully, this patch still has some value to the community. > > Signed-off-by: Alex Lyakas> > diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h > index 42d11e7..74feacb 100644 > --- a/fs/btrfs/ctree.h > +++ b/fs/btrfs/ctree.h > @@ -1490,6 +1490,8 @@ struct btrfs_fs_info { > struct list_head trans_list; > struct list_head dead_roots; > struct list_head caching_block_groups; > +/* protects the above list */ > +struct mutex caching_block_groups_mutex; > > spinlock_t delayed_iput_lock; > struct list_head delayed_iputs; > diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c > index 5177954..130ec58 100644 > --- a/fs/btrfs/disk-io.c > +++ b/fs/btrfs/disk-io.c > @@ -2229,6 +2229,7 @@ int open_ctree(struct super_block *sb, > INIT_LIST_HEAD(_info->delayed_iputs); > INIT_LIST_HEAD(_info->delalloc_roots); > INIT_LIST_HEAD(_info->caching_block_groups); > +mutex_init(_info->caching_block_groups_mutex); > spin_lock_init(_info->delalloc_root_lock); > spin_lock_init(_info->trans_lock); > spin_lock_init(_info->fs_roots_radix_lock); > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c > index a067065..906fb08 100644 > --- a/fs/btrfs/extent-tree.c > +++ b/fs/btrfs/extent-tree.c > @@ -637,10 +637,10 @@ static int cache_block_group(struct > btrfs_block_group_cache *cache, > return 0; > } > > -down_write(_info->commit_root_sem); > +mutex_lock(_info->caching_block_groups_mutex); > atomic_inc(_ctl->count); > list_add_tail(_ctl->list, _info->caching_block_groups); > -up_write(_info->commit_root_sem); > +mutex_unlock(_info->caching_block_groups_mutex); > > btrfs_get_block_group(cache); > > @@ -5693,6 +5693,7 @@ void btrfs_prepare_extent_commit(struct > btrfs_trans_handle *trans, > > down_write(_info->commit_root_sem); > Witht the new mutex, it's not necessary to take commit_root_sem here because a) pinned_extents could not be modified outside of a transaction, and b) while at btrfs_prepare_extent_commit(), it's already at the very end of commiting a transaction. caching_ctl should have at least one reference and caching_ctl->progress is supposed to be protected by
Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization
On Tue, 2017-03-21 at 15:13 -0400, J. Bruce Fields wrote: > On Tue, Mar 21, 2017 at 02:46:53PM -0400, Jeff Layton wrote: > > On Tue, 2017-03-21 at 14:30 -0400, J. Bruce Fields wrote: > > > On Tue, Mar 21, 2017 at 01:23:24PM -0400, Jeff Layton wrote: > > > > On Tue, 2017-03-21 at 12:30 -0400, J. Bruce Fields wrote: > > > > > - It's durable; the above comparison still works if there were reboots > > > > > between the two i_version checks. > > > > > - I don't know how realistic this is--we may need to figure out > > > > > if there's a weaker guarantee that's still useful. Do > > > > > filesystems actually make ctime/mtime/i_version changes > > > > > atomically with the changes that caused them? What if a > > > > > change attribute is exposed to an NFS client but doesn't make > > > > > it to disk, and then that value is reused after reboot? > > > > > > > > > > > > > Yeah, there could be atomicity there. If we bump i_version, we'll mark > > > > the inode dirty and I think that will end up with the new i_version at > > > > least being journalled before __mark_inode_dirty returns. > > > > > > So you think the filesystem can provide the atomicity? In more detail: > > > > > > > Sorry, I hit send too quickly. That should have read: > > > > "Yeah, there could be atomicity issues there." > > > > I think providing that level of atomicity may be difficult, though > > maybe there's some way to make the querying of i_version block until > > the inode update has been journalled? > > No idea. Anyway, I'd like to figure out some reasonable requirement > that we can document. > > > > > > - if I write to a file, a simultaneous reader should see either > > > (old data, old i_version) or (new data, new i_version), not a > > > combination of the two. > > > - ditto for metadata modifications. > > > - the same should be true if there's a crash. > > > > > > (If that's not possible, then I think we could live with a brief window > > > of (new data, old i_version) as long as it doesn't persist beyond sync?) > > > > > > > That said, I suppose it is possible for us to bump the counter, hand > > > > that new counter value out to a NFS client and then the box crashes > > > > before it makes it to the journal. > > > > > > > > Not sure how big a problem that really is. > > > > > > The other case I was wondering about may have been unclear. Represent > > > the state of a file by a (data, i_version) pair. Say: > > > > > > - file is modified from (F, V) to (F', V+1). > > > - client reads and caches (F', V+1). > > > - server crashes before writeback, so disk still has (F, V). > > > - after restart, someone else modifies file to (F'', V+1). > > > - original client revalidates its cache, sees V+1, concludes > > > file data is still F'. > > > > > > This may not cause a real problem for clients depending only on > > > traditional NFS close-to-open semantics. > > > > > > > > > > No, I think that is a legitimate problem. > > > > That said, after F'', the mtime would almost certainly be different > > from the time after F', and that would likely be enough to prevent > > confusion in NFS clients. > > Oh, good point. So, may be worth saying that anyone wanting to make > sense of these across reboot should compare times as well (maybe that > should be in nfs rfc's too). I think that should be ctime not mtime, > though? > Yes, it might be worth a mention there. IIRC, it does mention that you shouldn't just look at a single attribute for cache validation purposes, but the wording is a bit vague. I can't find the section at the moment though. The more I think about it though, simply ensuring that we don't publish a new change attr until the inode update has hit the journal may be the best we can do. I'd have to think about how to implement that though. -- Jeff Layton-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization
On Tue, Mar 21, 2017 at 01:23:24PM -0400, Jeff Layton wrote: > On Tue, 2017-03-21 at 12:30 -0400, J. Bruce Fields wrote: > > - It's durable; the above comparison still works if there were reboots > > between the two i_version checks. > > - I don't know how realistic this is--we may need to figure out > > if there's a weaker guarantee that's still useful. Do > > filesystems actually make ctime/mtime/i_version changes > > atomically with the changes that caused them? What if a > > change attribute is exposed to an NFS client but doesn't make > > it to disk, and then that value is reused after reboot? > > > > Yeah, there could be atomicity there. If we bump i_version, we'll mark > the inode dirty and I think that will end up with the new i_version at > least being journalled before __mark_inode_dirty returns. The change may be journalled, but it isn't guaranteed stable until fsync is run on the inode. NFS server operations commit the metadata changed by a modification through ->commit_metadata or sync_inode_metadata() before the response is sent back to the client, hence guaranteeing that i_version changes through the NFS server are stable and durable. This is not the case for normal operations done through the POSIX API - the journalling is asynchronous and the only durability guarantees are provided by fsync() > That said, I suppose it is possible for us to bump the counter, hand > that new counter value out to a NFS client and then the box crashes > before it makes it to the journal. Yup, this has aways been a problem when you mix posix applications running on the NFS server modifying the same files as the NFS clients are accessing and requiring synchronisation. > Not sure how big a problem that really is. This coherency problem has always existed on the server side... Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization
On Tue, Mar 21, 2017 at 02:46:53PM -0400, Jeff Layton wrote: > On Tue, 2017-03-21 at 14:30 -0400, J. Bruce Fields wrote: > > On Tue, Mar 21, 2017 at 01:23:24PM -0400, Jeff Layton wrote: > > > On Tue, 2017-03-21 at 12:30 -0400, J. Bruce Fields wrote: > > > > - It's durable; the above comparison still works if there were reboots > > > > between the two i_version checks. > > > > - I don't know how realistic this is--we may need to figure out > > > > if there's a weaker guarantee that's still useful. Do > > > > filesystems actually make ctime/mtime/i_version changes > > > > atomically with the changes that caused them? What if a > > > > change attribute is exposed to an NFS client but doesn't make > > > > it to disk, and then that value is reused after reboot? > > > > > > > > > > Yeah, there could be atomicity there. If we bump i_version, we'll mark > > > the inode dirty and I think that will end up with the new i_version at > > > least being journalled before __mark_inode_dirty returns. > > > > So you think the filesystem can provide the atomicity? In more detail: > > > > Sorry, I hit send too quickly. That should have read: > > "Yeah, there could be atomicity issues there." > > I think providing that level of atomicity may be difficult, though > maybe there's some way to make the querying of i_version block until > the inode update has been journalled? No idea. Anyway, I'd like to figure out some reasonable requirement that we can document. > > > - if I write to a file, a simultaneous reader should see either > > (old data, old i_version) or (new data, new i_version), not a > > combination of the two. > > - ditto for metadata modifications. > > - the same should be true if there's a crash. > > > > (If that's not possible, then I think we could live with a brief window > > of (new data, old i_version) as long as it doesn't persist beyond sync?) > > > > > That said, I suppose it is possible for us to bump the counter, hand > > > that new counter value out to a NFS client and then the box crashes > > > before it makes it to the journal. > > > > > > Not sure how big a problem that really is. > > > > The other case I was wondering about may have been unclear. Represent > > the state of a file by a (data, i_version) pair. Say: > > > > - file is modified from (F, V) to (F', V+1). > > - client reads and caches (F', V+1). > > - server crashes before writeback, so disk still has (F, V). > > - after restart, someone else modifies file to (F'', V+1). > > - original client revalidates its cache, sees V+1, concludes > > file data is still F'. > > > > This may not cause a real problem for clients depending only on > > traditional NFS close-to-open semantics. > > > > > > No, I think that is a legitimate problem. > > That said, after F'', the mtime would almost certainly be different > from the time after F', and that would likely be enough to prevent > confusion in NFS clients. Oh, good point. So, may be worth saying that anyone wanting to make sense of these across reboot should compare times as well (maybe that should be in nfs rfc's too). I think that should be ctime not mtime, though? --b. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization
On Tue, Mar 21, 2017 at 01:23:24PM -0400, Jeff Layton wrote: > On Tue, 2017-03-21 at 12:30 -0400, J. Bruce Fields wrote: > > - It's durable; the above comparison still works if there were reboots > > between the two i_version checks. > > - I don't know how realistic this is--we may need to figure out > > if there's a weaker guarantee that's still useful. Do > > filesystems actually make ctime/mtime/i_version changes > > atomically with the changes that caused them? What if a > > change attribute is exposed to an NFS client but doesn't make > > it to disk, and then that value is reused after reboot? > > > > Yeah, there could be atomicity there. If we bump i_version, we'll mark > the inode dirty and I think that will end up with the new i_version at > least being journalled before __mark_inode_dirty returns. So you think the filesystem can provide the atomicity? In more detail: - if I write to a file, a simultaneous reader should see either (old data, old i_version) or (new data, new i_version), not a combination of the two. - ditto for metadata modifications. - the same should be true if there's a crash. (If that's not possible, then I think we could live with a brief window of (new data, old i_version) as long as it doesn't persist beyond sync?) > That said, I suppose it is possible for us to bump the counter, hand > that new counter value out to a NFS client and then the box crashes > before it makes it to the journal. > > Not sure how big a problem that really is. The other case I was wondering about may have been unclear. Represent the state of a file by a (data, i_version) pair. Say: - file is modified from (F, V) to (F', V+1). - client reads and caches (F', V+1). - server crashes before writeback, so disk still has (F, V). - after restart, someone else modifies file to (F'', V+1). - original client revalidates its cache, sees V+1, concludes file data is still F'. This may not cause a real problem for clients depending only on traditional NFS close-to-open semantics. --b. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization
On Tue, 2017-03-21 at 14:30 -0400, J. Bruce Fields wrote: > On Tue, Mar 21, 2017 at 01:23:24PM -0400, Jeff Layton wrote: > > On Tue, 2017-03-21 at 12:30 -0400, J. Bruce Fields wrote: > > > - It's durable; the above comparison still works if there were reboots > > > between the two i_version checks. > > > - I don't know how realistic this is--we may need to figure out > > > if there's a weaker guarantee that's still useful. Do > > > filesystems actually make ctime/mtime/i_version changes > > > atomically with the changes that caused them? What if a > > > change attribute is exposed to an NFS client but doesn't make > > > it to disk, and then that value is reused after reboot? > > > > > > > Yeah, there could be atomicity there. If we bump i_version, we'll mark > > the inode dirty and I think that will end up with the new i_version at > > least being journalled before __mark_inode_dirty returns. > > So you think the filesystem can provide the atomicity? In more detail: > Sorry, I hit send too quickly. That should have read: "Yeah, there could be atomicity issues there." I think providing that level of atomicity may be difficult, though maybe there's some way to make the querying of i_version block until the inode update has been journalled? > - if I write to a file, a simultaneous reader should see either > (old data, old i_version) or (new data, new i_version), not a > combination of the two. > - ditto for metadata modifications. > - the same should be true if there's a crash. > > (If that's not possible, then I think we could live with a brief window > of (new data, old i_version) as long as it doesn't persist beyond sync?) > > > That said, I suppose it is possible for us to bump the counter, hand > > that new counter value out to a NFS client and then the box crashes > > before it makes it to the journal. > > > > Not sure how big a problem that really is. > > The other case I was wondering about may have been unclear. Represent > the state of a file by a (data, i_version) pair. Say: > > - file is modified from (F, V) to (F', V+1). > - client reads and caches (F', V+1). > - server crashes before writeback, so disk still has (F, V). > - after restart, someone else modifies file to (F'', V+1). > - original client revalidates its cache, sees V+1, concludes > file data is still F'. > > This may not cause a real problem for clients depending only on > traditional NFS close-to-open semantics. > > No, I think that is a legitimate problem. That said, after F'', the mtime would almost certainly be different from the time after F', and that would likely be enough to prevent confusion in NFS clients. Applications that are just looking at i_version via statx() could get confused though. -- Jeff Layton-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization
On Tue, Mar 21, 2017 at 01:37:04PM -0400, J. Bruce Fields wrote: > On Tue, Mar 21, 2017 at 01:23:24PM -0400, Jeff Layton wrote: > > On Tue, 2017-03-21 at 12:30 -0400, J. Bruce Fields wrote: > > > - NFS doesn't actually require that it increases, but I think it > > > should. I assume 64 bits means we don't need a discussion of > > > wraparound. > > > > I thought NFS spec required that you be able to recognize old change > > attributes so that they can be discarded. I could be wrong here though. > > I'd have to go back and look through the spec to be sure. > > https://tools.ietf.org/html/rfc7862#section-10 So, I'm suggesting we implement this one: NFS4_CHANGE_TYPE_IS_MONOTONIC_INCR: The change attribute value MUST monotonically increase for every atomic change to the file attributes, data, or directory contents. It may be a slight lie--after your patches we wouldn't actually increase "for every atomic change". I think that's OK. --b. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization
On Tue, Mar 21, 2017 at 01:23:24PM -0400, Jeff Layton wrote: > On Tue, 2017-03-21 at 12:30 -0400, J. Bruce Fields wrote: > > - NFS doesn't actually require that it increases, but I think it > > should. I assume 64 bits means we don't need a discussion of > > wraparound. > > I thought NFS spec required that you be able to recognize old change > attributes so that they can be discarded. I could be wrong here though. > I'd have to go back and look through the spec to be sure. https://tools.ietf.org/html/rfc7862#section-10 --b. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization
On Tue, 2017-03-21 at 12:30 -0400, J. Bruce Fields wrote: > On Tue, Mar 21, 2017 at 06:45:00AM -0700, Christoph Hellwig wrote: > > On Mon, Mar 20, 2017 at 05:43:27PM -0400, J. Bruce Fields wrote: > > > To me, the interesting question is whether this allows us to turn on > > > i_version updates by default on xfs and ext4. > > > > XFS v5 file systems have it on by default. > > Great, thanks. > > > Although we'll still need to agree on the exact semantics of i_version > > before it's going to be useful. > > Once it's figured out maybe we should write it up for a manpage that > could be used if statx starts exposing it to userspace. > > A first attempt: > > - It's a u64. > > - It works for regular files and directories. (What about symlinks or > other special types?) > > - It changes between two checks if and only if there were intervening > data or metadata changes. The change will always be an increase, but > the amount of the increase is meaningless. > - NFS doesn't actually require that it increases, but I think it > should. I assume 64 bits means we don't need a discussion of > wraparound. I thought NFS spec required that you be able to recognize old change attributes so that they can be discarded. I could be wrong here though. I'd have to go back and look through the spec to be sure. > - AFS wants an actual counter: if you get i_version X, then > write twice, then get i_version X+2, you're allowed to assume > your writes were the only modifications. Let's ignore this > for now. In the future if someone explains how to count > operations, then we can extend the interface to tell the > caller it can get those extra semantics. > > - It's durable; the above comparison still works if there were reboots > between the two i_version checks. > - I don't know how realistic this is--we may need to figure out > if there's a weaker guarantee that's still useful. Do > filesystems actually make ctime/mtime/i_version changes > atomically with the changes that caused them? What if a > change attribute is exposed to an NFS client but doesn't make > it to disk, and then that value is reused after reboot? > Yeah, there could be atomicity there. If we bump i_version, we'll mark the inode dirty and I think that will end up with the new i_version at least being journalled before __mark_inode_dirty returns. That said, I suppose it is possible for us to bump the counter, hand that new counter value out to a NFS client and then the box crashes before it makes it to the journal. Not sure how big a problem that really is. > Am I missing any issues? > No, I think you have it covered, and that's pretty much exactly what I had in mind as far as semantics go. Thanks for writing it up! -- Jeff Layton-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization
On Tue, Mar 21, 2017 at 06:45:00AM -0700, Christoph Hellwig wrote: > On Mon, Mar 20, 2017 at 05:43:27PM -0400, J. Bruce Fields wrote: > > To me, the interesting question is whether this allows us to turn on > > i_version updates by default on xfs and ext4. > > XFS v5 file systems have it on by default. Great, thanks. > Although we'll still need to agree on the exact semantics of i_version > before it's going to be useful. Once it's figured out maybe we should write it up for a manpage that could be used if statx starts exposing it to userspace. A first attempt: - It's a u64. - It works for regular files and directories. (What about symlinks or other special types?) - It changes between two checks if and only if there were intervening data or metadata changes. The change will always be an increase, but the amount of the increase is meaningless. - NFS doesn't actually require that it increases, but I think it should. I assume 64 bits means we don't need a discussion of wraparound. - AFS wants an actual counter: if you get i_version X, then write twice, then get i_version X+2, you're allowed to assume your writes were the only modifications. Let's ignore this for now. In the future if someone explains how to count operations, then we can extend the interface to tell the caller it can get those extra semantics. - It's durable; the above comparison still works if there were reboots between the two i_version checks. - I don't know how realistic this is--we may need to figure out if there's a weaker guarantee that's still useful. Do filesystems actually make ctime/mtime/i_version changes atomically with the changes that caused them? What if a change attribute is exposed to an NFS client but doesn't make it to disk, and then that value is reused after reboot? Am I missing any issues? --b. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization
On Mon, Mar 20, 2017 at 05:43:27PM -0400, J. Bruce Fields wrote: > To me, the interesting question is whether this allows us to turn on > i_version updates by default on xfs and ext4. XFS v5 file systems have it on by default. Although we'll still need to agree on the exact semantics of i_version before it's going to be useful. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 3/3] fstests: btrfs: Test inband dedupe with data balance.
On Thu, Mar 16, 2017 at 05:08:51PM +0800, Qu Wenruo wrote: > Btrfs balance will reloate date extent, but its hash is removed too late ^^^ relocate > at run_delayed_ref() time, which will cause extent ref increased > during balance, cause either find_data_references() gives WARN_ON() > or even run_delayed_refs() fails and cause transaction abort. > > Add such concurrency test for inband dedupe and data balance. > > Signed-off-by: Qu Wenruo> --- > tests/btrfs/202 | 109 > > tests/btrfs/202.out | 3 ++ > tests/btrfs/group | 1 + > 3 files changed, 113 insertions(+) > create mode 100755 tests/btrfs/202 > create mode 100644 tests/btrfs/202.out > > diff --git a/tests/btrfs/202 b/tests/btrfs/202 > new file mode 100755 > index 000..60bb924 > --- /dev/null > +++ b/tests/btrfs/202 > @@ -0,0 +1,109 @@ > +#! /bin/bash > +# FS QA Test 203 > +# > +# Btrfs inband dedupe with balance concurrency test > +# > +# This can spot inband dedupe error which will increase delayed ref on > +# an data extent inside RO block group > +# > +#--- > +# Copyright (c) 2017 Fujitsu. All Rights Reserved. > +# > +# This program is free software; you can redistribute it and/or > +# modify it under the terms of the GNU General Public License as > +# published by the Free Software Foundation. > +# > +# This program is distributed in the hope that it would be useful, > +# but WITHOUT ANY WARRANTY; without even the implied warranty of > +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the > +# GNU General Public License for more details. > +# > +# You should have received a copy of the GNU General Public License > +# along with this program; if not, write the Free Software Foundation, > +# Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA > +#--- > +# > + > +seq=`basename $0` > +seqres=$RESULT_DIR/$seq > +echo "QA output created by $seq" > + > +here=`pwd` > +tmp=/tmp/$$ > +status=1 # failure is the default! > +trap "_cleanup; exit \$status" 0 1 2 3 15 > + > +_cleanup() > +{ > + cd / > + kill $populate_pid &> /dev/null > + kill $balance_pid &> /dev/null > + wait > + # Check later comment for reason > + $BTRFS_UTIL_PROG balance cancel $SCRATCH_MNT &> /dev/null > + rm -f $tmp.* I sometimes saw "btrfs balance start" hold $SCRATCH_MNT from umounting and result in fs inconsistency after test. Some debug codes show that the "btrfs balance start $SCRATCH_MNT" is still running. If I add what btrfs/061 does: while ps aux | grep "balance start" | grep -qv grep; do sleep 1 done test finishes fine. But btrfs-balance(8) says balance cancel should block and wait for the backgroud balance process to finish. Seems like a "balance cancel" bug? Otherwise these three patches look fine to me overall, except the wanted "dedupe feature check", and btrfs/200 is quick enough to fit in 'quick' group. I think the "dedupe feature check" should _notrun the tests if current btrfs has incompatible features set, e.g. compress and/or nodatacow. Thanks, Eryu -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] Btrfs in-band de-duplication test cases
On Tue, Mar 21, 2017 at 03:36:54PM +0800, Qu Wenruo wrote: > > > At 03/21/2017 03:23 PM, Eryu Guan wrote: > > On Thu, Mar 16, 2017 at 09:50:24AM +0800, Qu Wenruo wrote: > > > Btrfs in-band de-duplication test cases for in-memory backend, which > > > covers > > > the bugs exposed during the development. > > > > Sorry, I'm having trouble enabling inband dedupe in tests, I always get > > ioctl failure, $seqres.full shows: > > > > btrfs-progs v4.10-5-g23d4fd5 > > See http://btrfs.wiki.kernel.org for more information. > > > > Label: (null) > > UUID: 3d2cedd0-64ae-4667-ba09-9c12e28ecadd > > Node size: 16384 > > Sector size:4096 > > Filesystem size:15.00GiB > > Block group profiles: > > Data: single8.00MiB > > Metadata: DUP 1.00GiB > > System: DUP 8.00MiB > > SSD detected: no > > Incompat features: extref, skinny-metadata > > Number of devices: 1 > > Devices: > >IDSIZE PATH > > 115.00GiB /dev/mapper/systemvg-testlv2 > > > > # /usr/local/bin/btrfs dedupe enable -f -s inmemory -b 64K /scratch > > ERROR: failed to (enable) inband deduplication: Inappropriate ioctl for > > device > > ERROR: unsupported dedupe limit combination: nr: 0, mem: 0 > > What I forgot to mention is that, the inband dedupe feature is an > experimental feature, so CONFIG_BTRFS_DEBUG is needed to use this feature. OK, I'll rebuild with BTRFS_DEBUG and try again, thanks for the info! But IMHO, giving out experimental message in dmesg seems sufficient, like what XFS reflink feature does. > > Sorry for the inconvenience. > > I'll add also add the CONFIG_BTRFS_DEBUG check for /sys/fs/btrfs/feature/, > so it won't export inband dedupe to users. I'd suggest that adding a new require rule to check if inband dedupe is supported by current kernel, by enabling inband dedupe and see the result, so that we don't need to update the CONFIG_BTRFS_DEBUG check when this feature is moved out of BTRFS_DEBUG. Thanks, Eryu -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] Btrfs in-band de-duplication test cases
At 03/21/2017 03:23 PM, Eryu Guan wrote: On Thu, Mar 16, 2017 at 09:50:24AM +0800, Qu Wenruo wrote: Btrfs in-band de-duplication test cases for in-memory backend, which covers the bugs exposed during the development. Sorry, I'm having trouble enabling inband dedupe in tests, I always get ioctl failure, $seqres.full shows: btrfs-progs v4.10-5-g23d4fd5 See http://btrfs.wiki.kernel.org for more information. Label: (null) UUID: 3d2cedd0-64ae-4667-ba09-9c12e28ecadd Node size: 16384 Sector size:4096 Filesystem size:15.00GiB Block group profiles: Data: single8.00MiB Metadata: DUP 1.00GiB System: DUP 8.00MiB SSD detected: no Incompat features: extref, skinny-metadata Number of devices: 1 Devices: IDSIZE PATH 115.00GiB /dev/mapper/systemvg-testlv2 # /usr/local/bin/btrfs dedupe enable -f -s inmemory -b 64K /scratch ERROR: failed to (enable) inband deduplication: Inappropriate ioctl for device ERROR: unsupported dedupe limit combination: nr: 0, mem: 0 What I forgot to mention is that, the inband dedupe feature is an experimental feature, so CONFIG_BTRFS_DEBUG is needed to use this feature. Sorry for the inconvenience. I'll add also add the CONFIG_BTRFS_DEBUG check for /sys/fs/btrfs/feature/, so it won't export inband dedupe to users. Thanks, Qu failed: '/usr/local/bin/btrfs dedupe enable -f -s inmemory -b 64K /scratch' I have no special MKFS_OPTIONS or MOUNT_OPTIONS specified, all defaults. kernel is built from your test tree. Did I miss anything? Thanks, Eryu -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] Btrfs in-band de-duplication test cases
On Thu, Mar 16, 2017 at 09:50:24AM +0800, Qu Wenruo wrote: > Btrfs in-band de-duplication test cases for in-memory backend, which covers > the bugs exposed during the development. Sorry, I'm having trouble enabling inband dedupe in tests, I always get ioctl failure, $seqres.full shows: btrfs-progs v4.10-5-g23d4fd5 See http://btrfs.wiki.kernel.org for more information. Label: (null) UUID: 3d2cedd0-64ae-4667-ba09-9c12e28ecadd Node size: 16384 Sector size:4096 Filesystem size:15.00GiB Block group profiles: Data: single8.00MiB Metadata: DUP 1.00GiB System: DUP 8.00MiB SSD detected: no Incompat features: extref, skinny-metadata Number of devices: 1 Devices: IDSIZE PATH 115.00GiB /dev/mapper/systemvg-testlv2 # /usr/local/bin/btrfs dedupe enable -f -s inmemory -b 64K /scratch ERROR: failed to (enable) inband deduplication: Inappropriate ioctl for device ERROR: unsupported dedupe limit combination: nr: 0, mem: 0 failed: '/usr/local/bin/btrfs dedupe enable -f -s inmemory -b 64K /scratch' I have no special MKFS_OPTIONS or MOUNT_OPTIONS specified, all defaults. kernel is built from your test tree. Did I miss anything? Thanks, Eryu -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 0/3] Btrfs in-band de-duplication test cases
At 03/21/2017 02:05 PM, Eryu Guan wrote: On Tue, Mar 21, 2017 at 01:22:33PM +0800, Qu Wenruo wrote: At 03/21/2017 12:51 PM, Eryu Guan wrote: Hi Qu, On Thu, Mar 16, 2017 at 05:08:48PM +0800, Qu Wenruo wrote: Btrfs in-band de-duplication test cases for in-memory backend, which covers the bugs exposed during the development. Do you have a kernel tree that contains the in-band patches so that I and others could try and run these tests? Of course. Kernel tree: https://github.com/adam900710/linux.git wang_dedupe_latest Btrfs-progs tree: https://github.com/adam900710/btrfs-progs.git dedupe_20170306 Thanks! But I guess it's branch 'dedupe_20170316' of btrfs-progs repo, right? There's no dedupe_20170306 branch. My fault, it's 20170316. Thanks, Qu Thanks, Eryu -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/5] btrfs: raid56: Don't keep rbio for later steal
On Tue, Mar 21, 2017 at 10:23:56AM +0800, Qu Wenruo wrote: > > > At 03/21/2017 10:08 AM, Liu Bo wrote: > > On Tue, Mar 21, 2017 at 08:44:18AM +0800, Qu Wenruo wrote: > > > > > > > > > At 03/21/2017 04:23 AM, Liu Bo wrote: > > > > On Mon, Mar 20, 2017 at 02:21:48PM +0800, Qu Wenruo wrote: > > > > > > > > > > > > > > > At 03/18/2017 10:03 AM, Liu Bo wrote: > > > > > > On Fri, Mar 17, 2017 at 01:28:45PM +0800, Qu Wenruo wrote: > > > > > > > > > > > > > > > > > > > > > At 03/17/2017 12:44 PM, Liu Bo wrote: > > > > > > > > On Fri, Feb 03, 2017 at 04:20:22PM +0800, Qu Wenruo wrote: > > > > > > > > > Before this patch, btrfs raid56 will keep raid56 rbio even > > > > > > > > > all its IO is > > > > > > > > > done. > > > > > > > > > This may save some time allocating rbio, but it can cause > > > > > > > > > deadly > > > > > > > > > use-after-free bug, for the following case: > > > > > > > > > > > > > > > > > > Original fs: 4 devices RAID5 > > > > > > > > > > > > > > > > > >Process A | Process B > > > > > > > > > -- > > > > > > > > > | Start device replace > > > > > > > > > |Now the fs has 5 devices > > > > > > > > > |devid 0: replace device > > > > > > > > > |devid 1~4: old devices > > > > > > > > > btrfs_map_bio() | > > > > > > > > > |- __btrfs_map_block() | > > > > > > > > > |bbio has 5 stripes | > > > > > > > > > |including devid 0 | > > > > > > > > > |- raid56_parity_write() | > > > > > > > > > | > > > > > > > > > raid_write_end_io() | > > > > > > > > > |- rbio_orig_end_io()| > > > > > > > > >|- unlock_stripe()| > > > > > > > > >Keeps the old rbio for| > > > > > > > > >later steal, which has| > > > > > > > > >stripe for devid 0| > > > > > > > > > | Cancel device replace > > > > > > > > > |Now the fs has 4 devices > > > > > > > > > |devid 0 is freed > > > > > > > > > Some IO happens | > > > > > > > > > raid_write_end_io() | > > > > > > > > > |- rbio_orig_end_io()| > > > > > > > > >|- unlock_stripe()| > > > > > > > > > |- steal_rbio()| > > > > > > > > >Use old rbio, whose | > > > > > > > > >bbio has freed devid 0| > > > > > > > > >stripe| > > > > > > > > > Any access to rbio->bbio will| > > > > > > > > > cause general protection or NULL | > > > > > > > > > pointer dereference | > > > > > > > > > > > > > > > > > > Such bug can already be triggered by fstests btrfs/069 for > > > > > > > > > RAID5/6 > > > > > > > > > profiles. > > > > > > > > > > > > > > > > > > Fix it by not keeping old rbio in unlock_stripe(), so we just > > > > > > > > > free the > > > > > > > > > finished rbio and rbio->bbio, so above problem wont' happen. > > > > > > > > > > > > > > > > > > > > > > > > > I don't think this is acceptable, keeping a cache is important > > > > > > > > for > > > > > > > > raid56 write performance, could you please fix it by checking > > > > > > > > if the > > > > > > > > device is missing? > > > > > > > > > > > > > > Not possible, as it's keeping the btrfs_device pointer and never > > > > > > > release it, > > > > > > > the stolen rbio can be kept forever until umount. > > > > > > > > > > > > > > > > > > > steal_rbio() only takes pages from rbio->stripe_pages, so the cached > > > > > > rbio->bbio is not going to the next IO's rbio because the cached one > > > > > > got freed immediately in steal_rbio(), where could we dereference > > > > > > rbio->bbio? > > > > > > > > > > Did you mean in unlock_stripe(), we still keep the rbio in cache, > > > > > while > > > > > release its bbio? > > > > > > > > > > > > > I thought it was lock_stripe_add(), OK, so unlock_stripe() just caches > > > > the current rbio and doens't free it. But the point about > > > > steal_rbio() still stands, steal_rbio() is supposed to take uptodate > > > > pages from the cached old rbio to the current processing rbio, but it > > > > doesn't touch the cached old rbio's bbio, nor uses the cached old rbio > > > > afterwards, instead it is the current processing rbio that would use > > > > its bbio for writing into target devices, but it has increased its own > > > > bio_counter. > > > > > > > > > This sounds quite good but I'm afraid it may cause more problems. > > > > > > > > > > Quite a lot of places are accessing rbio->bbio either for their btrfs > > > > > logical address or stripes or even stripe->dev. > > > > > > > > > > > > > I'm confused,
Re: [PATCH 4/5] btrfs: raid56: Don't keep rbio for later steal
At 03/21/2017 01:45 PM, Liu Bo wrote: On Tue, Mar 21, 2017 at 10:23:56AM +0800, Qu Wenruo wrote: At 03/21/2017 10:08 AM, Liu Bo wrote: On Tue, Mar 21, 2017 at 08:44:18AM +0800, Qu Wenruo wrote: At 03/21/2017 04:23 AM, Liu Bo wrote: On Mon, Mar 20, 2017 at 02:21:48PM +0800, Qu Wenruo wrote: At 03/18/2017 10:03 AM, Liu Bo wrote: On Fri, Mar 17, 2017 at 01:28:45PM +0800, Qu Wenruo wrote: At 03/17/2017 12:44 PM, Liu Bo wrote: On Fri, Feb 03, 2017 at 04:20:22PM +0800, Qu Wenruo wrote: Before this patch, btrfs raid56 will keep raid56 rbio even all its IO is done. This may save some time allocating rbio, but it can cause deadly use-after-free bug, for the following case: Original fs: 4 devices RAID5 Process A | Process B -- | Start device replace |Now the fs has 5 devices |devid 0: replace device |devid 1~4: old devices btrfs_map_bio() | |- __btrfs_map_block() | |bbio has 5 stripes | |including devid 0 | |- raid56_parity_write() | | raid_write_end_io() | |- rbio_orig_end_io()| |- unlock_stripe()| Keeps the old rbio for| later steal, which has| stripe for devid 0| | Cancel device replace |Now the fs has 4 devices |devid 0 is freed Some IO happens | raid_write_end_io() | |- rbio_orig_end_io()| |- unlock_stripe()| |- steal_rbio()| Use old rbio, whose | bbio has freed devid 0| stripe| Any access to rbio->bbio will| cause general protection or NULL | pointer dereference | Such bug can already be triggered by fstests btrfs/069 for RAID5/6 profiles. Fix it by not keeping old rbio in unlock_stripe(), so we just free the finished rbio and rbio->bbio, so above problem wont' happen. I don't think this is acceptable, keeping a cache is important for raid56 write performance, could you please fix it by checking if the device is missing? Not possible, as it's keeping the btrfs_device pointer and never release it, the stolen rbio can be kept forever until umount. steal_rbio() only takes pages from rbio->stripe_pages, so the cached rbio->bbio is not going to the next IO's rbio because the cached one got freed immediately in steal_rbio(), where could we dereference rbio->bbio? Did you mean in unlock_stripe(), we still keep the rbio in cache, while release its bbio? I thought it was lock_stripe_add(), OK, so unlock_stripe() just caches the current rbio and doens't free it. But the point about steal_rbio() still stands, steal_rbio() is supposed to take uptodate pages from the cached old rbio to the current processing rbio, but it doesn't touch the cached old rbio's bbio, nor uses the cached old rbio afterwards, instead it is the current processing rbio that would use its bbio for writing into target devices, but it has increased its own bio_counter. This sounds quite good but I'm afraid it may cause more problems. Quite a lot of places are accessing rbio->bbio either for their btrfs logical address or stripes or even stripe->dev. I'm confused, could you please specify the call trace of general protection you got in the commit log? The 4th and 5th patches are designed to fix the same problem. If one applies 5th patch only and run btrfs/069, it will cause hang when aborting replace, since btrfs_device of dev 0 is hold in bbio->stripes[] and never get released. The 4th patch is used to solve such hang. OK, I see. Firstly, the above commit log is misleading people a bit because it says that steal_rbio() dereferences the device of the cached rbio and that device has got free'd, but steal_rbio() actually doesn't. Yes, the cached rbio holds a reference on the free'd device, but I think the below 'NULL pointer dereference' comes from writing back pages into target devices when doing RMW with the current rbio instead of the old cached one, right? Yes, steal_bio() is not related to this problem, sorry for the confusion. Secondly, if it is the current rio that ends up this 'NULL pointer dereference', it could hold a bio_counter and let the replace thread canceled by scrub wait for this bio_counter to be zero. Although btrfs_dev_replace_finishing() has flushed delalloc IO and committed transaction, seems like scrub is an exception because it can continued after committing transaction. If I understand it correctly, did you mean hold bio_counter when rbio holds bbio? We have bio_counter to prevent the race against a
Re: [PATCH] fstests: add test for btrfs send/receive with sparse files
Hi Filipe, At 02/15/2017 04:35 AM, fdman...@kernel.org wrote: From: Filipe MananaTest that both a full and incremental btrfs send operation preserves file holes. I found the test case fails with compress=lzo mount option. In fact, it turns out to be 2 bugs: [Inline-then-regular layout] And further more, it seems that it can create inline-then-regular file extents layout with compress=lzo. It does not always create such inline-then-regular, but the possibility seems quite high in my test box, near 80%. Btrfs dump tree for snap2 (258): item 9 key (257 INODE_REF 256) itemoff 15736 itemsize 13 inode ref index 2 namelen 3 name: foo item 10 key (257 EXTENT_DATA 0) itemoff 15663 itemsize 73 generation 7 type 0 (inline) inline extent data size 52 ram_bytes 4096 compression 2 (lzo) item 11 key (257 EXTENT_DATA 4096) itemoff 15610 itemsize 53 generation 8 type 1 (regular) extent data disk byte 0 nr 0 extent data offset 0 nr 1048576 ram 1052672 extent compression 0 (none) item 12 key (257 EXTENT_DATA 1052672) itemoff 15557 itemsize 53 generation 8 type 1 (regular) extent data disk byte 12582912 nr 4096 extent data offset 0 nr 4096 ram 4096 extent compression 0 (none) [Send-stream without hole] This bug is 100% reproducible, the send stream of snap2 contains no hole, just the full file contents. Thanks, Qu This used to fail when the filesystem had the NO_HOLES feature enabled, that is, when the test is run with MKFS_OPTIONS="-O no-holes". This is fixed by the following patch for the linux kernel: "Btrfs: incremental send, fix unnecessary hole writes for sparse files" Signed-off-by: Filipe Manana --- tests/btrfs/137 | 141 tests/btrfs/137.out | 63 +++ tests/btrfs/group | 1 + 3 files changed, 205 insertions(+) create mode 100755 tests/btrfs/137 create mode 100644 tests/btrfs/137.out diff --git a/tests/btrfs/137 b/tests/btrfs/137 new file mode 100755 index 000..3ff2c6b --- /dev/null +++ b/tests/btrfs/137 @@ -0,0 +1,141 @@ +#! /bin/bash +# FS QA Test No. btrfs/137 +# +# Test that both incremental and full send operations preserve file holes. +# +#--- +# +# Copyright (C) 2017 SUSE Linux Products GmbH. All Rights Reserved. +# Author: Filipe Manana +# +# This program is free software; you can redistribute it and/or +# modify it under the terms of the GNU General Public License as +# published by the Free Software Foundation. +# +# This program is distributed in the hope that it would be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program; if not, write the Free Software Foundation, +# Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA +#--- +# + +seq=`basename $0` +seqres=$RESULT_DIR/$seq +echo "QA output created by $seq" + +tmp=/tmp/$$ +status=1 # failure is the default! +trap "_cleanup; exit \$status" 0 1 2 3 15 + +_cleanup() +{ + cd / + rm -fr $send_files_dir + rm -f $tmp.* +} + +# get standard environment, filters and checks +. ./common/rc +. ./common/filter +. ./common/punch + +# real QA test starts here +_supported_fs btrfs +_supported_os Linux +_require_test +_require_scratch +_require_xfs_io_command "fiemap" + +send_files_dir=$TEST_DIR/btrfs-test-$seq + +rm -f $seqres.full +rm -fr $send_files_dir +mkdir $send_files_dir + +_scratch_mkfs >>$seqres.full 2>&1 +_scratch_mount + +# Create the first test file. +$XFS_IO_PROG -f -c "pwrite -S 0xaa 0 4K" $SCRATCH_MNT/foo | _filter_xfs_io + +# Create a second test file with a 1Mb hole. +$XFS_IO_PROG -f \ + -c "pwrite -S 0xaa 0 4K" \ + -c "pwrite -S 0xbb 1028K 4K" \ + $SCRATCH_MNT/bar | _filter_xfs_io + +$BTRFS_UTIL_PROG subvolume snapshot -r $SCRATCH_MNT \ + $SCRATCH_MNT/snap1 >/dev/null + +# Now add one new extent to our first test file, increasing its size and leaving +# a 1Mb hole between the first extent and this new extent. +$XFS_IO_PROG -c "pwrite -S 0xbb 1028K 4K" $SCRATCH_MNT/foo | _filter_xfs_io + +# Now overwrite the last extent of our second test file. +$XFS_IO_PROG -c "pwrite -S 0xcc 1028K 4K" $SCRATCH_MNT/bar | _filter_xfs_io + +$BTRFS_UTIL_PROG subvolume snapshot -r $SCRATCH_MNT \ +$SCRATCH_MNT/snap2 >/dev/null + +echo +echo "File digests in the original filesystem:" +md5sum $SCRATCH_MNT/snap1/foo | _filter_scratch +md5sum $SCRATCH_MNT/snap1/bar |
Re: [PATCH v2 0/3] Btrfs in-band de-duplication test cases
On Tue, Mar 21, 2017 at 01:22:33PM +0800, Qu Wenruo wrote: > > > At 03/21/2017 12:51 PM, Eryu Guan wrote: > > Hi Qu, > > > > On Thu, Mar 16, 2017 at 05:08:48PM +0800, Qu Wenruo wrote: > > > Btrfs in-band de-duplication test cases for in-memory backend, which > > > covers > > > the bugs exposed during the development. > > > > Do you have a kernel tree that contains the in-band patches so that I > > and others could try and run these tests? > > Of course. > > Kernel tree: > https://github.com/adam900710/linux.git wang_dedupe_latest > > Btrfs-progs tree: > https://github.com/adam900710/btrfs-progs.git dedupe_20170306 Thanks! But I guess it's branch 'dedupe_20170316' of btrfs-progs repo, right? There's no dedupe_20170306 branch. Thanks, Eryu -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html