Re: WARNING at fs/btrfs/inode.c:9261 btrfs_destroy_inode()
On Tue, May 10, 2016 at 06:19:20PM -0500, Eric Biggers wrote: > Hello, > > The following warning has been triggering for me since about v4.6-rc3: > > WARN_ON(BTRFS_I(inode)->csum_bytes); > > On one machine the warning has occurred 657 times since v4.6-rc5. On another > it > has occurred 3 times since v4.6-rc3. Both are now on v4.6-rc7, where I have > still observed the warning. The warnings occur in groups, and do_unlinkat() > and > evict() are always in the call stack. > > Is this a known issue? Here is the first occurrence: Finally tracked this down today, it's a bug in how we deal with page faults in the middle of a write. We're testing the fix here and I'll have a patch on Monday. Strictly speaking this is a regression in one of Chandan's setup patches for PAGE_SIZE > sectorsize, but the code was much much too subtle to blame him. It should be possible to trigger with a non-aligned multi-page write, I'm trying to nail down a reliable test program so we don't make the same mistake again. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] fstests: generic: Test SHARED flag about fiemap ioctl before and after sync
On 05/13/2016 01:17 PM, Darrick J. Wong wrote: On Fri, May 13, 2016 at 09:52:52AM +0800, Qu Wenruo wrote: The test case will check SHARED flag returned by fiemap ioctl on reflinked files before and after sync. Normally SHARED flag won't change just due to a normal sync operation. But btrfs doesn't handle SHARED flag well, and this time it won't check any delayed extent tree(reverse extent searching tree) modification, but only metadata already committed to disk. So btrfs will not return correct SHARED flag on reflinked files if there is no sync to commit all metadata. This testcase will just check it. Signed-off-by: Qu Wenruo -- And of course, xfs handles it quite well. Nice work Darrick. Also the test case needs the new infrastructure introduced in previous generic/352 test case. --- tests/generic/353 | 86 +++ tests/generic/353.out | 9 ++ tests/generic/group | 1 + 3 files changed, 96 insertions(+) create mode 100755 tests/generic/353 create mode 100644 tests/generic/353.out diff --git a/tests/generic/353 b/tests/generic/353 new file mode 100755 index 000..1e9117e --- /dev/null +++ b/tests/generic/353 @@ -0,0 +1,86 @@ +#! /bin/bash +# FS QA Test 353 +# +# Check if fiemap ioctl returns correct SHARED flag on reflinked file +# before and after sync the fs +# +# Btrfs has a bug in checking shared extent, which can only handle metadata +# already committed to disk, but not delayed extent tree modification. +# This caused SHARED flag only occurs after sync. I noticed this a while ago, but figured it was just btrfs being btrfs. Ho hum. Thanks for writing a test and getting the problem fixed. +# +#--- +# Copyright (c) 2016 Fujitsu. All Rights Reserved. +# +# This program is free software; you can redistribute it and/or +# modify it under the terms of the GNU General Public License as +# published by the Free Software Foundation. +# +# This program is distributed in the hope that it would be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program; if not, write the Free Software Foundation, +# Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA +#--- +# + +seq=`basename $0` +seqres=$RESULT_DIR/$seq +echo "QA output created by $seq" + +here=`pwd` +tmp=/tmp/$$ +status=1 # failure is the default! +trap "_cleanup; exit \$status" 0 1 2 3 15 + +_cleanup() +{ + cd / + rm -f $tmp.* tmp isn't used for anything in this testcase. That's from template, and does no harm. +} + +# get standard environment, filters and checks +. ./common/rc +. ./common/filter +. ./common/reflink +. ./common/punch + +# remove previous $seqres.full before test +rm -f $seqres.full + +# real QA test starts here + +# Modify as appropriate. +_supported_fs generic +_supported_os Linux +_require_scratch_reflink +_require_fiemap + +_scratch_mkfs > /dev/null 2>&1 +_scratch_mount + +blocksize=64k +file1="$SCRATCH_MNT/file1" +file2="$SCRATCH_MNT/file2" + +# write the initial file +_pwrite_byte 0xcdcdcdcd 0 $blocksize $file1 | _filter_xfs_io + +# reflink initial file +_reflink_range $file1 0 $file2 0 $blocksize | _filter_xfs_io + +# check their fiemap to make sure it's correct +$XFS_IO_PROG -c "fiemap -v" $file1 | _filter_fiemap_flags +$XFS_IO_PROG -c "fiemap -v" $file2 | _filter_fiemap_flags + +# sync and recheck, to make sure the fiemap doesn't change just +# due to sync +sync +$XFS_IO_PROG -c "fiemap -v" $file1 | _filter_fiemap_flags +$XFS_IO_PROG -c "fiemap -v" $file2 | _filter_fiemap_flags Nowadays, when I write a test that prints similar output one after the other I will also write a comment to the output to distinguish the two cases, e.g.: +XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) fiemap before sync +0: [0..127]: 0x2001 +0: [0..127]: 0x2001 fiemap after sync +0: [0..127]: 0x2001 +0: [0..127]: 0x2001 This way when a bunch of tests regress some months later it's easier for me to relearn what's going on. Yes, that's pretty nice, I'll also add filename to the output, in case the reflinked or original file layout changes. Thanks for the review. Qu (I wasn't always good at doing that.) Otherwise, everything looks ok. --D + +# success, all done +status=0 +exit diff --git a/tests/generic/353.out b/tests/generic/353.out new file mode 100644 index 000..0cd8981 --- /dev/null +++ b/tests/generic/353.out @@ -0,0 +1,9 @@ +QA output created by 353 +wrote 65536/65536 bytes at offset 0 +XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) +linked 65536/65536 bytes at offset 0 +XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) +0: [0.
[PATCH 1/7] Btrfs: replace BUG() with WARN_ONCE in raid56
This BUG() has been triggered by a fuzz testing image, but in fact btrfs can handle this gracefully by returning -EIO. Thus, use WARN_ONCE for warning purpose and don't leave a possible kernel panic. Signed-off-by: Liu Bo --- fs/btrfs/raid56.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c index 0b7792e..863f7fe 100644 --- a/fs/btrfs/raid56.c +++ b/fs/btrfs/raid56.c @@ -2139,7 +2139,7 @@ int raid56_parity_recover(struct btrfs_root *root, struct bio *bio, rbio->faila = find_logical_bio_stripe(rbio, bio); if (rbio->faila == -1) { - BUG(); + WARN_ONCE(1, KERN_WARNING "rbio->faila is -1\n"); if (generic_io) btrfs_put_bbio(bbio); kfree(rbio); -- 2.5.5 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/7] Btrfs: check if extent buffer is aligned to sectorsize
Thanks to fuzz testing, we can pass an invalid bytenr to extent buffer via alloc_extent_buffer(). An unaligned eb can have more pages than it should have, which ends up extent buffer's leak or some corrupted content in extent buffer. This adds a warning to let us quickly know what was happening. Signed-off-by: Liu Bo --- fs/btrfs/extent_io.c | 4 1 file changed, 4 insertions(+) diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index d247fc0..e601e0f 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -4868,6 +4868,10 @@ struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info, int uptodate = 1; int ret; + WARN_ONCE(!IS_ALIGNED(start, fs_info->tree_root->sectorsize), + KERN_WARNING "eb->start(%llu) is not aligned to root->sectorsize(%u)\n", + start, fs_info->tree_root->sectorsize); + eb = find_extent_buffer(fs_info, start); if (eb) return eb; -- 2.5.5 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/7] Btrfs: replace BUG_ON with WARN_ONCE in cow_file_range
This BUG_ON is more like a warning since an invalid btrfs_super_total_bytes() doesn't affect other stuff. Thus, use WARN_ONCE for warning purpose and don't leave a possible kernel panic here. Signed-off-by: Liu Bo --- fs/btrfs/inode.c | 7 +-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 2aaba58..5874562 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -974,8 +974,11 @@ static noinline int cow_file_range(struct inode *inode, } } - BUG_ON(disk_num_bytes > - btrfs_super_total_bytes(root->fs_info->super_copy)); + WARN_ONCE(disk_num_bytes > + btrfs_super_total_bytes(root->fs_info->super_copy), + KERN_WARNING "disk_num_bytes(%llu) > btrfs_super_total_bytes(%llu)\n", + disk_num_bytes, + btrfs_super_total_bytes(root->fs_info->super_copy)); alloc_hint = get_extent_allocation_hint(inode, start, num_bytes); btrfs_drop_extent_cache(inode, start, start + num_bytes - 1, 0); -- 2.5.5 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 6/7] Btrfs: fix eb memory leak due to readpage failure
eb->io_pages is set in read_extent_buffer_pages(). In case of readpage failure, for pages that have been added to bio, it calls bio_endio and later readpage_io_failed_hook() does the work. When this eb's page (couldn't be the 1st page) fails to add itself to bio due to failure in merge_bio(), it cannot decrease eb->io_pages via bio_endio, and ends up with a memory leak eventually. This adds the 'atomic_dec(&eb->io_pages)' to the readpage error handling. Signed-off-by: Liu Bo --- fs/btrfs/extent_io.c | 24 1 file changed, 24 insertions(+) diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 99286d1..2327200 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -3069,6 +3069,30 @@ static int __do_readpage(struct extent_io_tree *tree, *bio_flags = this_bio_flag; } else { SetPageError(page); + /* +* Only metadata io request has this issue, for data it +* just unlocks extent and releases page's lock. +* +* eb->io_pages is set in read_extent_buffer_pages(). +* +* When this eb's page fails to add itself to bio, +* it cannot decrease eb->io_pages via bio_endio, and +* ends up with extent_buffer_under_io() always being +* true, because of that, eb won't be freed and we have +* a memory leak eventually. +* +* Here we still hold this page's lock, and other tasks +* who're also reading this eb are blocked. +*/ + if (rw & REQ_META) { + struct extent_buffer *eb; + + WARN_ON_ONCE(!PagePrivate(page)); + eb = (struct extent_buffer *)page->private; + + WARN_ON_ONCE(atomic_read(&eb->io_pages) < 1); + atomic_dec(&eb->io_pages); + } unlock_extent(tree, cur, cur + iosize - 1); } cur = cur + iosize; -- 2.5.5 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 7/7] Btrfs: fix memory leak due to invalid btree height
Thanks to fuzz testing, we can have invalid btree root node height. Btrfs limits btree height to 7 and if the given height is 9, then btrfs will have problems in both releasing root node's lock and freeing the node. Signed-off-by: Liu Bo --- fs/btrfs/ctree.c | 7 +++ 1 file changed, 7 insertions(+) diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c index ec7928a..3fccbcc 100644 --- a/fs/btrfs/ctree.c +++ b/fs/btrfs/ctree.c @@ -2756,6 +2756,13 @@ again: } } } + if (level > BTRFS_MAX_LEVEL - 1 || level < 0) { + WARN_ONCE(1, KERN_WARNING "Invalid btree height %d\n", level); + if (!p->skip_locking) + btrfs_tree_unlock_rw(b, root_lock); + free_extent_buffer(b); + return -EINVAL; + } p->nodes[level] = b; if (!p->skip_locking) p->locks[level] = root_lock; -- 2.5.5 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/7] Btrfs: free sys_array eb as soon as possible
While reading sys_chunk_array in superblock, btrfs creates a temporary extent buffer. Since we don't use it after finishing reading sys_chunk_array, we don't need to keep it in memory. Signed-off-by: Liu Bo --- fs/btrfs/volumes.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index bd0f45f..1331606 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -6586,13 +6586,13 @@ int btrfs_read_sys_array(struct btrfs_root *root) sb_array_offset += len; cur_offset += len; } - free_extent_buffer(sb); + free_extent_buffer_stale(sb); return ret; out_short_read: printk(KERN_ERR "BTRFS: sys_array too short to read %u bytes at offset %u\n", len, cur_offset); - free_extent_buffer(sb); + free_extent_buffer_stale(sb); return -EIO; } -- 2.5.5 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 5/7] Btrfs: replace BUG_ON with WARN in merge_bio
We have two BUG_ON in merge_bio, but since it can gracefully return errors to callers, use WARN_ONCE to give error information and don't leave a possible panic. Signed-off-by: Liu Bo --- fs/btrfs/extent_io.c | 1 - fs/btrfs/inode.c | 6 -- 2 files changed, 4 insertions(+), 3 deletions(-) diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index e601e0f..99286d1 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -2746,7 +2746,6 @@ static int merge_bio(int rw, struct extent_io_tree *tree, struct page *page, if (tree->ops && tree->ops->merge_bio_hook) ret = tree->ops->merge_bio_hook(rw, page, offset, size, bio, bio_flags); - BUG_ON(ret < 0); return ret; } diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 5874562..3a989e3 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -1827,8 +1827,10 @@ int btrfs_merge_bio_hook(int rw, struct page *page, unsigned long offset, map_length = length; ret = btrfs_map_block(root->fs_info, rw, logical, &map_length, NULL, 0); - /* Will always return 0 with map_multi == NULL */ - BUG_ON(ret < 0); + if (ret < 0) { + WARN_ONCE(ret < 0, KERN_ERR "ret = %d\n", ret); + return ret; + } if (map_length < length + size) return 1; return 0; -- 2.5.5 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] Btrfs: add valid checks for chunk loading
On Wed, May 04, 2016 at 03:56:26PM +0200, David Sterba wrote: > A few minor comments below > > On Mon, May 02, 2016 at 11:15:51AM -0700, Liu Bo wrote: > > --- a/fs/btrfs/volumes.c > > +++ b/fs/btrfs/volumes.c > > @@ -6206,27 +6206,23 @@ struct btrfs_device *btrfs_alloc_device(struct > > btrfs_fs_info *fs_info, > > return dev; > > } > > > > -static int read_one_chunk(struct btrfs_root *root, struct btrfs_key *key, > > - struct extent_buffer *leaf, > > - struct btrfs_chunk *chunk) > > +/* Return -EIO if any error, otherwise return 0. */ > > +static int btrfs_check_chunk_valid(struct btrfs_root *root, > > + struct extent_buffer *leaf, > > + struct btrfs_chunk *chunk, u64 logical) > > { > > - struct btrfs_mapping_tree *map_tree = &root->fs_info->mapping_tree; > > - struct map_lookup *map; > > - struct extent_map *em; > > - u64 logical; > > u64 length; > > u64 stripe_len; > > - u64 devid; > > - u8 uuid[BTRFS_UUID_SIZE]; > > - int num_stripes; > > - int ret; > > - int i; > > + u16 num_stripes; > > + u16 sub_stripes; > > + u64 type; > > > > - logical = key->offset; > > length = btrfs_chunk_length(leaf, chunk); > > stripe_len = btrfs_chunk_stripe_len(leaf, chunk); > > num_stripes = btrfs_chunk_num_stripes(leaf, chunk); > > - /* Validation check */ > > + sub_stripes = btrfs_chunk_sub_stripes(leaf, chunk); > > + type = btrfs_chunk_type(leaf, chunk); > > + > > if (!num_stripes) { > > btrfs_err(root->fs_info, "invalid chunk num_stripes: %u", > > num_stripes); > > @@ -6237,24 +6233,70 @@ static int read_one_chunk(struct btrfs_root *root, > > struct btrfs_key *key, > > "invalid chunk logical %llu", logical); > > return -EIO; > > } > > + if (btrfs_chunk_sector_size(leaf, chunk) != root->sectorsize) { > > + btrfs_err(root->fs_info, "invalid chunk sectorsize %llu", > > + (unsigned long long)btrfs_chunk_sector_size(leaf, > > type cast not necessry > > > + chunk)); > > + return -EIO; > > + } > > if (!length || !IS_ALIGNED(length, root->sectorsize)) { > > btrfs_err(root->fs_info, > > "invalid chunk length %llu", length); > > return -EIO; > > } > > - if (!is_power_of_2(stripe_len)) { > > + if (stripe_len != BTRFS_STRIPE_LEN) { > > Again too strict. As mentined elsewhere, add a helper to validate > stripe_len and use it so we don't open-code it. I'm not sure I understand the comment about open-code, right now the value must be BTRFS_STRIPE_LEN and we don't set any other value, are we going to add a helper for just (stripe_len != BTRFS_STRIPE_LEN)? I fixed other issues. Thanks, -liubo > > > btrfs_err(root->fs_info, "invalid chunk stripe length: %llu", > > stripe_len); > > return -EIO; > > } > > if (~(BTRFS_BLOCK_GROUP_TYPE_MASK | BTRFS_BLOCK_GROUP_PROFILE_MASK) & > > - btrfs_chunk_type(leaf, chunk)) { > > + type) { > > btrfs_err(root->fs_info, "unrecognized chunk type: %llu", > > ~(BTRFS_BLOCK_GROUP_TYPE_MASK | > > BTRFS_BLOCK_GROUP_PROFILE_MASK) & > > btrfs_chunk_type(leaf, chunk)); > > return -EIO; > > } > > + if ((type & BTRFS_BLOCK_GROUP_RAID10 && sub_stripes == 0) || > > + (type & BTRFS_BLOCK_GROUP_RAID1 && num_stripes < 1) || > > + (type & BTRFS_BLOCK_GROUP_RAID5 && num_stripes < 2) || > > + (type & BTRFS_BLOCK_GROUP_RAID5 && num_stripes < 3) || > > + (type & BTRFS_BLOCK_GROUP_DUP && num_stripes > 2) || > > + ((type & BTRFS_BLOCK_GROUP_PROFILE_MASK) == 0 && > > I was looking if we could turn that into some generic checks using the > btrfs_raid_array but seems that the tests do not make a uniform pattern, > eg the DUP and SINGLE disguised as "mask == 0". As we don't add new > profiles too often I'm ok with that version. > > > +num_stripes != 1)) { > > + btrfs_err(root->fs_info, "Invalid num_stripes:sub_stripes %u:%u > > for profile %llu", > > "invalid..." (no initial capital letter) and put the string on the next > line so it does not exceed 80 cols > > > + num_stripes, sub_stripes, > > + type & BTRFS_BLOCK_GROUP_PROFILE_MASK); > > + return -EIO; > > + } > > + > > + return 0; > > +} > > + > > +static int read_one_chunk(struct btrfs_root *root, struct btrfs_key *key, > > + struct extent_buffer *leaf, > > + struct btrfs_chunk *chunk) > > +{ > > + struct btrfs_mapping_tree *map_tree = &root->fs_info->mapping_tree; > > + struct map_lookup *map; > > + struct extent_map *em; > > + u64 logical; > > + u64 length;
Re: [PATCH 1/2] Btrfs: add more valid checks for superblock
On 05/14/2016 02:14 AM, Liu Bo wrote: On Mon, May 09, 2016 at 09:31:37AM +0800, Qu Wenruo wrote: David Sterba wrote on 2016/05/06 16:35 +0200: On Thu, May 05, 2016 at 09:08:54AM +0800, Qu Wenruo wrote: An early check can compare against some reasonable value, but the total_bytes value must be equal to the sum of all device sizes (disk_total_bytes). I'm not sure if we have enough information to verify that at this point though. That's what I had in mind, the problem is that only the first device information is recorded in superblock. At this moment We have device_num but we don't know the size of other devices. Thanks, -liubo What about error out if we found sb->total_bytes < sb->dev_item->total_bytes? As we are just doing early check, no need to be comprehensive, but spot obvious problem. Ok. I'm gonna check for total_bytes and num_devices after loading chunk tree. For exact device_num and sb->total_bytes, we may do post check when device tree are loaded? Splitting early_check() and post_check() seems valid for me. (Also I prefer post_check() just warning, not forced exit) Why just warning? Superblock total_bytes and device sizes must be correct, otherwise all sorts of operations can fail randomly. Because if we exit, we can't even do fsck. Maybe we need a new flag to control whether exit or warn at post_check(). Thanks, Qu IMHO for kernel part, we have to exit in order to avoid any panic due to those invalid value. I'm OK with this. For fsck code, we can go forth and back to fix them. In fact I don't think fsck could work out anything, as superblock checksum has _matched_ but the values inside superblock are invalid, in this case we cannot trust other parts in this FS image, then how can we expect fsck to fix it by reading other parts? For rw fsck, that may cause huge problem and I agree with you on error out. But for ro fsck, it's a little overkilled for me. Currently, if we found error in extent tree, we still continue checking fstree, to shows what is really wrong. And for case like btrfs-image restored images, its dev extents doesn't even match with its chunk (may be it's already fixed?), but that's not a big problem for ro btrfsck, and we can go on without problem. So the same case is here for ro btrfsck. As long as that's ro btrfsck, we could just continue as we don't really need the total_bytes in superblock. Thanks, Qu Thanks, -liubo -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs ate my data in just two days, after a fresh install. ram and disk are ok. it still mounts, but I cannot repair
On Fri, May 13, 2016 at 6:10 AM, Niccolò Belli wrote: > On venerdì 13 maggio 2016 13:35:01 CEST, Austin S. Hemmelgarn wrote: >> >> The fact that you're getting an OOPS involving core kernel threads >> (kswapd) is a pretty good indication that either there's a bug elsewhere in >> the kernel, or that something is wrong with your hardware. it's really >> difficult to be certain if you don't have a reliable test case though. > > > Talking about reliable test cases, I forgot to say that I definitely found > an interesting one. It doesn't lead to OOPS but perhaps something even more > interesting. While running countless stress tests I tried running some games > to stress the system in different ways. I chosed openmw (an open source > engine for Morrowind) and I played it for a while on my second external > monitor (while I watched at some monitoring tools on my first monitor). I > noticed that after playing a while I *always* lose internet connection (I > use an USB3 Gigabit Ethernet adapter). This isn't the only thing which > happens: even if the game keeps running flawlessly and the system *seems* to > work fine (I can drag windows, open the terminal...) lots of commands simply > stall (for example mounting a partition, unmounting it, rebooting...). I can > reliably reproduce it, it ALWAYS happens. Well there are a bunch of kernel debug options. If your kernel has CONFIG_SLUB_DEBUG=y CONFIG_SLUB=y at compile time you can boot with boot parameter slub_debug=1 to enable it and maybe there'll be something more revealing about the problems you're having. More aggressive is CONFIG_DEBUG_PAGEALLOC=y but it'll slow things down quite noticeably. And then there's some Btrfs debug options for compile time, and are enabled with mount options. But I think the problem you're having isn't specific to Btrfs or someone else would have run into it. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: fsck: to repair or not to repair
On Fri, May 13, 2016 at 9:28 AM, Nikolaus Rath wrote: > On May 13 2016, Duncan <1i5t5.dun...@cox.net> wrote: >> Because btrfs can be multi-device, it needs some way to track which >> devices belong to each filesystem, and it uses filesystem UUID for this >> purpose. >> >> If you clone a filesystem (for instance using dd or lvm snapshotting, >> doesn't matter how) and then trigger a btrfs device scan, say by plugging >> in some other device with btrfs on it so udev triggers a scan, and the >> kernel sees multiple devices with the same filesystem UUID as a result, >> and one of those happens to be mounted, you can corrupt both copies as >> the kernel btrfs won't be able to tell them apart and may write updates >> to the wrong one. > > That seems like a rather odd design. Why isn't btrfs refusing to mount > in this situation? In the face of ambiguity, guessing is generally bad > idea (at least for a computer program). The logic you describe requires code. It's the absence of code rather than an intentional design that's the cause of the current behavior. And yes, it'd be nice if Btrfs weren't stepping on its own tail in this situation. It could be as simple as refusing to mount anytime there's an ambiguity, but that's sorta user hostile if there isn't a message that goes along with it to help the user figure out a way to resolve the problem. And that too could be fraught with peril if the user makes a mistake. So, really what's the right way to do this is part of the problem but I agree it's better to be hostile and refuse to mount a given volume UUID at all when too many devices are found, than corrupt the file system. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] Btrfs: add more valid checks for superblock
On Mon, May 09, 2016 at 09:31:37AM +0800, Qu Wenruo wrote: > > > David Sterba wrote on 2016/05/06 16:35 +0200: > > On Thu, May 05, 2016 at 09:08:54AM +0800, Qu Wenruo wrote: > > > > > An early check can compare against some reasonable value, but the > > > > > total_bytes value must be equal to the sum of all device sizes > > > > > (disk_total_bytes). I'm not sure if we have enough information to > > > > > verify > > > > > that at this point though. > > > > > > > > That's what I had in mind, the problem is that only the first device > > > > information is recorded in superblock. > > > > > > > > At this moment We have device_num but we don't know the size of other > > > > devices. > > > > > > > > Thanks, > > > > > > > > -liubo > > > > > > > > > > > What about error out if we found sb->total_bytes < > > > sb->dev_item->total_bytes? > > > > > > As we are just doing early check, no need to be comprehensive, but spot > > > obvious problem. > > > > Ok. I'm gonna check for total_bytes and num_devices after loading chunk tree. > > > > > For exact device_num and sb->total_bytes, we may do post check when > > > device tree are loaded? > > > Splitting early_check() and post_check() seems valid for me. > > > (Also I prefer post_check() just warning, not forced exit) > > > > Why just warning? Superblock total_bytes and device sizes must be > > correct, otherwise all sorts of operations can fail randomly. > > > > > Because if we exit, we can't even do fsck. > > Maybe we need a new flag to control whether exit or warn at post_check(). > > Thanks, > Qu > IMHO for kernel part, we have to exit in order to avoid any panic due to those invalid value. For fsck code, we can go forth and back to fix them. In fact I don't think fsck could work out anything, as superblock checksum has _matched_ but the values inside superblock are invalid, in this case we cannot trust other parts in this FS image, then how can we expect fsck to fix it by reading other parts? Thanks, -liubo -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS Data at Rest File Corruption
On Thu, May 12, 2016 at 10:49 PM, Richard A. Lochner wrote: > My apologies, they were from different boots. After the dd, I get > these: > > [109479.550836] BTRFS warning (device sdb1): csum failed ino 1437377 > off 75754369024 csum 1689728329 expected csum 2165338402 > [109479.596626] BTRFS warning (device sdb1): csum failed ino 1437377 > off 75754369024 csum 1689728329 expected csum 2165338402 > [109479.601969] BTRFS warning (device sdb1): csum failed ino 1437377 > off 75754369024 csum 1689728329 expected csum 2165338402 > [109479.602189] BTRFS warning (device sdb1): csum failed ino 1437377 > off 75754369024 csum 1689728329 expected csum 2165338402 > [109479.602323] BTRFS warning (device sdb1): csum failed ino 1437377 > off 75754369024 csum 1689728329 expected csum 2165338402 That's it? Only errors from sdb1? And this time no attempt to fix it? Normally when there is failure to match data checksums stored in metadata to the newly computed data checksums as the blocks are read there's an attempt to read the mismatching blocks from another stripe. I don't see that this is being attempted. >> >> Also what do you get for these for each device: >> >> smartctl scterc -l /dev/sdX >> cat /sys/block/sdX/device/timeout >> > # smartctl -l scterc /dev/sdb > sartctl 6.4 2015-06-04 r4109 [x86_64-linux-4.4.8-300.fc23.x86_64] > (local build) > Copyright (C) 2002-15, Bruce Allen, Christian Franke, www.smartmontools > .org > > SCT Error Recovery Control: >Read: 70 (7.0 seconds) > Write: 70 (7.0 seconds) > > # smartctl -l scterc /dev/sdc > smartctl 6.4 2015-06-04 r4109 [x86_64-linux-4.4.8-300.fc23.x86_64] > (local build) > Copyright (C) 2002-15, Bruce Allen, Christian Franke, www.smartmontools > .org > > SCT Error Recovery Control: >Read: 70 (7.0 seconds) > Write: 70 (7.0 seconds) > > # cat /sys/block/sdb/device/timeout > 30 > # cat /sys/block/sdc/device/timeout > 30 >> That's appropriate. So at least any failures have a chance of being fixed before the command timer does a reset on the bus. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS Data at Rest File Corruption
On 2016-05-13 12:28, Goffredo Baroncelli wrote: On 2016-05-11 21:26, Austin S. Hemmelgarn wrote: (although it can't tell the difference between a corrupted checksum and a corrupted block of data). I don't think so. The data checksums are stored in metadata blocks, and as metadata block, these have their checksums. So btrfs know if the checksum is correct or none, despite the fact that the data is correct or none. Of course if the checksum is wrong, btrfs can't tell if the data is correct. The only exception should be the inline data: in this case the data is stored in the metadata block, and this block is protected by only one checksum. I know that I am pedantic :_) but after reading your comment I looked at the btrfs data structure to refresh my memory, so I want to share these information. It is fully possible for the block of data to be good and the checksum to be bad, it's just ridiculously unlikely. I've actually had this happen before at least twice (I have really bad luck when it comes disk corruption). -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: About in-band dedupe for v4.7
On Fri, May 13, 2016 at 08:14:10AM -0400, Austin S. Hemmelgarn wrote: > On 2016-05-12 16:54, Mark Fasheh wrote: > >Now ask yourself the question - would you accept a write cache which is > >expensive to fill and would only have a hit rate of less than 5%? > In-band deduplication is a feature that is not used by typical desktop users > or even many developers because it's computationally expensive, but it's > used _all the time_ by big data-centers and similar places where processor > time is cheap and storage efficiency is paramount. Deduplication is more > useful in general the more data you have. 5% of 1 TB is 20 GB, which is not > much. 5% of 1 PB is 20 TB, which is at least 3-5 disks, which can then be > used for storing more data, or providing better resiliency against failures. There's also a big cost difference between a 5-drive and 6-drive array when all your servers are built with 5-drive cages. Delay that expansion for six months, and you can buy 5 larger drives for the same price. I have laptops with filesystems that are _just_ over 512GB before dedup. SSDs seem to come in 512GB and 1TB sizes with nothing in between, so saving even a few dozen MB can translate into hundreds of dollars per user (not to mention the bureaucratic hardware provisioning side-effects which easily triple that cost). Working on big software integration projects can generate 60% duplication rates. Think checkout of entire OS + disk images + embedded media, build + install trees with copied data, long build times so there are multiple working directories, multiple active branches of the same project, and oh yeah it's all in SVN, which stores a complete second copy of everything on disk just because wasting space is cool. > To look at it another way, deduplicating an individual's home directory will > almost never get you decent space savings, the majority of shared data is > usually file headers and nothing more, which can't be deduplicated > efficiently because of block size requirements. Deduplicating all the home > directories on a terminal server with 500 users usually will get you decent > space savings, as there very likely are a number of files that multiple > people have exact copies of, but most of them are probably not big files. > Deduplicating the entirety of a multi-petabyte file server used for storing > VM disk images will probably save you a very significant amount of space, > because the probability of having data that can be deduplicated goes up as > you store more data, and there is likely to be a lot of data shared between > the disk images. Mail stores benefit from this too. For some reason, Microsoft Office users seem to enjoy emailing multi-megabyte runs of MIME-encoded zeros to each other. ;) signature.asc Description: Digital signature
[GIT PULL] Btrfs fixes for 4.7
From: Filipe Manana Hi Chris, Please consider the following changes for the merge window for 4.7. There's an implementation for the rename exchange and rename whiteout operations, from Dan Fuhry, which was originally in David's integration branches and linux-next but then we agreed to move it into my branches since I made a few fixes on top of it. Other than that, there's just the usual set of bug fixes, in particular races between direct IO writes and fsync and between direct IO writes and balance, which lead to data loss and metadata corruption issues. These were recently rebased just to add Reviewed-by tags from Josef and Liu Bo, but otherwise unchanged and have been in testing for quite some time here. Thanks. The following changes since commit 44549e8f5eea4e0a41b487b63e616cb089922b99: Linux 4.6-rc7 (2016-05-08 14:38:32 -0700) are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/fdmanana/linux.git for-chris-4.7 for you to fetch changes up to 5f9a8a51d8b95505d8de8b7191ae2ed8c504d4af: Btrfs: add semaphore to synchronize direct IO writes with fsync (2016-05-13 01:59:36 +0100) Dan Fuhry (1): btrfs: add support for RENAME_EXCHANGE and RENAME_WHITEOUT Filipe Manana (13): Btrfs: fix for incorrect directory entries after fsync log replay Btrfs: fix empty symlink after creating symlink and fsync parent dir Btrfs: don't wait for unrelated IO to finish before relocation Btrfs: don't do unnecessary delalloc flushes when relocating Btrfs: unpin log if rename operation fails Btrfs: pin log earlier when renaming Btrfs: fix inode leak on failure to setup whiteout inode in rename Btrfs: unpin logs if rename exchange operation fails Btrfs: pin logs earlier when doing a rename exchange operation Btrfs: fix number of transaction units for renames with whiteout Btrfs: fix race between fsync and direct IO writes for prealloc extents Btrfs: fix race between block group relocation and nocow writes Btrfs: add semaphore to synchronize direct IO writes with fsync fs/btrfs/btrfs_inode.h | 10 +++ fs/btrfs/ctree.h| 27 fs/btrfs/dev-replace.c | 4 +- fs/btrfs/extent-tree.c | 122 +-- fs/btrfs/inode.c| 464 +++--- fs/btrfs/ioctl.c| 2 +- fs/btrfs/ordered-data.c | 26 +--- fs/btrfs/ordered-data.h | 6 +- fs/btrfs/relocation.c | 11 ++-- fs/btrfs/super.c| 2 +- fs/btrfs/transaction.c | 2 +- fs/btrfs/tree-log.c | 66 +++ 12 files changed, 608 insertions(+), 134 deletions(-) -- 2.7.0.rc3 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS Data at Rest File Corruption
On 2016-05-11 21:26, Austin S. Hemmelgarn wrote: > (although it can't tell the difference between a corrupted checksum and a > corrupted block of data). I don't think so. The data checksums are stored in metadata blocks, and as metadata block, these have their checksums. So btrfs know if the checksum is correct or none, despite the fact that the data is correct or none. Of course if the checksum is wrong, btrfs can't tell if the data is correct. The only exception should be the inline data: in this case the data is stored in the metadata block, and this block is protected by only one checksum. I know that I am pedantic :_) but after reading your comment I looked at the btrfs data structure to refresh my memory, so I want to share these information. BR G.Baroncelli -- gpg @keyserver.linux.it: Goffredo Baroncelli Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: fsck: to repair or not to repair
On May 13 2016, Duncan <1i5t5.dun...@cox.net> wrote: > Because btrfs can be multi-device, it needs some way to track which > devices belong to each filesystem, and it uses filesystem UUID for this > purpose. > > If you clone a filesystem (for instance using dd or lvm snapshotting, > doesn't matter how) and then trigger a btrfs device scan, say by plugging > in some other device with btrfs on it so udev triggers a scan, and the > kernel sees multiple devices with the same filesystem UUID as a result, > and one of those happens to be mounted, you can corrupt both copies as > the kernel btrfs won't be able to tell them apart and may write updates > to the wrong one. That seems like a rather odd design. Why isn't btrfs refusing to mount in this situation? In the face of ambiguity, guessing is generally bad idea (at least for a computer program). Best, -Nikolaus -- GPG encrypted emails preferred. Key id: 0xD113FCAC3C4E599F Fingerprint: ED31 791B 2C5C 1613 AF38 8B8A D113 FCAC 3C4E 599F »Time flies like an arrow, fruit flies like a Banana.« -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: About in-band dedupe for v4.7
On 05/13/2016 08:14 PM, Austin S. Hemmelgarn wrote: On 2016-05-12 16:54, Mark Fasheh wrote: On Wed, May 11, 2016 at 07:36:59PM +0200, David Sterba wrote: On Tue, May 10, 2016 at 07:52:11PM -0700, Mark Fasheh wrote: Taking your history with qgroups out of this btw, my opinion does not change. With respect to in-memory only dedupe, it is my honest opinion that such a limited feature is not worth the extra maintenance work. In particular there's about 800 lines of code in the userspace patches which I'm sure you'd want merged, because how could we test this then? I like the in-memory dedup backend. It's lightweight, only a heuristic, does not need any IO or persistent storage. OTOH I consider it a subpart of the in-band deduplication that does all the persistency etc. So I treat the ioctl interface from a broader aspect. Those are all nice qualities, but what do they all get us? For example, my 'large' duperemove test involves about 750 gigabytes of general purpose data - quite literally /home off my workstation. After the run I'm usually seeing between 65-75 gigabytes saved for a total of only 10% duplicated data. I would expect this to be fairly 'average' - /home on my machine has the usual stuff - documents, source code, media, etc. So if you were writing your whole fs out you could expect about the same from inline dedupe - 10%-ish. Let's be generous and go with that number though as a general 'this is how much dedupe we get'. What the memory backend is doing then is providing a cache of sha256/block calculations. This cache is very expensive to fill, and every written block must go through it. On top of that, the cache does not persist between mounts, and has items regularly removed from it when we run low on memory. All of this will drive down the amount of duplicated data we can find. So our best case savings is probably way below 10% - let's be _really_ nice and say 5%. Now ask yourself the question - would you accept a write cache which is expensive to fill and would only have a hit rate of less than 5%? In-band deduplication is a feature that is not used by typical desktop users or even many developers because it's computationally expensive, but it's used _all the time_ by big data-centers and similar places where processor time is cheap and storage efficiency is paramount. Deduplication is more useful in general the more data you have. 5% of 1 TB is 20 GB, which is not much. 5% of 1 PB is 20 TB, which is at least 3-5 disks, which can then be used for storing more data, or providing better resiliency against failures. To look at it another way, deduplicating an individual's home directory will almost never get you decent space savings, the majority of shared data is usually file headers and nothing more, which can't be deduplicated efficiently because of block size requirements. Deduplicating all the home directories on a terminal server with 500 users usually will get you decent space savings, as there very likely are a number of files that multiple people have exact copies of, but most of them are probably not big files. Deduplicating the entirety of a multi-petabyte file server used for storing VM disk images will probably save you a very significant amount of space, because the probability of having data that can be deduplicated goes up as you store more data, and there is likely to be a lot of data shared between the disk images. This is exactly why I don't use deduplication on any of my personal systems. On my laptop, the space saved is just not worth the time spent doing it, as I fall pretty solidly into the first case (most of the data duplication on my systems is in file headers). On my home server, I'm not storing enough data with sufficient internal duplication that it would save more than 10-20 GB, which doesn't matter for me given that I'm using roughly half of the 2.2 TB of effective storage space I have. However, once we (eventually) get all the file servers where I work moved over to Linux systems running BTRFS, we will absolutely be using deduplication there, as we have enough duplication in our data that it will probably cut our storage requirements by around 20% on average. -- Nice to hear that. Although the feature is not going to be merged soon, but there is some info which may help anyone interested with inband dedupe to plan their usage or tests. 1) CPU core number Since currently in-band dedupe balance all SHA256 calculation using async-thread, for default thread_pool number, it's max(nr_cpu + 2, 8). So if using default thread_pool number, and the system doesn't have more than 8 cores, all CPU can be easily eaten up by dedupe. So either use dedupe with default thread_pool setting when you have more than 8 cores, or tunning thread_pool to nr_cpu - 1 or 2. Or your system will be less responsive then running delalloc (fsync/sync). 2) Dedupe speed. In our internal 1 core test. It's about the same speed compared to
Re: About in-band dedupe for v4.7
On 2016-05-12 16:54, Mark Fasheh wrote: On Wed, May 11, 2016 at 07:36:59PM +0200, David Sterba wrote: On Tue, May 10, 2016 at 07:52:11PM -0700, Mark Fasheh wrote: Taking your history with qgroups out of this btw, my opinion does not change. With respect to in-memory only dedupe, it is my honest opinion that such a limited feature is not worth the extra maintenance work. In particular there's about 800 lines of code in the userspace patches which I'm sure you'd want merged, because how could we test this then? I like the in-memory dedup backend. It's lightweight, only a heuristic, does not need any IO or persistent storage. OTOH I consider it a subpart of the in-band deduplication that does all the persistency etc. So I treat the ioctl interface from a broader aspect. Those are all nice qualities, but what do they all get us? For example, my 'large' duperemove test involves about 750 gigabytes of general purpose data - quite literally /home off my workstation. After the run I'm usually seeing between 65-75 gigabytes saved for a total of only 10% duplicated data. I would expect this to be fairly 'average' - /home on my machine has the usual stuff - documents, source code, media, etc. So if you were writing your whole fs out you could expect about the same from inline dedupe - 10%-ish. Let's be generous and go with that number though as a general 'this is how much dedupe we get'. What the memory backend is doing then is providing a cache of sha256/block calculations. This cache is very expensive to fill, and every written block must go through it. On top of that, the cache does not persist between mounts, and has items regularly removed from it when we run low on memory. All of this will drive down the amount of duplicated data we can find. So our best case savings is probably way below 10% - let's be _really_ nice and say 5%. Now ask yourself the question - would you accept a write cache which is expensive to fill and would only have a hit rate of less than 5%? In-band deduplication is a feature that is not used by typical desktop users or even many developers because it's computationally expensive, but it's used _all the time_ by big data-centers and similar places where processor time is cheap and storage efficiency is paramount. Deduplication is more useful in general the more data you have. 5% of 1 TB is 20 GB, which is not much. 5% of 1 PB is 20 TB, which is at least 3-5 disks, which can then be used for storing more data, or providing better resiliency against failures. To look at it another way, deduplicating an individual's home directory will almost never get you decent space savings, the majority of shared data is usually file headers and nothing more, which can't be deduplicated efficiently because of block size requirements. Deduplicating all the home directories on a terminal server with 500 users usually will get you decent space savings, as there very likely are a number of files that multiple people have exact copies of, but most of them are probably not big files. Deduplicating the entirety of a multi-petabyte file server used for storing VM disk images will probably save you a very significant amount of space, because the probability of having data that can be deduplicated goes up as you store more data, and there is likely to be a lot of data shared between the disk images. This is exactly why I don't use deduplication on any of my personal systems. On my laptop, the space saved is just not worth the time spent doing it, as I fall pretty solidly into the first case (most of the data duplication on my systems is in file headers). On my home server, I'm not storing enough data with sufficient internal duplication that it would save more than 10-20 GB, which doesn't matter for me given that I'm using roughly half of the 2.2 TB of effective storage space I have. However, once we (eventually) get all the file servers where I work moved over to Linux systems running BTRFS, we will absolutely be using deduplication there, as we have enough duplication in our data that it will probably cut our storage requirements by around 20% on average. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs ate my data in just two days, after a fresh install. ram and disk are ok. it still mounts, but I cannot repair
On venerdì 13 maggio 2016 13:35:01 CEST, Austin S. Hemmelgarn wrote: The fact that you're getting an OOPS involving core kernel threads (kswapd) is a pretty good indication that either there's a bug elsewhere in the kernel, or that something is wrong with your hardware. it's really difficult to be certain if you don't have a reliable test case though. Talking about reliable test cases, I forgot to say that I definitely found an interesting one. It doesn't lead to OOPS but perhaps something even more interesting. While running countless stress tests I tried running some games to stress the system in different ways. I chosed openmw (an open source engine for Morrowind) and I played it for a while on my second external monitor (while I watched at some monitoring tools on my first monitor). I noticed that after playing a while I *always* lose internet connection (I use an USB3 Gigabit Ethernet adapter). This isn't the only thing which happens: even if the game keeps running flawlessly and the system *seems* to work fine (I can drag windows, open the terminal...) lots of commands simply stall (for example mounting a partition, unmounting it, rebooting...). I can reliably reproduce it, it ALWAYS happens. Niccolò -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/3] [RFC] fstests: btrfs: add functions to set and reset required number of SCRATCH_DEV_POOL
This patch provides functions _scratch_dev_pool_get() _scratch_dev_pool_put() Which will help to set/reset SCRATCH_DEV_POOL with the required number of devices. SCRATCH_DEV_POOL_SAVED will hold all the devices. Usage: _scratch_dev_pool_get() :: do stuff _scratch_dev_pool_put() Signed-off-by: Anand Jain --- common/rc | 47 +++ 1 file changed, 47 insertions(+) diff --git a/common/rc b/common/rc index 91e8f1c8e693..33632fd8e4a3 100644 --- a/common/rc +++ b/common/rc @@ -786,6 +786,53 @@ _scratch_mkfs() esac } +# +# $1 Number of the scratch devs required +# +_scratch_dev_pool_get() +{ + if [ $# != 1 ]; then + _fail "Usage: _scratch_dev_pool_get ndevs" + fi + + local test_ndevs=$1 + local config_ndevs=`echo $SCRATCH_DEV_POOL| wc -w` + local devs[]="( $SCRATCH_DEV_POOL )" + + typeset -p config_ndevs >/dev/null 2>&1 + if [ $? != 0 ]; then + _fail "Bug: unset val, must call _scratch_dev_pool_get before _scratch_dev_pool_put" + fi + + # _require_scratch_dev_pool $test_ndevs + # must have already checked the min required devices + # but just in case, trap here for any potential bugs + # perpetuating any further + if [ $config_ndevs -lt $test_ndevs ]; then + _notrun "Need at least test requested number of ndevs $test_ndevs" + fi + + SCRATCH_DEV_POOL_SAVED=${SCRATCH_DEV_POOL} + export SCRATCH_DEV_POOL_SAVED + SCRATCH_DEV_POOL=${devs[@]:0:$test_ndevs} + export SCRATCH_DEV_POOL +} + +_scratch_dev_pool_put() +{ + typeset -p SCRATCH_DEV_POOL_SAVED >/dev/null 2>&1 + if [ $? != 0 ]; then + _fail "Bug: unset val, must call _scratch_dev_pool_get before _scratch_dev_pool_put" + fi + + if [ -z "$SCRATCH_DEV_POOL_SAVED" ]; then + _fail "Bug: str empty, must call _scratch_dev_pool_get before _scratch_dev_pool_put" + fi + + export SCRATCH_DEV_POOL=$SCRATCH_DEV_POOL_SAVED + export SCRATCH_DEV_POOL_SAVED="" +} + _scratch_pool_mkfs() { case $FSTYP in -- 2.7.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/3] fstests: btrfs: 027 make use of new device get and put helper functions
Below patches added helper function to get the requested number of devices for scratch and spare device fstest: btrfs: add functions to get and put a device for replace target fstests: btrfs: add functions to set and reset required number of SCRATCH_DEV_POOL This patch makes use of them. Signed-off-by: Anand Jain --- tests/btrfs/027 | 24 1 file changed, 16 insertions(+), 8 deletions(-) diff --git a/tests/btrfs/027 b/tests/btrfs/027 index f0844a14f8e6..905dc0ebaa27 100755 --- a/tests/btrfs/027 +++ b/tests/btrfs/027 @@ -57,17 +57,22 @@ _require_command "$WIPEFS_PROG" wipefs run_test() { local mkfs_opts=$1 - local saved_scratch_dev_pool=$SCRATCH_DEV_POOL - local replace_dev=`echo $SCRATCH_DEV_POOL | awk '{print $NF}'` + local ndevs=`echo $SCRATCH_DEV_POOL | wc -w` + + # reserve one for replace target + ((ndevs--)) + + _scratch_dev_pool_get $ndevs + _spare_dev_get echo "Test $mkfs_opts" >>$seqres.full - SCRATCH_DEV_POOL=`echo $SCRATCH_DEV_POOL | sed -e "s# *$replace_dev *##"` _scratch_pool_mkfs $mkfs_opts >>$seqres.full 2>&1 # make sure we created btrfs with desired options if [ $? -ne 0 ]; then echo "mkfs $mkfs_opts failed" - SCRATCH_DEV_POOL=$saved_scratch_dev_pool + _spare_dev_put + _scratch_dev_pool_put return fi _scratch_mount >>$seqres.full 2>&1 @@ -89,17 +94,19 @@ run_test() _scratch_mount -o degraded >>$seqres.full 2>&1 # replace $missing_dev with $replace_dev and scrub it to double-check - $BTRFS_UTIL_PROG replace start -B -r $missing_dev_id $replace_dev \ + $BTRFS_UTIL_PROG replace start -B -r $missing_dev_id $SPARE_DEV \ $SCRATCH_MNT -f >>$seqres.full 2>&1 if [ $? -ne 0 ]; then echo "btrfs replace failed" - SCRATCH_DEV_POOL=$saved_scratch_dev_pool + _spare_dev_put + _scratch_dev_pool_put return fi $BTRFS_UTIL_PROG scrub start -B $SCRATCH_MNT >>$seqres.full 2>&1 if [ $? -ne 0 ]; then echo "btrfs scrub failed" - SCRATCH_DEV_POOL=$saved_scratch_dev_pool + _spare_dev_put + _scratch_dev_pool_put return fi @@ -107,7 +114,8 @@ run_test() # we called _require_scratch_nocheck instead of _require_scratch # do check after test for each profile config _check_scratch_fs - SCRATCH_DEV_POOL=$saved_scratch_dev_pool + _spare_dev_put + _scratch_dev_pool_put } echo "Silence is golden" -- 2.7.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/3] [RFC] fstests: btrfs: add functions to get and put a device for replace target
For the replace tests we need a device as a spare device, here functions _spare_dev_get() and _spare_dev_put() will get it from the SCRATCH_DEV_POOL_SAVED, which is set when _scratch_dev_pool_get() is called, and is based on how many has already been assigned to SCRATCH_DEV_POOL. usage: _scratch_dev_pool_get 3 _spare_dev_get SPARE_DEV will have a device set which can be used as the replace target device. _spare_dev_put _scratch_dev_pool_put _spare_dev_get() will pick the next device after SCRATCH_DEV_POOL devices, from the SCRATCH_DEV_POOL_SAVED, and assigns it to SPARE_DEV. _spare_dev_put() will set to SPARE_DEV to null. Signed-off-by: Anand Jain --- common/rc | 45 + 1 file changed, 45 insertions(+) diff --git a/common/rc b/common/rc index 33632fd8e4a3..f2b39ddbee0c 100644 --- a/common/rc +++ b/common/rc @@ -786,6 +786,51 @@ _scratch_mkfs() esac } +_spare_dev_get() +{ + typeset -p SCRATCH_DEV_POOL_SAVED >/dev/null 2>&1 + if [ $? != 0 ]; then + _fail "Bug: unset val, must call _scratch_dev_pool_get before _spare_dev_get" + fi + + if [ -z "$SCRATCH_DEV_POOL_SAVED" ]; then + _fail "Bug: str empty, must call _scratch_dev_pool_get before _spare_dev_get" + fi + + # Check if the spare is already assigned + typeset -p SPARE_DEV >/dev/null 2>&1 + if [ $? == 0 ]; then + if [ ! -z "$SPARE_DEV" ]; then + _fail "Bug: SPARE_DEV = $SPARE_DEV already assigned" + fi + fi + + local ndevs=`echo $SCRATCH_DEV_POOL| wc -w` + local config_ndevs=`echo $SCRATCH_DEV_POOL_SAVED| wc -w` + + if [ $ndevs -eq $config_ndevs ]; then + _notrun "All devs used no spare" + fi + # Get a dev that is not used + local devs[]="( $SCRATCH_DEV_POOL_SAVED )" + SPARE_DEV=${devs[@]:$ndevs:1} + export SPARE_DEV +} + +_spare_dev_put() +{ + typeset -p SPARE_DEV >/dev/null 2>&1 + if [ $? != 0 ]; then + _fail "Bug: unset val, must call _spare_dev_get before its put" + fi + + if [ -z "$SPARE_DEV" ]; then + _fail "Bug: str empty, must call _spare_dev_get before its put" + fi + + export SPARE_DEV="" +} + # # $1 Number of the scratch devs required # -- 2.7.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs ate my data in just two days, after a fresh install. ram and disk are ok. it still mounts, but I cannot repair
On 2016-05-13 07:07, Niccolò Belli wrote: On giovedì 12 maggio 2016 17:43:38 CEST, Austin S. Hemmelgarn wrote: That's probably a good indication of the CPU and the MB being OK, but not necessarily the RAM. There's two other possible options for testing the RAM that haven't been mentioned yet though (which I hadn't thought of myself until now): 1. If you have access to Windows, try the Windows Memory Diagnostic. This runs yet another slightly different set of tests from memtest86 and memtest86+, so it may catch issues they don't. You can start this directly on an EFI system by loading /EFI/Microsoft/Boot/MEMTEST.EFI from the EFI system partition. 2. This is a Dell system. If you still have the utility partition which Dell ships all their per-provisioned systems with, that should have a hardware diagnostics tool. I doubt that this will find anything (it's part of their QA procedure AFAICT), but it's probably worth trying, as the memory testing in that uses yet another slightly different implementation of the typical tests. You can usually find this in the boot interrupt menu accessed by hitting F12 before the boot-loader loads. I tried the Dell System Test, including the enhanced optional ram tests and it was fine. I also tried the Microsoft one, which passed. BUT if I select the advanced test in the Microsoft One it always stops at 21% of first test. The test menus are still working, but fans get quiet and it keeps writing "test running... 21%" forever. I tried it many times and it always got stuck at 21%, so I suspect a test suite bug instead of a ram failure. I've actually seen this before on other systems (different completion percentage on each system, but otherwise the same), all of them ended up actually having a bad CPU or MB, although the ones with CPU issues were fine after BIOS updates which included newer microcode. I also noticed some other interesting behaviours: while I was running the usual scrub+check (both were fine) from the livecd I noticed this in dmesg: [ 261.301159] BTRFS info (device dm-0): bdev /dev/mapper/cryptroot errs: wr 0, rd 0, flush 0, corrupt 4, gen 0 Corrupt? But both scrub and check were fine... I double checked scrub and check and they were still fine. It's worth noting that these are running counts of errors since the last time the stats were reset (and they only get reset manually). If you haven't reset the stats, then this isn't all that surprising. This is what happened another time: https://drive.google.com/open?id=0Bwe9Wtc-5xF1dGtPaWhTZ0w5aUU I was making a backup of my partition USING DD from the livecd. It wasn't even mounted if I recall correctly! The fact that you're getting an OOPS involving core kernel threads (kswapd) is a pretty good indication that either there's a bug elsewhere in the kernel, or that something is wrong with your hardware. it's really difficult to be certain if you don't have a reliable test case though. On giovedì 12 maggio 2016 18:48:17 CEST, Zygo Blaxell wrote: That's what a RAM corruption problem looks like when you run btrfs scrub. Maybe the RAM itself is OK, but *something* is scribbling on it. Does the Arch live usb use the same kernel as your normal system? Yes, except for the point release (the system is slightly ahead of the liveusb). On giovedì 12 maggio 2016 18:48:17 CEST, Zygo Blaxell wrote: Did you try an older (or newer) kernel? I've been running 4.5.x on a few canary systems, but so far none of them have survived more than a day. No (except for point releases from 4.5.0 to 4.5.4), but I will try 4.4. FWIW, I've been running 4.5 with almost no issues on my laptop since it came out (the few issues I have had are not unique to 4.5, and are all ultimately firmware issues (Lenovo has been getting _really_ bad recently about having broken ACPI and EFI implementations...)). Of course, I'm also running Gentoo, so everything is built locally, but I doubt that that has much impact on stability. On giovedì 12 maggio 2016 18:48:17 CEST, Zygo Blaxell wrote: It's possible there's a problem that affects only very specific chipsets You seem to have eliminated RAM in isolation, but there could be a problem in the kernel that affects only your chipset. Funny considering it is sold as a Linux laptop. Unfortunately they only tested it with the ancient Ubuntu 14.04. Sadly, this is pretty typical for anything sold as a 'Linux' system that isn't a server. Even for the servers sold as such, it's not unusual for it to only be tested with with old versions of CentOS. Now, I hadn't thought of this before, but it's a Dell system, so you're trapping out to SMBIOS for everything under the sun, and if they don't pass a correct memory map (or correct ACPI tables) to the OS during boot, then there may be some sections of RAM that both Linux and the firmware think they can use, which could definitely result in symptoms like bad RAM while still consistently passing memory tests (because they don't make
Re: btrfs ate my data in just two days, after a fresh install. ram and disk are ok. it still mounts, but I cannot repair
On giovedì 12 maggio 2016 17:43:38 CEST, Austin S. Hemmelgarn wrote: That's probably a good indication of the CPU and the MB being OK, but not necessarily the RAM. There's two other possible options for testing the RAM that haven't been mentioned yet though (which I hadn't thought of myself until now): 1. If you have access to Windows, try the Windows Memory Diagnostic. This runs yet another slightly different set of tests from memtest86 and memtest86+, so it may catch issues they don't. You can start this directly on an EFI system by loading /EFI/Microsoft/Boot/MEMTEST.EFI from the EFI system partition. 2. This is a Dell system. If you still have the utility partition which Dell ships all their per-provisioned systems with, that should have a hardware diagnostics tool. I doubt that this will find anything (it's part of their QA procedure AFAICT), but it's probably worth trying, as the memory testing in that uses yet another slightly different implementation of the typical tests. You can usually find this in the boot interrupt menu accessed by hitting F12 before the boot-loader loads. I tried the Dell System Test, including the enhanced optional ram tests and it was fine. I also tried the Microsoft one, which passed. BUT if I select the advanced test in the Microsoft One it always stops at 21% of first test. The test menus are still working, but fans get quiet and it keeps writing "test running... 21%" forever. I tried it many times and it always got stuck at 21%, so I suspect a test suite bug instead of a ram failure. I also noticed some other interesting behaviours: while I was running the usual scrub+check (both were fine) from the livecd I noticed this in dmesg: [ 261.301159] BTRFS info (device dm-0): bdev /dev/mapper/cryptroot errs: wr 0, rd 0, flush 0, corrupt 4, gen 0 Corrupt? But both scrub and check were fine... I double checked scrub and check and they were still fine. This is what happened another time: https://drive.google.com/open?id=0Bwe9Wtc-5xF1dGtPaWhTZ0w5aUU I was making a backup of my partition USING DD from the livecd. It wasn't even mounted if I recall correctly! On giovedì 12 maggio 2016 18:48:17 CEST, Zygo Blaxell wrote: That's what a RAM corruption problem looks like when you run btrfs scrub. Maybe the RAM itself is OK, but *something* is scribbling on it. Does the Arch live usb use the same kernel as your normal system? Yes, except for the point release (the system is slightly ahead of the liveusb). On giovedì 12 maggio 2016 18:48:17 CEST, Zygo Blaxell wrote: Did you try an older (or newer) kernel? I've been running 4.5.x on a few canary systems, but so far none of them have survived more than a day. No (except for point releases from 4.5.0 to 4.5.4), but I will try 4.4. On giovedì 12 maggio 2016 18:48:17 CEST, Zygo Blaxell wrote: It's possible there's a problem that affects only very specific chipsets You seem to have eliminated RAM in isolation, but there could be a problem in the kernel that affects only your chipset. Funny considering it is sold as a Linux laptop. Unfortunately they only tested it with the ancient Ubuntu 14.04. Niccolò -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH] btrfs: fix memory leak during RAID 5/6 device replacement
Hi, Scott Talbert * From: David Sterba [mailto:dste...@suse.cz] > Sent: Tuesday, May 10, 2016 1:32 AM > To: Scott Talbert > Cc: linux-btrfs@vger.kernel.org; Zhao Lei > Subject: Re: [PATCH] btrfs: fix memory leak during RAID 5/6 device replacement > > CC Zhao Lei > > On Mon, May 09, 2016 at 09:14:28AM -0400, Scott Talbert wrote: > > A 'struct bio' is allocated in scrub_missing_raid56_pages(), but it was > > never > > freed anywhere. > > > > Signed-off-by: Scott Talbert > > --- > > fs/btrfs/scrub.c | 2 ++ > > 1 file changed, 2 insertions(+) > > > > diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c > > index 82bedf9..607cc6e 100644 > > --- a/fs/btrfs/scrub.c > > +++ b/fs/btrfs/scrub.c > > @@ -2130,6 +2130,8 @@ static void scrub_missing_raid56_end_io(struct > bio *bio) > > if (bio->bi_error) > > sblock->no_io_error_seen = 0; > > > > + bio_put(bio); > > + Seems good by reviewing. I'll add some debug code to test it more detailedly. Thanks Zhaolei > > btrfs_queue_work(fs_info->scrub_workers, &sblock->work); > > } > > > > -- > > 1.9.1 > > > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > > the body of a message to majord...@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs-progs: added quiet-option for scripts
Hi, On Thu, May 12, 2016 at 09:49:25PM +0200, bt...@oss.m-berberich.de wrote: > From: M G Berberich > > -q,--quiet to prevent status-messages on stderr > --verbose as alternative for -v > moved 'Mode NO_FILE_DATA enabled' message to stderr > changed default for g_verbose to 1 added the signed-off-by line and did the following tweaks and applied, thanks. --- a/cmds-send.c +++ b/cmds-send.c @@ -44,8 +44,10 @@ #include "send.h" #include "send-utils.h" -/* default is 1 for historical reasons - changing may break scripts */ +/* + * Default is 1 for historical reasons, changing may break scripts that expect + * the 'At subvol' message. + */ static int g_verbose = 1; struct btrfs_send { @@ -445,7 +447,7 @@ int cmd_send(int argc, char **argv) g_verbose++; break; case 'q': - g_verbose--; + g_verbose = 0; break; case 'e': new_end_cmd_semantic = 1; @@ -630,7 +632,8 @@ int cmd_send(int argc, char **argv) } if ((send_flags & BTRFS_SEND_FLAG_NO_FILE_DATA) && g_verbose > 1) - fprintf(stderr, "Mode NO_FILE_DATA enabled\n"); + if (g_verbose > 1) + fprintf(stderr, "Mode NO_FILE_DATA enabled\n"); for (i = optind; i < argc; i++) { int is_first_subvol; @@ -721,9 +724,6 @@ const char * const cmd_send_usage[] = { "which case 'btrfs send' will determine a suitable parent among the", "clone sources itself.", "\n", - "-v, --verboseEnable verbose debug output. Each occurrence of", - " this option increases the verbose level more.", - "-q, --quiet suppress messages to stderr.", "-e If sending multiple subvols at once, use the new", " format and omit the end-cmd between the subvols.", "-p Send an incremental stream from to", @@ -737,6 +737,8 @@ const char * const cmd_send_usage[] = { " does not contain any file data and thus cannot be used", " to transfer changes. This mode is faster and useful to", " show the differences in metadata.", - "--help display this help and exit", + "-v|--verbose enable verbose output to stderr, each occurrence of", + " this option increases verbosity", + "-q|--quiet suppress all messages, except errors", NULL }; -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/1] btrfs-progs: Typo review of strings and comments
On Thu, May 12, 2016 at 04:56:46PM -0400, Nicholas D Steeves wrote: > Thank you for reviewing this patch. Sorry for missing the context of > the strtoull comment; I should have been able to infer that and am > embarrassed that I failed to. Also, embarrassed because I think I've > used it in some C++ code! > > I learned how to use git rebase and git reset today, and can submit a > v2 patch diffed against master at your earliest convenience. My only > remaining question is this: No need to send v2, I fixed the issues (and usually do if they seem easy). > mkfs.c: printf:("Incompatible features: %s", features_buf) > * Should this be left as "Imcompat features"? Actually yes, it should be 'Incompat', patch updated. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Input/output error on newly created file
Am Friday 13 May 2016 schrieb Duncan <1i5t5.dun...@cox.net> >Szalma László posted on Thu, 12 May 2016 20:28:24 +0200 as excerpted: >> The files that rarely become unreadable (I/O error but no error in dmesg >> or anywhere) are mysql MyIsam database files, and they are always small. >> Like 16kbyte for example, or smaller. Sometimes dropping the fs cache >> fixes the problem, sometimes not. Umount / mount always fixes the >> problem. Scrub says the filesystem is OK (when the file is unreadable). > >Is it possible the files are always under 4 KiB? For the record, I was seeing a similar error with dovecot *.index.log files (see the ML thread started by Szalma László) . In my case they are *not* all under 4 KiB. Looking at some of the affected files, one of them is 25K, and another is 6.6K. However, perhaps they compress to under 4K? But compressing the 25K one with lzop only goes down to 5.6K with -9 :-/ . >While there's a few variables as to max size, with 4 KiB being the >practical max, btrfs will inline really small files into their metadata >node instead of writing them out as 4 KiB data block extents. Since >that's obviously a different code-path, if it's only 4 KiB and smaller >files, it's possible there's a race in that code-path that when lost, >results in the file "disappearing" without a dmesg trace, only to >reappear after reboot. > >Actually, now that I think about it, if the files are all OVER say 2 KiB >but still quite small, say under 16 KiB, and if they had just recently >grown, it's possible that the race is in the transition from the inline >metadata state to the multiples of 4 KiB block data extent state. > >And if all the files had just shrunk, say from compaction (if done in- >place, not with a copy and rename), perhaps it's the reverse, the >transition from written data blocks to inline metadata state. I'm glad somebody is (publicly) thinking about this :-) ! Greetings -- Marc Joliet -- "People who think they know everything really annoy those of us who know we don't" - Bjarne Stroustrup signature.asc Description: This is a digitally signed message part.
Re: Undelete deleted subvolume?
Andrei Borzenkov posted on Thu, 12 May 2016 17:19:11 +0300 as excerpted: > I accidentally deleted wrong snapshot using SUSE snapper. Is it possible > to undelete subvolume? I know that it is possible to extract files from > old tree (although SLES12 does not seem to offer btrfs-find-root), but > is it possible to "reconnect" subvolume back? I don't know of a way to do it directly, altho the devs may be able to suggest something exotic, but... You should at least be able to /emulate/ it, by using btrfs restore (with the filesystem unmounted) to write the files elsewhere (the part you suggested), then mounting the filesystem, creating a new subvolume, and copying the restored files back into it. That does lose the reflinks of a snapshot if that's what you deleted, so will take more space, but if you then run one of the btrfs dedupers on it, you should be able to re-reflink it, sharing extents via reflink and reducing the exclusive space used, once again. (Tho I have no personal experience with the dedupers so can't give you specific help in that regard.) But if you were using it as the reference parent for send/receive, or something, I think it's gone from that regard, as the emulation above would of course have a different subvolume ID. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Input/output error on newly created file
Szalma László posted on Thu, 12 May 2016 20:28:24 +0200 as excerpted: > The files that rarely become unreadable (I/O error but no error in dmesg > or anywhere) are mysql MyIsam database files, and they are always small. > Like 16kbyte for example, or smaller. Sometimes dropping the fs cache > fixes the problem, sometimes not. Umount / mount always fixes the > problem. Scrub says the filesystem is OK (when the file is unreadable). Is it possible the files are always under 4 KiB? While there's a few variables as to max size, with 4 KiB being the practical max, btrfs will inline really small files into their metadata node instead of writing them out as 4 KiB data block extents. Since that's obviously a different code-path, if it's only 4 KiB and smaller files, it's possible there's a race in that code-path that when lost, results in the file "disappearing" without a dmesg trace, only to reappear after reboot. Actually, now that I think about it, if the files are all OVER say 2 KiB but still quite small, say under 16 KiB, and if they had just recently grown, it's possible that the race is in the transition from the inline metadata state to the multiples of 4 KiB block data extent state. And if all the files had just shrunk, say from compaction (if done in- place, not with a copy and rename), perhaps it's the reverse, the transition from written data blocks to inline metadata state. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: About in-band dedupe for v4.7
Mark Fasheh posted on Thu, 12 May 2016 13:54:26 -0700 as excerpted: > For example, my 'large' duperemove test involves about 750 gigabytes of > general purpose data - quite literally /home off my workstation. > > After the run I'm usually seeing between 65-75 gigabytes saved for a > total of only 10% duplicated data. I would expect this to be fairly > 'average' - /home on my machine has the usual stuff - documents, source > code, media, etc. > > So if you were writing your whole fs out you could expect about the same > from inline dedupe - 10%-ish. Let's be generous and go with that number > though as a general 'this is how much dedupe we get'. > > What the memory backend is doing then is providing a cache of > sha256/block calculations. This cache is very expensive to fill, and > every written block must go through it. On top of that, the cache does > not persist between mounts, and has items regularly removed from it when > we run low on memory. All of this will drive down the amount of > duplicated data we can find. > > So our best case savings is probably way below 10% - let's be _really_ > nice and say 5%. My understanding is that this "general purpose data" use-case isn't being targeted by the in-memory dedup at all, because indeed it's a very poor fit for exactly the reason you explain. Instead, think data centers where perhaps 50% of all files are duplicated thousands of times... and it's exactly those files that are most frequently used. Totally different use-case, where that 5% on general purpose data could easily skyrocket to 50%+. Refining that a bit, as I understand it, the idea with the in-memory- inline dedup is pretty much opportunity-based dedup. Where there's an easy opportunity seen, grab it, but don't go out of your way to do anything fancy. Then somewhat later, a much more thorough offline dedup process will come along and dedup-pack everything else. In that scenario a quick-opportunity 20% hit rate may be acceptable, while actual hit rates may approach 50% due to skew toward the common. Then the dedup-pack comes along and finishes things, possibly resulting in total savings of say 70% or so. Even if the in-memory doesn't get that common-skew boost and ends up nearer 20%, that's still a significant savings for the initial inline result, with the dedup-packer coming along later to clean things up properly. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html