Re: WARNING at fs/btrfs/inode.c:9261 btrfs_destroy_inode()

2016-05-13 Thread Chris Mason
On Tue, May 10, 2016 at 06:19:20PM -0500, Eric Biggers wrote:
> Hello,
> 
> The following warning has been triggering for me since about v4.6-rc3:
> 
>   WARN_ON(BTRFS_I(inode)->csum_bytes);
> 
> On one machine the warning has occurred 657 times since v4.6-rc5.  On another 
> it
> has occurred 3 times since v4.6-rc3.  Both are now on v4.6-rc7, where I have
> still observed the warning.  The warnings occur in groups, and do_unlinkat() 
> and
> evict() are always in the call stack.
> 
> Is this a known issue?  Here is the first occurrence:

Finally tracked this down today, it's a bug in how we deal with page
faults in the middle of a write.  We're testing the fix here and I'll
have a patch on Monday.

Strictly speaking this is a regression in one of Chandan's setup patches
for PAGE_SIZE > sectorsize, but the code was much much too subtle to blame
him.  It should be possible to trigger with a non-aligned multi-page
write, I'm trying to nail down a reliable test program so we don't make
the same mistake again.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] fstests: generic: Test SHARED flag about fiemap ioctl before and after sync

2016-05-13 Thread Qu Wenruo



On 05/13/2016 01:17 PM, Darrick J. Wong wrote:

On Fri, May 13, 2016 at 09:52:52AM +0800, Qu Wenruo wrote:

The test case will check SHARED flag returned by fiemap ioctl on
reflinked files before and after sync.

Normally SHARED flag won't change just due to a normal sync operation.

But btrfs doesn't handle SHARED flag well, and this time it won't check
any delayed extent tree(reverse extent searching tree) modification, but
only metadata already committed to disk.

So btrfs will not return correct SHARED flag on reflinked files if there
is no sync to commit all metadata.

This testcase will just check it.

Signed-off-by: Qu Wenruo 
--
And of course, xfs handles it quite well. Nice work Darrick.
Also the test case needs the new infrastructure introduced in previous
generic/352 test case.
---
 tests/generic/353 | 86 +++
 tests/generic/353.out |  9 ++
 tests/generic/group   |  1 +
 3 files changed, 96 insertions(+)
 create mode 100755 tests/generic/353
 create mode 100644 tests/generic/353.out

diff --git a/tests/generic/353 b/tests/generic/353
new file mode 100755
index 000..1e9117e
--- /dev/null
+++ b/tests/generic/353
@@ -0,0 +1,86 @@
+#! /bin/bash
+# FS QA Test 353
+#
+# Check if fiemap ioctl returns correct SHARED flag on reflinked file
+# before and after sync the fs
+#
+# Btrfs has a bug in checking shared extent, which can only handle metadata
+# already committed to disk, but not delayed extent tree modification.
+# This caused SHARED flag only occurs after sync.


I noticed this a while ago, but figured it was just btrfs being btrfs.  Ho hum.

Thanks for writing a test and getting the problem fixed.


+#
+#---
+# Copyright (c) 2016 Fujitsu. All Rights Reserved.
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License as
+# published by the Free Software Foundation.
+#
+# This program is distributed in the hope that it would be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write the Free Software Foundation,
+# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+#---
+#
+
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo "QA output created by $seq"
+
+here=`pwd`
+tmp=/tmp/$$
+status=1   # failure is the default!
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+_cleanup()
+{
+   cd /
+   rm -f $tmp.*


tmp isn't used for anything in this testcase.


That's from template, and does no harm.




+}
+
+# get standard environment, filters and checks
+. ./common/rc
+. ./common/filter
+. ./common/reflink
+. ./common/punch
+
+# remove previous $seqres.full before test
+rm -f $seqres.full
+
+# real QA test starts here
+
+# Modify as appropriate.
+_supported_fs generic
+_supported_os Linux
+_require_scratch_reflink
+_require_fiemap
+
+_scratch_mkfs > /dev/null 2>&1
+_scratch_mount
+
+blocksize=64k
+file1="$SCRATCH_MNT/file1"
+file2="$SCRATCH_MNT/file2"
+
+# write the initial file
+_pwrite_byte 0xcdcdcdcd 0 $blocksize $file1 | _filter_xfs_io
+
+# reflink initial file
+_reflink_range $file1 0 $file2 0 $blocksize | _filter_xfs_io
+
+# check their fiemap to make sure it's correct
+$XFS_IO_PROG -c "fiemap -v" $file1 | _filter_fiemap_flags
+$XFS_IO_PROG -c "fiemap -v" $file2 | _filter_fiemap_flags
+
+# sync and recheck, to make sure the fiemap doesn't change just
+# due to sync
+sync
+$XFS_IO_PROG -c "fiemap -v" $file1 | _filter_fiemap_flags
+$XFS_IO_PROG -c "fiemap -v" $file2 | _filter_fiemap_flags


Nowadays, when I write a test that prints similar output one after the other I
will also write a comment to the output to distinguish the two cases, e.g.:

+XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
fiemap before sync
+0: [0..127]: 0x2001
+0: [0..127]: 0x2001
fiemap after sync
+0: [0..127]: 0x2001
+0: [0..127]: 0x2001

This way when a bunch of tests regress some months later it's easier
for me to relearn what's going on.


Yes, that's pretty nice, I'll also add filename to the output, in case
the reflinked or original file layout changes.

Thanks for the review.
Qu


(I wasn't always good at doing that.)

Otherwise, everything looks ok.

--D


+
+# success, all done
+status=0
+exit
diff --git a/tests/generic/353.out b/tests/generic/353.out
new file mode 100644
index 000..0cd8981
--- /dev/null
+++ b/tests/generic/353.out
@@ -0,0 +1,9 @@
+QA output created by 353
+wrote 65536/65536 bytes at offset 0
+XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+linked 65536/65536 bytes at offset 0
+XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+0: [0.

[PATCH 1/7] Btrfs: replace BUG() with WARN_ONCE in raid56

2016-05-13 Thread Liu Bo
This BUG() has been triggered by a fuzz testing image, but in fact
btrfs can handle this gracefully by returning -EIO.

Thus, use WARN_ONCE for warning purpose and don't leave a possible
kernel panic.

Signed-off-by: Liu Bo 
---
 fs/btrfs/raid56.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index 0b7792e..863f7fe 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -2139,7 +2139,7 @@ int raid56_parity_recover(struct btrfs_root *root, struct 
bio *bio,
 
rbio->faila = find_logical_bio_stripe(rbio, bio);
if (rbio->faila == -1) {
-   BUG();
+   WARN_ONCE(1, KERN_WARNING "rbio->faila is -1\n");
if (generic_io)
btrfs_put_bbio(bbio);
kfree(rbio);
-- 
2.5.5

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/7] Btrfs: check if extent buffer is aligned to sectorsize

2016-05-13 Thread Liu Bo
Thanks to fuzz testing, we can pass an invalid bytenr to extent buffer
via alloc_extent_buffer().  An unaligned eb can have more pages than it
should have, which ends up extent buffer's leak or some corrupted content
in extent buffer.

This adds a warning to let us quickly know what was happening.

Signed-off-by: Liu Bo 
---
 fs/btrfs/extent_io.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index d247fc0..e601e0f 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -4868,6 +4868,10 @@ struct extent_buffer *alloc_extent_buffer(struct 
btrfs_fs_info *fs_info,
int uptodate = 1;
int ret;
 
+   WARN_ONCE(!IS_ALIGNED(start, fs_info->tree_root->sectorsize),
+ KERN_WARNING "eb->start(%llu) is not aligned to 
root->sectorsize(%u)\n",
+ start, fs_info->tree_root->sectorsize);
+
eb = find_extent_buffer(fs_info, start);
if (eb)
return eb;
-- 
2.5.5

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/7] Btrfs: replace BUG_ON with WARN_ONCE in cow_file_range

2016-05-13 Thread Liu Bo
This BUG_ON is more like a warning since an invalid
btrfs_super_total_bytes() doesn't affect other stuff.

Thus, use WARN_ONCE for warning purpose and don't leave a possible
kernel panic here.

Signed-off-by: Liu Bo 
---
 fs/btrfs/inode.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 2aaba58..5874562 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -974,8 +974,11 @@ static noinline int cow_file_range(struct inode *inode,
}
}
 
-   BUG_ON(disk_num_bytes >
-  btrfs_super_total_bytes(root->fs_info->super_copy));
+   WARN_ONCE(disk_num_bytes >
+ btrfs_super_total_bytes(root->fs_info->super_copy),
+ KERN_WARNING "disk_num_bytes(%llu) > btrfs_super_total_bytes(%llu)\n",
+ disk_num_bytes,
+ btrfs_super_total_bytes(root->fs_info->super_copy));
 
alloc_hint = get_extent_allocation_hint(inode, start, num_bytes);
btrfs_drop_extent_cache(inode, start, start + num_bytes - 1, 0);
-- 
2.5.5

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 6/7] Btrfs: fix eb memory leak due to readpage failure

2016-05-13 Thread Liu Bo
eb->io_pages is set in read_extent_buffer_pages().

In case of readpage failure, for pages that have been added to bio,
it calls bio_endio and later readpage_io_failed_hook() does the work.

When this eb's page (couldn't be the 1st page) fails to add itself to bio
due to failure in merge_bio(), it cannot decrease eb->io_pages via bio_endio,
 and ends up with a memory leak eventually.

This adds the 'atomic_dec(&eb->io_pages)' to the readpage error handling.

Signed-off-by: Liu Bo 
---
 fs/btrfs/extent_io.c | 24 
 1 file changed, 24 insertions(+)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 99286d1..2327200 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -3069,6 +3069,30 @@ static int __do_readpage(struct extent_io_tree *tree,
*bio_flags = this_bio_flag;
} else {
SetPageError(page);
+   /*
+* Only metadata io request has this issue, for data it
+* just unlocks extent and releases page's lock.
+*
+* eb->io_pages is set in read_extent_buffer_pages().
+*
+* When this eb's page fails to add itself to bio,
+* it cannot decrease eb->io_pages via bio_endio, and
+* ends up with extent_buffer_under_io() always being
+* true, because of that, eb won't be freed and we have
+* a memory leak eventually.
+*
+* Here we still hold this page's lock, and other tasks
+* who're also reading this eb are blocked.
+*/
+   if (rw & REQ_META) {
+   struct extent_buffer *eb;
+
+   WARN_ON_ONCE(!PagePrivate(page));
+   eb = (struct extent_buffer *)page->private;
+
+   WARN_ON_ONCE(atomic_read(&eb->io_pages) < 1);
+   atomic_dec(&eb->io_pages);
+   }
unlock_extent(tree, cur, cur + iosize - 1);
}
cur = cur + iosize;
-- 
2.5.5

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 7/7] Btrfs: fix memory leak due to invalid btree height

2016-05-13 Thread Liu Bo
Thanks to fuzz testing, we can have invalid btree root node height.
Btrfs limits btree height to 7 and if the given height is 9, then btrfs
will have problems in both releasing root node's lock and freeing the node.

Signed-off-by: Liu Bo 
---
 fs/btrfs/ctree.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index ec7928a..3fccbcc 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -2756,6 +2756,13 @@ again:
}
}
}
+   if (level > BTRFS_MAX_LEVEL - 1 || level < 0) {
+   WARN_ONCE(1, KERN_WARNING "Invalid btree height %d\n", level);
+   if (!p->skip_locking)
+   btrfs_tree_unlock_rw(b, root_lock);
+   free_extent_buffer(b);
+   return -EINVAL;
+   }
p->nodes[level] = b;
if (!p->skip_locking)
p->locks[level] = root_lock;
-- 
2.5.5

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/7] Btrfs: free sys_array eb as soon as possible

2016-05-13 Thread Liu Bo
While reading sys_chunk_array in superblock, btrfs creates a temporary
extent buffer.  Since we don't use it after finishing reading
 sys_chunk_array, we don't need to keep it in memory.

Signed-off-by: Liu Bo 
---
 fs/btrfs/volumes.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index bd0f45f..1331606 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -6586,13 +6586,13 @@ int btrfs_read_sys_array(struct btrfs_root *root)
sb_array_offset += len;
cur_offset += len;
}
-   free_extent_buffer(sb);
+   free_extent_buffer_stale(sb);
return ret;
 
 out_short_read:
printk(KERN_ERR "BTRFS: sys_array too short to read %u bytes at offset 
%u\n",
len, cur_offset);
-   free_extent_buffer(sb);
+   free_extent_buffer_stale(sb);
return -EIO;
 }
 
-- 
2.5.5

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 5/7] Btrfs: replace BUG_ON with WARN in merge_bio

2016-05-13 Thread Liu Bo
We have two BUG_ON in merge_bio, but since it can gracefully return errors
to callers, use WARN_ONCE to give error information and don't leave a
possible panic.

Signed-off-by: Liu Bo 
---
 fs/btrfs/extent_io.c | 1 -
 fs/btrfs/inode.c | 6 --
 2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index e601e0f..99286d1 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2746,7 +2746,6 @@ static int merge_bio(int rw, struct extent_io_tree *tree, 
struct page *page,
if (tree->ops && tree->ops->merge_bio_hook)
ret = tree->ops->merge_bio_hook(rw, page, offset, size, bio,
bio_flags);
-   BUG_ON(ret < 0);
return ret;
 
 }
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 5874562..3a989e3 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -1827,8 +1827,10 @@ int btrfs_merge_bio_hook(int rw, struct page *page, 
unsigned long offset,
map_length = length;
ret = btrfs_map_block(root->fs_info, rw, logical,
  &map_length, NULL, 0);
-   /* Will always return 0 with map_multi == NULL */
-   BUG_ON(ret < 0);
+   if (ret < 0) {
+   WARN_ONCE(ret < 0, KERN_ERR "ret = %d\n", ret);
+   return ret;
+   }
if (map_length < length + size)
return 1;
return 0;
-- 
2.5.5

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] Btrfs: add valid checks for chunk loading

2016-05-13 Thread Liu Bo
On Wed, May 04, 2016 at 03:56:26PM +0200, David Sterba wrote:
> A few minor comments below
> 
> On Mon, May 02, 2016 at 11:15:51AM -0700, Liu Bo wrote:
> > --- a/fs/btrfs/volumes.c
> > +++ b/fs/btrfs/volumes.c
> > @@ -6206,27 +6206,23 @@ struct btrfs_device *btrfs_alloc_device(struct 
> > btrfs_fs_info *fs_info,
> > return dev;
> >  }
> >  
> > -static int read_one_chunk(struct btrfs_root *root, struct btrfs_key *key,
> > - struct extent_buffer *leaf,
> > - struct btrfs_chunk *chunk)
> > +/* Return -EIO if any error, otherwise return 0. */
> > +static int btrfs_check_chunk_valid(struct btrfs_root *root,
> > +  struct extent_buffer *leaf,
> > +  struct btrfs_chunk *chunk, u64 logical)
> >  {
> > -   struct btrfs_mapping_tree *map_tree = &root->fs_info->mapping_tree;
> > -   struct map_lookup *map;
> > -   struct extent_map *em;
> > -   u64 logical;
> > u64 length;
> > u64 stripe_len;
> > -   u64 devid;
> > -   u8 uuid[BTRFS_UUID_SIZE];
> > -   int num_stripes;
> > -   int ret;
> > -   int i;
> > +   u16 num_stripes;
> > +   u16 sub_stripes;
> > +   u64 type;
> >  
> > -   logical = key->offset;
> > length = btrfs_chunk_length(leaf, chunk);
> > stripe_len = btrfs_chunk_stripe_len(leaf, chunk);
> > num_stripes = btrfs_chunk_num_stripes(leaf, chunk);
> > -   /* Validation check */
> > +   sub_stripes = btrfs_chunk_sub_stripes(leaf, chunk);
> > +   type = btrfs_chunk_type(leaf, chunk);
> > +
> > if (!num_stripes) {
> > btrfs_err(root->fs_info, "invalid chunk num_stripes: %u",
> >   num_stripes);
> > @@ -6237,24 +6233,70 @@ static int read_one_chunk(struct btrfs_root *root, 
> > struct btrfs_key *key,
> >   "invalid chunk logical %llu", logical);
> > return -EIO;
> > }
> > +   if (btrfs_chunk_sector_size(leaf, chunk) != root->sectorsize) {
> > +   btrfs_err(root->fs_info, "invalid chunk sectorsize %llu",
> > + (unsigned long long)btrfs_chunk_sector_size(leaf,
> 
> type cast not necessry
> 
> > + chunk));
> > +   return -EIO;
> > +   }
> > if (!length || !IS_ALIGNED(length, root->sectorsize)) {
> > btrfs_err(root->fs_info,
> > "invalid chunk length %llu", length);
> > return -EIO;
> > }
> > -   if (!is_power_of_2(stripe_len)) {
> > +   if (stripe_len != BTRFS_STRIPE_LEN) {
> 
> Again too strict. As mentined elsewhere, add a helper to validate
> stripe_len and use it so we don't open-code it.

I'm not sure I understand the comment about open-code, right now
the value must be BTRFS_STRIPE_LEN and we don't set any other value,
are we going to add a helper for just (stripe_len != BTRFS_STRIPE_LEN)?

I fixed other issues.

Thanks,

-liubo

> 
> > btrfs_err(root->fs_info, "invalid chunk stripe length: %llu",
> >   stripe_len);
> > return -EIO;
> > }
> > if (~(BTRFS_BLOCK_GROUP_TYPE_MASK | BTRFS_BLOCK_GROUP_PROFILE_MASK) &
> > -   btrfs_chunk_type(leaf, chunk)) {
> > +   type) {
> > btrfs_err(root->fs_info, "unrecognized chunk type: %llu",
> >   ~(BTRFS_BLOCK_GROUP_TYPE_MASK |
> > BTRFS_BLOCK_GROUP_PROFILE_MASK) &
> >   btrfs_chunk_type(leaf, chunk));
> > return -EIO;
> > }
> > +   if ((type & BTRFS_BLOCK_GROUP_RAID10 && sub_stripes == 0) ||
> > +   (type & BTRFS_BLOCK_GROUP_RAID1 && num_stripes < 1) ||
> > +   (type & BTRFS_BLOCK_GROUP_RAID5 && num_stripes < 2) ||
> > +   (type & BTRFS_BLOCK_GROUP_RAID5 && num_stripes < 3) ||
> > +   (type & BTRFS_BLOCK_GROUP_DUP && num_stripes > 2) ||
> > +   ((type & BTRFS_BLOCK_GROUP_PROFILE_MASK) == 0 &&
> 
> I was looking if we could turn that into some generic checks using the
> btrfs_raid_array but seems that the tests do not make a uniform pattern,
> eg the DUP and SINGLE disguised as "mask == 0". As we don't add new
> profiles too often I'm ok with that version.
> 
> > +num_stripes != 1)) {
> > +   btrfs_err(root->fs_info, "Invalid num_stripes:sub_stripes %u:%u 
> > for profile %llu",
> 
> "invalid..." (no initial capital letter) and put the string on the next
> line so it does not exceed 80 cols
> 
> > + num_stripes, sub_stripes,
> > + type & BTRFS_BLOCK_GROUP_PROFILE_MASK);
> > +   return -EIO;
> > +   }
> > +
> > +   return 0;
> > +}
> > +
> > +static int read_one_chunk(struct btrfs_root *root, struct btrfs_key *key,
> > + struct extent_buffer *leaf,
> > + struct btrfs_chunk *chunk)
> > +{
> > +   struct btrfs_mapping_tree *map_tree = &root->fs_info->mapping_tree;
> > +   struct map_lookup *map;
> > +   struct extent_map *em;
> > +   u64 logical;
> > +   u64 length;

Re: [PATCH 1/2] Btrfs: add more valid checks for superblock

2016-05-13 Thread Qu Wenruo



On 05/14/2016 02:14 AM, Liu Bo wrote:

On Mon, May 09, 2016 at 09:31:37AM +0800, Qu Wenruo wrote:



David Sterba wrote on 2016/05/06 16:35 +0200:

On Thu, May 05, 2016 at 09:08:54AM +0800, Qu Wenruo wrote:

An early check can compare against some reasonable value, but the
total_bytes value must be equal to the sum of all device sizes
(disk_total_bytes). I'm not sure if we have enough information to verify
that at this point though.


That's what I had in mind, the problem is that only the first device 
information is recorded in superblock.

At this moment We have device_num but we don't know the size of other devices.

Thanks,

-liubo



What about error out if we found sb->total_bytes <
sb->dev_item->total_bytes?

As we are just doing early check, no need to be comprehensive, but spot
obvious problem.


Ok.


I'm gonna check for total_bytes and num_devices after loading chunk
tree.




For exact device_num and sb->total_bytes, we may do post check when
device tree are loaded?
Splitting early_check() and post_check() seems valid for me.
(Also I prefer post_check() just warning, not forced exit)


Why just warning? Superblock total_bytes and device sizes must be
correct, otherwise all sorts of operations can fail randomly.



Because if we exit, we can't even do fsck.

Maybe we need a new flag to control whether exit or warn at post_check().




Thanks,
Qu



IMHO for kernel part, we have to exit in order to avoid any panic due to those 
invalid value.


I'm OK with this.



For fsck code, we can go forth and back to fix them.  In fact I don't
think fsck could work out anything, as superblock checksum has _matched_
but the values inside superblock are invalid, in this case we cannot trust
 other parts in this FS image, then how can we expect fsck to fix it by reading
other parts?


For rw fsck, that may cause huge problem and I agree with you on error out.
But for ro fsck, it's a little overkilled for me.

Currently, if we found error in extent tree, we still continue checking 
fstree, to shows what is really wrong.


And for case like btrfs-image restored images, its dev extents doesn't 
even match with its chunk (may be it's already fixed?), but that's not a 
big problem for ro btrfsck, and we can go on without problem.


So the same case is here for ro btrfsck.
As long as that's ro btrfsck, we could just continue as we don't really 
need the total_bytes in superblock.


Thanks,
Qu



Thanks,

-liubo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs ate my data in just two days, after a fresh install. ram and disk are ok. it still mounts, but I cannot repair

2016-05-13 Thread Chris Murphy
On Fri, May 13, 2016 at 6:10 AM, Niccolò Belli
 wrote:
> On venerdì 13 maggio 2016 13:35:01 CEST, Austin S. Hemmelgarn wrote:
>>
>> The fact that you're getting an OOPS involving core kernel threads
>> (kswapd) is a pretty good indication that either there's a bug elsewhere in
>> the kernel, or that something is wrong with your hardware.  it's really
>> difficult to be certain if you don't have a reliable test case though.
>
>
> Talking about reliable test cases, I forgot to say that I definitely found
> an interesting one. It doesn't lead to OOPS but perhaps something even more
> interesting. While running countless stress tests I tried running some games
> to stress the system in different ways. I chosed openmw (an open source
> engine for Morrowind) and I played it for a while on my second external
> monitor (while I watched at some monitoring tools on my first monitor). I
> noticed that after playing a while I *always* lose internet connection (I
> use an USB3 Gigabit Ethernet adapter). This isn't the only thing which
> happens: even if the game keeps running flawlessly and the system *seems* to
> work fine (I can drag windows, open the terminal...) lots of commands simply
> stall (for example mounting a partition, unmounting it, rebooting...). I can
> reliably reproduce it, it ALWAYS happens.

Well there are a bunch of kernel debug options. If your kernel has
CONFIG_SLUB_DEBUG=y
CONFIG_SLUB=y at compile time you can boot with boot parameter
slub_debug=1 to enable it and maybe there'll be something more
revealing about the problems you're having. More aggressive is
CONFIG_DEBUG_PAGEALLOC=y but it'll slow things down quite noticeably.

And then there's some Btrfs debug options for compile time, and are
enabled with mount options. But I think the problem you're having
isn't specific to Btrfs or someone else would have run into it.




-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: fsck: to repair or not to repair

2016-05-13 Thread Chris Murphy
On Fri, May 13, 2016 at 9:28 AM, Nikolaus Rath  wrote:
> On May 13 2016, Duncan <1i5t5.dun...@cox.net> wrote:
>> Because btrfs can be multi-device, it needs some way to track which
>> devices belong to each filesystem, and it uses filesystem UUID for this
>> purpose.
>>
>> If you clone a filesystem (for instance using dd or lvm snapshotting,
>> doesn't matter how) and then trigger a btrfs device scan, say by plugging
>> in some other device with btrfs on it so udev triggers a scan, and the
>> kernel sees multiple devices with the same filesystem UUID as a result,
>> and one of those happens to be mounted, you can corrupt both copies as
>> the kernel btrfs won't be able to tell them apart and may write updates
>> to the wrong one.
>
> That seems like a rather odd design. Why isn't btrfs refusing to mount
> in this situation? In the face of ambiguity, guessing is generally bad
> idea (at least for a computer program).

The logic  you describe requires code. It's the absence of code rather
than an intentional design that's the cause of the current behavior.
And yes, it'd be nice if Btrfs weren't stepping on its own tail in
this situation. It could be as simple as refusing to mount anytime
there's an ambiguity, but that's sorta user hostile if there isn't a
message that goes along with it to help the user figure out a way to
resolve the problem. And that too could be fraught with peril if the
user makes a mistake. So, really what's the right way to do this is
part of the problem but I agree it's better to be hostile and refuse
to mount a given volume UUID at all when too many devices are found,
than corrupt the file system.

-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] Btrfs: add more valid checks for superblock

2016-05-13 Thread Liu Bo
On Mon, May 09, 2016 at 09:31:37AM +0800, Qu Wenruo wrote:
> 
> 
> David Sterba wrote on 2016/05/06 16:35 +0200:
> > On Thu, May 05, 2016 at 09:08:54AM +0800, Qu Wenruo wrote:
> > > > > An early check can compare against some reasonable value, but the
> > > > > total_bytes value must be equal to the sum of all device sizes
> > > > > (disk_total_bytes). I'm not sure if we have enough information to 
> > > > > verify
> > > > > that at this point though.
> > > > 
> > > > That's what I had in mind, the problem is that only the first device 
> > > > information is recorded in superblock.
> > > > 
> > > > At this moment We have device_num but we don't know the size of other 
> > > > devices.
> > > > 
> > > > Thanks,
> > > > 
> > > > -liubo
> > > > 
> > > > 
> > > What about error out if we found sb->total_bytes <
> > > sb->dev_item->total_bytes?
> > > 
> > > As we are just doing early check, no need to be comprehensive, but spot
> > > obvious problem.
> > 
> > Ok.

I'm gonna check for total_bytes and num_devices after loading chunk
tree.

> > 
> > > For exact device_num and sb->total_bytes, we may do post check when
> > > device tree are loaded?
> > > Splitting early_check() and post_check() seems valid for me.
> > > (Also I prefer post_check() just warning, not forced exit)
> > 
> > Why just warning? Superblock total_bytes and device sizes must be
> > correct, otherwise all sorts of operations can fail randomly.
> > 
> > 
> Because if we exit, we can't even do fsck.
> 
> Maybe we need a new flag to control whether exit or warn at post_check().
> 

> Thanks,
> Qu
>

IMHO for kernel part, we have to exit in order to avoid any panic due to those 
invalid value.

For fsck code, we can go forth and back to fix them.  In fact I don't
think fsck could work out anything, as superblock checksum has _matched_
but the values inside superblock are invalid, in this case we cannot trust
 other parts in this FS image, then how can we expect fsck to fix it by reading
other parts?

Thanks,

-liubo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS Data at Rest File Corruption

2016-05-13 Thread Chris Murphy
On Thu, May 12, 2016 at 10:49 PM, Richard A. Lochner  wrote:

> My apologies, they were from different boots.  After the dd, I get
> these:
>
> [109479.550836] BTRFS warning (device sdb1): csum failed ino 1437377
> off 75754369024 csum 1689728329 expected csum 2165338402
> [109479.596626] BTRFS warning (device sdb1): csum failed ino 1437377
> off 75754369024 csum 1689728329 expected csum 2165338402
> [109479.601969] BTRFS warning (device sdb1): csum failed ino 1437377
> off 75754369024 csum 1689728329 expected csum 2165338402
> [109479.602189] BTRFS warning (device sdb1): csum failed ino 1437377
> off 75754369024 csum 1689728329 expected csum 2165338402
> [109479.602323] BTRFS warning (device sdb1): csum failed ino 1437377
> off 75754369024 csum 1689728329 expected csum 2165338402

That's it? Only errors from sdb1? And this time no attempt to fix it?

Normally when there is failure to match data checksums stored in
metadata to the newly computed data checksums as the blocks are read
there's an attempt to read the mismatching blocks from another stripe.
I don't see that this is being attempted.


>>
>> Also what do you get for these for each device:
>>
>> smartctl scterc -l /dev/sdX
>> cat /sys/block/sdX/device/timeout
>>
> # smartctl -l scterc  /dev/sdb
> sartctl 6.4 2015-06-04 r4109 [x86_64-linux-4.4.8-300.fc23.x86_64]
> (local build)
> Copyright (C) 2002-15, Bruce Allen, Christian Franke, www.smartmontools
> .org
>
> SCT Error Recovery Control:
>Read: 70 (7.0 seconds)
>   Write: 70 (7.0 seconds)
>
> # smartctl -l scterc  /dev/sdc
> smartctl 6.4 2015-06-04 r4109 [x86_64-linux-4.4.8-300.fc23.x86_64]
> (local build)
> Copyright (C) 2002-15, Bruce Allen, Christian Franke, www.smartmontools
> .org
>
> SCT Error Recovery Control:
>Read: 70 (7.0 seconds)
>   Write: 70 (7.0 seconds)
>
> # cat /sys/block/sdb/device/timeout
> 30
> # cat /sys/block/sdc/device/timeout
> 30
>>

That's appropriate. So at least any failures have a chance of being
fixed before the command timer does a reset on the bus.


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS Data at Rest File Corruption

2016-05-13 Thread Austin S. Hemmelgarn

On 2016-05-13 12:28, Goffredo Baroncelli wrote:

On 2016-05-11 21:26, Austin S. Hemmelgarn wrote:

(although it can't tell the difference between a corrupted checksum and a 
corrupted block of data).


I don't think so. The data checksums are stored in metadata blocks, and as 
metadata block, these have their checksums. So btrfs know if the checksum is 
correct or none, despite the fact that the data is correct or none. Of course 
if the checksum is wrong, btrfs can't tell if the data is correct.

The only exception should be the inline data: in this case the data is stored 
in the metadata block, and this block is protected by only one checksum.

I know that I am pedantic :_)  but after reading your comment I looked at the 
btrfs data structure to refresh my memory, so I want to share these information.
It is fully possible for the block of data to be good and the checksum 
to be bad, it's just ridiculously unlikely.  I've actually had this 
happen before at least twice (I have really bad luck when it comes disk 
corruption).


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: About in-band dedupe for v4.7

2016-05-13 Thread Zygo Blaxell
On Fri, May 13, 2016 at 08:14:10AM -0400, Austin S. Hemmelgarn wrote:
> On 2016-05-12 16:54, Mark Fasheh wrote:
> >Now ask yourself the question - would you accept a write cache which is
> >expensive to fill and would only have a hit rate of less than 5%?
> In-band deduplication is a feature that is not used by typical desktop users
> or even many developers because it's computationally expensive, but it's
> used _all the time_ by big data-centers and similar places where processor
> time is cheap and storage efficiency is paramount. Deduplication is more
> useful in general the more data you have.  5% of 1 TB is 20 GB, which is not
> much.  5% of 1 PB is 20 TB, which is at least 3-5 disks, which can then be
> used for storing more data, or providing better resiliency against failures.

There's also a big cost difference between a 5-drive and 6-drive array
when all your servers are built with 5-drive cages.  Delay that expansion
for six months, and you can buy 5 larger drives for the same price.

I have laptops with filesystems that are _just_ over 512GB before dedup.
SSDs seem to come in 512GB and 1TB sizes with nothing in between, so
saving even a few dozen MB can translate into hundreds of dollars per
user (not to mention the bureaucratic hardware provisioning side-effects
which easily triple that cost).

Working on big software integration projects can generate 60% duplication
rates.  Think checkout of entire OS + disk images + embedded media,
build + install trees with copied data, long build times so there are
multiple working directories, multiple active branches of the same
project, and oh yeah it's all in SVN, which stores a complete second
copy of everything on disk just because wasting space is cool.

> To look at it another way, deduplicating an individual's home directory will
> almost never get you decent space savings, the majority of shared data is
> usually file headers and nothing more, which can't be deduplicated
> efficiently because of block size requirements. Deduplicating all the home
> directories on a terminal server with 500 users usually will get you decent
> space savings, as there very likely are a number of files that multiple
> people have exact copies of, but most of them are probably not big files.
> Deduplicating the entirety of a multi-petabyte file server used for storing
> VM disk images will probably save you a very significant amount of space,
> because the probability of having data that can be deduplicated goes up as
> you store more data, and there is likely to be a lot of data shared between
> the disk images.

Mail stores benefit from this too.  For some reason, Microsoft Office
users seem to enjoy emailing multi-megabyte runs of MIME-encoded zeros
to each other.  ;)



signature.asc
Description: Digital signature


[GIT PULL] Btrfs fixes for 4.7

2016-05-13 Thread fdmanana
From: Filipe Manana 

Hi Chris,

Please consider the following changes for the merge window for 4.7.
There's an implementation for the rename exchange and rename whiteout
operations, from Dan Fuhry, which was originally in David's integration
branches and linux-next but then we agreed to move it into my branches
since I made a few fixes on top of it. Other than that, there's just the
usual set of bug fixes, in particular races between direct IO writes and
fsync and between direct IO writes and balance, which lead to data loss
and metadata corruption issues.
These were recently rebased just to add Reviewed-by tags from Josef and
Liu Bo, but otherwise unchanged and have been in testing for quite some
time here.

Thanks.

The following changes since commit 44549e8f5eea4e0a41b487b63e616cb089922b99:

  Linux 4.6-rc7 (2016-05-08 14:38:32 -0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/fdmanana/linux.git for-chris-4.7

for you to fetch changes up to 5f9a8a51d8b95505d8de8b7191ae2ed8c504d4af:

  Btrfs: add semaphore to synchronize direct IO writes with fsync (2016-05-13 
01:59:36 +0100)


Dan Fuhry (1):
  btrfs: add support for RENAME_EXCHANGE and RENAME_WHITEOUT

Filipe Manana (13):
  Btrfs: fix for incorrect directory entries after fsync log replay
  Btrfs: fix empty symlink after creating symlink and fsync parent dir
  Btrfs: don't wait for unrelated IO to finish before relocation
  Btrfs: don't do unnecessary delalloc flushes when relocating
  Btrfs: unpin log if rename operation fails
  Btrfs: pin log earlier when renaming
  Btrfs: fix inode leak on failure to setup whiteout inode in rename
  Btrfs: unpin logs if rename exchange operation fails
  Btrfs: pin logs earlier when doing a rename exchange operation
  Btrfs: fix number of transaction units for renames with whiteout
  Btrfs: fix race between fsync and direct IO writes for prealloc extents
  Btrfs: fix race between block group relocation and nocow writes
  Btrfs: add semaphore to synchronize direct IO writes with fsync

 fs/btrfs/btrfs_inode.h  |  10 +++
 fs/btrfs/ctree.h|  27 
 fs/btrfs/dev-replace.c  |   4 +-
 fs/btrfs/extent-tree.c  | 122 +--
 fs/btrfs/inode.c| 464 
+++---
 fs/btrfs/ioctl.c|   2 +-
 fs/btrfs/ordered-data.c |  26 +---
 fs/btrfs/ordered-data.h |   6 +-
 fs/btrfs/relocation.c   |  11 ++--
 fs/btrfs/super.c|   2 +-
 fs/btrfs/transaction.c  |   2 +-
 fs/btrfs/tree-log.c |  66 +++
 12 files changed, 608 insertions(+), 134 deletions(-)
-- 
2.7.0.rc3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS Data at Rest File Corruption

2016-05-13 Thread Goffredo Baroncelli
On 2016-05-11 21:26, Austin S. Hemmelgarn wrote:
> (although it can't tell the difference between a corrupted checksum and a 
> corrupted block of data).

I don't think so. The data checksums are stored in metadata blocks, and as 
metadata block, these have their checksums. So btrfs know if the checksum is 
correct or none, despite the fact that the data is correct or none. Of course 
if the checksum is wrong, btrfs can't tell if the data is correct.

The only exception should be the inline data: in this case the data is stored 
in the metadata block, and this block is protected by only one checksum.

I know that I am pedantic :_)  but after reading your comment I looked at the 
btrfs data structure to refresh my memory, so I want to share these information.

BR
G.Baroncelli

-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: fsck: to repair or not to repair

2016-05-13 Thread Nikolaus Rath
On May 13 2016, Duncan <1i5t5.dun...@cox.net> wrote:
> Because btrfs can be multi-device, it needs some way to track which 
> devices belong to each filesystem, and it uses filesystem UUID for this 
> purpose.
>
> If you clone a filesystem (for instance using dd or lvm snapshotting, 
> doesn't matter how) and then trigger a btrfs device scan, say by plugging 
> in some other device with btrfs on it so udev triggers a scan, and the 
> kernel sees multiple devices with the same filesystem UUID as a result, 
> and one of those happens to be mounted, you can corrupt both copies as 
> the kernel btrfs won't be able to tell them apart and may write updates 
> to the wrong one.

That seems like a rather odd design. Why isn't btrfs refusing to mount
in this situation? In the face of ambiguity, guessing is generally bad
idea (at least for a computer program).


Best,
-Nikolaus
-- 
GPG encrypted emails preferred. Key id: 0xD113FCAC3C4E599F
Fingerprint: ED31 791B 2C5C 1613 AF38 8B8A D113 FCAC 3C4E 599F

 »Time flies like an arrow, fruit flies like a Banana.«
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: About in-band dedupe for v4.7

2016-05-13 Thread Qu Wenruo



On 05/13/2016 08:14 PM, Austin S. Hemmelgarn wrote:

On 2016-05-12 16:54, Mark Fasheh wrote:

On Wed, May 11, 2016 at 07:36:59PM +0200, David Sterba wrote:

On Tue, May 10, 2016 at 07:52:11PM -0700, Mark Fasheh wrote:

Taking your history with qgroups out of this btw, my opinion does not
change.

With respect to in-memory only dedupe, it is my honest opinion that
such a
limited feature is not worth the extra maintenance work. In particular
there's about 800 lines of code in the userspace patches which I'm sure
you'd want merged, because how could we test this then?


I like the in-memory dedup backend. It's lightweight, only a heuristic,
does not need any IO or persistent storage. OTOH I consider it a subpart
of the in-band deduplication that does all the persistency etc. So I
treat the ioctl interface from a broader aspect.


Those are all nice qualities, but what do they all get us?

For example, my 'large' duperemove test involves about 750 gigabytes of
general purpose data - quite literally /home off my workstation.

After the run I'm usually seeing between 65-75 gigabytes saved for a
total
of only 10% duplicated data. I would expect this to be fairly 'average' -
/home on my machine has the usual stuff - documents, source code, media,
etc.

So if you were writing your whole fs out you could expect about the same
from inline dedupe - 10%-ish. Let's be generous and go with that number
though as a general 'this is how much dedupe we get'.

What the memory backend is doing then is providing a cache of
sha256/block
calculations. This cache is very expensive to fill, and every written
block
must go through it. On top of that, the cache does not persist between
mounts, and has items regularly removed from it when we run low on
memory.
All of this will drive down the amount of duplicated data we can find.

So our best case savings is probably way below 10% - let's be _really_
nice
and say 5%.

Now ask yourself the question - would you accept a write cache which is
expensive to fill and would only have a hit rate of less than 5%?

In-band deduplication is a feature that is not used by typical desktop
users or even many developers because it's computationally expensive,
but it's used _all the time_ by big data-centers and similar places
where processor time is cheap and storage efficiency is paramount.
Deduplication is more useful in general the more data you have.  5% of 1
TB is 20 GB, which is not much.  5% of 1 PB is 20 TB, which is at least
3-5 disks, which can then be used for storing more data, or providing
better resiliency against failures.

To look at it another way, deduplicating an individual's home directory
will almost never get you decent space savings, the majority of shared
data is usually file headers and nothing more, which can't be
deduplicated efficiently because of block size requirements.
Deduplicating all the home directories on a terminal server with 500
users usually will get you decent space savings, as there very likely
are a number of files that multiple people have exact copies of, but
most of them are probably not big files.  Deduplicating the entirety of
a multi-petabyte file server used for storing VM disk images will
probably save you a very significant amount of space, because the
probability of having data that can be deduplicated goes up as you store
more data, and there is likely to be a lot of data shared between the
disk images.

This is exactly why I don't use deduplication on any of my personal
systems.  On my laptop, the space saved is just not worth the time spent
doing it, as I fall pretty solidly into the first case (most of the data
duplication on my systems is in file headers).  On my home server, I'm
not storing enough data with sufficient internal duplication that it
would save more than 10-20 GB, which doesn't matter for me given that
I'm using roughly half of the 2.2 TB of effective storage space I have.
However, once we (eventually) get all the file servers where I work
moved over to Linux systems running BTRFS, we will absolutely be using
deduplication there, as we have enough duplication in our data that it
will probably cut our storage requirements by around 20% on average.
--


Nice to hear that.

Although the feature is not going to be merged soon, but there is some 
info which may help anyone interested with inband dedupe to plan their 
usage or tests.


1) CPU core number
   Since currently in-band dedupe balance all SHA256 calculation using
   async-thread, for default thread_pool number, it's
   max(nr_cpu + 2, 8).
   So if using default thread_pool number, and the system doesn't have
   more than 8 cores, all CPU can be easily eaten up by dedupe.

   So either use dedupe with default thread_pool setting when you have
   more  than 8 cores, or tunning thread_pool to nr_cpu - 1 or 2.

   Or your system will be less responsive then running delalloc
   (fsync/sync).

2) Dedupe speed.
   In our internal 1 core test. It's about the same speed compared to

Re: About in-band dedupe for v4.7

2016-05-13 Thread Austin S. Hemmelgarn

On 2016-05-12 16:54, Mark Fasheh wrote:

On Wed, May 11, 2016 at 07:36:59PM +0200, David Sterba wrote:

On Tue, May 10, 2016 at 07:52:11PM -0700, Mark Fasheh wrote:

Taking your history with qgroups out of this btw, my opinion does not
change.

With respect to in-memory only dedupe, it is my honest opinion that such a
limited feature is not worth the extra maintenance work. In particular
there's about 800 lines of code in the userspace patches which I'm sure
you'd want merged, because how could we test this then?


I like the in-memory dedup backend. It's lightweight, only a heuristic,
does not need any IO or persistent storage. OTOH I consider it a subpart
of the in-band deduplication that does all the persistency etc. So I
treat the ioctl interface from a broader aspect.


Those are all nice qualities, but what do they all get us?

For example, my 'large' duperemove test involves about 750 gigabytes of
general purpose data - quite literally /home off my workstation.

After the run I'm usually seeing between 65-75 gigabytes saved for a total
of only 10% duplicated data. I would expect this to be fairly 'average' -
/home on my machine has the usual stuff - documents, source code, media,
etc.

So if you were writing your whole fs out you could expect about the same
from inline dedupe - 10%-ish. Let's be generous and go with that number
though as a general 'this is how much dedupe we get'.

What the memory backend is doing then is providing a cache of sha256/block
calculations. This cache is very expensive to fill, and every written block
must go through it. On top of that, the cache does not persist between
mounts, and has items regularly removed from it when we run low on memory.
All of this will drive down the amount of duplicated data we can find.

So our best case savings is probably way below 10% - let's be _really_ nice
and say 5%.

Now ask yourself the question - would you accept a write cache which is
expensive to fill and would only have a hit rate of less than 5%?
In-band deduplication is a feature that is not used by typical desktop 
users or even many developers because it's computationally expensive, 
but it's used _all the time_ by big data-centers and similar places 
where processor time is cheap and storage efficiency is paramount. 
Deduplication is more useful in general the more data you have.  5% of 1 
TB is 20 GB, which is not much.  5% of 1 PB is 20 TB, which is at least 
3-5 disks, which can then be used for storing more data, or providing 
better resiliency against failures.


To look at it another way, deduplicating an individual's home directory 
will almost never get you decent space savings, the majority of shared 
data is usually file headers and nothing more, which can't be 
deduplicated efficiently because of block size requirements. 
Deduplicating all the home directories on a terminal server with 500 
users usually will get you decent space savings, as there very likely 
are a number of files that multiple people have exact copies of, but 
most of them are probably not big files.  Deduplicating the entirety of 
a multi-petabyte file server used for storing VM disk images will 
probably save you a very significant amount of space, because the 
probability of having data that can be deduplicated goes up as you store 
more data, and there is likely to be a lot of data shared between the 
disk images.


This is exactly why I don't use deduplication on any of my personal 
systems.  On my laptop, the space saved is just not worth the time spent 
doing it, as I fall pretty solidly into the first case (most of the data 
duplication on my systems is in file headers).  On my home server, I'm 
not storing enough data with sufficient internal duplication that it 
would save more than 10-20 GB, which doesn't matter for me given that 
I'm using roughly half of the 2.2 TB of effective storage space I have. 
However, once we (eventually) get all the file servers where I work 
moved over to Linux systems running BTRFS, we will absolutely be using 
deduplication there, as we have enough duplication in our data that it 
will probably cut our storage requirements by around 20% on average.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs ate my data in just two days, after a fresh install. ram and disk are ok. it still mounts, but I cannot repair

2016-05-13 Thread Niccolò Belli

On venerdì 13 maggio 2016 13:35:01 CEST, Austin S. Hemmelgarn wrote:
The fact that you're getting an OOPS involving core kernel 
threads (kswapd) is a pretty good indication that either there's 
a bug elsewhere in the kernel, or that something is wrong with 
your hardware.  it's really difficult to be certain if you don't 
have a reliable test case though.


Talking about reliable test cases, I forgot to say that I definitely found 
an interesting one. It doesn't lead to OOPS but perhaps something even more 
interesting. While running countless stress tests I tried running some 
games to stress the system in different ways. I chosed openmw (an open 
source engine for Morrowind) and I played it for a while on my second 
external monitor (while I watched at some monitoring tools on my first 
monitor). I noticed that after playing a while I *always* lose internet 
connection (I use an USB3 Gigabit Ethernet adapter). This isn't the only 
thing which happens: even if the game keeps running flawlessly and the 
system *seems* to work fine (I can drag windows, open the terminal...) lots 
of commands simply stall (for example mounting a partition, unmounting it, 
rebooting...). I can reliably reproduce it, it ALWAYS happens.


Niccolò
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/3] [RFC] fstests: btrfs: add functions to set and reset required number of SCRATCH_DEV_POOL

2016-05-13 Thread Anand Jain
This patch provides functions
 _scratch_dev_pool_get()
 _scratch_dev_pool_put()

Which will help to set/reset SCRATCH_DEV_POOL with the required
number of devices. SCRATCH_DEV_POOL_SAVED will hold all the devices.

Usage:
  _scratch_dev_pool_get() 
  :: do stuff

  _scratch_dev_pool_put()

Signed-off-by: Anand Jain 
---
 common/rc | 47 +++
 1 file changed, 47 insertions(+)

diff --git a/common/rc b/common/rc
index 91e8f1c8e693..33632fd8e4a3 100644
--- a/common/rc
+++ b/common/rc
@@ -786,6 +786,53 @@ _scratch_mkfs()
 esac
 }
 
+#
+# $1 Number of the scratch devs required
+#
+_scratch_dev_pool_get()
+{
+   if [ $# != 1 ]; then
+   _fail "Usage: _scratch_dev_pool_get ndevs"
+   fi
+
+   local test_ndevs=$1
+   local config_ndevs=`echo $SCRATCH_DEV_POOL| wc -w`
+   local devs[]="( $SCRATCH_DEV_POOL )"
+
+   typeset -p config_ndevs >/dev/null 2>&1
+   if [ $? != 0 ]; then
+   _fail "Bug: unset val, must call _scratch_dev_pool_get before 
_scratch_dev_pool_put"
+   fi
+
+   # _require_scratch_dev_pool $test_ndevs
+   # must have already checked the min required devices
+   # but just in case, trap here for any potential bugs
+   # perpetuating any further
+   if [ $config_ndevs -lt $test_ndevs ]; then
+   _notrun "Need at least test requested number of ndevs 
$test_ndevs"
+   fi
+
+   SCRATCH_DEV_POOL_SAVED=${SCRATCH_DEV_POOL}
+   export SCRATCH_DEV_POOL_SAVED
+   SCRATCH_DEV_POOL=${devs[@]:0:$test_ndevs}
+   export SCRATCH_DEV_POOL
+}
+
+_scratch_dev_pool_put()
+{
+   typeset -p SCRATCH_DEV_POOL_SAVED >/dev/null 2>&1
+   if [ $? != 0 ]; then
+   _fail "Bug: unset val, must call _scratch_dev_pool_get before 
_scratch_dev_pool_put"
+   fi
+
+   if [ -z "$SCRATCH_DEV_POOL_SAVED" ]; then
+   _fail "Bug: str empty, must call _scratch_dev_pool_get before 
_scratch_dev_pool_put"
+   fi
+
+   export SCRATCH_DEV_POOL=$SCRATCH_DEV_POOL_SAVED
+   export SCRATCH_DEV_POOL_SAVED=""
+}
+
 _scratch_pool_mkfs()
 {
 case $FSTYP in
-- 
2.7.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/3] fstests: btrfs: 027 make use of new device get and put helper functions

2016-05-13 Thread Anand Jain
Below patches added helper function to get the requested
number of devices for scratch and spare device

fstest: btrfs: add functions to get and put a device for replace target
fstests: btrfs: add functions to set and reset required number of 
SCRATCH_DEV_POOL

This patch makes use of them.

Signed-off-by: Anand Jain 
---
 tests/btrfs/027 | 24 
 1 file changed, 16 insertions(+), 8 deletions(-)

diff --git a/tests/btrfs/027 b/tests/btrfs/027
index f0844a14f8e6..905dc0ebaa27 100755
--- a/tests/btrfs/027
+++ b/tests/btrfs/027
@@ -57,17 +57,22 @@ _require_command "$WIPEFS_PROG" wipefs
 run_test()
 {
local mkfs_opts=$1
-   local saved_scratch_dev_pool=$SCRATCH_DEV_POOL
-   local replace_dev=`echo $SCRATCH_DEV_POOL | awk '{print $NF}'`
+   local ndevs=`echo $SCRATCH_DEV_POOL | wc -w`
+
+   # reserve one for replace target
+   ((ndevs--))
+
+   _scratch_dev_pool_get $ndevs
+   _spare_dev_get
 
echo "Test $mkfs_opts" >>$seqres.full
 
-   SCRATCH_DEV_POOL=`echo $SCRATCH_DEV_POOL | sed -e "s# *$replace_dev 
*##"`
_scratch_pool_mkfs $mkfs_opts >>$seqres.full 2>&1
# make sure we created btrfs with desired options
if [ $? -ne 0 ]; then
echo "mkfs $mkfs_opts failed"
-   SCRATCH_DEV_POOL=$saved_scratch_dev_pool
+   _spare_dev_put
+   _scratch_dev_pool_put
return
fi
_scratch_mount >>$seqres.full 2>&1
@@ -89,17 +94,19 @@ run_test()
_scratch_mount -o degraded >>$seqres.full 2>&1
 
# replace $missing_dev with $replace_dev and scrub it to double-check
-   $BTRFS_UTIL_PROG replace start -B -r $missing_dev_id $replace_dev \
+   $BTRFS_UTIL_PROG replace start -B -r $missing_dev_id $SPARE_DEV \
$SCRATCH_MNT -f >>$seqres.full 2>&1
if [ $? -ne 0 ]; then
echo "btrfs replace failed"
-   SCRATCH_DEV_POOL=$saved_scratch_dev_pool
+   _spare_dev_put
+   _scratch_dev_pool_put
return
fi
$BTRFS_UTIL_PROG scrub start -B $SCRATCH_MNT >>$seqres.full 2>&1
if [ $? -ne 0 ]; then
echo "btrfs scrub failed"
-   SCRATCH_DEV_POOL=$saved_scratch_dev_pool
+   _spare_dev_put
+   _scratch_dev_pool_put
return
fi
 
@@ -107,7 +114,8 @@ run_test()
# we called _require_scratch_nocheck instead of _require_scratch
# do check after test for each profile config
_check_scratch_fs
-   SCRATCH_DEV_POOL=$saved_scratch_dev_pool
+   _spare_dev_put
+   _scratch_dev_pool_put
 }
 
 echo "Silence is golden"
-- 
2.7.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/3] [RFC] fstests: btrfs: add functions to get and put a device for replace target

2016-05-13 Thread Anand Jain
For the replace tests we need a device as a spare device,
here functions _spare_dev_get() and _spare_dev_put()
will get it from the SCRATCH_DEV_POOL_SAVED, which is set
when _scratch_dev_pool_get() is called, and is based on how
many has already been assigned to SCRATCH_DEV_POOL.

 usage:
   _scratch_dev_pool_get 3
   _spare_dev_get

  SPARE_DEV will have a device set which can be
  used as the replace target device.

   _spare_dev_put
   _scratch_dev_pool_put

_spare_dev_get() will pick the next device after SCRATCH_DEV_POOL
devices, from the SCRATCH_DEV_POOL_SAVED, and assigns it to
SPARE_DEV. _spare_dev_put() will set to SPARE_DEV to null.

Signed-off-by: Anand Jain 
---
 common/rc | 45 +
 1 file changed, 45 insertions(+)

diff --git a/common/rc b/common/rc
index 33632fd8e4a3..f2b39ddbee0c 100644
--- a/common/rc
+++ b/common/rc
@@ -786,6 +786,51 @@ _scratch_mkfs()
 esac
 }
 
+_spare_dev_get()
+{
+   typeset -p SCRATCH_DEV_POOL_SAVED >/dev/null 2>&1
+   if [ $? != 0 ]; then
+   _fail "Bug: unset val, must call _scratch_dev_pool_get before 
_spare_dev_get"
+   fi
+
+   if [ -z "$SCRATCH_DEV_POOL_SAVED" ]; then
+   _fail "Bug: str empty, must call _scratch_dev_pool_get before 
_spare_dev_get"
+   fi
+
+   # Check if the spare is already assigned
+   typeset -p SPARE_DEV >/dev/null 2>&1
+   if [ $? == 0 ]; then
+   if [ ! -z "$SPARE_DEV" ]; then
+   _fail "Bug: SPARE_DEV = $SPARE_DEV already assigned"
+   fi
+   fi
+
+   local ndevs=`echo $SCRATCH_DEV_POOL| wc -w`
+   local config_ndevs=`echo $SCRATCH_DEV_POOL_SAVED| wc -w`
+
+   if [ $ndevs -eq $config_ndevs ]; then
+   _notrun "All devs used no spare"
+   fi
+   # Get a dev that is not used
+   local devs[]="( $SCRATCH_DEV_POOL_SAVED )"
+   SPARE_DEV=${devs[@]:$ndevs:1}
+   export SPARE_DEV
+}
+
+_spare_dev_put()
+{
+   typeset -p SPARE_DEV >/dev/null 2>&1
+   if [ $? != 0 ]; then
+   _fail "Bug: unset val, must call _spare_dev_get before its put"
+   fi
+
+   if [ -z "$SPARE_DEV" ]; then
+   _fail "Bug: str empty, must call _spare_dev_get before its put"
+   fi
+
+   export SPARE_DEV=""
+}
+
 #
 # $1 Number of the scratch devs required
 #
-- 
2.7.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs ate my data in just two days, after a fresh install. ram and disk are ok. it still mounts, but I cannot repair

2016-05-13 Thread Austin S. Hemmelgarn

On 2016-05-13 07:07, Niccolò Belli wrote:

On giovedì 12 maggio 2016 17:43:38 CEST, Austin S. Hemmelgarn wrote:

That's probably a good indication of the CPU and the MB being OK, but
not necessarily the RAM.  There's two other possible options for
testing the RAM that haven't been mentioned yet though (which I hadn't
thought of myself until now):
1. If you have access to Windows, try the Windows Memory Diagnostic.
This runs yet another slightly different set of tests from memtest86
and memtest86+, so it may catch issues they don't.  You can start this
directly on an EFI system by loading /EFI/Microsoft/Boot/MEMTEST.EFI
from the EFI system partition.
2. This is a Dell system.  If you still have the utility partition
which Dell ships all their per-provisioned systems with, that should
have a hardware diagnostics tool.  I doubt that this will find
anything (it's part of their QA procedure AFAICT), but it's probably
worth trying, as the memory testing in that uses yet another slightly
different implementation of the typical tests.  You can usually find
this in the boot interrupt menu accessed by hitting F12 before the
boot-loader loads.


I tried the Dell System Test, including the enhanced optional ram tests
and it was fine. I also tried the Microsoft one, which passed. BUT if I
select the advanced test in the Microsoft One it always stops at 21% of
first test. The test menus are still working, but fans get quiet and it
keeps writing "test running... 21%" forever. I tried it many times and
it always got stuck at 21%, so I suspect a test suite bug instead of a
ram failure.
I've actually seen this before on other systems (different completion 
percentage on each system, but otherwise the same), all of them ended up 
actually having a bad CPU or MB, although the ones with CPU issues were 
fine after BIOS updates which included newer microcode.


I also noticed some other interesting behaviours: while I was running
the usual scrub+check (both were fine) from the livecd I noticed this in
dmesg:
[  261.301159] BTRFS info (device dm-0): bdev /dev/mapper/cryptroot
errs: wr 0, rd 0, flush 0, corrupt 4, gen 0
Corrupt? But both scrub and check were fine... I double checked scrub
and check and they were still fine.
It's worth noting that these are running counts of errors since the last 
time the stats were reset (and they only get reset manually).  If you 
haven't reset the stats, then this isn't all that surprising.


This is what happened another time:
https://drive.google.com/open?id=0Bwe9Wtc-5xF1dGtPaWhTZ0w5aUU
I was making a backup of my partition USING DD from the livecd. It
wasn't even mounted if I recall correctly!
The fact that you're getting an OOPS involving core kernel threads 
(kswapd) is a pretty good indication that either there's a bug elsewhere 
in the kernel, or that something is wrong with your hardware.  it's 
really difficult to be certain if you don't have a reliable test case 
though.


On giovedì 12 maggio 2016 18:48:17 CEST, Zygo Blaxell wrote:

That's what a RAM corruption problem looks like when you run btrfs scrub.
Maybe the RAM itself is OK, but *something* is scribbling on it.

Does the Arch live usb use the same kernel as your normal system?


Yes, except for the point release (the system is slightly ahead of the
liveusb).

On giovedì 12 maggio 2016 18:48:17 CEST, Zygo Blaxell wrote:

Did you try an older (or newer) kernel?  I've been running 4.5.x on a few
canary systems, but so far none of them have survived more than a day.


No (except for point releases from 4.5.0 to 4.5.4), but I will try 4.4.
FWIW, I've been running 4.5 with almost no issues on my laptop since it 
came out (the few issues I have had are not unique to 4.5, and are all 
ultimately firmware issues (Lenovo has been getting _really_ bad 
recently about having broken ACPI and EFI implementations...)).  Of 
course, I'm also running Gentoo, so everything is built locally, but I 
doubt that that has much impact on stability.


On giovedì 12 maggio 2016 18:48:17 CEST, Zygo Blaxell wrote:

It's possible there's a problem that affects only very specific chipsets
You seem to have eliminated RAM in isolation, but there could be a
problem
in the kernel that affects only your chipset.


Funny considering it is sold as a Linux laptop. Unfortunately they only
tested it with the ancient Ubuntu 14.04.
Sadly, this is pretty typical for anything sold as a 'Linux' system that 
isn't a server.  Even for the servers sold as such, it's not unusual for 
it to only be tested with with old versions of CentOS.


Now, I hadn't thought of this before, but it's a Dell system, so you're 
trapping out to SMBIOS for everything under the sun, and if they don't 
pass a correct memory map (or correct ACPI tables) to the OS during 
boot, then there may be some sections of RAM that both Linux and the 
firmware think they can use, which could definitely result in symptoms 
like bad RAM while still consistently passing memory tests (because they 
don't make

Re: btrfs ate my data in just two days, after a fresh install. ram and disk are ok. it still mounts, but I cannot repair

2016-05-13 Thread Niccolò Belli

On giovedì 12 maggio 2016 17:43:38 CEST, Austin S. Hemmelgarn wrote:
That's probably a good indication of the CPU and the MB being 
OK, but not necessarily the RAM.  There's two other possible 
options for testing the RAM that haven't been mentioned yet 
though (which I hadn't thought of myself until now):
1. If you have access to Windows, try the Windows Memory 
Diagnostic. This runs yet another slightly different set of 
tests from memtest86 and memtest86+, so it may catch issues they 
don't.  You can start this directly on an EFI system by loading 
/EFI/Microsoft/Boot/MEMTEST.EFI from the EFI system partition.
2. This is a Dell system.  If you still have the utility 
partition which Dell ships all their per-provisioned systems 
with, that should have a hardware diagnostics tool.  I doubt 
that this will find anything (it's part of their QA procedure 
AFAICT), but it's probably worth trying, as the memory testing 
in that uses yet another slightly different implementation of 
the typical tests.  You can usually find this in the boot 
interrupt menu accessed by hitting F12 before the boot-loader 
loads.


I tried the Dell System Test, including the enhanced optional ram tests and 
it was fine. I also tried the Microsoft one, which passed. BUT if I select 
the advanced test in the Microsoft One it always stops at 21% of first 
test. The test menus are still working, but fans get quiet and it keeps 
writing "test running... 21%" forever. I tried it many times and it always 
got stuck at 21%, so I suspect a test suite bug instead of a ram failure.


I also noticed some other interesting behaviours: while I was running the 
usual scrub+check (both were fine) from the livecd I noticed this in dmesg:
[  261.301159] BTRFS info (device dm-0): bdev /dev/mapper/cryptroot errs: 
wr 0, rd 0, flush 0, corrupt 4, gen 0
Corrupt? But both scrub and check were fine... I double checked scrub and 
check and they were still fine.


This is what happened another time: 
https://drive.google.com/open?id=0Bwe9Wtc-5xF1dGtPaWhTZ0w5aUU
I was making a backup of my partition USING DD from the livecd. It wasn't 
even mounted if I recall correctly!


On giovedì 12 maggio 2016 18:48:17 CEST, Zygo Blaxell wrote:

That's what a RAM corruption problem looks like when you run btrfs scrub.
Maybe the RAM itself is OK, but *something* is scribbling on it.

Does the Arch live usb use the same kernel as your normal system?


Yes, except for the point release (the system is slightly ahead of the 
liveusb).


On giovedì 12 maggio 2016 18:48:17 CEST, Zygo Blaxell wrote:

Did you try an older (or newer) kernel?  I've been running 4.5.x on a few
canary systems, but so far none of them have survived more than a day.


No (except for point releases from 4.5.0 to 4.5.4), but I will try 4.4.

On giovedì 12 maggio 2016 18:48:17 CEST, Zygo Blaxell wrote:

It's possible there's a problem that affects only very specific chipsets
You seem to have eliminated RAM in isolation, but there could be a problem
in the kernel that affects only your chipset.


Funny considering it is sold as a Linux laptop. Unfortunately they only 
tested it with the ancient Ubuntu 14.04.


Niccolò
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH] btrfs: fix memory leak during RAID 5/6 device replacement

2016-05-13 Thread Zhao Lei
Hi, Scott Talbert

* From: David Sterba [mailto:dste...@suse.cz]
> Sent: Tuesday, May 10, 2016 1:32 AM
> To: Scott Talbert 
> Cc: linux-btrfs@vger.kernel.org; Zhao Lei 
> Subject: Re: [PATCH] btrfs: fix memory leak during RAID 5/6 device replacement
> 
> CC Zhao Lei 
> 
> On Mon, May 09, 2016 at 09:14:28AM -0400, Scott Talbert wrote:
> > A 'struct bio' is allocated in scrub_missing_raid56_pages(), but it was 
> > never
> > freed anywhere.
> >
> > Signed-off-by: Scott Talbert 
> > ---
> >  fs/btrfs/scrub.c | 2 ++
> >  1 file changed, 2 insertions(+)
> >
> > diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
> > index 82bedf9..607cc6e 100644
> > --- a/fs/btrfs/scrub.c
> > +++ b/fs/btrfs/scrub.c
> > @@ -2130,6 +2130,8 @@ static void scrub_missing_raid56_end_io(struct
> bio *bio)
> > if (bio->bi_error)
> > sblock->no_io_error_seen = 0;
> >
> > +   bio_put(bio);
> > +
Seems good by reviewing.
I'll add some debug code to test it more detailedly.

Thanks
Zhaolei

> > btrfs_queue_work(fs_info->scrub_workers, &sblock->work);
> >  }
> >
> > --
> > 1.9.1
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> > the body of a message to majord...@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html




--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs-progs: added quiet-option for scripts

2016-05-13 Thread David Sterba
Hi,

On Thu, May 12, 2016 at 09:49:25PM +0200, bt...@oss.m-berberich.de wrote:
> From: M G Berberich 
> 
> -q,--quiet to prevent status-messages on stderr
> --verbose as alternative for -v
> moved 'Mode NO_FILE_DATA enabled' message to stderr
> changed default for g_verbose to 1

added the signed-off-by line and did the following tweaks and applied, thanks.

--- a/cmds-send.c
+++ b/cmds-send.c
@@ -44,8 +44,10 @@
 #include "send.h"
 #include "send-utils.h"

-/* default is 1 for historical reasons
-   changing may break scripts */
+/*
+ * Default is 1 for historical reasons, changing may break scripts that expect
+ * the 'At subvol' message.
+ */
 static int g_verbose = 1;

 struct btrfs_send {
@@ -445,7 +447,7 @@ int cmd_send(int argc, char **argv)
g_verbose++;
break;
case 'q':
-   g_verbose--;
+   g_verbose = 0;
break;
case 'e':
new_end_cmd_semantic = 1;
@@ -630,7 +632,8 @@ int cmd_send(int argc, char **argv)
}

if ((send_flags & BTRFS_SEND_FLAG_NO_FILE_DATA) && g_verbose > 1)
-   fprintf(stderr, "Mode NO_FILE_DATA enabled\n");
+   if (g_verbose > 1)
+   fprintf(stderr, "Mode NO_FILE_DATA enabled\n");

for (i = optind; i < argc; i++) {
int is_first_subvol;
@@ -721,9 +724,6 @@ const char * const cmd_send_usage[] = {
"which case 'btrfs send' will determine a suitable parent among the",
"clone sources itself.",
"\n",
-   "-v, --verboseEnable verbose debug output. Each occurrence of",
-   " this option increases the verbose level more.",
-   "-q, --quiet  suppress messages to stderr.",
"-e   If sending multiple subvols at once, use the new",
" format and omit the end-cmd between the subvols.",
"-p   Send an incremental stream from  to",
@@ -737,6 +737,8 @@ const char * const cmd_send_usage[] = {
" does not contain any file data and thus cannot be 
used",
" to transfer changes. This mode is faster and useful 
to",
" show the differences in metadata.",
-   "--help   display this help and exit",
+   "-v|--verbose enable verbose output to stderr, each occurrence of",
+   " this option increases verbosity",
+   "-q|--quiet   suppress all messages, except errors",
NULL
 };
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/1] btrfs-progs: Typo review of strings and comments

2016-05-13 Thread David Sterba
On Thu, May 12, 2016 at 04:56:46PM -0400, Nicholas D Steeves wrote:
> Thank you for reviewing this patch.  Sorry for missing the context of
> the strtoull comment; I should have been able to infer that and am
> embarrassed that I failed to.  Also, embarrassed because I think I've
> used it in some C++ code!
> 
> I learned how to use git rebase and git reset today, and can submit a
> v2 patch diffed against master at your earliest convenience.  My only
> remaining question is this:

No need to send v2, I fixed the issues (and usually do if they seem
easy).

> mkfs.c: printf:("Incompatible features:  %s", features_buf)
>   * Should this be left as "Imcompat features"?

Actually yes, it should be 'Incompat', patch updated.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Input/output error on newly created file

2016-05-13 Thread Marc Joliet
Am Friday 13 May 2016
schrieb Duncan <1i5t5.dun...@cox.net>

>Szalma László posted on Thu, 12 May 2016 20:28:24 +0200 as excerpted:
>> The files that rarely become unreadable (I/O error but no error in dmesg
>> or anywhere) are mysql MyIsam database files, and they are always small.
>> Like 16kbyte for example, or smaller. Sometimes dropping the fs cache
>> fixes the problem, sometimes not. Umount / mount always fixes the
>> problem. Scrub says the filesystem is OK (when the file is unreadable).
>
>Is it possible the files are always under 4 KiB?

For the record, I was seeing a similar error with dovecot *.index.log files 
(see the ML thread started by Szalma László) .  In my case they are *not* all 
under 4 KiB.  Looking at some of the affected files, one of them is 25K, and 
another is 6.6K.  However, perhaps they compress to under 4K?  But compressing 
the 25K one with lzop only goes down to 5.6K with -9 :-/ .

>While there's a few variables as to max size, with 4 KiB being the
>practical max, btrfs will inline really small files into their metadata
>node instead of writing them out as 4 KiB data block extents.  Since
>that's obviously a different code-path, if it's only 4 KiB and smaller
>files, it's possible there's a race in that code-path that when lost,
>results in the file "disappearing" without a dmesg trace, only to
>reappear after reboot.
>
>Actually, now that I think about it, if the files are all OVER say 2 KiB
>but still quite small, say under 16 KiB, and if they had just recently
>grown, it's possible that the race is in the transition from the inline
>metadata state to the multiples of 4 KiB block data extent state.
>
>And if all the files had just shrunk, say from compaction (if done in-
>place, not with a copy and rename), perhaps it's the reverse, the
>transition from written data blocks to inline metadata state.

I'm glad somebody is (publicly) thinking about this :-) !

Greetings
-- 
Marc Joliet
--
"People who think they know everything really annoy those of us who know we
don't" - Bjarne Stroustrup


signature.asc
Description: This is a digitally signed message part.


Re: Undelete deleted subvolume?

2016-05-13 Thread Duncan
Andrei Borzenkov posted on Thu, 12 May 2016 17:19:11 +0300 as excerpted:

> I accidentally deleted wrong snapshot using SUSE snapper. Is it possible
> to undelete subvolume? I know that it is possible to extract files from
> old tree (although SLES12 does not seem to offer btrfs-find-root), but
> is it possible to "reconnect" subvolume back?

I don't know of a way to do it directly, altho the devs may be able to 
suggest something exotic, but...

You should at least be able to /emulate/ it, by using btrfs restore (with 
the filesystem unmounted) to write the files elsewhere (the part you 
suggested), then mounting the filesystem, creating a new subvolume, and 
copying the restored files back into it.

That does lose the reflinks of a snapshot if that's what you deleted, so 
will take more space, but if you then run one of the btrfs dedupers on 
it, you should be able to re-reflink it, sharing extents via reflink and 
reducing the exclusive space used, once again.  (Tho I have no personal 
experience with the dedupers so can't give you specific help in that 
regard.)

But if you were using it as the reference parent for send/receive, or 
something, I think it's gone from that regard, as the emulation above 
would of course have a different subvolume ID.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Input/output error on newly created file

2016-05-13 Thread Duncan
Szalma László posted on Thu, 12 May 2016 20:28:24 +0200 as excerpted:

> The files that rarely become unreadable (I/O error but no error in dmesg
> or anywhere) are mysql MyIsam database files, and they are always small.
> Like 16kbyte for example, or smaller. Sometimes dropping the fs cache
> fixes the problem, sometimes not. Umount / mount always fixes the
> problem. Scrub says the filesystem is OK (when the file is unreadable).

Is it possible the files are always under 4 KiB?

While there's a few variables as to max size, with 4 KiB being the 
practical max, btrfs will inline really small files into their metadata 
node instead of writing them out as 4 KiB data block extents.  Since 
that's obviously a different code-path, if it's only 4 KiB and smaller 
files, it's possible there's a race in that code-path that when lost, 
results in the file "disappearing" without a dmesg trace, only to 
reappear after reboot.

Actually, now that I think about it, if the files are all OVER say 2 KiB 
but still quite small, say under 16 KiB, and if they had just recently 
grown, it's possible that the race is in the transition from the inline 
metadata state to the multiples of 4 KiB block data extent state.

And if all the files had just shrunk, say from compaction (if done in-
place, not with a copy and rename), perhaps it's the reverse, the 
transition from written data blocks to inline metadata state.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: About in-band dedupe for v4.7

2016-05-13 Thread Duncan
Mark Fasheh posted on Thu, 12 May 2016 13:54:26 -0700 as excerpted:

> For example, my 'large' duperemove test involves about 750 gigabytes of
> general purpose data - quite literally /home off my workstation.
> 
> After the run I'm usually seeing between 65-75 gigabytes saved for a
> total of only 10% duplicated data. I would expect this to be fairly
> 'average' - /home on my machine has the usual stuff - documents, source
> code, media, etc.
> 
> So if you were writing your whole fs out you could expect about the same
> from inline dedupe - 10%-ish. Let's be generous and go with that number
> though as a general 'this is how much dedupe we get'.
> 
> What the memory backend is doing then is providing a cache of
> sha256/block calculations. This cache is very expensive to fill, and
> every written block must go through it. On top of that, the cache does
> not persist between mounts, and has items regularly removed from it when
> we run low on memory. All of this will drive down the amount of
> duplicated data we can find.
> 
> So our best case savings is probably way below 10% - let's be _really_
> nice and say 5%.

My understanding is that this "general purpose data" use-case isn't being 
targeted by the in-memory dedup at all, because indeed it's a very poor 
fit for exactly the reason you explain.

Instead, think data centers where perhaps 50% of all files are duplicated 
thousands of times... and it's exactly those files that are most 
frequently used.  Totally different use-case, where that 5% on general 
purpose data could easily skyrocket to 50%+.

Refining that a bit, as I understand it, the idea with the in-memory-
inline dedup is pretty much opportunity-based dedup.  Where there's an 
easy opportunity seen, grab it, but don't go out of your way to do 
anything fancy.  Then somewhat later, a much more thorough offline dedup 
process will come along and dedup-pack everything else.

In that scenario a quick-opportunity 20% hit rate may be acceptable, 
while actual hit rates may approach 50% due to skew toward the common.  
Then the dedup-pack comes along and finishes things, possibly resulting 
in total savings of say 70% or so.  Even if the in-memory doesn't get 
that common-skew boost and ends up nearer 20%, that's still a significant 
savings for the initial inline result, with the dedup-packer coming along 
later to clean things up properly.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html