date:20161114

Re: About difference in extent sharing in btrfs and xfs

2016-11-14 Thread Qu Wenruo




At 11/15/2016 03:47 PM, Darrick J. Wong wrote:

On Tue, Nov 15, 2016 at 02:33:31PM +0800, Qu Wenruo wrote:

Hi, xfs guys and btrfs guys.

Although the test case generic/372 exists for some time, I noticed that
btrfs always fails the test case, due to the difference in how btrfs
and xfs handle shared extents.

The difference is, btrfs can handle shared extents which points to a subset
of a larger extent, so it doesn't need to split reflink source.

In case of the test case.

On Disk Extent A:
Bytenr X
|<---Data=0x61, Length=320K---|


File1: File Extent 0 -> Extent A, offset=0, referred len=320K

File2: File Extent 0 -> Hole
   File Extent 64K -> Extent A, offset=192k, refferred len=64k
   File Extent 128K -> Hole
   File Extent 192K -> Extent A, offset=64k, refferred len=64k

Unlike Xfs, Btrfs don't split the source extent, as its file extent has more
fields which can handle offset/length inside the large extent.


XFS doesn't split the extent either, internally.  The FIEMAP
implementation cross-references extent data with the refcount records,
using extra struct fiemap_extent to report precisely which blocks are
shared and which aren't.  ocfs2 exhibits the same behavior.

It's only btrfs that reports file1 as having one big shared extent when
only parts of that extent are actually shared.  I'd rather btrfs only
report blocks that are actually shared as shared, but since fiemap
results can be obsolete as soon as the ioctl returns I don't consider it
a huge priority.


The btrfs way to handle shared extent has its pros and cons.
For example, it's very flex, but it wastes more space for COW since the
whole extent can only be freed after all referencer is freed.


Wait, what?  So if I reflinked a single block of file1 into file3 and
then deleted file1 and file2, btrfs would hold on to all 320K?


Yep. That's the biggest problem for such extent booking.
Normally it's defrag to split and free the unused part...

But ironically, current btrfs defrag can't handle shared extents at all.
So defrag is only useful when there is only last referencer on that extent.




But since the test case is generic test case, I think it doesn't take such
btrfs behavior into consideration.
So it always fails on btrfs.

How about moving it to xfs specific tests?


I'd prefer the testcase be left in generic/ and _notrun'd on btrfs since
two of the three reflink fses have this behavior.

(Assuming the btrfs behavior doesn't get changed.)


I don't think btrfs will change the behavior soon, so I'd prefer to 
blacklist btrfs for this test case.


Thanks,
Qu


--D



Thanks,
Qu








--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: About difference in extent sharing in btrfs and xfs

2016-11-14 Thread Darrick J. Wong

On Tue, Nov 15, 2016 at 02:33:31PM +0800, Qu Wenruo wrote:
> Hi, xfs guys and btrfs guys.
> 
> Although the test case generic/372 exists for some time, I noticed that
> btrfs always fails the test case, due to the difference in how btrfs
> and xfs handle shared extents.
> 
> The difference is, btrfs can handle shared extents which points to a subset
> of a larger extent, so it doesn't need to split reflink source.
> 
> In case of the test case.
> 
> On Disk Extent A:
> Bytenr X
> |<---Data=0x61, Length=320K---|
> 
> 
> File1: File Extent 0 -> Extent A, offset=0, referred len=320K
> 
> File2: File Extent 0 -> Hole
>File Extent 64K -> Extent A, offset=192k, refferred len=64k
>File Extent 128K -> Hole
>File Extent 192K -> Extent A, offset=64k, refferred len=64k
> 
> Unlike Xfs, Btrfs don't split the source extent, as its file extent has more
> fields which can handle offset/length inside the large extent.

XFS doesn't split the extent either, internally.  The FIEMAP
implementation cross-references extent data with the refcount records,
using extra struct fiemap_extent to report precisely which blocks are
shared and which aren't.  ocfs2 exhibits the same behavior.

It's only btrfs that reports file1 as having one big shared extent when
only parts of that extent are actually shared.  I'd rather btrfs only
report blocks that are actually shared as shared, but since fiemap
results can be obsolete as soon as the ioctl returns I don't consider it
a huge priority.

> The btrfs way to handle shared extent has its pros and cons.
> For example, it's very flex, but it wastes more space for COW since the
> whole extent can only be freed after all referencer is freed.

Wait, what?  So if I reflinked a single block of file1 into file3 and
then deleted file1 and file2, btrfs would hold on to all 320K?

> But since the test case is generic test case, I think it doesn't take such
> btrfs behavior into consideration.
> So it always fails on btrfs.
> 
> How about moving it to xfs specific tests?

I'd prefer the testcase be left in generic/ and _notrun'd on btrfs since
two of the three reflink fses have this behavior.

(Assuming the btrfs behavior doesn't get changed.)

--D

> 
> Thanks,
> Qu
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] fstests: generic/098 update test for truncating a file into the middle of a hole

2016-11-14 Thread Eryu Guan

On Fri, Nov 11, 2016 at 02:30:04PM -0800, Liu Bo wrote:
> This updates generic/098 by adding a sync option, i.e. 'sync' after the second
> write, and with btrfs's NO_HOLES, we could still get wrong isize after 
> remount.
> 
> This gets fixed by the patch
> 
> 'Btrfs: fix truncate down when no_holes feature is enabled'
> 
> Signed-off-by: Liu Bo 

Looks good to me, just some nitpicks inline :)

> ---
>  tests/generic/098 | 57 
> ---
>  tests/generic/098.out | 10 +
>  2 files changed, 46 insertions(+), 21 deletions(-)
> 
> diff --git a/tests/generic/098 b/tests/generic/098
> index 838bb5d..3b89939 100755
> --- a/tests/generic/098
> +++ b/tests/generic/098
> @@ -64,27 +64,42 @@ rm -f $seqres.full
>  _scratch_mkfs >>$seqres.full 2>&1
>  _scratch_mount
>  
> -# Create our test file with some data and durably persist it.
> -$XFS_IO_PROG -f -c "pwrite -S 0xaa 0 128K" $SCRATCH_MNT/foo | _filter_xfs_io
> -sync
> -
> -# Append some data to the file, increasing its size, and leave a hole between
> -# the old size and the start offset if the following write. So our file gets
> -# a hole in the range [128Kb, 256Kb[.
> -$XFS_IO_PROG -c "pwrite -S 0xbb 256K 32K" $SCRATCH_MNT/foo | _filter_xfs_io
> -
> -# Now truncate our file to a smaller size that is in the middle of the hole 
> we
> -# previously created. On most truncate implementations the data we appended
> -# before gets discarded from memory (with truncate_setsize()) and never ends
> -# up being written to disk.
> -$XFS_IO_PROG -c "truncate 160K" $SCRATCH_MNT/foo
> -
> -_scratch_cycle_mount
> -
> -# We expect to see a file with a size of 160Kb, with the first 128Kb of data 
> all
> -# having the value 0xaa and the remaining 32Kb of data all having the value 
> 0x00
> -echo "File content after remount:"
> -od -t x1 $SCRATCH_MNT/foo
> +workout()
> +{
> + NEED_SYNC=$1

Use "local" to declare this var, and in lower case. Usually we use upper
case for global variables.

> +
> + # Create our test file with some data and durably persist it.
> + $XFS_IO_PROG -t -f -c "pwrite -S 0xaa 0 128K" $SCRATCH_MNT/foo | 
> _filter_xfs_io
> + sync
> +
> + # Append some data to the file, increasing its size, and leave a hole 
> between
> + # the old size and the start offset if the following write. So our file 
> gets
> + # a hole in the range [128Kb, 256Kb[.
> + $XFS_IO_PROG -c "pwrite -S 0xbb 256K 32K" $SCRATCH_MNT/foo | 
> _filter_xfs_io
> +
> + if [ $NEED_SYNC -eq 1 ]; then
> + sync
> + fi

Good to see some comments to explain why we need this to test
with/without sync case.

Thanks,
Eryu

> +
> + # Now truncate our file to a smaller size that is in the middle of the 
> hole we
> + # previously created.
> + # If we don't flush dirty page cache above, on most truncate
> + # implementations the data we appended before gets discarded from
> + # memory (with truncate_setsize()) and never ends up being written to
> + # disk.
> + $XFS_IO_PROG -c "truncate 160K" $SCRATCH_MNT/foo
> +
> + _scratch_cycle_mount
> +
> + # We expect to see a file with a size of 160Kb, with the first 128Kb of 
> data all
> + # having the value 0xaa and the remaining 32Kb of data all having the 
> value 0x00
> + echo "File content after remount:"
> + od -t x1 $SCRATCH_MNT/foo
> +}
> +
> +workout 0
> +# flush after each write
> +workout 1
>  
>  status=0
>  exit
> diff --git a/tests/generic/098.out b/tests/generic/098.out
> index 37415ee..f87f046 100644
> --- a/tests/generic/098.out
> +++ b/tests/generic/098.out
> @@ -9,3 +9,13 @@ File content after remount:
>  040 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>  *
>  050
> +wrote 131072/131072 bytes at offset 0
> +XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +wrote 32768/32768 bytes at offset 262144
> +XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +File content after remount:
> +000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
> +*
> +040 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> +*
> +050
> -- 
> 2.5.0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe fstests" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

About difference in extent sharing in btrfs and xfs

2016-11-14 Thread Qu Wenruo


Hi, xfs guys and btrfs guys.

Although the test case generic/372 exists for some time, I noticed that
btrfs always fails the test case, due to the difference in how btrfs
and xfs handle shared extents.

The difference is, btrfs can handle shared extents which points to a 
subset of a larger extent, so it doesn't need to split reflink source.


In case of the test case.

On Disk Extent A:
Bytenr X
|<---Data=0x61, Length=320K---|


File1: File Extent 0 -> Extent A, offset=0, referred len=320K

File2: File Extent 0 -> Hole
   File Extent 64K -> Extent A, offset=192k, refferred len=64k
   File Extent 128K -> Hole
   File Extent 192K -> Extent A, offset=64k, refferred len=64k

Unlike Xfs, Btrfs don't split the source extent, as its file extent has 
more fields which can handle offset/length inside the large extent.



The btrfs way to handle shared extent has its pros and cons.
For example, it's very flex, but it wastes more space for COW since the 
whole extent can only be freed after all referencer is freed.


But since the test case is generic test case, I think it doesn't take 
such btrfs behavior into consideration.

So it always fails on btrfs.

How about moving it to xfs specific tests?

Thanks,
Qu


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/2] btrfs: scrub: Introduce full stripe lock for RAID56

2016-11-14 Thread Qu Wenruo

Unlike mirror based profiles, RAID5/6 recovery needs to read out the
whole full stripe.

And if we don't do proper protect, it can easily cause race condition.

Introduce 2 new functions: lock_full_stripe() and unlock_full_stripe()
for RAID5/6.
Which stores a rb_tree of mutex for full stripes, so scrub callers can
use them to lock a full stripe to avoid race.

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/ctree.h   |   4 ++
 fs/btrfs/extent-tree.c |   3 +
 fs/btrfs/scrub.c   | 177 +
 3 files changed, 184 insertions(+)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 9d8edcb..37d5f29 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -638,6 +638,10 @@ struct btrfs_block_group_cache {
 * Protected by free_space_lock.
 */
int needs_free_space;
+
+   /* Scrub full stripe lock tree for RAID5/6 scrub */
+   struct rb_root scrub_lock_root;
+   spinlock_t scrub_lock;
 };
 
 /* delayed seq elem */
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 4607af3..b098a1f 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -132,6 +132,7 @@ void btrfs_put_block_group(struct btrfs_block_group_cache 
*cache)
if (atomic_dec_and_test(>count)) {
WARN_ON(cache->pinned > 0);
WARN_ON(cache->reserved > 0);
+   WARN_ON(!RB_EMPTY_ROOT(>scrub_lock_root));
kfree(cache->free_space_ctl);
kfree(cache);
}
@@ -10122,6 +10123,8 @@ btrfs_create_block_group_cache(struct btrfs_root *root, 
u64 start, u64 size)
 
atomic_set(>count, 1);
spin_lock_init(>lock);
+   spin_lock_init(>scrub_lock);
+   cache->scrub_lock_root = RB_ROOT;
init_rwsem(>data_rwsem);
INIT_LIST_HEAD(>list);
INIT_LIST_HEAD(>cluster_list);
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index fffb9ab..4fce415 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -240,6 +240,13 @@ struct scrub_warning {
struct btrfs_device *dev;
 };
 
+struct scrub_full_stripe_lock {
+   struct rb_node node;
+   u64 logical;
+   u64 refs;
+   struct mutex mutex;
+};
+
 static void scrub_pending_bio_inc(struct scrub_ctx *sctx);
 static void scrub_pending_bio_dec(struct scrub_ctx *sctx);
 static void scrub_pending_trans_workers_inc(struct scrub_ctx *sctx);
@@ -351,6 +358,176 @@ static void scrub_blocked_if_needed(struct btrfs_fs_info 
*fs_info)
 }
 
 /*
+ * Caller must hold cache->scrub_root_lock.
+ *
+ * Return existing full stripe and increase it refs
+ * Or return NULL, and insert @fstripe_lock into the bg cache
+ */
+static struct scrub_full_stripe_lock *
+add_scrub_lock(struct btrfs_block_group_cache *cache,
+  struct scrub_full_stripe_lock *fstripe_lock)
+{
+   struct rb_node **p;
+   struct rb_node *parent = NULL;
+   struct scrub_full_stripe_lock *entry;
+
+   p = >scrub_lock_root.rb_node;
+   while (*p) {
+   parent = *p;
+   entry = rb_entry(parent, struct scrub_full_stripe_lock, node);
+   if (fstripe_lock->logical < entry->logical) {
+   p = &(*p)->rb_left;
+   } else if (fstripe_lock->logical > entry->logical) {
+   p = &(*p)->rb_right;
+   } else {
+   entry->refs++;
+   return entry;
+   }
+   }
+   /* Insert new one */
+   rb_link_node(_lock->node, parent, p);
+   rb_insert_color(_lock->node, >scrub_lock_root);
+
+   return NULL;
+}
+
+static struct scrub_full_stripe_lock *
+search_scrub_lock(struct btrfs_block_group_cache *cache, u64 bytenr)
+{
+   struct rb_node *node;
+   struct scrub_full_stripe_lock *entry;
+
+   node = cache->scrub_lock_root.rb_node;
+   while (node) {
+   entry = rb_entry(node, struct scrub_full_stripe_lock, node);
+   if (bytenr < entry->logical)
+   node = node->rb_left;
+   else if (bytenr > entry->logical)
+   node = node->rb_right;
+   else
+   return entry;
+   }
+   return NULL;
+}
+
+/*
+ * Helper to get full stripe logical from a normal bytenr.
+ * Thanks to the chaos of scrub structures, we need to get it all
+ * by ourselves, using btrfs_map_sblock().
+ */
+static int get_full_stripe_logical(struct btrfs_fs_info *fs_info, u64 bytenr,
+  u64 *bytenr_ret)
+{
+   struct btrfs_bio *bbio = NULL;
+   u64 len;
+   int ret;
+
+   /* Just use map_sblock() to get full stripe logical */
+   ret = btrfs_map_sblock(fs_info, REQ_GET_READ_MIRRORS, bytenr, ,
+  , 0, 1);
+   if (ret || !bbio || !bbio->raid_map)
+   goto error;
+   *bytenr_ret = bbio->raid_map[0];
+   btrfs_put_bbio(bbio);
+   return 0;
+error:

[PATCH 2/2] btrfs: scrub: Fix RAID56 recovery race condition

2016-11-14 Thread Qu Wenruo

When scrubbing a RAID5 which has recoverable data corruption (only one
data stripe is corrupted), sometimes scrub will report more csum errors
than expected. Sometimes even unrecoverable error will be reported.

The problem can be easily reproduced by the following steps:
1) Create a btrfs with RAID5 data profile with 3 devs
2) Mount it with nospace_cache or space_cache=v2
   To avoid extra data space usage.
3) Create a 128K file and sync the fs, unmount it
   Now the 128K file lies at the beginning of the data chunk
4) Locate the physical bytenr of data chunk on dev3
   Dev3 is the 1st data stripe.
5) Corrupt the first 64K of the data chunk stripe on dev3
6) Mount the fs and scrub it

The correct csum error number should be 16(assuming using x86_64).
Larger csum error number can be reported in a 1/3 chance.
And unrecoverable error can also be reported in a 1/10 chance.

The root cause of the problem is RAID5/6 recover code has race
condition, due to the fact that full scrub is initiated per device.

While for other mirror based profiles, each mirror is independent with
each other, so race won't cause any big problem.

For example:
Corrupted   |   Correct  |  Correct|
|   Scrub dev3 (D1) |Scrub dev2 (D2) |Scrub dev1(P)|

Read out D1 |Read out D2 |Read full stripe |
Check csum  |Check csum  |Check parity |
Csum mismatch   |Csum match, continue|Parity mismatch  |
handle_errored_block||handle_errored_block |
 Read out full stripe   || Read out full stripe|
 D1 csum error(err++)   || D1 csum error(err++)|
 Recover D1 || Recover D1  |

So D1's csum error is accounted twice, just because
handle_errored_block() doesn't have enough protect, and race can happen.

On even worse case, for example D1's recovery code is re-writing
D1/D2/P, and P's recovery code is just reading out full stripe, then we
can cause unrecoverable error.

This patch will use previously introduced lock_full_stripe() and
unlock_full_stripe() to protect the whole scrub_handle_errored_block()
function for RAID56 recovery.
So no extra csum error nor unrecoverable error.

Reported-by: Goffredo Baroncelli 
Signed-off-by: Qu Wenruo 
---
 fs/btrfs/scrub.c | 15 +++
 1 file changed, 15 insertions(+)

diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 4fce415..1215a61 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -1067,6 +1067,7 @@ static int scrub_handle_errored_block(struct scrub_block 
*sblock_to_check)
unsigned int have_csum;
struct scrub_block *sblocks_for_recheck; /* holds one for each mirror */
struct scrub_block *sblock_bad;
+   int lock_ret;
int ret;
int mirror_index;
int page_num;
@@ -1096,6 +1097,17 @@ static int scrub_handle_errored_block(struct scrub_block 
*sblock_to_check)
have_csum = sblock_to_check->pagev[0]->have_csum;
dev = sblock_to_check->pagev[0]->dev;
 
+   /*
+* For RAID5/6 race can happen for different dev scrub thread.
+* For data corruption, Parity and Data thread will both try
+* to recovery the data.
+* Race can lead to double added csum error, or even unrecoverable
+* error.
+*/
+   ret = lock_full_stripe(fs_info, logical, GFP_NOFS);
+   if (ret < 0)
+   return ret;
+
if (sctx->is_dev_replace && !is_metadata && !have_csum) {
sblocks_for_recheck = NULL;
goto nodatasum_case;
@@ -1432,6 +1444,9 @@ out:
kfree(sblocks_for_recheck);
}
 
+   lock_ret = unlock_full_stripe(fs_info, logical);
+   if (lock_ret < 0)
+   return lock_ret;
return 0;
 }
 
-- 
2.10.2



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 0/2] RAID5/6 scrub race fix

2016-11-14 Thread Qu Wenruo

Fix the so-called famous RAID5/6 scrub error.

Thanks Goffredo Baroncelli for reporting the bug, and make it into our
sight.
(Yes, without the Phoronix report on this,
https://www.phoronix.com/scan.php?page=news_item=Btrfs-RAID-56-Is-Bad,
I won't ever be aware of it)

Unlike many of us(including myself) assumed, it's not a timed bomb buried
deeply into the RAID5/6 code, but a race condition in scrub recovery
code.

The problem is not found because normal mirror based profiles aren't
affected by the race, since they are independent with each other.

Although this time the fix doesn't affect the scrub code much, it should
warn us that current scrub code is really hard to maintain.

Abuse of workquque to delay works and the full fs scrub is race prone.

Xfstest will follow a little later, as we don't have good enough tools
to corrupt data stripes pinpointly.

Qu Wenruo (2):
  btrfs: scrub: Introduce full stripe lock for RAID56
  btrfs: scrub: Fix RAID56 recovery race condition

 fs/btrfs/ctree.h   |   4 ++
 fs/btrfs/extent-tree.c |   3 +
 fs/btrfs/scrub.c   | 192 +
 3 files changed, 199 insertions(+)

-- 
2.10.2



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/2] btrfs: remove old tree_root dirent processing in btrfs_real_readdir()

2016-11-14 Thread Jeff Mahoney

On 11/11/16 11:05 AM, Omar Sandoval wrote:
> On Sat, Nov 05, 2016 at 01:26:34PM -0400, je...@suse.com wrote:
>> From: Jeff Mahoney 
>>
>> Commit 3de4586c527 (Btrfs: Allow subvolumes and snapshots anywhere
>> in the directory tree) introduced the current system of placing
>> snapshots in the directory tree.  It also introduced the behavior of
>> creating the snapshot and then creating the directory entries for it.
>>
>> We've kept this code around for compatibility reasons, but it turns
>> out that no file systems with the old tree_root based snapshots can
>> be mounted on newer (>= 2009) kernels anyway.  About a month after the
>> above commit, commit 2a7108ad89e (Btrfs: rev the disk format for the
>> inode compat and csum selection changes) landed, changing the superblock
>> magic number.
>>
>> As a result, we know that we'll never encounter tree_root-based dirents
>> or have to deal with skipping our own snapshot dirents.  Since that
>> also means that we're now only iterating over DIR_INDEX items, which only
>> contain one directory entry per leaf item, we don't need to loop over
>> the leaf item contents anymore either.
>>
>> Signed-off-by: Jeff Mahoney 
> 
> Hi, Jeff,

Hi Omar -

> One comment below.
> 
>> ---
>>  fs/btrfs/inode.c | 118 
>> +--
>>  1 file changed, 37 insertions(+), 81 deletions(-)
>>
>> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
>> index 2b790bd..c52239d 100644
>> --- a/fs/btrfs/inode.c
>> +++ b/fs/btrfs/inode.c
>> @@ -5805,20 +5805,13 @@ static int btrfs_real_readdir(struct file *file, 
>> struct dir_context *ctx)
>>  int slot;
>>  unsigned char d_type;
>>  int over = 0;
>> -u32 di_cur;
>> -u32 di_total;
>> -u32 di_len;
>> -int key_type = BTRFS_DIR_INDEX_KEY;
>>  char tmp_name[32];
>>  char *name_ptr;
>>  int name_len;
>>  int is_curr = 0;/* ctx->pos points to the current index? */
>>  bool emitted;
>>  bool put = false;
>> -
>> -/* FIXME, use a real flag for deciding about the key type */
>> -if (root->fs_info->tree_root == root)
>> -key_type = BTRFS_DIR_ITEM_KEY;
>> +struct btrfs_key location;
>>  
>>  if (!dir_emit_dots(file, ctx))
>>  return 0;
>> @@ -5829,14 +5822,11 @@ static int btrfs_real_readdir(struct file *file, 
>> struct dir_context *ctx)
>>  
>>  path->reada = READA_FORWARD;
>>  
>> -if (key_type == BTRFS_DIR_INDEX_KEY) {
>> -INIT_LIST_HEAD(_list);
>> -INIT_LIST_HEAD(_list);
>> -put = btrfs_readdir_get_delayed_items(inode, _list,
>> -  _list);
>> -}
>> +INIT_LIST_HEAD(_list);
>> +INIT_LIST_HEAD(_list);
>> +put = btrfs_readdir_get_delayed_items(inode, _list, _list);
>>  
>> -key.type = key_type;
>> +key.type = BTRFS_DIR_INDEX_KEY;
>>  key.offset = ctx->pos;
>>  key.objectid = btrfs_ino(inode);
>>  
>> @@ -5862,85 +5852,53 @@ static int btrfs_real_readdir(struct file *file, 
>> struct dir_context *ctx)
>>  
>>  if (found_key.objectid != key.objectid)
>>  break;
>> -if (found_key.type != key_type)
>> +if (found_key.type != BTRFS_DIR_INDEX_KEY)
>>  break;
>>  if (found_key.offset < ctx->pos)
>>  goto next;
>> -if (key_type == BTRFS_DIR_INDEX_KEY &&
>> -btrfs_should_delete_dir_index(_list,
>> -  found_key.offset))
>> +if (btrfs_should_delete_dir_index(_list, found_key.offset))
>>  goto next;
>>  
>>  ctx->pos = found_key.offset;
>>  is_curr = 1;
>>  
>>  di = btrfs_item_ptr(leaf, slot, struct btrfs_dir_item);
>> -di_cur = 0;
>> -di_total = btrfs_item_size(leaf, item);
>> -
>> -while (di_cur < di_total) {
>> -struct btrfs_key location;
>> -
>> -if (verify_dir_item(root, leaf, di))
>> -break;
>> +if (verify_dir_item(root, leaf, di))
>> +goto next;
>>  
>> -name_len = btrfs_dir_name_len(leaf, di);
>> -if (name_len <= sizeof(tmp_name)) {
>> -name_ptr = tmp_name;
>> -} else {
>> -name_ptr = kmalloc(name_len, GFP_KERNEL);
>> -if (!name_ptr) {
>> -ret = -ENOMEM;
>> -goto err;
>> -}
>> +name_len = btrfs_dir_name_len(leaf, di);
>> +if (name_len <= sizeof(tmp_name)) {
>> +name_ptr = tmp_name;
>> +} else {
>> +name_ptr = kmalloc(name_len, GFP_KERNEL);
>> +if

Re: [PATCH 1/2] Btrfs: fix file extent corruption

2016-11-14 Thread Liu Bo

On Mon, Nov 14, 2016 at 02:06:21PM -0500, Josef Bacik wrote:
> In order to do hole punching we have a block reserve to hold the reservation 
> we
> need to drop the extents in our range.  Since we could end up dropping a lot 
> of
> extents we set rsv->failfast so we can just loop around again and drop the
> remaining of the range.  Unfortunately we unconditionally fill the hole 
> extents
> in and start from the last extent we encountered, which we may or may not have
> dropped.  So this can result in overlapping file extent entries, which can be
> tripped over in a variety of ways, either by hitting BUG_ON(!ret) in
> fill_holes() after the search, or in btrfs_set_item_key_safe() in
> btrfs_drop_extent() at a later time by an unrelated task.  Fix this by only
> setting drop_end to the last extent we did actually drop.  This way our holes
> are filled in properly for the range that we did drop, and the rest of the 
> range
> that remains to be dropped is actually dropped.  Thanks,

Can you pleaes share the reproducer?

Thanks,

-liubo
> 
> Signed-off-by: Josef Bacik 
> ---
>  fs/btrfs/file.c | 13 +++--
>  1 file changed, 11 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index cbefdc8..1c15a98 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -706,6 +706,7 @@ int __btrfs_drop_extents(struct btrfs_trans_handle *trans,
>   u64 num_bytes = 0;
>   u64 extent_offset = 0;
>   u64 extent_end = 0;
> + u64 last_end = 0;
>   int del_nr = 0;
>   int del_slot = 0;
>   int extent_type;
> @@ -797,8 +798,10 @@ next_slot:
>* extent item in the call to setup_items_for_insert() later
>* in this function.
>*/
> - if (extent_end == key.offset && extent_end >= search_start)
> + if (extent_end == key.offset && extent_end >= search_start) {
> + last_end = extent_end;
>   goto delete_extent_item;
> + }
>  
>   if (extent_end <= search_start) {
>   path->slots[0]++;
> @@ -861,6 +864,12 @@ next_slot:
>   key.offset = start;
>   }
>   /*
> +  * From here on out we will have actually dropped something, so
> +  * last_end can be updated.
> +  */
> + last_end = extent_end;
> +
> + /*
>*  |  range to drop - |
>*  |  extent  |
>*/
> @@ -1010,7 +1019,7 @@ delete_extent_item:
>   if (!replace_extent || !(*key_inserted))
>   btrfs_release_path(path);
>   if (drop_end)
> - *drop_end = found ? min(end, extent_end) : end;
> + *drop_end = found ? min(end, last_end) : end;
>   return ret;
>  }
>  
> -- 
> 2.7.4
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Announcing btrfs-dedupe

2016-11-14 Thread Zygo Blaxell

On Mon, Nov 14, 2016 at 09:07:51PM +0100, James Pharaoh wrote:
> On 14/11/16 20:51, Zygo Blaxell wrote:
> >On Mon, Nov 14, 2016 at 01:39:02PM -0500, Austin S. Hemmelgarn wrote:
> >>On 2016-11-14 13:22, James Pharaoh wrote:
> >>>One thing I am keen to understand is if BTRFS will automatically ignore
> >>>a request to deduplicate a file if it is already deduplicated? Given the
> >>>performance I see when doing a repeat deduplication, it seems to me that
> >>>it can't be doing so, although this could be caused by the CPU usage you
> >>>mention above.
> >>
> >>What's happening is that the dedupe ioctl does a byte-wise comparison of the
> >>ranges to make sure they're the same before linking them.  This is actually
> >>what takes most of the time when calling the ioctl, and is part of why it
> >>takes longer the larger the range to deduplicate is.  In essence, it's
> >>behaving like an OS should and not trusting userspace to make reasonable
> >>requests (which is also why there's a separate ioctl to clone a range from
> >>another file instead of deduplicating existing data).
> >
> > - the extent-same ioctl could check to see which extents
> > are referenced by the src and dst ranges, and return success
> > immediately without reading data if they are the same (but
> > userspace should already know this, or it's wasting a huge amount
> > of time before it even calls the kernel).
> 
> Yes, this is what I am talking about. I believe I should be able to read
> data about the BTRFS data structures and determine if this is the case. I
> don't care if there are false matches, due to concurrent updates, but
> there'll be a /lot/ of repeat deduplications unless I do this, because even
> if the file is identical, the mtime etc hasn't changed, and I have a record
> of previously doing a dedupe, there's no guarantee that the file hasn't been
> rewritten in place (eg by rsync), and no way that I know of to reliably
> detect if a file has been changed.
> 
> I am sure there are libraries out there which can look into the data
> structures of a BTRFS file system, I haven't researched this in detail
> though. I imagine that with some kind of lock on a BTRFS root, this could be
> achieved by simply reading the data from the disk, since I believe that
> everything is copy-on-write, so no existing data should be overwritten until
> all roots referring to it are updated. Perhaps I'm missing something
> though...

FIEMAP (VFS) and SEARCH_V2 (btrfs-specific) will both give you access
to the underlying physical block numbers.  SEARCH_V2 is non-trivial
to use without reverse-engineering significant parts of btrfs-progs.
SEARCH_V2 is a generic tree-searching tool which will give you all kinds
of information about btrfs structures...it's essential for a sophisticated
deduplicator and overkill for a simple one.

For full-file dedup using FIEMAP you only need to look at the "physical"
field of the first extent (if it's zero or the same as the other file, the
files cannot be deduplicated or are already deduplicated, respectively).
The source for 'filefrag' (from e2fsprogs) is good for learning how
FIEMAP works.

For block-level dedup you need to look at each extent individually.
That's much slower and full of additional caveats.  If you're going down
that road it's probably better to just improve duperemove instead.

> James


signature.asc
Description: Digital signature

Re: Announcing btrfs-dedupe

2016-11-14 Thread Zygo Blaxell

On Mon, Nov 14, 2016 at 02:56:51PM -0500, Austin S. Hemmelgarn wrote:
> On 2016-11-14 14:51, Zygo Blaxell wrote:
> >Deduplicating an extent that may might be concurrently modified during the
> >dedup is a reasonable userspace request.  In the general case there's
> >no way for userspace to ensure that it's not happening.
> I'm not even talking about the locking, I'm talking about the data
> comparison that the ioctl does to ensure they are the same before
> deduplicating them, and specifically that protecting against userspace just
> passing in two random extents that happen to be the same size but not
> contain the same data (because deduplication _should_ reject such a
> situation, that's what the clone ioctl is for).

If I'm deduping a VM image, and the virtual host is writing to said image
(which is likely since an incremental dedup will be intentionally doing
dedup over recently active data sets), the extent I just compared in
userspace might be different by the time the kernel sees it.

This is an important reason why the whole lock/read/compare/replace step
is an atomic operation from userspace's PoV.

The read also saves having to confirm a short/weak hash isn't a collision.
The RAM savings from using weak hashes (~48 bits) are a huge performance
win.

The locking overhead is very small compared to the reading overhead,
and (in the absence of bugs) it will only block concurrent writes to the
same offset range in the src/dst inodes (based on a read of the code...I
don't know if there's also an inode-level or backref-level barrier that
expands the locking scope).

I'm not sure the ioctl is well designed for simply throwing random
data at it, especially not entire files (it can't handle files over
16MB anyway).  It will read more data than it has to compared to a
block-by-block comparison from userspace with prefetches or a pair of
IO threads.  If userspace reads both copies of the data just before
issuing the extent-same call, the kernel will read the data from cache
reasonably quickly.

> The locking is perfectly reasonable and shouldn't contribute that much to
> the overhead (unless you're being crazy and deduplicating thousands of tiny
> blocks of data).

Why is deduplicating thousands of blocks of data crazy?  I already
deduplicate four orders of magnitude more than that per week.

> >That said, some optimization is possible (although there are good reasons
> >not to bother with optimization in the kernel):
> >
> > - VFS could recognize when it has two separate references to
> > the same physical extent and not re-read the same data twice
> > (but that requires teaching VFS how to do CoW in general, and is
> > hard for political reasons on top of the obvious technical ones).
> >
> > - the extent-same ioctl could check to see which extents
> > are referenced by the src and dst ranges, and return success
> > immediately without reading data if they are the same (but
> > userspace should already know this, or it's wasting a huge amount
> > of time before it even calls the kernel).
> >
> >>TBH, even though it's kind of annoying from a performance perspective, it's
> >>a rather nice safety net to have.  For example, one of the cases where I do
> >>deduplication is a couple of directories where each directory is an
> >>overlapping partial subset of one large tree which I keep elsewhere.  In
> >>this case, I can tell just by filename exactly what files might be
> >>duplicates, so the ioctl's check lets me just call the ioctl on all
> >>potential duplicates (after checking size, no point in wasting time if the
> >>files obviously aren't duplicates), and have it figure out whether or not
> >>they can be deduplicated.
> >>>
> >>>In any case, I'm considering some digging into the filesystem structures
> >>>to see if I can work this out myself before i do any deduplication. I'm
> >>>fairly sure this should be relatively simple to work out, at least well
> >>>enough for my purposes.
> >>Sadly, there's no way to avoid doing so right now.
> >>
> >>--
> >>To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> >>the body of a message to majord...@vger.kernel.org
> >>More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

signature.asc
Description: Digital signature

Re: [Bug 186671] New: OOM on system with just rsync running 32GB of ram 30GB of pagecache

2016-11-14 Thread E V

Pretty sure it was the system after the OOM just did a history search
to check, though it is 3 days afterwards and several OOMs killed
several processes in somewhat rapid succession, I just listed the 1st.
I'll turn on CONFIG_DEBUG_VM and reboot again.

On Mon, Nov 14, 2016 at 12:04 PM, Vlastimil Babka  wrote:
> On 11/14/2016 02:27 PM, E V wrote:
>> System is an intel dual socket Xeon E5620, 7500/5520/5500/X58 ICH10
>> family according to lspci. Anyways 4.8.4 OOM'd while I was gone. I'll
>> download the current 4.9rc and reboot, but in the mean time here's
>> xxd, vmstat & kern.log output:
>> 8532039 
>
> Hmm this would suggest that the memory is mostly free. But not according
> to vmstat. Is it possible you mistakenly provided the xxd from a fresh
> boot, but vmstat from after the OOM?
>
> But sure, a page_count() of zero is a reason why __isolate_lru_page()
> would fail due to its get_page_unless_zero(). The question is then how
> could it drop to zero without being freed at the same time, as
> put_page() does.
>
> I was going to suspect commit 83929372f6 and a page_ref_sub() it adds to
> delete_from_page_cache(), but that's since 4.8 and you mention problems
> since 4.7.
>
> Anyway it might be worth enabling CONFIG_DEBUG_VM as the relevant code
> usually has VM_BUG_ONs.
>
> Vlastimil
>
>>9324 0100
>>2226 0200
>> 405 0300
>>  80 0400
>>  34 0500
>>  48 0600
>>  17 0700
>>  17 0800
>>  32 0900
>>  19 0a00
>>   1 0c00
>>   1 0d00
>>   1 0e00
>>  12 1000
>>   8 1100
>>  32 1200
>>  10 1300
>>   2 1400
>>  11 1500
>>  12 1600
>>   7 1700
>>   3 1800
>>   5 1900
>>   6 1a00
>>  11 1b00
>>  22 1c00
>>   3 1d00
>>  19 1e00
>>  21 1f00
>>  18 2000
>>  28 2100
>>  40 2200
>>  38 2300
>>  85 2400
>>  59 2500
>>   40520 81ff
>>
>> /proc/vmstat:
>> nr_free_pages 60965
>> nr_zone_inactive_anon 4646
>> nr_zone_active_anon 3265
>> nr_zone_inactive_file 633882
>> nr_zone_active_file 7017458
>> nr_zone_unevictable 0
>> nr_zone_write_pending 0
>> nr_mlock 0
>> nr_slab_reclaimable 299205
>> nr_slab_unreclaimable 195497
>> nr_page_table_pages 935
>> nr_kernel_stack 4976
>> nr_bounce 0
>> numa_hit 3577063288
>> numa_miss 541393191
>> numa_foreign 541393191
>> numa_interleave 19415
>> numa_local 3577063288
>> numa_other 0
>> nr_free_cma 0
>> nr_inactive_anon 4646
>> nr_active_anon 3265
>> nr_inactive_file 633882
>> nr_active_file 7017458
>> nr_unevictable 0
>> nr_isolated_anon 0
>> nr_isolated_file 0
>> nr_pages_scanned 0
>> workingset_refault 42685891
>> workingset_activate 15247281
>> workingset_nodereclaim 26375216
>> nr_anon_pages 5067
>> nr_mapped 5630
>> nr_file_pages 7654746
>> nr_dirty 0
>> nr_writeback 0
>> nr_writeback_temp 0
>> nr_shmem 2504
>> nr_shmem_hugepages 0
>> nr_shmem_pmdmapped 0
>> nr_anon_transparent_hugepages 0
>> nr_unstable 0
>> nr_vmscan_write 5243750485
>> nr_vmscan_immediate_reclaim 4207633857
>> nr_dirtied 1839143430
>> nr_written 1832626107
>> nr_dirty_threshold 1147728
>> nr_dirty_background_threshold 151410
>> pgpgin 166731189
>> pgpgout 7328142335
>> pswpin 98608
>> pswpout 117794
>> pgalloc_dma 29504
>> pgalloc_dma32 1006726216
>> pgalloc_normal 5275218188
>> pgalloc_movable 0
>> allocstall_dma 0
>> allocstall_dma32 0
>> allocstall_normal 36461
>> allocstall_movable 5867
>> pgskip_dma 0
>> pgskip_dma32 0
>> pgskip_normal 6417890
>> pgskip_movable 0
>> pgfree 6309223401
>> pgactivate 35076483
>> pgdeactivate 63556974
>> pgfault 35753842
>> pgmajfault 69126
>> pglazyfreed 0
>> pgrefill 70008598
>> pgsteal_kswapd 3567289713
>> pgsteal_direct 5878057
>> pgscan_kswapd 9059309872
>> pgscan_direct 4239367903
>> pgscan_direct_throttle 0
>> zone_reclaim_failed 0
>> pginodesteal 102916
>> slabs_scanned 460790262
>> kswapd_inodesteal 9130243
>> kswapd_low_wmark_hit_quickly 10634373
>> kswapd_high_wmark_hit_quickly 7348173
>> pageoutrun 18349115
>> pgrotated 16291322
>> drop_pagecache 0
>> drop_slab 0
>> pgmigrate_success 18912908
>> pgmigrate_fail 63382146
>> compact_migrate_scanned 2986269789
>> compact_free_scanned 190451505123
>> compact_isolated 109549437
>> compact_stall 3544
>> compact_fail 8
>> compact_success 3536
>> compact_daemon_wake 1403515
>> htlb_buddy_alloc_success 0
>> htlb_buddy_alloc_fail 0
>> unevictable_pgs_culled 12473
>> unevictable_pgs_scanned 0
>> unevictable_pgs_rescued 11979
>> unevictable_pgs_mlocked 14556
>> unevictable_pgs_munlocked 14556
>> unevictable_pgs_cleared 0
>>

Re: corrupt leaf, slot offset bad

2016-11-14 Thread Kai Krakow

Am Tue, 11 Oct 2016 07:09:49 -0700
schrieb Liu Bo :

> On Tue, Oct 11, 2016 at 02:48:09PM +0200, David Sterba wrote:
> > Hi,
> > 
> > looks like a lot of random bitflips.
> > 
> > On Mon, Oct 10, 2016 at 11:50:14PM +0200, a...@aron.ws wrote:  
> > > item 109 has a few strange chars in its name (and it's
> > > truncated): 1-x86_64.pkg.tar.xz 0x62 0x14 0x0a 0x0a
> > > 
> > >   item 105 key (261 DIR_ITEM 54556048) itemoff 11723
> > > itemsize 72 location key (606286 INODE_ITEM 0) type FILE
> > >   namelen 42 datalen 0 name:
> > > python2-gobject-3.20.1-1-x86_64.pkg.tar.xz item 106 key (261
> > > DIR_ITEM 56363628) itemoff 11660 itemsize 63 location key (894298
> > > INODE_ITEM 0) type FILE namelen 33 datalen 0 name:
> > > unrar-1:5.4.5-1-x86_64.pkg.tar.xz item 107 key (261 DIR_ITEM
> > > 66963651) itemoff 11600 itemsize 60 location key (1178 INODE_ITEM
> > > 0) type FILE namelen 30 datalen 0 name:
> > > glibc-2.23-5-x86_64.pkg.tar.xz item 108 key (261 DIR_ITEM
> > > 68561395) itemoff 11532 itemsize 68 location key (660578
> > > INODE_ITEM 0) type FILE namelen 38 datalen 0 name:
> > > squashfs-tools-4.3-4-x86_64.pkg.tar.xz item 109 key (261 DIR_ITEM
> > > 76859450) itemoff 11483 itemsize 65 location key (2397184
> > > UNKNOWN.0 7091317839824617472) type 45 namelen 13102 datalen
> > > 13358 name: 1-x86_64.pkg.tar.xzb  
> > 
> > namelen must be smaller than 255, but the number itself does not
> > look like a bitflip (0x332e), the name looks like a fragment of.
> > 
> > The location key is random garbage, likely an overwritten memory,
> > 7091317839824617472 == 0x62696c010023 contains ascii 'bil', the
> > key type is unknown but should be INODE_ITEM.
> >   
> > >   data
> > >   item 110 key (261 DIR_ITEM 9799832789237604651) itemoff
> > > 11405 itemsize 62
> > >   location key (388547 INODE_ITEM 0) type FILE
> > >   namelen 32 datalen 0 name:
> > > intltool-0.51.0-1-any.pkg.tar.xz item 111 key (261 DIR_ITEM
> > > 81211850) itemoff 11344 itemsize 131133  
> > 
> > itemsize 131133 == 0x2003d is a clear bitflip, 0x3d == 61,
> > corresponds to the expected item size.
> > 
> > There's possibly other random bitflips in the keys or other
> > structures. It's hard to estimate the damage and thus the scope of
> > restorable data.  
> 
> It makes sense since this's a ssd we may have only one copy for
> metadata.
> 
> Thanks,
> 
> -liubo

>From this point of view it doesn't make sense to store only one copy of
meta data on SSD... The bit flip probably happened in RAM when taking
the other garbage into account, so dup meta data could have helped here.

If the SSD firmware would collapse duplicate meta data into single
blobs, that's perfectly fine. If the dup meta data arrives with bits
flipped, it won't be deduplicated. So this is fine, too.

BTW: I cannot believe that SSD firmwares really do the quite expensive
job of deduplication other than maybe internal compression. Maybe there
are some drives out there but most won't deduplicate. It's just too
little gain for too much complexity. So I personally would always
switch on duplicate meta data even for SSD. It shouldn't add to wear
leveling too much if you do the usual SSD optimization anyways (like
noatime).

PS: I suggest doing an extensive memtest86 before trying any repairs on
this system... Are you probably mixing different model DIMMs in dual
channel slots? Most of the times I've seen bitflips, this was the
culprit...

-- 
Regards,
Kai

Replies to list-only preferred.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Announcing btrfs-dedupe

2016-11-14 Thread James Pharaoh


On 14/11/16 20:51, Zygo Blaxell wrote:

On Mon, Nov 14, 2016 at 01:39:02PM -0500, Austin S. Hemmelgarn wrote:

On 2016-11-14 13:22, James Pharaoh wrote:

One thing I am keen to understand is if BTRFS will automatically ignore
a request to deduplicate a file if it is already deduplicated? Given the
performance I see when doing a repeat deduplication, it seems to me that
it can't be doing so, although this could be caused by the CPU usage you
mention above.

>>

What's happening is that the dedupe ioctl does a byte-wise comparison of the
ranges to make sure they're the same before linking them.  This is actually
what takes most of the time when calling the ioctl, and is part of why it
takes longer the larger the range to deduplicate is.  In essence, it's
behaving like an OS should and not trusting userspace to make reasonable
requests (which is also why there's a separate ioctl to clone a range from
another file instead of deduplicating existing data).


- the extent-same ioctl could check to see which extents
are referenced by the src and dst ranges, and return success
immediately without reading data if they are the same (but
userspace should already know this, or it's wasting a huge amount
of time before it even calls the kernel).


Yes, this is what I am talking about. I believe I should be able to read 
data about the BTRFS data structures and determine if this is the case. 
I don't care if there are false matches, due to concurrent updates, but 
there'll be a /lot/ of repeat deduplications unless I do this, because 
even if the file is identical, the mtime etc hasn't changed, and I have 
a record of previously doing a dedupe, there's no guarantee that the 
file hasn't been rewritten in place (eg by rsync), and no way that I 
know of to reliably detect if a file has been changed.


I am sure there are libraries out there which can look into the data 
structures of a BTRFS file system, I haven't researched this in detail 
though. I imagine that with some kind of lock on a BTRFS root, this 
could be achieved by simply reading the data from the disk, since I 
believe that everything is copy-on-write, so no existing data should be 
overwritten until all roots referring to it are updated. Perhaps I'm 
missing something though...


James
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Announcing btrfs-dedupe

2016-11-14 Thread Austin S. Hemmelgarn


On 2016-11-14 14:51, Zygo Blaxell wrote:

On Mon, Nov 14, 2016 at 01:39:02PM -0500, Austin S. Hemmelgarn wrote:

On 2016-11-14 13:22, James Pharaoh wrote:

One thing I am keen to understand is if BTRFS will automatically ignore
a request to deduplicate a file if it is already deduplicated? Given the
performance I see when doing a repeat deduplication, it seems to me that
it can't be doing so, although this could be caused by the CPU usage you
mention above.

What's happening is that the dedupe ioctl does a byte-wise comparison of the
ranges to make sure they're the same before linking them.  This is actually
what takes most of the time when calling the ioctl, and is part of why it
takes longer the larger the range to deduplicate is.  In essence, it's
behaving like an OS should and not trusting userspace to make reasonable
requests (which is also why there's a separate ioctl to clone a range from
another file instead of deduplicating existing data).


Deduplicating an extent that may might be concurrently modified during the
dedup is a reasonable userspace request.  In the general case there's
no way for userspace to ensure that it's not happening.
I'm not even talking about the locking, I'm talking about the data 
comparison that the ioctl does to ensure they are the same before 
deduplicating them, and specifically that protecting against userspace 
just passing in two random extents that happen to be the same size but 
not contain the same data (because deduplication _should_ reject such a 
situation, that's what the clone ioctl is for).


The locking is perfectly reasonable and shouldn't contribute that much 
to the overhead (unless you're being crazy and deduplicating thousands 
of tiny blocks of data).


That said, some optimization is possible (although there are good reasons
not to bother with optimization in the kernel):

- VFS could recognize when it has two separate references to
the same physical extent and not re-read the same data twice
(but that requires teaching VFS how to do CoW in general, and is
hard for political reasons on top of the obvious technical ones).

- the extent-same ioctl could check to see which extents
are referenced by the src and dst ranges, and return success
immediately without reading data if they are the same (but
userspace should already know this, or it's wasting a huge amount
of time before it even calls the kernel).


TBH, even though it's kind of annoying from a performance perspective, it's
a rather nice safety net to have.  For example, one of the cases where I do
deduplication is a couple of directories where each directory is an
overlapping partial subset of one large tree which I keep elsewhere.  In
this case, I can tell just by filename exactly what files might be
duplicates, so the ioctl's check lets me just call the ioctl on all
potential duplicates (after checking size, no point in wasting time if the
files obviously aren't duplicates), and have it figure out whether or not
they can be deduplicated.


In any case, I'm considering some digging into the filesystem structures
to see if I can work this out myself before i do any deduplication. I'm
fairly sure this should be relatively simple to work out, at least well
enough for my purposes.

Sadly, there's no way to avoid doing so right now.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Announcing btrfs-dedupe

2016-11-14 Thread Zygo Blaxell

On Mon, Nov 14, 2016 at 01:39:02PM -0500, Austin S. Hemmelgarn wrote:
> On 2016-11-14 13:22, James Pharaoh wrote:
> >One thing I am keen to understand is if BTRFS will automatically ignore
> >a request to deduplicate a file if it is already deduplicated? Given the
> >performance I see when doing a repeat deduplication, it seems to me that
> >it can't be doing so, although this could be caused by the CPU usage you
> >mention above.
> What's happening is that the dedupe ioctl does a byte-wise comparison of the
> ranges to make sure they're the same before linking them.  This is actually
> what takes most of the time when calling the ioctl, and is part of why it
> takes longer the larger the range to deduplicate is.  In essence, it's
> behaving like an OS should and not trusting userspace to make reasonable
> requests (which is also why there's a separate ioctl to clone a range from
> another file instead of deduplicating existing data).

Deduplicating an extent that may might be concurrently modified during the
dedup is a reasonable userspace request.  In the general case there's
no way for userspace to ensure that it's not happening.

That said, some optimization is possible (although there are good reasons
not to bother with optimization in the kernel):

- VFS could recognize when it has two separate references to
the same physical extent and not re-read the same data twice
(but that requires teaching VFS how to do CoW in general, and is
hard for political reasons on top of the obvious technical ones).

- the extent-same ioctl could check to see which extents
are referenced by the src and dst ranges, and return success
immediately without reading data if they are the same (but
userspace should already know this, or it's wasting a huge amount
of time before it even calls the kernel).

> TBH, even though it's kind of annoying from a performance perspective, it's
> a rather nice safety net to have.  For example, one of the cases where I do
> deduplication is a couple of directories where each directory is an
> overlapping partial subset of one large tree which I keep elsewhere.  In
> this case, I can tell just by filename exactly what files might be
> duplicates, so the ioctl's check lets me just call the ioctl on all
> potential duplicates (after checking size, no point in wasting time if the
> files obviously aren't duplicates), and have it figure out whether or not
> they can be deduplicated.
> >
> >In any case, I'm considering some digging into the filesystem structures
> >to see if I can work this out myself before i do any deduplication. I'm
> >fairly sure this should be relatively simple to work out, at least well
> >enough for my purposes.
> Sadly, there's no way to avoid doing so right now.
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


signature.asc
Description: Digital signature

Re: [PATCH 1/2] Btrfs: fix file extent corruption

2016-11-14 Thread Chris Mason


On 11/14/2016 02:06 PM, Josef Bacik wrote:

In order to do hole punching we have a block reserve to hold the reservation we
need to drop the extents in our range.  Since we could end up dropping a lot of
extents we set rsv->failfast so we can just loop around again and drop the
remaining of the range.  Unfortunately we unconditionally fill the hole extents
in and start from the last extent we encountered, which we may or may not have
dropped.  So this can result in overlapping file extent entries, which can be
tripped over in a variety of ways, either by hitting BUG_ON(!ret) in
fill_holes() after the search, or in btrfs_set_item_key_safe() in
btrfs_drop_extent() at a later time by an unrelated task.  Fix this by only
setting drop_end to the last extent we did actually drop.  This way our holes
are filled in properly for the range that we did drop, and the rest of the range
that remains to be dropped is actually dropped.  Thanks,

Signed-off-by: Josef Bacik 


Thanks for tracking this down Josef.  We should mark it for stable too.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/2] Btrfs: fix file extent corruption

2016-11-14 Thread Josef Bacik

In order to do hole punching we have a block reserve to hold the reservation we
need to drop the extents in our range.  Since we could end up dropping a lot of
extents we set rsv->failfast so we can just loop around again and drop the
remaining of the range.  Unfortunately we unconditionally fill the hole extents
in and start from the last extent we encountered, which we may or may not have
dropped.  So this can result in overlapping file extent entries, which can be
tripped over in a variety of ways, either by hitting BUG_ON(!ret) in
fill_holes() after the search, or in btrfs_set_item_key_safe() in
btrfs_drop_extent() at a later time by an unrelated task.  Fix this by only
setting drop_end to the last extent we did actually drop.  This way our holes
are filled in properly for the range that we did drop, and the rest of the range
that remains to be dropped is actually dropped.  Thanks,

Signed-off-by: Josef Bacik 
---
 fs/btrfs/file.c | 13 +++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index cbefdc8..1c15a98 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -706,6 +706,7 @@ int __btrfs_drop_extents(struct btrfs_trans_handle *trans,
u64 num_bytes = 0;
u64 extent_offset = 0;
u64 extent_end = 0;
+   u64 last_end = 0;
int del_nr = 0;
int del_slot = 0;
int extent_type;
@@ -797,8 +798,10 @@ next_slot:
 * extent item in the call to setup_items_for_insert() later
 * in this function.
 */
-   if (extent_end == key.offset && extent_end >= search_start)
+   if (extent_end == key.offset && extent_end >= search_start) {
+   last_end = extent_end;
goto delete_extent_item;
+   }
 
if (extent_end <= search_start) {
path->slots[0]++;
@@ -861,6 +864,12 @@ next_slot:
key.offset = start;
}
/*
+* From here on out we will have actually dropped something, so
+* last_end can be updated.
+*/
+   last_end = extent_end;
+
+   /*
 *  |  range to drop - |
 *  |  extent  |
 */
@@ -1010,7 +1019,7 @@ delete_extent_item:
if (!replace_extent || !(*key_inserted))
btrfs_release_path(path);
if (drop_end)
-   *drop_end = found ? min(end, extent_end) : end;
+   *drop_end = found ? min(end, last_end) : end;
return ret;
 }
 
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 2/2] Btrfs: abort transaction if fill_holes() fails

2016-11-14 Thread Josef Bacik

At this point we will have dropped extent entries from the file, so if we fail
to insert the new hole entries then we are leaving the fs in a corrupt state
(albeit an easily fixed one).  Abort the transaciton if this happens so we can
avoid corrupting the fs.  Thanks,

Signed-off-by: Josef Bacik 
---
 fs/btrfs/file.c | 17 +++--
 1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 1c15a98..d6fc719 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -2234,9 +2234,14 @@ static int fill_holes(struct btrfs_trans_handle *trans, 
struct inode *inode,
key.offset = offset;
 
ret = btrfs_search_slot(trans, root, , path, 0, 1);
-   if (ret < 0)
+   if (ret <= 0) {
+   /* We should have dropped this offset, so if we find it then
+* something has gone horribly wrong.
+*/
+   if (ret == 0)
+   ret = -EINVAL;
return ret;
-   BUG_ON(!ret);
+   }
 
leaf = path->nodes[0];
if (hole_mergeable(inode, leaf, path->slots[0]-1, offset, end)) {
@@ -2539,6 +2544,12 @@ static int btrfs_punch_hole(struct inode *inode, loff_t 
offset, loff_t len)
ret = fill_holes(trans, inode, path, cur_offset,
 drop_end);
if (ret) {
+   /* If we failed then we didn't insert our hole
+* entries for the area we dropped, so now the
+* fs is corrupted, so we must abort the
+* transaction.
+*/
+   btrfs_abort_transaction(trans, ret);
err = ret;
break;
}
@@ -2603,6 +2614,8 @@ static int btrfs_punch_hole(struct inode *inode, loff_t 
offset, loff_t len)
if (cur_offset < ino_size && cur_offset < drop_end) {
ret = fill_holes(trans, inode, path, cur_offset, drop_end);
if (ret) {
+   /* Same comment as above. */
+   btrfs_abort_transaction(trans, ret);
err = ret;
goto out_trans;
}
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 1/6] btrfs-progs: add the FREE_SPACE_TREE_VALID compat_ro bit definition

2016-11-14 Thread Omar Sandoval

From: Omar Sandoval 

This is just the definition; we don't support it yet.

Signed-off-by: Omar Sandoval 
---
 ctree.h | 13 +
 1 file changed, 13 insertions(+)

diff --git a/ctree.h b/ctree.h
index d47f0ae..d67b852 100644
--- a/ctree.h
+++ b/ctree.h
@@ -467,6 +467,14 @@ struct btrfs_super_block {
  * ones specified below then we will fail to mount
  */
 #define BTRFS_FEATURE_COMPAT_RO_FREE_SPACE_TREE(1ULL << 0)
+/*
+ * Older kernels on big-endian systems produced broken free space tree bitmaps,
+ * and btrfs-progs also used to corrupt the free space tree. If this bit is
+ * clear, then the free space tree cannot be trusted. btrfs-progs can also
+ * intentionally clear this bit to ask the kernel to rebuild the free space
+ * tree.
+ */
+#define BTRFS_FEATURE_COMPAT_RO_FREE_SPACE_TREE_VALID  (1ULL << 1)
 
 #define BTRFS_FEATURE_INCOMPAT_MIXED_BACKREF   (1ULL << 0)
 #define BTRFS_FEATURE_INCOMPAT_DEFAULT_SUBVOL  (1ULL << 1)
@@ -493,6 +501,11 @@ struct btrfs_super_block {
 
 #define BTRFS_FEATURE_COMPAT_SUPP  0ULL
 
+/*
+ * The FREE_SPACE_TREE and FREE_SPACE_TREE_VALID compat_ro bits must not be
+ * added here until read-write support for the free space tree is implemented 
in
+ * btrfs-progs.
+ */
 #define BTRFS_FEATURE_COMPAT_RO_SUPP   0ULL
 
 #define BTRFS_FEATURE_INCOMPAT_SUPP\
-- 
2.10.2

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 6/6] btrfs-progs: document space_cache=v2 more thoroughly

2016-11-14 Thread Omar Sandoval

From: Omar Sandoval 

Signed-off-by: Omar Sandoval 
---
 Documentation/btrfs-man5.asciidoc | 43 +++
 1 file changed, 25 insertions(+), 18 deletions(-)

diff --git a/Documentation/btrfs-man5.asciidoc 
b/Documentation/btrfs-man5.asciidoc
index d12b059..3b003d3 100644
--- a/Documentation/btrfs-man5.asciidoc
+++ b/Documentation/btrfs-man5.asciidoc
@@ -319,25 +319,32 @@ May be resumed with *btrfs balance resume* or the paused 
state can be removed
 by *btrfs balance cancel*. The default behaviour is to start interrutpd 
balance.
 
 *space_cache*::
-*space_cache=v2*::
+*space_cache='version'*::
 *nospace_cache*::
-('nospace_cache' since: 3.2, 'space_cache=v2' since 4.5, default: on)
-+
-Options to control the free space cache.  This affects performance as searching
-for new free blocks could take longer if the space cache is not enabled. On the
-other hand, managing the space cache consumes some resources.  It can be
-disabled without clearing at mount time.
-+
-There are two implementations of how the space is tracked. The safe default is
-'v1'.  On large filesystems (many-terabytes) and certain workloads the 'v1'
-performance may degrade.  This problem is addressed by 'v2', that is based on
-b-trees, sometimes referred to as 'free-space-tree'.
-+
-'Compatibility notes:'
-+
-* the 'v2' has to be enabled manually at mount time, once
-* kernel without 'v2' support will be able to mount the filesystem in 
read-only mode
-* 'v2' can be removed by mounting with 'clear_cache'
+('nospace_cache' since: 3.2, 'space_cache=v1' and 'space_cache=v2' since 4.5, 
default: 'space_cache=v1')
++
+Options to control the free space cache. The free space cache greatly improves
+performance when reading block group free space into memory. However, managing
+the space cache consumes some resources, including a small amount of disk
+space.
++
+There are two implementations of the free space cache. The original
+implementation, 'v1', is the safe default. The 'v1' space cache can be disabled
+at mount time with 'nospace_cache' without clearing.
++
+On very large filesystems (many terabytes) and certain workloads, the
+performance of the 'v1' space cache may degrade drastically. The 'v2'
+implementation, which adds a new B-tree called the free space tree, addresses
+this issue. Once enabled, the 'v2' space cache will always be used and cannot
+be disabled unless it is cleared. Use 'clear_cache,space_cache=v1' or
+'clear_cache,nospace_cache' to do so. If 'v2' is enabled, kernels without 'v2'
+support will only be able to mount the filesystem in read-only mode. The
+`btrfs(8)` command currently only has read-only support for 'v2'. A read-write
+command may be run on a 'v2' filesystem by clearing the cache, running the
+command, and then remounting with 'space_cache=v2'.
++
+If a version is not explicitly specified, the default implementation will be
+chosen, which is 'v1' as of 4.9.
 
 *ssd*::
 *nossd*::
-- 
2.10.2

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 4/6] btrfs-progs: add btrfs_clear_free_space_tree() from the kernel

2016-11-14 Thread Omar Sandoval

From: Omar Sandoval 

Reviewed-by: Qu Wenruo 
Signed-off-by: Omar Sandoval 
---
 ctree.h   |  6 
 extent-tree.c | 11 +++
 free-space-tree.c | 91 +++
 free-space-tree.h |  1 +
 root-tree.c   | 25 +++
 5 files changed, 134 insertions(+)

diff --git a/ctree.h b/ctree.h
index d67b852..b433bca 100644
--- a/ctree.h
+++ b/ctree.h
@@ -2504,6 +2504,10 @@ int btrfs_inc_ref(struct btrfs_trans_handle *trans, 
struct btrfs_root *root,
  struct extent_buffer *buf, int record_parent);
 int btrfs_dec_ref(struct btrfs_trans_handle *trans, struct btrfs_root *root,
  struct extent_buffer *buf, int record_parent);
+int btrfs_free_tree_block(struct btrfs_trans_handle *trans,
+ struct btrfs_root *root,
+ struct extent_buffer *buf,
+ u64 parent, int last_ref);
 int btrfs_free_extent(struct btrfs_trans_handle *trans,
  struct btrfs_root *root,
  u64 bytenr, u64 num_bytes, u64 parent,
@@ -2664,6 +2668,8 @@ int btrfs_add_root_ref(struct btrfs_trans_handle *trans,
 int btrfs_insert_root(struct btrfs_trans_handle *trans, struct btrfs_root
  *root, struct btrfs_key *key, struct btrfs_root_item
  *item);
+int btrfs_del_root(struct btrfs_trans_handle *trans, struct btrfs_root *root,
+  struct btrfs_key *key);
 int btrfs_update_root(struct btrfs_trans_handle *trans, struct btrfs_root
  *root, struct btrfs_key *key, struct btrfs_root_item
  *item);
diff --git a/extent-tree.c b/extent-tree.c
index 3b1577e..b52c515 100644
--- a/extent-tree.c
+++ b/extent-tree.c
@@ -2467,6 +2467,17 @@ static int del_pending_extents(struct btrfs_trans_handle 
*trans, struct
return err;
 }
 
+
+int btrfs_free_tree_block(struct btrfs_trans_handle *trans,
+ struct btrfs_root *root,
+ struct extent_buffer *buf,
+ u64 parent, int last_ref)
+{
+   return btrfs_free_extent(trans, root, buf->start, buf->len, parent,
+root->root_key.objectid,
+btrfs_header_level(buf), 0);
+}
+
 /*
  * remove an extent from the root, returns 0 on success
  */
diff --git a/free-space-tree.c b/free-space-tree.c
index 3c7a246..b9e21f7 100644
--- a/free-space-tree.c
+++ b/free-space-tree.c
@@ -20,6 +20,7 @@
 #include "disk-io.h"
 #include "free-space-cache.h"
 #include "free-space-tree.h"
+#include "transaction.h"
 
 static struct btrfs_free_space_info *
 search_free_space_info(struct btrfs_trans_handle *trans,
@@ -67,6 +68,96 @@ static int free_space_test_bit(struct 
btrfs_block_group_cache *block_group,
return !!extent_buffer_test_bit(leaf, ptr, i);
 }
 
+static int clear_free_space_tree(struct btrfs_trans_handle *trans,
+struct btrfs_root *root)
+{
+   struct btrfs_path *path;
+   struct btrfs_key key;
+   int nr;
+   int ret;
+
+   path = btrfs_alloc_path();
+   if (!path)
+   return -ENOMEM;
+
+   key.objectid = 0;
+   key.type = 0;
+   key.offset = 0;
+
+   while (1) {
+   ret = btrfs_search_slot(trans, root, , path, -1, 1);
+   if (ret < 0)
+   goto out;
+
+   nr = btrfs_header_nritems(path->nodes[0]);
+   if (!nr)
+   break;
+
+   path->slots[0] = 0;
+   ret = btrfs_del_items(trans, root, path, 0, nr);
+   if (ret)
+   goto out;
+
+   btrfs_release_path(path);
+   }
+
+   ret = 0;
+out:
+   btrfs_free_path(path);
+   return ret;
+}
+
+int btrfs_clear_free_space_tree(struct btrfs_fs_info *fs_info)
+{
+   struct btrfs_trans_handle *trans;
+   struct btrfs_root *tree_root = fs_info->tree_root;
+   struct btrfs_root *free_space_root = fs_info->free_space_root;
+   int ret;
+   u64 features;
+
+   trans = btrfs_start_transaction(tree_root, 0);
+   if (IS_ERR(trans))
+   return PTR_ERR(trans);
+
+
+   features = btrfs_super_compat_ro_flags(fs_info->super_copy);
+   features &= ~(BTRFS_FEATURE_COMPAT_RO_FREE_SPACE_TREE_VALID |
+ BTRFS_FEATURE_COMPAT_RO_FREE_SPACE_TREE);
+   btrfs_set_super_compat_ro_flags(fs_info->super_copy, features);
+   fs_info->free_space_root = NULL;
+
+   ret = clear_free_space_tree(trans, free_space_root);
+   if (ret)
+   goto abort;
+
+   ret = btrfs_del_root(trans, tree_root, _space_root->root_key);
+   if (ret)
+   goto abort;
+
+   list_del(_space_root->dirty_list);
+
+   ret = clean_tree_block(trans, tree_root, free_space_root->node);
+   if (ret)
+

Re: Announcing btrfs-dedupe

2016-11-14 Thread Zygo Blaxell

On Mon, Nov 14, 2016 at 07:22:59PM +0100, James Pharaoh wrote:
> On 14/11/16 19:07, Zygo Blaxell wrote:
> >There is also a still-unresolved problem where the filesystem CPU usage
> >rises exponentially for some operations depending on the number of shared
> >references to an extent.  Files which contain blocks with more than a few
> >thousand shared references can trigger this problem.  A file over 1TB can
> >keep the kernel busy at 100% CPU for over 40 minutes at a time.
> 
> Yes, I see this all the time. For my use cases, I don't really care about
> "shared references" as blocks of files, but am happy to simply deduplicate
> at the whole-file level. I wonder if this still will have the same effect,
> however. I guess that this could be mitigated in a tool, but this is going
> to be both annoying and not the most elegant solution.

If you have huge files (1TB+) this can be a problem even with whole-file
deduplications (which are really just extent-level deduplications applied
to the entire file).  The CPU time is a product of file size and extent
reference count with some other multipliers on top.

I've hacked around it by timing how long it takes to manipulate the data,
and blacklisting any hash value or block address that takes more than
10 seconds to process (if such a block is found after blacklisting, just
skip processing the block/extent/file entirely).  It turns out there are
very few of these in practice (only a few hundred per TB) but these few
hundred block hash values occur millions of times in a large data corpus.

> One thing I am keen to understand is if BTRFS will automatically ignore a
> request to deduplicate a file if it is already deduplicated? Given the
> performance I see when doing a repeat deduplication, it seems to me that it
> can't be doing so, although this could be caused by the CPU usage you
> mention above.

As far as I can tell btrfs doesn't do anything different in this
case--it'll happily repeat the entire lock/read/compare/delete/insert
sequence even if the outcome cannot be different from the initial
conditions.  Due to limitations of VFS caching it'll read the same blocks
from storage hardware twice, too.

> In any case, I'm considering some digging into the filesystem structures to
> see if I can work this out myself before i do any deduplication. I'm fairly
> sure this should be relatively simple to work out, at least well enough for
> my purposes.

I used FIEMAP (then later replaced it with SEARCH_V2 for speed) to map
the extents to physical addresses before deduping them.  If you're only
going to do whole-file dedup then you only need to care about the physical
address of the first non-hole extent.

signature.asc
Description: Digital signature

[PATCH v2 3/6] btrfs-progs: add OPEN_CTREE_INVALIDATE_FST flag

2016-11-14 Thread Omar Sandoval

From: Omar Sandoval 

If this flag is passed to open_ctree(), we'll clear the
FREE_SPACE_TREE_VALID compat_ro bit. The kernel will then reconstruct
the free space tree the next time the filesystem is mounted.

Reviewed-by: Qu Wenruo 
Signed-off-by: Omar Sandoval 
---
 chunk-recover.c |  2 +-
 disk-io.c   | 29 +++--
 disk-io.h   |  9 -
 3 files changed, 28 insertions(+), 12 deletions(-)

diff --git a/chunk-recover.c b/chunk-recover.c
index e33ee8b..e6b26ac 100644
--- a/chunk-recover.c
+++ b/chunk-recover.c
@@ -1477,7 +1477,7 @@ open_ctree_with_broken_chunk(struct recover_control *rc)
 
memcpy(fs_info->fsid, _super->fsid, BTRFS_FSID_SIZE);
 
-   ret = btrfs_check_fs_compatibility(disk_super, 1);
+   ret = btrfs_check_fs_compatibility(disk_super, OPEN_CTREE_WRITES);
if (ret)
goto out_devices;
 
diff --git a/disk-io.c b/disk-io.c
index a576300..de25fcd 100644
--- a/disk-io.c
+++ b/disk-io.c
@@ -904,7 +904,8 @@ free_all:
return NULL;
 }
 
-int btrfs_check_fs_compatibility(struct btrfs_super_block *sb, int writable)
+int btrfs_check_fs_compatibility(struct btrfs_super_block *sb,
+unsigned int flags)
 {
u64 features;
 
@@ -923,13 +924,22 @@ int btrfs_check_fs_compatibility(struct btrfs_super_block 
*sb, int writable)
btrfs_set_super_incompat_flags(sb, features);
}
 
-   features = btrfs_super_compat_ro_flags(sb) &
-   ~BTRFS_FEATURE_COMPAT_RO_SUPP;
-   if (writable && features) {
-   printk("couldn't open RDWR because of unsupported "
-  "option features (%Lx).\n",
-  (unsigned long long)features);
-   return -ENOTSUP;
+   features = btrfs_super_compat_ro_flags(sb);
+   if (flags & OPEN_CTREE_WRITES) {
+   if (flags & OPEN_CTREE_INVALIDATE_FST) {
+   /* Clear the FREE_SPACE_TREE_VALID bit on disk... */
+   features &= 
~BTRFS_FEATURE_COMPAT_RO_FREE_SPACE_TREE_VALID;
+   btrfs_set_super_compat_ro_flags(sb, features);
+   /* ... and ignore the free space tree bit. */
+   features &= ~BTRFS_FEATURE_COMPAT_RO_FREE_SPACE_TREE;
+   }
+   if (features & ~BTRFS_FEATURE_COMPAT_RO_SUPP) {
+   printk("couldn't open RDWR because of unsupported "
+  "option features (%Lx).\n",
+  (unsigned long long)features);
+   return -ENOTSUP;
+   }
+
}
return 0;
 }
@@ -1320,8 +1330,7 @@ static struct btrfs_fs_info *__open_ctree_fd(int fp, 
const char *path,
 
memcpy(fs_info->fsid, _super->fsid, BTRFS_FSID_SIZE);
 
-   ret = btrfs_check_fs_compatibility(fs_info->super_copy,
-  flags & OPEN_CTREE_WRITES);
+   ret = btrfs_check_fs_compatibility(fs_info->super_copy, flags);
if (ret)
goto out_devices;
 
diff --git a/disk-io.h b/disk-io.h
index 8ab36aa..d4a0e7f 100644
--- a/disk-io.h
+++ b/disk-io.h
@@ -74,6 +74,12 @@ enum btrfs_open_ctree_flags {
 
/* Allow to open a partially created filesystem */
OPEN_CTREE_FS_PARTIAL = (1U << 12),
+
+   /*
+* Invalidate the free space tree (i.e., clear the FREE_SPACE_TREE_VALID
+* compat_ro bit).
+*/
+   OPEN_CTREE_INVALIDATE_FST = (1U << 13),
 };
 
 /*
@@ -128,7 +134,8 @@ int clean_tree_block(struct btrfs_trans_handle *trans,
 
 void btrfs_free_fs_info(struct btrfs_fs_info *fs_info);
 struct btrfs_fs_info *btrfs_new_fs_info(int writable, u64 sb_bytenr);
-int btrfs_check_fs_compatibility(struct btrfs_super_block *sb, int writable);
+int btrfs_check_fs_compatibility(struct btrfs_super_block *sb,
+unsigned int flags);
 int btrfs_setup_all_roots(struct btrfs_fs_info *fs_info, u64 root_tree_bytenr,
  unsigned flags);
 void btrfs_release_all_roots(struct btrfs_fs_info *fs_info);
-- 
2.10.2

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 5/6] btrfs-progs: implement btrfs check --clear-space-cache v2

2016-11-14 Thread Omar Sandoval

From: Omar Sandoval 

Reviewed-by: Qu Wenruo 
Signed-off-by: Omar Sandoval 
---
 Documentation/btrfs-check.asciidoc | 14 +-
 cmds-check.c   | 34 +-
 2 files changed, 34 insertions(+), 14 deletions(-)

diff --git a/Documentation/btrfs-check.asciidoc 
b/Documentation/btrfs-check.asciidoc
index 5ef414e..633cbbf 100644
--- a/Documentation/btrfs-check.asciidoc
+++ b/Documentation/btrfs-check.asciidoc
@@ -80,12 +80,16 @@ superblock is damaged.
 
 --clear-space-cache v1|v2::
 completely wipe all free space cache of given type
-
-NOTE: Only v1 free space cache supported is implemented.
 +
-Kernel mount option 'clear_cache' is only designed to rebuild free space cache
-which is modified during the lifetime of that mount option.  It doesn't rebuild
-all free space cache, nor clear them out.
+For free space cache 'v1', the 'clear_cache' kernel mount option only rebuilds
+the free space cache for block groups that are modified while the filesystem is
+mounted with that option. Thus, using this option with 'v1' makes it possible
+to actually clear the entire free space cache.
++
+For free space cache 'v2', the 'clear_cache' kernel mount option does destroy
+the entire free space cache. This option with 'v2' provides an alternative
+method of clearing the free space cache that doesn't require mounting the
+filesystem.
 
 
 DANGEROUS OPTIONS
diff --git a/cmds-check.c b/cmds-check.c
index 57c4300..0eca102 100644
--- a/cmds-check.c
+++ b/cmds-check.c
@@ -11170,7 +11170,6 @@ const char * const cmd_check_usage[] = {
"--chunk-rootuse the given bytenr for the chunk tree 
root",
"-p|--progress   indicate progress",
"--clear-space-cache v1|v2   clear space cache for v1 or v2",
-   "NOTE: v1 support implemented",
NULL
 };
 
@@ -11292,13 +11291,16 @@ int cmd_check(int argc, char **argv)
}
break;
case GETOPT_VAL_CLEAR_SPACE_CACHE:
-   if (strcmp(optarg, "v1") != 0) {
-   error(
-   "only v1 support implmented, unrecognized value %s",
-   optarg);
+   if (strcmp(optarg, "v1") == 0) {
+   clear_space_cache = 1;
+   } else if (strcmp(optarg, "v2") == 0) {
+   clear_space_cache = 2;
+   ctree_flags |= 
OPEN_CTREE_INVALIDATE_FST;
+   } else {
+   error("invalid argument to 
--clear-space-cache; "
+ "must be v1 or v2");
exit(1);
}
-   clear_space_cache = 1;
ctree_flags |= OPEN_CTREE_WRITES;
break;
}
@@ -11352,11 +11354,10 @@ int cmd_check(int argc, char **argv)
 
global_info = info;
root = info->fs_root;
-   if (clear_space_cache) {
+   if (clear_space_cache == 1) {
if (btrfs_fs_compat_ro(info,
BTRFS_FEATURE_COMPAT_RO_FREE_SPACE_TREE)) {
-   error(
-   "free space cache v2 detected, clearing not 
implemented");
+   error("free space cache v2 detected; use 
--clear-space-cache v2");
ret = 1;
goto close_out;
}
@@ -11369,6 +11370,21 @@ int cmd_check(int argc, char **argv)
printf("Free space cache cleared\n");
}
goto close_out;
+   } else if (clear_space_cache == 2) {
+   if (!btrfs_fs_compat_ro(info, 
BTRFS_FEATURE_COMPAT_RO_FREE_SPACE_TREE)) {
+   printf("No free space cache v2 to clear\n");
+   ret = 0;
+   goto close_out;
+   }
+   printf("Clear free space cache v2\n");
+   ret = btrfs_clear_free_space_tree(info);
+   if (ret) {
+   error("Failed to clear free space cache v2");
+   ret = 1;
+   } else {
+   printf("Free space cache v2 cleared\n");
+   }
+   goto close_out;
}
 
/*
-- 
2.10.2

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 2/6] btrfs-progs: format FREE_SPACE_TREE{,_VALID} nicely in dump-super

2016-11-14 Thread Omar Sandoval

From: Omar Sandoval 

Signed-off-by: Omar Sandoval 
---
 cmds-inspect-dump-super.c | 24 
 1 file changed, 24 insertions(+)

diff --git a/cmds-inspect-dump-super.c b/cmds-inspect-dump-super.c
index d9f7bfb..ba0d708 100644
--- a/cmds-inspect-dump-super.c
+++ b/cmds-inspect-dump-super.c
@@ -197,6 +197,16 @@ struct readable_flag_entry {
char *output;
 };
 
+#define DEF_COMPAT_RO_FLAG_ENTRY(bit_name) \
+   {BTRFS_FEATURE_COMPAT_RO_##bit_name, #bit_name}
+
+static struct readable_flag_entry compat_ro_flags_array[] = {
+   DEF_COMPAT_RO_FLAG_ENTRY(FREE_SPACE_TREE),
+   DEF_COMPAT_RO_FLAG_ENTRY(FREE_SPACE_TREE_VALID),
+};
+static const int compat_ro_flags_num = sizeof(compat_ro_flags_array) /
+  sizeof(struct readable_flag_entry);
+
 #define DEF_INCOMPAT_FLAG_ENTRY(bit_name)  \
{BTRFS_FEATURE_INCOMPAT_##bit_name, #bit_name}
 
@@ -268,6 +278,19 @@ static void __print_readable_flag(u64 flag, struct 
readable_flag_entry *array,
printf(")\n");
 }
 
+static void print_readable_compat_ro_flag(u64 flag)
+{
+   /*
+* We know about the FREE_SPACE_TREE{,_VALID} bits, but we don't
+* actually support them yet.
+*/
+   return __print_readable_flag(flag, compat_ro_flags_array,
+compat_ro_flags_num,
+BTRFS_FEATURE_COMPAT_RO_SUPP |
+BTRFS_FEATURE_COMPAT_RO_FREE_SPACE_TREE |
+
BTRFS_FEATURE_COMPAT_RO_FREE_SPACE_TREE_VALID);
+}
+
 static void print_readable_incompat_flag(u64 flag)
 {
return __print_readable_flag(flag, incompat_flags_array,
@@ -377,6 +400,7 @@ static void dump_superblock(struct btrfs_super_block *sb, 
int full)
   (unsigned long long)btrfs_super_compat_flags(sb));
printf("compat_ro_flags\t\t0x%llx\n",
   (unsigned long long)btrfs_super_compat_ro_flags(sb));
+   print_readable_compat_ro_flag(btrfs_super_compat_ro_flags(sb));
printf("incompat_flags\t\t0x%llx\n",
   (unsigned long long)btrfs_super_incompat_flags(sb));
print_readable_incompat_flag(btrfs_super_incompat_flags(sb));
-- 
2.10.2

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 0/6] btrfs-progs: better space_cache=v2 support

2016-11-14 Thread Omar Sandoval

From: Omar Sandoval 

Hi,

Cover letter from v1:

This series implements some support for space_cache=v2 in btrfs-progs.
In particular, this adds support for `btrfs check --clear-space-cache v2`,
proper printing of the free space tree flags for `btrfs inspect-internal
dump-super`, and better documentation.

We'd previously talked about always making btrfs-progs always invalidate
the free space tree when doing write operations, but I decided that this
should be an action explicitly requested by the user. It'd also be
unsafe if using a kernel without the free space tree valid bit support,
which is why I didn't implement a `btrfs check --invalidate-free-space-cache`
option. Doing the full clear is always safe.

Still missing is full read-write support, but this should hopefully
cover most btrfs-progs usage.

Changes since v1:

- Change unsigned -> unsigned int argument to
  btrfs_check_fs_compatability() in patch 3
- Remove BUG_ON() in btrfs_del_root() in patch 4
- Return error from btrfs_free_tree_block() in patch 4
- Handle errors from btrfs_free_tree_block() and clean_tree_block() in
  patch 4
- Add Qu Wenruo's Reviewed-by to patches 3, 4, and 5

Thanks!

Omar Sandoval (6):
  btrfs-progs: add the FREE_SPACE_TREE_VALID compat_ro bit definition
  btrfs-progs: format FREE_SPACE_TREE{,_VALID} nicely in dump-super
  btrfs-progs: add OPEN_CTREE_INVALIDATE_FST flag
  btrfs-progs: add btrfs_clear_free_space_tree() from the kernel
  btrfs-progs: implement btrfs check --clear-space-cache v2
  btrfs-progs: document space_cache=v2 more thoroughly

 Documentation/btrfs-check.asciidoc | 14 +++---
 Documentation/btrfs-man5.asciidoc  | 43 ++
 chunk-recover.c|  2 +-
 cmds-check.c   | 34 ++
 cmds-inspect-dump-super.c  | 24 ++
 ctree.h| 19 
 disk-io.c  | 29 +++-
 disk-io.h  |  9 +++-
 extent-tree.c  | 11 +
 free-space-tree.c  | 91 ++
 free-space-tree.h  |  1 +
 root-tree.c| 25 +++
 12 files changed, 258 insertions(+), 44 deletions(-)

-- 
2.10.2

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Announcing btrfs-dedupe

2016-11-14 Thread Austin S. Hemmelgarn


On 2016-11-14 13:22, James Pharaoh wrote:

On 14/11/16 19:07, Zygo Blaxell wrote:

On Mon, Nov 07, 2016 at 07:49:51PM +0100, James Pharaoh wrote:

Annoyingly I can't find this now, but I definitely remember reading
someone,
apparently someone knowledgable, claim that the latest version of the
kernel
which I was using at the time, still suffered from issues regarding the
dedupe code.



This was a while ago, and I would be very pleased to hear that there
is high
confidence in the current implementation! I'll post a link if I
manage to
find the comments.


I've been running the btrfs dedup ioctl 7 times per second on average
over 42TB of test data for most of a year (and at a lower rate for two
years).  I have not found any data corruptions due to _dedup_.  I did
find
three distinct data corruption kernel bugs unrelated to dedup, and two
test machines with bad RAM, so I'm pretty sure my corruption detection
is working.

That said, I wouldn't run dedup on a kernel older than 4.4.  LTS kernels
might be OK too, but only if they're up to date with backported btrfs
fixes.


Ok, I think this might have referred to the 4.2 kernel, which was newly
released at the time. I wish I could find the post!


Kernels older than 3.13 lack the FILE_EXTENT_SAME ioctl and can
only deduplicate static data (i.e. data you are certain is not being
concurrently modified).  Before 3.12 there are so many bugs you might
as well not bother.


Yes well I don't need to be told that, sadly.


Older kernels are bad for dedup because of non-corruption reasons.
Between 3.13 and 4.4, the following bugs were fixed:

- false-negative capability checks (e.g. same-inode, EOF extent)
reduce dedup efficiency

- ctime updates (older versions would update ctime when a file was
deduped) mess with incremental backup tools, build systems, etc.

- kernel memory leaks (self-explanatory)

- multiple kernel hang/panic bugs (e.g. a deadlock if two threads
try to read the same extent at the same time, and at least one
of those threads is dedup; and there was some race condition
leading to invalid memory access on dedup's comparison reads)
which won't eat your data, but they might ruin your day anyway.


Ok, I have thing I've seen some stuff like this, I certainly have
problems, but never a loss of data. Things can take a LONG time to get
out of the filesystem, though.


There is also a still-unresolved problem where the filesystem CPU usage
rises exponentially for some operations depending on the number of shared
references to an extent.  Files which contain blocks with more than a few
thousand shared references can trigger this problem.  A file over 1TB can
keep the kernel busy at 100% CPU for over 40 minutes at a time.


Yes, I see this all the time. For my use cases, I don't really care
about "shared references" as blocks of files, but am happy to simply
deduplicate at the whole-file level. I wonder if this still will have
the same effect, however. I guess that this could be mitigated in a
tool, but this is going to be both annoying and not the most elegant
solution.
The issue is at the extent level, so it will impact whole files too (but 
it will have less impact on defragmented files that are then 
deduplicated as whole files).  Pretty much anything that pins references 
to extents will impact this, so cloned extents and snapshots will also 
have an impact.



There might also be a correlation between delalloc data and hangs in
extent-same, but I have NOT been able to confirm this.  All I know
at this point is that doing a fsync() on the source FD just before
doing the extent-same ioctl dramatically reduces filesystem hang rates:
several weeks between hangs (or no hangs at all) with fsync, vs. 18 hours
or less without.


Interesting, I'll maybe see if I can make use of this.

One thing I am keen to understand is if BTRFS will automatically ignore
a request to deduplicate a file if it is already deduplicated? Given the
performance I see when doing a repeat deduplication, it seems to me that
it can't be doing so, although this could be caused by the CPU usage you
mention above.
What's happening is that the dedupe ioctl does a byte-wise comparison of 
the ranges to make sure they're the same before linking them.  This is 
actually what takes most of the time when calling the ioctl, and is part 
of why it takes longer the larger the range to deduplicate is.  In 
essence, it's behaving like an OS should and not trusting userspace to 
make reasonable requests (which is also why there's a separate ioctl to 
clone a range from another file instead of deduplicating existing data).


TBH, even though it's kind of annoying from a performance perspective, 
it's a rather nice safety net to have.  For example, one of the cases 
where I do deduplication is a couple of directories where each directory 
is an overlapping partial subset of one large tree which I keep 
elsewhere.  In this case, I can tell just by filename exactly

Re: Announcing btrfs-dedupe

2016-11-14 Thread Zygo Blaxell

On Tue, Nov 08, 2016 at 12:06:01PM +0100, Niccolò Belli wrote:
> Nice, you should probably update the btrfs wiki as well, because there is no
> mention of btrfs-dedupe.
> 
> First question, why this name? Don't you plan to support xfs as well?

Does XFS plan to support LOGICAL_INO, INO_PATHS, and something analogous
to SEARCH_V2?

POSIX API + FILE_EXTENT_SAME is OK for the lowest common denominator
across arbitrary filesystems, but a btrfs-specific tool can do a lot
better.  Especially for incremental dedup and low-RAM algorithms.

signature.asc
Description: Digital signature

Re: [PATCH v4] btrfs: make block group flags in balance printks human-readable

2016-11-14 Thread David Sterba

On Mon, Nov 14, 2016 at 06:44:34PM +0100, Adam Borowski wrote:
> They're not even documented anywhere, letting users with no recourse but
> to RTFS.  It's no big burden to output the bitfield as words.
> 
> Also, display unknown flags as hex.
> 
> Signed-off-by: Adam Borowski 

Added to 4.10 queue, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Announcing btrfs-dedupe

2016-11-14 Thread James Pharaoh


On 14/11/16 19:07, Zygo Blaxell wrote:

On Mon, Nov 07, 2016 at 07:49:51PM +0100, James Pharaoh wrote:

Annoyingly I can't find this now, but I definitely remember reading someone,
apparently someone knowledgable, claim that the latest version of the kernel
which I was using at the time, still suffered from issues regarding the
dedupe code.



This was a while ago, and I would be very pleased to hear that there is high
confidence in the current implementation! I'll post a link if I manage to
find the comments.


I've been running the btrfs dedup ioctl 7 times per second on average
over 42TB of test data for most of a year (and at a lower rate for two
years).  I have not found any data corruptions due to _dedup_.  I did find
three distinct data corruption kernel bugs unrelated to dedup, and two
test machines with bad RAM, so I'm pretty sure my corruption detection
is working.

That said, I wouldn't run dedup on a kernel older than 4.4.  LTS kernels
might be OK too, but only if they're up to date with backported btrfs
fixes.


Ok, I think this might have referred to the 4.2 kernel, which was newly 
released at the time. I wish I could find the post!



Kernels older than 3.13 lack the FILE_EXTENT_SAME ioctl and can
only deduplicate static data (i.e. data you are certain is not being
concurrently modified).  Before 3.12 there are so many bugs you might
as well not bother.


Yes well I don't need to be told that, sadly.


Older kernels are bad for dedup because of non-corruption reasons.
Between 3.13 and 4.4, the following bugs were fixed:

- false-negative capability checks (e.g. same-inode, EOF extent)
reduce dedup efficiency

- ctime updates (older versions would update ctime when a file was
deduped) mess with incremental backup tools, build systems, etc.

- kernel memory leaks (self-explanatory)

- multiple kernel hang/panic bugs (e.g. a deadlock if two threads
try to read the same extent at the same time, and at least one
of those threads is dedup; and there was some race condition
leading to invalid memory access on dedup's comparison reads)
which won't eat your data, but they might ruin your day anyway.


Ok, I have thing I've seen some stuff like this, I certainly have 
problems, but never a loss of data. Things can take a LONG time to get 
out of the filesystem, though.



There is also a still-unresolved problem where the filesystem CPU usage
rises exponentially for some operations depending on the number of shared
references to an extent.  Files which contain blocks with more than a few
thousand shared references can trigger this problem.  A file over 1TB can
keep the kernel busy at 100% CPU for over 40 minutes at a time.


Yes, I see this all the time. For my use cases, I don't really care 
about "shared references" as blocks of files, but am happy to simply 
deduplicate at the whole-file level. I wonder if this still will have 
the same effect, however. I guess that this could be mitigated in a 
tool, but this is going to be both annoying and not the most elegant 
solution.



There might also be a correlation between delalloc data and hangs in
extent-same, but I have NOT been able to confirm this.  All I know
at this point is that doing a fsync() on the source FD just before
doing the extent-same ioctl dramatically reduces filesystem hang rates:
several weeks between hangs (or no hangs at all) with fsync, vs. 18 hours
or less without.


Interesting, I'll maybe see if I can make use of this.

One thing I am keen to understand is if BTRFS will automatically ignore 
a request to deduplicate a file if it is already deduplicated? Given the 
performance I see when doing a repeat deduplication, it seems to me that 
it can't be doing so, although this could be caused by the CPU usage you 
mention above.


In any case, I'm considering some digging into the filesystem structures 
to see if I can work this out myself before i do any deduplication. I'm 
fairly sure this should be relatively simple to work out, at least well 
enough for my purposes.


James
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Announcing btrfs-dedupe

2016-11-14 Thread Zygo Blaxell

On Mon, Nov 07, 2016 at 07:49:51PM +0100, James Pharaoh wrote:
> Annoyingly I can't find this now, but I definitely remember reading someone,
> apparently someone knowledgable, claim that the latest version of the kernel
> which I was using at the time, still suffered from issues regarding the
> dedupe code.

> This was a while ago, and I would be very pleased to hear that there is high
> confidence in the current implementation! I'll post a link if I manage to
> find the comments.

I've been running the btrfs dedup ioctl 7 times per second on average
over 42TB of test data for most of a year (and at a lower rate for two
years).  I have not found any data corruptions due to _dedup_.  I did find
three distinct data corruption kernel bugs unrelated to dedup, and two
test machines with bad RAM, so I'm pretty sure my corruption detection
is working.

That said, I wouldn't run dedup on a kernel older than 4.4.  LTS kernels
might be OK too, but only if they're up to date with backported btrfs
fixes.

Kernels older than 3.13 lack the FILE_EXTENT_SAME ioctl and can
only deduplicate static data (i.e. data you are certain is not being
concurrently modified).  Before 3.12 there are so many bugs you might
as well not bother.

Older kernels are bad for dedup because of non-corruption reasons.
Between 3.13 and 4.4, the following bugs were fixed:

- false-negative capability checks (e.g. same-inode, EOF extent)
reduce dedup efficiency

- ctime updates (older versions would update ctime when a file was
deduped) mess with incremental backup tools, build systems, etc.

- kernel memory leaks (self-explanatory)

- multiple kernel hang/panic bugs (e.g. a deadlock if two threads
try to read the same extent at the same time, and at least one
of those threads is dedup; and there was some race condition
leading to invalid memory access on dedup's comparison reads)
which won't eat your data, but they might ruin your day anyway.

There is also a still-unresolved problem where the filesystem CPU usage
rises exponentially for some operations depending on the number of shared
references to an extent.  Files which contain blocks with more than a few
thousand shared references can trigger this problem.  A file over 1TB can
keep the kernel busy at 100% CPU for over 40 minutes at a time.

There might also be a correlation between delalloc data and hangs in
extent-same, but I have NOT been able to confirm this.  All I know
at this point is that doing a fsync() on the source FD just before
doing the extent-same ioctl dramatically reduces filesystem hang rates:
several weeks between hangs (or no hangs at all) with fsync, vs. 18 hours
or less without.

> James
> 
> On 07/11/16 18:59, Mark Fasheh wrote:
> >Hi James,
> >
> >Re the following text on your project page:
> >
> >"IMPORTANT CAVEAT — I have read that there are race and/or error
> >conditions which can cause filesystem corruption in the kernel
> >implementation of the deduplication ioctl."
> >
> >Can you expound on that? I'm not aware of any bugs right now but if
> >there is any it'd absolutely be worth having that info on the btrfs
> >list.
> >
> >Thanks,
> >--Mark
> >
> >
> >On Sun, Nov 6, 2016 at 7:30 AM, James Pharaoh
> > wrote:
> >>Hi all,
> >>
> >>I'm pleased to announce my btrfs deduplication utility, written in Rust.
> >>This operates on whole files, is fast, and I believe complements the
> >>existing utilities (duperemove, bedup), which exist currently.
> >>
> >>Please visit the homepage for more information:
> >>
> >>http://btrfs-dedupe.com
> >>
> >>James Pharaoh
> >>--
> >>To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> >>the body of a message to majord...@vger.kernel.org
> >>More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

signature.asc
Description: Digital signature

[PATCH v4] btrfs: make block group flags in balance printks human-readable

2016-11-14 Thread Adam Borowski

They're not even documented anywhere, letting users with no recourse but
to RTFS.  It's no big burden to output the bitfield as words.

Also, display unknown flags as hex.

Signed-off-by: Adam Borowski 
---
 fs/btrfs/relocation.c | 42 +++---
 1 file changed, 39 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index c4af0cd..9a3abf3 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -4333,6 +4333,44 @@ static struct reloc_control *alloc_reloc_control(struct 
btrfs_fs_info *fs_info)
 }
 
 /*
+ * printk the block group being relocated
+ */
+static void describe_relocation(struct btrfs_fs_info *fs_info,
+   struct btrfs_block_group_cache *block_group)
+{
+   char buf[128]; /* prefixed by a '|' that'll be dropped */
+   u64 flags = block_group->flags;
+
+   if (!flags) { /* shouldn't happen */
+   strcpy(buf, "|NONE");
+   } else {
+   char *bp = buf;
+
+#define DESCRIBE_FLAG(f, d) \
+   if (flags & BTRFS_BLOCK_GROUP_##f) { \
+   bp += snprintf(bp, buf - bp + sizeof(buf), "|%s", d); \
+   flags &= ~BTRFS_BLOCK_GROUP_##f; \
+   }
+   DESCRIBE_FLAG(DATA, "data");
+   DESCRIBE_FLAG(SYSTEM,   "system");
+   DESCRIBE_FLAG(METADATA, "metadata");
+   DESCRIBE_FLAG(RAID0,"raid0");
+   DESCRIBE_FLAG(RAID1,"raid1");
+   DESCRIBE_FLAG(DUP,  "dup");
+   DESCRIBE_FLAG(RAID10,   "raid10");
+   DESCRIBE_FLAG(RAID5,"raid5");
+   DESCRIBE_FLAG(RAID6,"raid6");
+   if (flags)
+   snprintf(buf, buf - bp + sizeof(buf), "|0x%llx", flags);
+#undef DESCRIBE_FLAG
+   }
+
+   btrfs_info(fs_info,
+  "relocating block group %llu flags %s",
+  block_group->key.objectid, buf + 1);
+}
+
+/*
  * function to relocate all extents in a block group.
  */
 int btrfs_relocate_block_group(struct btrfs_root *extent_root, u64 group_start)
@@ -4388,9 +4426,7 @@ int btrfs_relocate_block_group(struct btrfs_root 
*extent_root, u64 group_start)
goto out;
}
 
-   btrfs_info(extent_root->fs_info,
-  "relocating block group %llu flags %llu",
-  rc->block_group->key.objectid, rc->block_group->flags);
+   describe_relocation(extent_root->fs_info, rc->block_group);
 
btrfs_wait_block_group_reservations(rc->block_group);
btrfs_wait_nocow_writers(rc->block_group);
-- 
2.10.2

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

another latest mainline crash in xfstests

2016-11-14 Thread Christoph Hellwig

And if this isn't enough it seems generic/166 hangs after finishing
the main test (which already takes a very long time):

generic/113 16s ...[ 2498.548221] run fstests generic/113 at 2016-11-14
17:04:56
[ 2498.984322] BTRFS info (device vdb): disk space caching is enabled
[ 2499.983635] BTRFS info (device vdb): disk space caching is enabled
[ 2501.282154] BTRFS info (device vdb): disk space caching is enabled
[ 2528.330899] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s!
[aio-stress:17961]
[ 2528.333617] Modules linked in:
[ 2528.334685] CPU: 0 PID: 17961 Comm: aio-stress Not tainted 4.9.0-rc5 #828
[ 2528.336999] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
1.7.5-20140531_083030-gandalf 04/01/2014
[ 2528.337563] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s! 
[aio-stress:17960]
[ 2528.337564] Modules linked in:
[ 2528.337566] CPU: 1 PID: 17960 Comm: aio-stress Not tainted 4.9.0-rc5 #828
[ 2528.337567] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
1.7.5-20140531_083030-gandalf 04/01/2014
[ 2528.337568] task: 880137e98240 task.stack: c91e4000
[ 2528.337573] RIP: 0010:[]  [] 
check_poison_obj+0x4e/0x250
[ 2528.337574] RSP: 0018:c91e7690  EFLAGS: 0297
[ 2528.337575] RAX: 006b RBX: 00a5 RCX: 
[ 2528.337576] RDX: 00a5 RSI: 88013ba7e300 RDI: 88013b800500
[ 2528.337577] RBP: c91e76d8 R08: 0001 R09: 0001
[ 2528.337577] R10:  R11:  R12: ffa5
[ 2528.337578] R13: 88013ba7e300 R14:  R15: 0100
[ 2528.337579] FS:  7fcc22b1e700() GS:88013fc8() 
knlGS:
[ 2528.337580] CS:  0010 DS:  ES:  CR0: 80050033
[ 2528.337581] CR2: 7fcc296e9000 CR3: 00012e2e CR4: 06e0
[ 2528.337584] DR0:  DR1:  DR2: 
[ 2528.337585] DR3:  DR6: fffe0ff0 DR7: 0400
[ 2528.337585] Stack:
[ 2528.337588]  0046 88013ba7e300 88013b800500 
00ff001e76c0
[ 2528.337590]  88013b800500 88013ba7e300 40002800 
81618622
[ 2528.337593]  88013b800500 c91e7708 811dd686 
0286
[ 2528.337594] Call Trace:
[ 2528.337598]  [] ? cache_block_group+0x62/0x3c0
[ 2528.337600]  [] 
cache_alloc_debugcheck_after.isra.71+0x146/0x1c0
[ 2528.337602]  [] ? cache_block_group+0x62/0x3c0
[ 2528.337603]  [] kmem_cache_alloc_trace+0x8e/0x160
[ 2528.337605]  [] cache_block_group+0x62/0x3c0
[ 2528.337607]  [] ? get_caching_control+0x31/0x40
[ 2528.337609]  [] ? wake_up_bit+0x30/0x30
[ 2528.337610]  [] find_free_extent+0x79c/0xef0
[ 2528.337612]  [] btrfs_reserve_extent+0x7e/0x1f0
[ 2528.337614]  [] btrfs_get_blocks_direct+0x360/0x740
[ 2528.337616]  [] __blockdev_direct_IO+0xbc0/0x4410
[ 2528.337619]  [] ? _raw_spin_unlock+0x9/0x10
[ 2528.337620]  [] ? _raw_spin_unlock+0x9/0x10
[ 2528.337622]  [] ? reserve_metadata_bytes+0x180/0x950
[ 2528.337624]  [] ? btrfs_page_exists_in_range+0x110/0x110
[ 2528.337625]  [] ? 
btrfs_endio_direct_write_update_ordered+0xc0/0xc0
[ 2528.337627]  [] btrfs_direct_IO+0x1c2/0x350
[ 2528.337628]  [] ? 
btrfs_endio_direct_write_update_ordered+0xc0/0xc0
[ 2528.337630]  [] generic_file_direct_write+0xa4/0x160
[ 2528.337632]  [] btrfs_file_write_iter+0x175/0x5f0
[ 2528.337634]  [] aio_write+0xb2/0x130
[ 2528.337636]  [] ? do_io_submit+0x224/0x530
[ 2528.337638]  [] ? 
cache_alloc_debugcheck_after.isra.71+0x164/0x1c0
[ 2528.337639]  [] ? do_io_submit+0x224/0x530
[ 2528.337641]  [] ? kmem_cache_alloc+0x14f/0x160
[ 2528.337642]  [] do_io_submit+0x3fd/0x530
[ 2528.337644]  [] SyS_io_submit+0xb/0x10
[ 2528.337646]  [] entry_SYSCALL_64_fastpath+0x1a/0xa9
[ 2528.337668] Code: 7e 75 41 8d 47 ff 48 89 75 c0 45 31 f6 31 db 48 89 7d c8 
41 bc a5 ff ff ff 89 45 d4 3b 5d d4 b8 6b 00 00 00 48 63 d3 41 0f 44 c4 <41> 38 
44 15 00 74 38 45 85 f6 0f 84 41 01 00 00 8d 73 0f 85 db 
[ 2528.344233] NMI watchdog: BUG: soft lockup - CPU#2 stuck for 22s! 
[aio-stress:17962]
[ 2528.344234] Modules linked in:
[ 2528.344236] CPU: 2 PID: 17962 Comm: aio-stress Tainted: G L  4.9.0-rc5 #828
[ 2528.344237] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
1.7.5-20140531_083030-gandalf 04/01/2014
[ 2528.344238] task: 88012aec0380 task.stack: c91f4000
[ 2528.344243] RIP: 0010:[]  [] 
do_raw_spin_lock+0x1/0x120
[ 2528.344244] RSP: 0018:c91f77c0  EFLAGS: 0246
[ 2528.344245] RAX:  RBX: 880138740c00 RCX: 
[ 2528.344245] RDX: 000e RSI: 006b RDI: 88012e5979c0
[ 2528.344246] RBP: c91f77d0 R08: 0430a390 R09: 88013277e700
[ 2528.344247] R10:  R11:  R12: 880131c84000
[ 2528.344247] R13: c91f79b0 R14:  R15: 8801387e6000
[ 2528.344249] FS:  7fcc21b1c700()

Re: [Bug 186671] New: OOM on system with just rsync running 32GB of ram 30GB of pagecache

2016-11-14 Thread Vlastimil Babka

On 11/14/2016 02:27 PM, E V wrote:
> System is an intel dual socket Xeon E5620, 7500/5520/5500/X58 ICH10
> family according to lspci. Anyways 4.8.4 OOM'd while I was gone. I'll
> download the current 4.9rc and reboot, but in the mean time here's
> xxd, vmstat & kern.log output:
> 8532039 

Hmm this would suggest that the memory is mostly free. But not according
to vmstat. Is it possible you mistakenly provided the xxd from a fresh
boot, but vmstat from after the OOM?

But sure, a page_count() of zero is a reason why __isolate_lru_page()
would fail due to its get_page_unless_zero(). The question is then how
could it drop to zero without being freed at the same time, as
put_page() does.

I was going to suspect commit 83929372f6 and a page_ref_sub() it adds to
delete_from_page_cache(), but that's since 4.8 and you mention problems
since 4.7.

Anyway it might be worth enabling CONFIG_DEBUG_VM as the relevant code
usually has VM_BUG_ONs.

Vlastimil

>9324 0100
>2226 0200
> 405 0300
>  80 0400
>  34 0500
>  48 0600
>  17 0700
>  17 0800
>  32 0900
>  19 0a00
>   1 0c00
>   1 0d00
>   1 0e00
>  12 1000
>   8 1100
>  32 1200
>  10 1300
>   2 1400
>  11 1500
>  12 1600
>   7 1700
>   3 1800
>   5 1900
>   6 1a00
>  11 1b00
>  22 1c00
>   3 1d00
>  19 1e00
>  21 1f00
>  18 2000
>  28 2100
>  40 2200
>  38 2300
>  85 2400
>  59 2500
>   40520 81ff
> 
> /proc/vmstat:
> nr_free_pages 60965
> nr_zone_inactive_anon 4646
> nr_zone_active_anon 3265
> nr_zone_inactive_file 633882
> nr_zone_active_file 7017458
> nr_zone_unevictable 0
> nr_zone_write_pending 0
> nr_mlock 0
> nr_slab_reclaimable 299205
> nr_slab_unreclaimable 195497
> nr_page_table_pages 935
> nr_kernel_stack 4976
> nr_bounce 0
> numa_hit 3577063288
> numa_miss 541393191
> numa_foreign 541393191
> numa_interleave 19415
> numa_local 3577063288
> numa_other 0
> nr_free_cma 0
> nr_inactive_anon 4646
> nr_active_anon 3265
> nr_inactive_file 633882
> nr_active_file 7017458
> nr_unevictable 0
> nr_isolated_anon 0
> nr_isolated_file 0
> nr_pages_scanned 0
> workingset_refault 42685891
> workingset_activate 15247281
> workingset_nodereclaim 26375216
> nr_anon_pages 5067
> nr_mapped 5630
> nr_file_pages 7654746
> nr_dirty 0
> nr_writeback 0
> nr_writeback_temp 0
> nr_shmem 2504
> nr_shmem_hugepages 0
> nr_shmem_pmdmapped 0
> nr_anon_transparent_hugepages 0
> nr_unstable 0
> nr_vmscan_write 5243750485
> nr_vmscan_immediate_reclaim 4207633857
> nr_dirtied 1839143430
> nr_written 1832626107
> nr_dirty_threshold 1147728
> nr_dirty_background_threshold 151410
> pgpgin 166731189
> pgpgout 7328142335
> pswpin 98608
> pswpout 117794
> pgalloc_dma 29504
> pgalloc_dma32 1006726216
> pgalloc_normal 5275218188
> pgalloc_movable 0
> allocstall_dma 0
> allocstall_dma32 0
> allocstall_normal 36461
> allocstall_movable 5867
> pgskip_dma 0
> pgskip_dma32 0
> pgskip_normal 6417890
> pgskip_movable 0
> pgfree 6309223401
> pgactivate 35076483
> pgdeactivate 63556974
> pgfault 35753842
> pgmajfault 69126
> pglazyfreed 0
> pgrefill 70008598
> pgsteal_kswapd 3567289713
> pgsteal_direct 5878057
> pgscan_kswapd 9059309872
> pgscan_direct 4239367903
> pgscan_direct_throttle 0
> zone_reclaim_failed 0
> pginodesteal 102916
> slabs_scanned 460790262
> kswapd_inodesteal 9130243
> kswapd_low_wmark_hit_quickly 10634373
> kswapd_high_wmark_hit_quickly 7348173
> pageoutrun 18349115
> pgrotated 16291322
> drop_pagecache 0
> drop_slab 0
> pgmigrate_success 18912908
> pgmigrate_fail 63382146
> compact_migrate_scanned 2986269789
> compact_free_scanned 190451505123
> compact_isolated 109549437
> compact_stall 3544
> compact_fail 8
> compact_success 3536
> compact_daemon_wake 1403515
> htlb_buddy_alloc_success 0
> htlb_buddy_alloc_fail 0
> unevictable_pgs_culled 12473
> unevictable_pgs_scanned 0
> unevictable_pgs_rescued 11979
> unevictable_pgs_mlocked 14556
> unevictable_pgs_munlocked 14556
> unevictable_pgs_cleared 0
> unevictable_pgs_stranded 0
> thp_fault_alloc 0
> thp_fault_fallback 0
> thp_collapse_alloc 0
> thp_collapse_alloc_failed 0
> thp_file_alloc 0
> thp_file_mapped 0
> thp_split_page 0
> thp_split_page_failed 0
> thp_deferred_split_page 0
> thp_split_pmd 0
> thp_zero_page_alloc 0
> thp_zero_page_alloc_failed 0
> 
> kern.log OOM message:
> [737778.724194] snmpd invoked oom-killer:
> gfp_mask=0x24200ca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
> [737778.724246] snmpd cpuset=/ mems_allowed=0-1
> [737778.724278]

Re: [PATCH 3/6] btrfs-progs: add OPEN_CTREE_INVALIDATE_FST flag

2016-11-14 Thread David Sterba

On Mon, Nov 14, 2016 at 08:22:56AM -0800, Omar Sandoval wrote:
> > > --- a/disk-io.c
> > > +++ b/disk-io.c
> > > @@ -904,7 +904,7 @@ free_all:
> > >   return NULL;
> > >  }
> > > 
> > > -int btrfs_check_fs_compatibility(struct btrfs_super_block *sb, int 
> > > writable)
> > > +int btrfs_check_fs_compatibility(struct btrfs_super_block *sb, unsigned 
> > > flags)
> > 
> > IIRC kernel checkpatch will warn on single "unsigned", as it's recommended
> > to use "unsigned int".
> 
> I was going for consistency with the rest of disk-io.c, but I can fix
> it.

Not needed IMHO, we don't use unsinged int/unsigned consistently.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/6] btrfs-progs: add btrfs_clear_free_space_tree() from the kernel

2016-11-14 Thread Omar Sandoval

On Mon, Nov 14, 2016 at 09:38:19AM +0800, Qu Wenruo wrote:
> 
> 
> At 11/14/2016 03:35 AM, Omar Sandoval wrote:
> > From: Omar Sandoval 
> > 
> > Signed-off-by: Omar Sandoval 
> 
> Only small nits about the BUG_ON() and return value.
> Despite that, feel free to add my reviewed-by:
> 
> Reviewed-by: Qu Wenruo 
> > ---
> >  ctree.h   |  6 
> >  extent-tree.c | 10 +++
> >  free-space-tree.c | 87 
> > +++
> >  free-space-tree.h |  1 +
> >  root-tree.c   | 22 ++
> >  5 files changed, 126 insertions(+)
> > 
> > diff --git a/ctree.h b/ctree.h
> > index d67b852..90193ad 100644
> > --- a/ctree.h
> > +++ b/ctree.h
> > @@ -2504,6 +2504,10 @@ int btrfs_inc_ref(struct btrfs_trans_handle *trans, 
> > struct btrfs_root *root,
> >   struct extent_buffer *buf, int record_parent);
> >  int btrfs_dec_ref(struct btrfs_trans_handle *trans, struct btrfs_root 
> > *root,
> >   struct extent_buffer *buf, int record_parent);
> > +void btrfs_free_tree_block(struct btrfs_trans_handle *trans,
> > +  struct btrfs_root *root,
> > +  struct extent_buffer *buf,
> > +  u64 parent, int last_ref);
> >  int btrfs_free_extent(struct btrfs_trans_handle *trans,
> >   struct btrfs_root *root,
> >   u64 bytenr, u64 num_bytes, u64 parent,
> > @@ -2664,6 +2668,8 @@ int btrfs_add_root_ref(struct btrfs_trans_handle 
> > *trans,
> >  int btrfs_insert_root(struct btrfs_trans_handle *trans, struct btrfs_root
> >   *root, struct btrfs_key *key, struct btrfs_root_item
> >   *item);
> > +int btrfs_del_root(struct btrfs_trans_handle *trans, struct btrfs_root 
> > *root,
> > +  struct btrfs_key *key);
> >  int btrfs_update_root(struct btrfs_trans_handle *trans, struct btrfs_root
> >   *root, struct btrfs_key *key, struct btrfs_root_item
> >   *item);
> > diff --git a/extent-tree.c b/extent-tree.c
> > index 3b1577e..d445723 100644
> > --- a/extent-tree.c
> > +++ b/extent-tree.c
> > @@ -2467,6 +2467,16 @@ static int del_pending_extents(struct 
> > btrfs_trans_handle *trans, struct
> > return err;
> >  }
> > 
> > +
> > +void btrfs_free_tree_block(struct btrfs_trans_handle *trans,
> > +  struct btrfs_root *root,
> > +  struct extent_buffer *buf,
> > +  u64 parent, int last_ref)
> > +{
> > +   btrfs_free_extent(trans, root, buf->start, buf->len, parent,
> > + root->root_key.objectid, btrfs_header_level(buf), 0);
> > +}
> 
> btrfs_free_extent() will return int.
> 
> Better return it.

Will fix.

> > +
> >  /*
> >   * remove an extent from the root, returns 0 on success
> >   */
> > diff --git a/free-space-tree.c b/free-space-tree.c
> > index 3c7a246..d972f26 100644
> > --- a/free-space-tree.c
> > +++ b/free-space-tree.c
> > @@ -20,6 +20,7 @@
> >  #include "disk-io.h"
> >  #include "free-space-cache.h"
> >  #include "free-space-tree.h"
> > +#include "transaction.h"
> > 
> >  static struct btrfs_free_space_info *
> >  search_free_space_info(struct btrfs_trans_handle *trans,
> > @@ -67,6 +68,92 @@ static int free_space_test_bit(struct 
> > btrfs_block_group_cache *block_group,
> > return !!extent_buffer_test_bit(leaf, ptr, i);
> >  }
> > 
> > +static int clear_free_space_tree(struct btrfs_trans_handle *trans,
> > +struct btrfs_root *root)
> > +{
> > +   struct btrfs_path *path;
> > +   struct btrfs_key key;
> > +   int nr;
> > +   int ret;
> > +
> > +   path = btrfs_alloc_path();
> > +   if (!path)
> > +   return -ENOMEM;
> > +
> > +   key.objectid = 0;
> > +   key.type = 0;
> > +   key.offset = 0;
> > +
> > +   while (1) {
> > +   ret = btrfs_search_slot(trans, root, , path, -1, 1);
> > +   if (ret < 0)
> > +   goto out;
> > +
> > +   nr = btrfs_header_nritems(path->nodes[0]);
> > +   if (!nr)
> > +   break;
> > +
> > +   path->slots[0] = 0;
> > +   ret = btrfs_del_items(trans, root, path, 0, nr);
> > +   if (ret)
> > +   goto out;
> > +
> > +   btrfs_release_path(path);
> > +   }
> > +
> > +   ret = 0;
> > +out:
> > +   btrfs_free_path(path);
> > +   return ret;
> > +}
> > +
> > +int btrfs_clear_free_space_tree(struct btrfs_fs_info *fs_info)
> > +{
> > +   struct btrfs_trans_handle *trans;
> > +   struct btrfs_root *tree_root = fs_info->tree_root;
> > +   struct btrfs_root *free_space_root = fs_info->free_space_root;
> > +   int ret;
> > +   u64 features;
> > +
> > +   trans = btrfs_start_transaction(tree_root, 0);
> > +   if (IS_ERR(trans))
> > +   return PTR_ERR(trans);
> > +
> > +
> > +   features = btrfs_super_compat_ro_flags(fs_info->super_copy);
> > +   features &=

Re: [PATCH v3-onstack] btrfs: make block group flags in balance printks human-readable

2016-11-14 Thread David Sterba

On Sat, Nov 12, 2016 at 12:59:59AM +0100, Adam Borowski wrote:
> They're not even documented anywhere, letting users with no recourse but
> to RTFS.  It's no big burden to output the bitfield as words.
> 
> Also, display unknown flags as hex.
> 
> Signed-off-by: Adam Borowski 
> ---
>  fs/btrfs/relocation.c | 41 ++---
>  1 file changed, 38 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
> index c4af0cd..57d867d 100644
> --- a/fs/btrfs/relocation.c
> +++ b/fs/btrfs/relocation.c
> @@ -4333,6 +4333,43 @@ static struct reloc_control 
> *alloc_reloc_control(struct btrfs_fs_info *fs_info)
>  }
>  
>  /*
> + * printk the block group being relocated
> + */
> +static void describe_relocation(struct btrfs_fs_info *fs_info,
> + struct btrfs_block_group_cache *block_group)
> +{
> + char buf[128]; // prefixed by a '|' that'll be dropped

Oh right, moving the buffer to the function is the right way.

As my main objection is addressed, we can proceed to the codinstyle.
Please don't use the // comments.

> + u64 flags = block_group->flags;
> +
> + if (unlikely(!flags)) // shouldn't happen

add explicit { ... } if there's an multi-statement else block, drop
'unlikely'

> + strcpy(buf, "|NONE");
> + else {
> + char *bp = buf;

newline between declarations and code

> +#define DESCRIBE_FLAG(f, d) \
> + if (flags & BTRFS_BLOCK_GROUP_##f) { \
> + bp += snprintf(bp, buf - bp + sizeof(buf), "|%s", d); \
> + flags &= ~BTRFS_BLOCK_GROUP_##f; \
> + }
> + DESCRIBE_FLAG(DATA, "data");
> + DESCRIBE_FLAG(SYSTEM,   "system");
> + DESCRIBE_FLAG(METADATA, "metadata");
> + DESCRIBE_FLAG(RAID0,"raid0");
> + DESCRIBE_FLAG(RAID1,"raid1");
> + DESCRIBE_FLAG(DUP,  "dup");
> + DESCRIBE_FLAG(RAID10,   "raid10");
> + DESCRIBE_FLAG(RAID5,"raid5");
> + DESCRIBE_FLAG(RAID6,"raid6");
> + if (unlikely(flags))

drop 'unlikely'

> + snprintf(buf, buf - bp + sizeof(buf), "|0x%llx", flags);
> +#undef DESCRIBE_FLAG
> + }
> +
> + btrfs_info(fs_info,
> +"relocating block group %llu flags %s",
> +block_group->key.objectid, buf+1);

   block_group->key.objectid, buf + 1);

space around a binary operator

> +}
> +
> +/*
>   * function to relocate all extents in a block group.
>   */
>  int btrfs_relocate_block_group(struct btrfs_root *extent_root, u64 
> group_start)
> @@ -4388,9 +4425,7 @@ int btrfs_relocate_block_group(struct btrfs_root 
> *extent_root, u64 group_start)
>   goto out;
>   }
>  
> - btrfs_info(extent_root->fs_info,
> -"relocating block group %llu flags %llu",
> -rc->block_group->key.objectid, rc->block_group->flags);
> + describe_relocation(extent_root->fs_info, rc->block_group);
>  
>   btrfs_wait_block_group_reservations(rc->block_group);
>   btrfs_wait_nocow_writers(rc->block_group);
> -- 
> 2.10.2
> 
> This is a version that uses a temp buffer on the stack, but does it in a
> separate function so it doesn't cost us anything when deep call chains are
> involved.  While balance that can trigger deep call chain, it's not called
> deeply itself.
> 
> This approach is simpler than mucking with allocs and avoids code
> duplication that would be needed for handling failed alloc.

Agreed.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/6] btrfs-progs: add OPEN_CTREE_INVALIDATE_FST flag

2016-11-14 Thread Omar Sandoval

On Mon, Nov 14, 2016 at 09:22:22AM +0800, Qu Wenruo wrote:
> 
> 
> At 11/14/2016 03:35 AM, Omar Sandoval wrote:
> > From: Omar Sandoval 
> > 
> > If this flag is passed to open_ctree(), we'll clear the
> > FREE_SPACE_TREE_VALID compat_ro bit. The kernel will then reconstruct
> > the free space tree the next time the filesystem is mounted.
> 
> This feature is really helpful.
> 
> Especially for users reporting failed to mount v2 space_cache fs.
> 
> Despite a small nit commented below, feel free to add my reviewed by tag.
> 
> Reviewed-by: Qu Wenruo 
> > 
> > Signed-off-by: Omar Sandoval 
> > ---
> >  chunk-recover.c |  2 +-
> >  disk-io.c   | 28 ++--
> >  disk-io.h   |  8 +++-
> >  3 files changed, 26 insertions(+), 12 deletions(-)
> > 
> > diff --git a/chunk-recover.c b/chunk-recover.c
> > index e33ee8b..e6b26ac 100644
> > --- a/chunk-recover.c
> > +++ b/chunk-recover.c
> > @@ -1477,7 +1477,7 @@ open_ctree_with_broken_chunk(struct recover_control 
> > *rc)
> > 
> > memcpy(fs_info->fsid, _super->fsid, BTRFS_FSID_SIZE);
> > 
> > -   ret = btrfs_check_fs_compatibility(disk_super, 1);
> > +   ret = btrfs_check_fs_compatibility(disk_super, OPEN_CTREE_WRITES);
> > if (ret)
> > goto out_devices;
> > 
> > diff --git a/disk-io.c b/disk-io.c
> > index a576300..a55fcc1 100644
> > --- a/disk-io.c
> > +++ b/disk-io.c
> > @@ -904,7 +904,7 @@ free_all:
> > return NULL;
> >  }
> > 
> > -int btrfs_check_fs_compatibility(struct btrfs_super_block *sb, int 
> > writable)
> > +int btrfs_check_fs_compatibility(struct btrfs_super_block *sb, unsigned 
> > flags)
> 
> IIRC kernel checkpatch will warn on single "unsigned", as it's recommended
> to use "unsigned int".
> 
> Thanks,
> Qu

I was going for consistency with the rest of disk-io.c, but I can fix
it.

Thanks!
-- 
Omar
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] ioctl.h: add missing kernel compatibility header for BUILD_ASSERT

2016-11-14 Thread David Sterba

On Mon, Nov 14, 2016 at 01:34:49PM +0100, David Sterba wrote:
> On Sat, Nov 12, 2016 at 10:14:49PM +, Sergei Trofimovich wrote:
> > > > Basically gcc tries to say us BUILD_ASSERT is not visible.
> > > > 
> > > > BUILD_ASSERT lives in kerncompat.h which this change adds.  
> > > 
> > > I think including the kerncompat.h is too intrusive here, I've fixed by
> > > providing an empty macro if it's not defined. I'll release 4.8.2 soon.
> > 
> > Apologies. I did not test your fix right afterwards. Seems now header is 
> > incomplete
> > due to missing NULL (gcc-6):
> > 
> > btrfs-progs-v4.8.3 $ gcc -c ioctl.h -o /tmp/a.o
> > ioctl.h: In function 'btrfs_err_str':
> > ioctl.h:711:11: error: 'NULL' undeclared (first use in this function)
> > return NULL;
> >^~~~
> > ioctl.h:711:11: note: each undeclared identifier is reported only once for 
> > each function it appears in
> 
> The ioctl.h file can be included in both C and C++ code, I'd rahter
> avoid to ifdef the right definition of NULL, so s/NULL/0/ seems as a
> best fix to me.

Or #include , now committed to devel.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: how to understand "btrfs fi show" output? "No space left" issues

2016-11-14 Thread Johannes Hirte

On 2016 Sep 20, Peter Becker wrote:
> Data, RAID1: total=417.12GiB, used=131.33GiB
> 
> You have 417(total)-131(used) blocks wo are only partial filled.
> You should balance your file-system.
> 
> At first you need some free space. You could remove some files / old
> snapshots etc. or you add a empty USB-Stick with min. 4 GB to your
> BTRFS-Pool (after balancing complete you can remove the stick from the
> pool).

He has plenty of space. What you're describing is the case that either
data pool or metadata pool is full, the other has enough space and
nothing is left that could be allocated to the full pool. In this case
rebalancing would help. But in Tomasz' case there is enough space in
every pool, so the allocator should use it. This really sounds like a
bug.

> But at first you should try to free emty data and meta data blocks:
> 
> btrfs balance start -musage=0 /mnt
> btrfs balance start -dusage=0 /mnt

Since kernel 3.18 this is done automatically.

regards,
  Johannes
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Bug 186671] New: OOM on system with just rsync running 32GB of ram 30GB of pagecache

2016-11-14 Thread E V

System is an intel dual socket Xeon E5620, 7500/5520/5500/X58 ICH10
family according to lspci. Anyways 4.8.4 OOM'd while I was gone. I'll
download the current 4.9rc and reboot, but in the mean time here's
xxd, vmstat & kern.log output:
8532039 
   9324 0100
   2226 0200
405 0300
 80 0400
 34 0500
 48 0600
 17 0700
 17 0800
 32 0900
 19 0a00
  1 0c00
  1 0d00
  1 0e00
 12 1000
  8 1100
 32 1200
 10 1300
  2 1400
 11 1500
 12 1600
  7 1700
  3 1800
  5 1900
  6 1a00
 11 1b00
 22 1c00
  3 1d00
 19 1e00
 21 1f00
 18 2000
 28 2100
 40 2200
 38 2300
 85 2400
 59 2500
  40520 81ff

/proc/vmstat:
nr_free_pages 60965
nr_zone_inactive_anon 4646
nr_zone_active_anon 3265
nr_zone_inactive_file 633882
nr_zone_active_file 7017458
nr_zone_unevictable 0
nr_zone_write_pending 0
nr_mlock 0
nr_slab_reclaimable 299205
nr_slab_unreclaimable 195497
nr_page_table_pages 935
nr_kernel_stack 4976
nr_bounce 0
numa_hit 3577063288
numa_miss 541393191
numa_foreign 541393191
numa_interleave 19415
numa_local 3577063288
numa_other 0
nr_free_cma 0
nr_inactive_anon 4646
nr_active_anon 3265
nr_inactive_file 633882
nr_active_file 7017458
nr_unevictable 0
nr_isolated_anon 0
nr_isolated_file 0
nr_pages_scanned 0
workingset_refault 42685891
workingset_activate 15247281
workingset_nodereclaim 26375216
nr_anon_pages 5067
nr_mapped 5630
nr_file_pages 7654746
nr_dirty 0
nr_writeback 0
nr_writeback_temp 0
nr_shmem 2504
nr_shmem_hugepages 0
nr_shmem_pmdmapped 0
nr_anon_transparent_hugepages 0
nr_unstable 0
nr_vmscan_write 5243750485
nr_vmscan_immediate_reclaim 4207633857
nr_dirtied 1839143430
nr_written 1832626107
nr_dirty_threshold 1147728
nr_dirty_background_threshold 151410
pgpgin 166731189
pgpgout 7328142335
pswpin 98608
pswpout 117794
pgalloc_dma 29504
pgalloc_dma32 1006726216
pgalloc_normal 5275218188
pgalloc_movable 0
allocstall_dma 0
allocstall_dma32 0
allocstall_normal 36461
allocstall_movable 5867
pgskip_dma 0
pgskip_dma32 0
pgskip_normal 6417890
pgskip_movable 0
pgfree 6309223401
pgactivate 35076483
pgdeactivate 63556974
pgfault 35753842
pgmajfault 69126
pglazyfreed 0
pgrefill 70008598
pgsteal_kswapd 3567289713
pgsteal_direct 5878057
pgscan_kswapd 9059309872
pgscan_direct 4239367903
pgscan_direct_throttle 0
zone_reclaim_failed 0
pginodesteal 102916
slabs_scanned 460790262
kswapd_inodesteal 9130243
kswapd_low_wmark_hit_quickly 10634373
kswapd_high_wmark_hit_quickly 7348173
pageoutrun 18349115
pgrotated 16291322
drop_pagecache 0
drop_slab 0
pgmigrate_success 18912908
pgmigrate_fail 63382146
compact_migrate_scanned 2986269789
compact_free_scanned 190451505123
compact_isolated 109549437
compact_stall 3544
compact_fail 8
compact_success 3536
compact_daemon_wake 1403515
htlb_buddy_alloc_success 0
htlb_buddy_alloc_fail 0
unevictable_pgs_culled 12473
unevictable_pgs_scanned 0
unevictable_pgs_rescued 11979
unevictable_pgs_mlocked 14556
unevictable_pgs_munlocked 14556
unevictable_pgs_cleared 0
unevictable_pgs_stranded 0
thp_fault_alloc 0
thp_fault_fallback 0
thp_collapse_alloc 0
thp_collapse_alloc_failed 0
thp_file_alloc 0
thp_file_mapped 0
thp_split_page 0
thp_split_page_failed 0
thp_deferred_split_page 0
thp_split_pmd 0
thp_zero_page_alloc 0
thp_zero_page_alloc_failed 0

kern.log OOM message:
[737778.724194] snmpd invoked oom-killer:
gfp_mask=0x24200ca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
[737778.724246] snmpd cpuset=/ mems_allowed=0-1
[737778.724278] CPU: 15 PID: 2976 Comm: snmpd Tainted: GW I 4.8.4 #1
[737778.724352]   81292069 88041e043c48
88041e043c48
[737778.724403]  8118d1f6 88041dd70fc0 88041e043c48
0136236f
[737778.724454]  8170e11e 0001 8112a700
030f
[737778.724505] Call Trace:
[737778.724533]  [] ? dump_stack+0x46/0x5d
[737778.727077]  [] ? dump_header.isra.16+0x56/0x185
[737778.727108]  [] ? oom_kill_process+0x210/0x3c0
[737778.727136]  [] ? out_of_memory+0x34b/0x420
[737778.727165]  [] ? __alloc_pages_nodemask+0xd9a/0xde0
[737778.727195]  [] ? alloc_pages_vma+0xc1/0x240
[737778.727223]  [] ? pagecache_get_page+0x22/0x230
[737778.727253]  [] ? __read_swap_cache_async+0x104/0x180
[737778.727282]  [] ? read_swap_cache_async+0xf/0x30
[737778.727311]  [] ? swapin_readahead+0xec/0x1a0
[737778.727340]  [] ? do_swap_page+0x420/0x5c0
[737778.727369]  [] ? SYSC_recvfrom+0xa8/0x110
[737778.727397]  [] ? handle_mm_fault+0x629/0xe30
[737778.727426]  [] ?

Re: [PULL] Btrfs fixes for 4.9-rc5

2016-11-14 Thread David Sterba

On Fri, Nov 11, 2016 at 03:45:10PM -0500, Chris Mason wrote:
> On 11/10/2016 10:00 AM, David Sterba wrote:
> > Hi,
> >
> > two minor error handling fixes and one fix for balance sttus item that goes
> > back to 4.4.
> >
> > The branch continues from my last pull that went to Linus' tree, so it would
> > be a good idea to do the same as before. I've added a signed tag for branch.
> 
> Looking at this again before sending off:
> 
> btrfs: remove redundant check of btrfs_iget return value
> 
> Is really a cleanup and we should wait for the merge window.  Since this 
> means rebasing the pull, I'm happy to send the others on Monday.

Technically it is a cleanup, but I look at it more from the correctness
perspective and based of feelings that "this looks relvant for rc".
But postponing for 4.10 fine for me.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

reproducable oops in btrfs/130 with latests mainline

2016-11-14 Thread Christoph Hellwig

btrfs/130   [  384.645337] run fstests btrfs/130 at 2016-11-14
12:33:26
[  384.827333] BTRFS: device fsid bf118b00-e2e0-4a96-a177-765789170093 devid 1 
transid 3 /dev/vdc
[  384.851643] BTRFS info (device vdc): disk space caching is enabled
[  384.852113] BTRFS info (device vdc): flagging fs with big metadata feature
[  384.857043] BTRFS info (device vdc): creating UUID tree
[  384.988347] BTRFS: device fsid 3b92b8c1-295d-4099-8623-d71a3cb270f8 devid 1 
transid 3 /dev/vdc
[  385.001946] BTRFS info (device vdc): disk space caching is enabled
[  385.002846] BTRFS info (device vdc): flagging fs with big metadata
feature
[  385.008870] BTRFS info (device vdc): creating UUID tree
[  416.318581] NMI watchdog: BUG: soft lockup - CPU#3 stuck for 22s! 
[btrfs:12782]
[  416.319139] Modules linked in:
[  416.319366] CPU: 3 PID: 12782 Comm: btrfs Not tainted 4.9.0-rc1 #826
[  416.319789] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
1.7.5-20140531_083030-gandalf 04/01/2014
[  416.320466] task: 8801355a4140 task.stack: c90a4000
[  416.320864] RIP: 0010:[]  [] 
find_parent_nodes+0xb7d/0x1530
[  416.321455] RSP: 0018:c90a79b0  EFLAGS: 0286
[  416.321811] RAX: 88012de45640 RBX:  RCX: c90a7a28
[  416.322285] RDX: 88012de45660 RSI: 01ca8000 RDI: 88013b803e40
[  416.322759] RBP: c90a7ab0 R08: 02400040 R09: 88010077a478
[  416.323317] R10: 880127652f70 R11: 880127652f08 R12: 8800
[  416.323791] R13: 6db6db6db6db6db7 R14: 8801295093b0 R15: 
[  416.324262] FS:  7f83ef8398c0() GS:88013fd8() 
knlGS:
[  416.324795] CS:  0010 DS:  ES:  CR0: 80050033
[  416.325176] CR2: 7f83ee4dbe38 CR3: 000136b56000 CR4: 06e0
[  416.325649] DR0:  DR1:  DR2: 
[  416.326120] DR3:  DR6: fffe0ff0 DR7: 0400
[  416.326590] Stack:
[  416.326730]   880102400040 88012a34 
1063
[  416.327257]  00010001 88012dd0e800 88012a34 
0001
[  416.327780]  88012a34  88013b803e40 
00c4
[  416.328304] Call Trace:
[  416.328475]  [] ? changed_cb+0xb70/0xb70
[  416.328841]  [] iterate_extent_inodes+0xe7/0x270
[  416.329251]  [] ? release_extent_buffer+0x26/0xc0
[  416.329657]  [] ? free_extent_buffer+0x46/0x80
[  416.330068]  [] process_extent+0x69f/0xb00
[  416.330452]  [] changed_cb+0x2cb/0xb70
[  416.330811]  [] ? read_extent_buffer+0xe2/0x140
[  416.331380]  [] ? btrfs_search_slot_for_read+0xc2/0x1b0
[  416.331905]  [] btrfs_ioctl_send+0x1187/0x12c0
[  416.332309]  [] ? kmem_cache_alloc+0x8a/0x160
[  416.332704]  [] btrfs_ioctl+0x7dc/0x21f0
[  416.333071]  [] ? flat_send_IPI_mask+0xc/0x10
[  416.333465]  [] ? default_send_IPI_single+0x2d/0x30
[  416.333893]  [] ? native_smp_send_reschedule+0x27/0x40
[  416.334340]  [] ? resched_curr+0xad/0xb0
[  416.334706]  [] do_vfs_ioctl+0x8b/0x5b0
[  416.335065]  [] ? _do_fork+0x132/0x390
[  416.335423]  [] SyS_ioctl+0x3c/0x70
[  416.335763]  [] entry_SYSCALL_64_fastpath+0x1a/0xa9

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] ioctl.h: add missing kernel compatibility header for BUILD_ASSERT

2016-11-14 Thread David Sterba

On Sat, Nov 12, 2016 at 10:14:49PM +, Sergei Trofimovich wrote:
> > > Basically gcc tries to say us BUILD_ASSERT is not visible.
> > > 
> > > BUILD_ASSERT lives in kerncompat.h which this change adds.  
> > 
> > I think including the kerncompat.h is too intrusive here, I've fixed by
> > providing an empty macro if it's not defined. I'll release 4.8.2 soon.
> 
> Apologies. I did not test your fix right afterwards. Seems now header is 
> incomplete
> due to missing NULL (gcc-6):
> 
> btrfs-progs-v4.8.3 $ gcc -c ioctl.h -o /tmp/a.o
> ioctl.h: In function 'btrfs_err_str':
> ioctl.h:711:11: error: 'NULL' undeclared (first use in this function)
> return NULL;
>^~~~
> ioctl.h:711:11: note: each undeclared identifier is reported only once for 
> each function it appears in

The ioctl.h file can be included in both C and C++ code, I'd rahter
avoid to ifdef the right definition of NULL, so s/NULL/0/ seems as a
best fix to me.

Compiling with g++ shows other type errors like

ioctl.h:709:5: warning: ISO C++ forbids converting a string constant to ‘char*’ 
[-Wwrite-strings]
 "in progress";

so we'd have to change the type to 'const char*'. I'm not sure how much
breakage this could cause in programs using the header.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Btrfs progs release 4.8.3

2016-11-14 Thread David Sterba

On Sun, Nov 13, 2016 at 02:28:02PM +0100, Henk Slager wrote:
> On Fri, Nov 11, 2016 at 4:38 PM, David Sterba  wrote:
> > Hi,
> >
> > btrfs-progs version 4.8.3 have been released. Handful of fixes and lots of
> > cleanups.
> >
> > Changes:
> >   * check:
> > * support for clearing space cache (v1)
> > * size reduction of inode backref structure
> >   * send:
> > * fix handling of multiple snapshots (-p and -c options)
> > * transfer buffer increased (should reduce number of context switches)
> > * reuse existing file for output (-f), eg. when root cannot create 
> > files (NFS)
> >   * dump-tree:
> > * print missing items for various structures
> > * new: dev stats, balance status item
> > * sync key names with kernel (the persistent items)
> >   * subvol show: now able to print the toplevel subvolume -- the creation 
> > time
> > might be wrong though
> >   * mkfs:
> > * store the creation time of toplevel root inode
> 
> It looks like commit 5c4d53450b2c6ff7169c99f9158c14ae96b7b0a8
>  (btrfs-progs: mkfs: store creation time of the toplevel subvolume'')
> is not enough to display the creation time of a newly created fs with
> tools v4.8.3, or am I missing something?

This is known and mentioned on the changelog entry above mkfs.

> With kernel 4.8.6-2-default as well as 4.9.0-rc4-1-default:
> 
> # /net/src/btrfs-progs/mkfs.btrfs -L ctimetest -m single /dev/loop0
> btrfs-progs v4.8.3
> See http://btrfs.wiki.kernel.org for more information.
> 
> Performing full device TRIM (100.00GiB) ...
> Label:  ctimetest
> UUID:   d65486f0-368b-4b2a-962b-176cd945feb5
> Node size:  16384
> Sector size:4096
> Filesystem size:100.00GiB
> Block group profiles:
>   Data: single8.00MiB
>   Metadata: single8.00MiB
>   System:   single4.00MiB
> SSD detected:   no
> Incompat features:  extref, skinny-metadata
> Number of devices:  1
> Devices:
>IDSIZE  PATH
> 1   100.00GiB  /dev/loop0
> 
> # mount /dev/loop0 /mnt
> # cd /mnt
> # ls > test
> # sync
> # /net/src/btrfs-progs/btrfs sub sh .
> /mnt
> Name:   
> UUID:   -
> Parent UUID:-
> Received UUID:  -
> Creation time:  -
> Subvolume ID:   5
> Generation: 9
> Gen at creation:0
> Parent ID:  0
> Top level ID:   0
> Flags:  -
> Snapshot(s):
> 
> I noticed that   btrfs_set_stack_timespec_sec(_item.otime, now);
> is called twice during mkfs.btrfs, but during btrfs sub sh
> get_ri.otime is 0 just before it is formatted/printed.

Your finding is right, I saw the same and would have fixed already, but
it turned out to be more than a simple fix. The on-disk items are
connected to in-memory ones, and the corresponding root item does not
keep them in sync. Ie. setting the otime in one place does not propagate
to the point where 'subvol show' reads it.

> My idea was to
> patch the code (kernel and/or progs) such that I can also put a time
> in some exiting filesystems.

Well, it's possible to write a oneshot tool to set the creation time, if
it's really desired. It would be an arbitrary value set by the user,
because the information about otime cannot be simply derived from the
existing timestamps.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 0/5] Make btrfs-progs really compatible with any kernel version

2016-11-14 Thread David Sterba

On Tue, Nov 08, 2016 at 09:14:39PM +0800, Anand Jain wrote:
>   This patch isn't integrated, any idea why ?

Because it does not cover all usecases. I've committed the preparatory
bits from branch where I've been working on that feature so we can
actually get to the hard parts.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v14.1 13/16] btrfs: Introduce DEDUPE reserve type to fix false enospc for in-band dedupe

2016-11-14 Thread Qu Wenruo

From: Wang Xiaoguang 

We can trigger false ENOSPC error if enabling in-band dedupe.
This is the same reason of compress false ENOSPC error:
Difference in max extent size can lead to too much space reserved for
metadata.

Since we already have type-based reserve facilities, add DEDUP reserve
type to fix such false ENOSPC error.

Signed-off-by: Wang Xiaoguang 
Signed-off-by: Qu Wenruo 
---
 fs/btrfs/ctree.h   |  4 ++-
 fs/btrfs/dedupe.h  | 18 +
 fs/btrfs/extent-tree.c | 14 +-
 fs/btrfs/extent_io.c   |  9 ---
 fs/btrfs/extent_io.h   |  1 +
 fs/btrfs/file.c|  3 +++
 fs/btrfs/inode.c   | 72 ++
 fs/btrfs/ioctl.c   |  2 ++
 fs/btrfs/relocation.c  |  2 ++
 9 files changed, 91 insertions(+), 34 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index e21b315..d09d924 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -110,9 +110,11 @@ static const int btrfs_csum_sizes[] = { 4 };
 enum btrfs_metadata_reserve_type {
BTRFS_RESERVE_NORMAL,
BTRFS_RESERVE_COMPRESS,
+   BTRFS_RESERVE_DEDUPE,
 };
 
-u64 btrfs_max_extent_size(enum btrfs_metadata_reserve_type reserve_type);
+u64 btrfs_max_extent_size(struct inode *inode,
+ enum btrfs_metadata_reserve_type reserve_type);
 int inode_need_compress(struct inode *inode);
 
 struct btrfs_mapping_tree {
diff --git a/fs/btrfs/dedupe.h b/fs/btrfs/dedupe.h
index 8311ee1..50977c4 100644
--- a/fs/btrfs/dedupe.h
+++ b/fs/btrfs/dedupe.h
@@ -22,6 +22,7 @@
 #include 
 #include 
 #include 
+#include "btrfs_inode.h"
 
 static const int btrfs_hash_sizes[] = { 32 };
 
@@ -63,6 +64,23 @@ struct btrfs_dedupe_info {
 
 struct btrfs_trans_handle;
 
+static inline u64 btrfs_dedupe_blocksize(struct inode *inode)
+{
+   struct btrfs_fs_info *fs_info = BTRFS_I(inode)->root->fs_info;
+
+   return fs_info->dedupe_info->blocksize;
+}
+
+static inline int inode_need_dedupe(struct inode *inode)
+{
+   struct btrfs_fs_info *fs_info = BTRFS_I(inode)->root->fs_info;
+
+   if (!fs_info->dedupe_enabled)
+   return 0;
+
+   return 1;
+}
+
 static inline int btrfs_dedupe_hash_hit(struct btrfs_dedupe_hash *hash)
 {
return (hash && hash->bytenr);
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 3c82730..77a7d41 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -5862,7 +5862,7 @@ static unsigned drop_outstanding_extent(struct inode 
*inode, u64 num_bytes,
unsigned drop_inode_space = 0;
unsigned dropped_extents = 0;
unsigned num_extents = 0;
-   u64 max_extent_size = btrfs_max_extent_size(reserve_type);
+   u64 max_extent_size = btrfs_max_extent_size(inode, reserve_type);
 
num_extents = (unsigned)div64_u64(num_bytes + max_extent_size - 1,
  max_extent_size);
@@ -5935,15 +5935,17 @@ static u64 calc_csum_metadata_size(struct inode *inode, 
u64 num_bytes,
return btrfs_calc_trans_metadata_size(root, old_csums - num_csums);
 }
 
-u64 btrfs_max_extent_size(enum btrfs_metadata_reserve_type reserve_type)
+u64 btrfs_max_extent_size(struct inode *inode,
+ enum btrfs_metadata_reserve_type reserve_type)
 {
if (reserve_type == BTRFS_RESERVE_NORMAL)
return BTRFS_MAX_EXTENT_SIZE;
else if (reserve_type == BTRFS_RESERVE_COMPRESS)
return SZ_128K;
-
-   ASSERT(0);
-   return BTRFS_MAX_EXTENT_SIZE;
+   else if (reserve_type == BTRFS_RESERVE_DEDUPE)
+   return btrfs_dedupe_blocksize(inode);
+   else
+   return BTRFS_MAX_EXTENT_SIZE;
 }
 
 int btrfs_delalloc_reserve_metadata(struct inode *inode, u64 num_bytes,
@@ -5958,7 +5960,7 @@ int btrfs_delalloc_reserve_metadata(struct inode *inode, 
u64 num_bytes,
int ret = 0;
bool delalloc_lock = true;
u64 to_free = 0;
-   u64 max_extent_size = btrfs_max_extent_size(reserve_type);
+   u64 max_extent_size = btrfs_max_extent_size(inode, reserve_type);
unsigned dropped;
bool release_extra = false;
 
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 854379d..d59e91d 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -603,7 +603,7 @@ static int __clear_extent_bit(struct extent_io_tree *tree, 
u64 start, u64 end,
btrfs_debug_check_extent_io_range(tree, start, end);
 
if (bits & EXTENT_DELALLOC)
-   bits |= EXTENT_NORESERVE | EXTENT_COMPRESS;
+   bits |= EXTENT_NORESERVE | EXTENT_COMPRESS | EXTENT_DEDUPE;
 
if (delete)
bits |= ~EXTENT_CTLBITS;
@@ -746,7 +746,7 @@ static void adjust_one_outstanding_extent(struct inode 
*inode, u64 len,
enum btrfs_metadata_reserve_type reserve_type)
 {
unsigned old_extents, new_extents;
-

[PATCH v14.1 15/16] btrfs: relocation: Enhance error handling to avoid BUG_ON

2016-11-14 Thread Qu Wenruo

Since the introduce of btrfs dedupe tree, it's possible that balance can
race with dedupe disabling.

When this happens, dedupe_enabled will make btrfs_get_fs_root() return
PTR_ERR(-ENOENT).
But due to a bug in error handling branch, when this happens
backref_cache->nr_nodes is increased but the node is neither added to
backref_cache or nr_nodes decreased.
Causing BUG_ON() in backref_cache_cleanup()

[ 2611.668810] [ cut here ]
[ 2611.669946] kernel BUG at
/home/sat/ktest/linux/fs/btrfs/relocation.c:243!
[ 2611.670572] invalid opcode:  [#1] SMP
[ 2611.686797] Call Trace:
[ 2611.687034]  []
btrfs_relocate_block_group+0x1b3/0x290 [btrfs]
[ 2611.687706]  []
btrfs_relocate_chunk.isra.40+0x47/0xd0 [btrfs]
[ 2611.688385]  [] btrfs_balance+0xb22/0x11e0 [btrfs]
[ 2611.688966]  [] btrfs_ioctl_balance+0x391/0x3a0
[btrfs]
[ 2611.689587]  [] btrfs_ioctl+0x1650/0x2290 [btrfs]
[ 2611.690145]  [] ? lru_cache_add+0x3a/0x80
[ 2611.690647]  [] ?
lru_cache_add_active_or_unevictable+0x4c/0xc0
[ 2611.691310]  [] ? handle_mm_fault+0xcd4/0x17f0
[ 2611.691842]  [] ? cp_new_stat+0x153/0x180
[ 2611.692342]  [] ? __vma_link_rb+0xfd/0x110
[ 2611.692842]  [] ? vma_link+0xb9/0xc0
[ 2611.693303]  [] do_vfs_ioctl+0xa1/0x5a0
[ 2611.693781]  [] ? __do_page_fault+0x1b4/0x400
[ 2611.694310]  [] SyS_ioctl+0x41/0x70
[ 2611.694758]  [] entry_SYSCALL_64_fastpath+0x12/0x71
[ 2611.695331] Code: ff 48 8b 45 bf 49 83 af a8 05 00 00 01 49 89 87 a0
05 00 00 e9 2e fd ff ff b8 f4 ff ff ff e9 e4 fb ff ff 0f 0b 0f 0b 0f 0b
0f 0b <0f> 0b 0f 0b 41 89 c6 e9 b8 fb ff ff e8 9e a6 e8 e0 4c 89 e7 44
[ 2611.697870] RIP  []
relocate_block_group+0x741/0x7a0 [btrfs]
[ 2611.698818]  RSP 

This patch will call remove_backref_node() in error handling branch, and
cache the returned -ENOENT in relocate_tree_block() and continue
balancing.

Reported-by: Satoru Takeuchi 
Signed-off-by: Qu Wenruo 
---
 fs/btrfs/relocation.c | 22 +-
 1 file changed, 17 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index 9881a78..8b6b695 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -889,6 +889,13 @@ again:
root = read_fs_root(rc->extent_root->fs_info, key.offset);
if (IS_ERR(root)) {
err = PTR_ERR(root);
+   /*
+* Don't forget to cleanup current node.
+* As it may not be added to backref_cache but nr_node
+* increased.
+* This will cause BUG_ON() in backref_cache_cleanup().
+*/
+   remove_backref_node(>backref_cache, cur);
goto out;
}
 
@@ -3022,14 +3029,21 @@ int relocate_tree_blocks(struct btrfs_trans_handle 
*trans,
}
 
rb_node = rb_first(blocks);
-   while (rb_node) {
+   for (rb_node = rb_first(blocks); rb_node; rb_node = rb_next(rb_node)) {
block = rb_entry(rb_node, struct tree_block, rb_node);
 
node = build_backref_tree(rc, >key,
  block->level, block->bytenr);
if (IS_ERR(node)) {
+   /*
+* The root(dedupe tree yet) of the tree block is
+* going to be freed and can't be reached.
+* Just skip it and continue balancing.
+*/
+   if (PTR_ERR(node) == -ENOENT)
+   continue;
err = PTR_ERR(node);
-   goto out;
+   break;
}
 
ret = relocate_tree_block(trans, rc, node, >key,
@@ -3037,11 +3051,9 @@ int relocate_tree_blocks(struct btrfs_trans_handle 
*trans,
if (ret < 0) {
if (ret != -EAGAIN || rb_node == rb_first(blocks))
err = ret;
-   goto out;
+   break;
}
-   rb_node = rb_next(rb_node);
}
-out:
err = finish_pending_nodes(trans, rc, path, err);
 
 out_free_path:
-- 
2.10.2



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v14.1 16/16] btrfs: dedupe: Introduce new reconfigure ioctl

2016-11-14 Thread Qu Wenruo

Introduce new reconfigure ioctl, and new FORCE flag for in-band dedupe
ioctls.

Now dedupe enable and reconfigure ioctl are stateful.


| Current state |   Ioctl| Next state  |

| Disabled  |  enable| Enabled |
| Enabled   |  enable| Not allowed |
| Enabled   |  reconf| Enabled |
| Enabled   |  disable   | Disabled|
| Disabled  |  dsiable   | Disabled|
| Disabled  |  reconf| Not allowed |

(While disbale is always stateless)

While for guys prefer stateless ioctl (myself for example), new FORCE
flag is introduced.

In FORCE mode, enable/disable is completely stateless.

| Current state |   Ioctl| Next state  |

| Disabled  |  enable| Enabled |
| Enabled   |  enable| Enabled |
| Enabled   |  disable   | Disabled|
| Disabled  |  disable   | Disabled|


Also, re-configure ioctl will only modify specified fields.
Unlike enable, un-specified fields will be filled with default value.

For example:
 # btrfs dedupe enable --block-size 64k /mnt
 # btrfs dedupe reconfigure --limit-hash 1m /mnt
Will leads to:
 dedupe blocksize: 64K
 dedupe hash limit nr: 1m

While for enable:
 # btrfs dedupe enable --force --block-size 64k /mnt
 # btrfs dedupe enable --force --limit-hash 1m /mnt
Will reset blocksize to default value:
 dedupe blocksize: 128K << reset
 dedupe hash limit nr: 1m

Suggested-by: David Sterba 
Signed-off-by: Qu Wenruo 
---
 fs/btrfs/dedupe.c  | 131 -
 fs/btrfs/dedupe.h  |  13 +
 fs/btrfs/ioctl.c   |  13 +
 include/uapi/linux/btrfs.h |  11 +++-
 4 files changed, 143 insertions(+), 25 deletions(-)

diff --git a/fs/btrfs/dedupe.c b/fs/btrfs/dedupe.c
index 37b5a05..5fd4a9c 100644
--- a/fs/btrfs/dedupe.c
+++ b/fs/btrfs/dedupe.c
@@ -41,6 +41,40 @@ static inline struct inmem_hash *inmem_alloc_hash(u16 algo)
GFP_NOFS);
 }
 
+/*
+ * Copy from current dedupe info to fill dargs.
+ * For reconf case, only fill members which is uninitialized.
+ */
+static void get_dedupe_status(struct btrfs_dedupe_info *dedupe_info,
+ struct btrfs_ioctl_dedupe_args *dargs)
+{
+   int reconf = (dargs->cmd == BTRFS_DEDUPE_CTL_RECONF);
+
+   dargs->status = 1;
+
+   if (!reconf || (reconf && dargs->blocksize == (u64)-1))
+   dargs->blocksize = dedupe_info->blocksize;
+   if (!reconf || (reconf && dargs->backend == (u16)-1))
+   dargs->backend = dedupe_info->backend;
+   if (!reconf || (reconf && dargs->hash_algo ==(u16)-1))
+   dargs->hash_algo = dedupe_info->hash_algo;
+
+   /*
+* For re-configure case, if not modifying limit,
+* therir limit will be set to 0, unlike other fields
+*/
+   if (!reconf || !(dargs->limit_nr || dargs->limit_mem)) {
+   dargs->limit_nr = dedupe_info->limit_nr;
+   dargs->limit_mem = dedupe_info->limit_nr *
+   (sizeof(struct inmem_hash) +
+btrfs_hash_sizes[dedupe_info->hash_algo]);
+   }
+
+   /* current_nr doesn't makes sense for reconfig case */
+   if (!reconf)
+   dargs->current_nr = dedupe_info->current_nr;
+}
+
 void btrfs_dedupe_status(struct btrfs_fs_info *fs_info,
 struct btrfs_ioctl_dedupe_args *dargs)
 {
@@ -57,15 +91,7 @@ void btrfs_dedupe_status(struct btrfs_fs_info *fs_info,
return;
}
mutex_lock(_info->lock);
-   dargs->status = 1;
-   dargs->blocksize = dedupe_info->blocksize;
-   dargs->backend = dedupe_info->backend;
-   dargs->hash_algo = dedupe_info->hash_algo;
-   dargs->limit_nr = dedupe_info->limit_nr;
-   dargs->limit_mem = dedupe_info->limit_nr *
-   (sizeof(struct inmem_hash) +
-btrfs_hash_sizes[dedupe_info->hash_algo]);
-   dargs->current_nr = dedupe_info->current_nr;
+   get_dedupe_status(dedupe_info, dargs);
mutex_unlock(_info->lock);
memset(dargs->__unused, -1, sizeof(dargs->__unused));
 }
@@ -114,17 +140,50 @@ static int init_dedupe_info(struct btrfs_dedupe_info 
**ret_info,
 static int check_dedupe_parameter(struct btrfs_fs_info *fs_info,
  struct btrfs_ioctl_dedupe_args *dargs)
 {
-   u64 blocksize = dargs->blocksize;
-   u64 limit_nr = dargs->limit_nr;
-   u64 limit_mem = dargs->limit_mem;
-   u16 hash_algo = dargs->hash_algo;
-   u8 backend = dargs->backend;
+   struct btrfs_dedupe_info *dedupe_info = fs_info->dedupe_info;
+
+   u64 blocksize;
+   u64 limit_nr;
+   u64

[PATCH v14.1 14/16] btrfs: dedupe: Add ioctl for inband dedupelication

2016-11-14 Thread Qu Wenruo

From: Wang Xiaoguang 

Add ioctl interface for inband dedupelication, which includes:
1) enable
2) disable
3) status

And a pseudo RO compat flag, to imply that btrfs now supports inband
dedup.
However we don't add any ondisk format change, it's just a pseudo RO
compat flag.

All these ioctl interfaces are state-less, which means caller don't need
to bother previous dedupe state before calling them, and only need to
care the final desired state.

For example, if user want to enable dedupe with specified block size and
limit, just fill the ioctl structure and call enable ioctl.
No need to check if dedupe is already running.

These ioctls will handle things like re-configure or disable quite well.

Also, for invalid parameters, enable ioctl interface will set the field
of the first encounted invalid parameter to (-1) to inform caller.
While for limit_nr/limit_mem, the value will be (0).

Signed-off-by: Qu Wenruo 
Signed-off-by: Wang Xiaoguang 
---
 fs/btrfs/dedupe.c  | 50 ++
 fs/btrfs/dedupe.h  | 17 
 fs/btrfs/disk-io.c |  3 +++
 fs/btrfs/ioctl.c   | 67 ++
 fs/btrfs/sysfs.c   |  2 ++
 include/uapi/linux/btrfs.h | 12 -
 6 files changed, 145 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/dedupe.c b/fs/btrfs/dedupe.c
index d0d2f8a..37b5a05 100644
--- a/fs/btrfs/dedupe.c
+++ b/fs/btrfs/dedupe.c
@@ -41,6 +41,35 @@ static inline struct inmem_hash *inmem_alloc_hash(u16 algo)
GFP_NOFS);
 }
 
+void btrfs_dedupe_status(struct btrfs_fs_info *fs_info,
+struct btrfs_ioctl_dedupe_args *dargs)
+{
+   struct btrfs_dedupe_info *dedupe_info = fs_info->dedupe_info;
+
+   if (!fs_info->dedupe_enabled || !dedupe_info) {
+   dargs->status = 0;
+   dargs->blocksize = 0;
+   dargs->backend = 0;
+   dargs->hash_algo = 0;
+   dargs->limit_nr = 0;
+   dargs->current_nr = 0;
+   memset(dargs->__unused, -1, sizeof(dargs->__unused));
+   return;
+   }
+   mutex_lock(_info->lock);
+   dargs->status = 1;
+   dargs->blocksize = dedupe_info->blocksize;
+   dargs->backend = dedupe_info->backend;
+   dargs->hash_algo = dedupe_info->hash_algo;
+   dargs->limit_nr = dedupe_info->limit_nr;
+   dargs->limit_mem = dedupe_info->limit_nr *
+   (sizeof(struct inmem_hash) +
+btrfs_hash_sizes[dedupe_info->hash_algo]);
+   dargs->current_nr = dedupe_info->current_nr;
+   mutex_unlock(_info->lock);
+   memset(dargs->__unused, -1, sizeof(dargs->__unused));
+}
+
 static int init_dedupe_info(struct btrfs_dedupe_info **ret_info,
struct btrfs_ioctl_dedupe_args *dargs)
 {
@@ -420,6 +449,27 @@ static void unblock_all_writers(struct btrfs_fs_info 
*fs_info)
percpu_up_write(sb->s_writers.rw_sem + SB_FREEZE_WRITE - 1);
 }
 
+int btrfs_dedupe_cleanup(struct btrfs_fs_info *fs_info)
+{
+   struct btrfs_dedupe_info *dedupe_info;
+
+   fs_info->dedupe_enabled = 0;
+   /* same as disable */
+   smp_wmb();
+   dedupe_info = fs_info->dedupe_info;
+   fs_info->dedupe_info = NULL;
+
+   if (!dedupe_info)
+   return 0;
+
+   if (dedupe_info->backend == BTRFS_DEDUPE_BACKEND_INMEMORY)
+   inmem_destroy(dedupe_info);
+
+   crypto_free_shash(dedupe_info->dedupe_driver);
+   kfree(dedupe_info);
+   return 0;
+}
+
 int btrfs_dedupe_disable(struct btrfs_fs_info *fs_info)
 {
struct btrfs_dedupe_info *dedupe_info;
diff --git a/fs/btrfs/dedupe.h b/fs/btrfs/dedupe.h
index 50977c4..3b15592 100644
--- a/fs/btrfs/dedupe.h
+++ b/fs/btrfs/dedupe.h
@@ -109,6 +109,15 @@ static inline struct btrfs_dedupe_hash 
*btrfs_dedupe_alloc_hash(u16 algo)
 int btrfs_dedupe_enable(struct btrfs_fs_info *fs_info,
struct btrfs_ioctl_dedupe_args *dargs);
 
+
+ /*
+ * Get inband dedupe info
+ * Since it needs to access different backends' hash size, which
+ * is not exported, we need such simple function.
+ */
+void btrfs_dedupe_status(struct btrfs_fs_info *fs_info,
+struct btrfs_ioctl_dedupe_args *dargs);
+
 /*
  * Disable dedupe and invalidate all its dedupe data.
  * Called at dedupe disable time.
@@ -120,12 +129,10 @@ int btrfs_dedupe_enable(struct btrfs_fs_info *fs_info,
 int btrfs_dedupe_disable(struct btrfs_fs_info *fs_info);
 
 /*
- * Get current dedupe status.
- * Return 0 for success
- * No possible error yet
+ * Cleanup current btrfs_dedupe_info
+ * Called in umount time
  */
-void btrfs_dedupe_status(struct btrfs_fs_info *fs_info,
-struct btrfs_ioctl_dedupe_args *dargs);
+int btrfs_dedupe_cleanup(struct btrfs_fs_info *fs_info);
 
 /*
  * Calculate hash for dedupe.
diff --git

[PATCH v14.1 11/16] btrfs: ordered-extent: Add support for dedupe

2016-11-14 Thread Qu Wenruo

From: Wang Xiaoguang 

Add ordered-extent support for dedupe.

Note, current ordered-extent support only supports non-compressed source
extent.
Support for compressed source extent will be added later.

Signed-off-by: Qu Wenruo 
Signed-off-by: Wang Xiaoguang 
Reviewed-by: Josef Bacik 
---
 fs/btrfs/ordered-data.c | 46 ++
 fs/btrfs/ordered-data.h | 13 +
 2 files changed, 55 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index b2d1e95..dc989af 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -26,6 +26,7 @@
 #include "extent_io.h"
 #include "disk-io.h"
 #include "compression.h"
+#include "dedupe.h"
 
 static struct kmem_cache *btrfs_ordered_extent_cache;
 
@@ -184,7 +185,8 @@ static inline struct rb_node *tree_search(struct 
btrfs_ordered_inode_tree *tree,
  */
 static int __btrfs_add_ordered_extent(struct inode *inode, u64 file_offset,
  u64 start, u64 len, u64 disk_len,
- int type, int dio, int compress_type)
+ int type, int dio, int compress_type,
+ struct btrfs_dedupe_hash *hash)
 {
struct btrfs_root *root = BTRFS_I(inode)->root;
struct btrfs_ordered_inode_tree *tree;
@@ -204,6 +206,33 @@ static int __btrfs_add_ordered_extent(struct inode *inode, 
u64 file_offset,
entry->inode = igrab(inode);
entry->compress_type = compress_type;
entry->truncated_len = (u64)-1;
+   entry->hash = NULL;
+   /*
+* A hash hit means we have already incremented the extents delayed
+* ref.
+* We must handle this even if another process is trying to
+* turn off dedupe, otherwise we will leak a reference.
+*/
+   if (hash && (hash->bytenr || root->fs_info->dedupe_enabled)) {
+   struct btrfs_dedupe_info *dedupe_info;
+
+   dedupe_info = root->fs_info->dedupe_info;
+   if (WARN_ON(dedupe_info == NULL)) {
+   kmem_cache_free(btrfs_ordered_extent_cache,
+   entry);
+   return -EINVAL;
+   }
+   entry->hash = btrfs_dedupe_alloc_hash(dedupe_info->hash_algo);
+   if (!entry->hash) {
+   kmem_cache_free(btrfs_ordered_extent_cache, entry);
+   return -ENOMEM;
+   }
+   entry->hash->bytenr = hash->bytenr;
+   entry->hash->num_bytes = hash->num_bytes;
+   memcpy(entry->hash->hash, hash->hash,
+  btrfs_hash_sizes[dedupe_info->hash_algo]);
+   }
+
if (type != BTRFS_ORDERED_IO_DONE && type != BTRFS_ORDERED_COMPLETE)
set_bit(type, >flags);
 
@@ -250,15 +279,23 @@ int btrfs_add_ordered_extent(struct inode *inode, u64 
file_offset,
 {
return __btrfs_add_ordered_extent(inode, file_offset, start, len,
  disk_len, type, 0,
- BTRFS_COMPRESS_NONE);
+ BTRFS_COMPRESS_NONE, NULL);
 }
 
+int btrfs_add_ordered_extent_dedupe(struct inode *inode, u64 file_offset,
+  u64 start, u64 len, u64 disk_len, int type,
+  struct btrfs_dedupe_hash *hash)
+{
+   return __btrfs_add_ordered_extent(inode, file_offset, start, len,
+ disk_len, type, 0,
+ BTRFS_COMPRESS_NONE, hash);
+}
 int btrfs_add_ordered_extent_dio(struct inode *inode, u64 file_offset,
 u64 start, u64 len, u64 disk_len, int type)
 {
return __btrfs_add_ordered_extent(inode, file_offset, start, len,
  disk_len, type, 1,
- BTRFS_COMPRESS_NONE);
+ BTRFS_COMPRESS_NONE, NULL);
 }
 
 int btrfs_add_ordered_extent_compress(struct inode *inode, u64 file_offset,
@@ -267,7 +304,7 @@ int btrfs_add_ordered_extent_compress(struct inode *inode, 
u64 file_offset,
 {
return __btrfs_add_ordered_extent(inode, file_offset, start, len,
  disk_len, type, 0,
- compress_type);
+ compress_type, NULL);
 }
 
 /*
@@ -577,6 +614,7 @@ void btrfs_put_ordered_extent(struct btrfs_ordered_extent 
*entry)
list_del(>list);
kfree(sum);
}
+   kfree(entry->hash);
kmem_cache_free(btrfs_ordered_extent_cache, entry);
}
 }
diff --git a/fs/btrfs/ordered-data.h

[PATCH v14.1 10/16] btrfs: dedupe: Implement btrfs_dedupe_calc_hash interface

2016-11-14 Thread Qu Wenruo

From: Wang Xiaoguang 

Unlike in-memory or on-disk dedupe method, only SHA256 hash method is
supported yet, so implement btrfs_dedupe_calc_hash() interface using
SHA256.

Signed-off-by: Qu Wenruo 
Signed-off-by: Wang Xiaoguang 
Reviewed-by: Josef Bacik 
---
 fs/btrfs/dedupe.c | 46 ++
 1 file changed, 46 insertions(+)

diff --git a/fs/btrfs/dedupe.c b/fs/btrfs/dedupe.c
index ef4968f..d0d2f8a 100644
--- a/fs/btrfs/dedupe.c
+++ b/fs/btrfs/dedupe.c
@@ -639,3 +639,49 @@ int btrfs_dedupe_search(struct btrfs_fs_info *fs_info,
}
return ret;
 }
+
+int btrfs_dedupe_calc_hash(struct btrfs_fs_info *fs_info,
+  struct inode *inode, u64 start,
+  struct btrfs_dedupe_hash *hash)
+{
+   int i;
+   int ret;
+   struct page *p;
+   struct btrfs_dedupe_info *dedupe_info = fs_info->dedupe_info;
+   struct crypto_shash *tfm = dedupe_info->dedupe_driver;
+   SHASH_DESC_ON_STACK(sdesc, tfm);
+   u64 dedupe_bs;
+   u64 sectorsize = BTRFS_I(inode)->root->sectorsize;
+
+   if (!fs_info->dedupe_enabled || !hash)
+   return 0;
+
+   if (WARN_ON(dedupe_info == NULL))
+   return -EINVAL;
+
+   WARN_ON(!IS_ALIGNED(start, sectorsize));
+
+   dedupe_bs = dedupe_info->blocksize;
+
+   sdesc->tfm = tfm;
+   sdesc->flags = 0;
+   ret = crypto_shash_init(sdesc);
+   if (ret)
+   return ret;
+   for (i = 0; sectorsize * i < dedupe_bs; i++) {
+   char *d;
+
+   p = find_get_page(inode->i_mapping,
+ (start >> PAGE_SHIFT) + i);
+   if (WARN_ON(!p))
+   return -ENOENT;
+   d = kmap(p);
+   ret = crypto_shash_update(sdesc, d, sectorsize);
+   kunmap(p);
+   put_page(p);
+   if (ret)
+   return ret;
+   }
+   ret = crypto_shash_final(sdesc, hash->hash);
+   return ret;
+}
-- 
2.10.2



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v14.1 01/16] btrfs: improve inode's outstanding_extents computation

2016-11-14 Thread Qu Wenruo

From: Wang Xiaoguang 

This issue was revealed by modifying BTRFS_MAX_EXTENT_SIZE(128MB) to 64KB,
When modifying BTRFS_MAX_EXTENT_SIZE(128MB) to 64KB, fsstress test often
gets these warnings from btrfs_destroy_inode():
WARN_ON(BTRFS_I(inode)->outstanding_extents);
WARN_ON(BTRFS_I(inode)->reserved_extents);

Simple test program below can reproduce this issue steadily.
Note: you need to modify BTRFS_MAX_EXTENT_SIZE to 64KB to have test,
otherwise there won't be such WARNING.
#include 
#include 
#include 
#include 
#include 

int main(void)
{
int fd;
char buf[68 *1024];

memset(buf, 0, 68 * 1024);
fd = open("testfile", O_CREAT | O_EXCL | O_RDWR);
pwrite(fd, buf, 68 * 1024, 64 * 1024);
return;
}

When BTRFS_MAX_EXTENT_SIZE is 64KB, and buffered data range is:
64KB128K132KB
|---|---|
 64 + 4KB

1) for above data range, btrfs_delalloc_reserve_metadata() will reserve
metadata and set BTRFS_I(inode)->outstanding_extents to 2.
(68KB + 64KB - 1) / 64KB == 2

Outstanding_extents: 2

2) then btrfs_dirty_page() will be called to dirty pages and set
EXTENT_DELALLOC flag. In this case, btrfs_set_bit_hook() will be called
twice.
The 1st set_bit_hook() call will set DEALLOC flag for the first 64K.
64KB128KB
|---|
64KB DELALLOC
Outstanding_extents: 2

Set_bit_hooks() uses FIRST_DELALLOC flag to avoid re-increase
outstanding_extents counter.
So for 1st set_bit_hooks() call, it won't modify outstanding_extents,
it's still 2.

Then FIRST_DELALLOC flag is *CLEARED*.

3) 2nd btrfs_set_bit_hook() call.
Because FIRST_DELALLOC have been cleared by previous set_bit_hook(),
btrfs_set_bit_hook() will increase BTRFS_I(inode)->outstanding_extents by
one, so now BTRFS_I(inode)->outstanding_extents is 3.
64KB128KB132KB
|---||
64K DELALLOC   4K DELALLOC
Outstanding_extents: 3

But the correct outstanding_extents number should be 2, not 3.
The 2nd btrfs_set_bit_hook() call just screwed up this, and leads to the
WARN_ON().

Normally, we can solve it by only increasing outstanding_extents in
set_bit_hook().
But the problem is for delalloc_reserve/release_metadata(), we only have
a 'length' parameter, and calculate in-accurate outstanding_extents.
If we only rely on set_bit_hook() release_metadata() will crew things up
as it will decrease inaccurate number.

So the fix we use is:
1) Increase *INACCURATE* outstanding_extents at delalloc_reserve_meta
   Just as a place holder.
2) Increase *accurate* outstanding_extents at set_bit_hooks()
   This is the real increaser.
3) Decrease *INACCURATE* outstanding_extents before returning
   This makes outstanding_extents to correct value.

For 128M BTRFS_MAX_EXTENT_SIZE, due to limitation of
__btrfs_buffered_write(), each iteration will only handle about 2MB
data.
So btrfs_dirty_pages() won't need to handle cases cross 2 extents.

Signed-off-by: Wang Xiaoguang 
---
 fs/btrfs/ctree.h |  2 ++
 fs/btrfs/inode.c | 65 ++--
 fs/btrfs/ioctl.c |  6 ++
 3 files changed, 62 insertions(+), 11 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 9d8edcb..766d152 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3139,6 +3139,8 @@ int btrfs_start_delalloc_roots(struct btrfs_fs_info 
*fs_info, int delay_iput,
   int nr);
 int btrfs_set_extent_delalloc(struct inode *inode, u64 start, u64 end,
  struct extent_state **cached_state, int dedupe);
+int btrfs_set_extent_defrag(struct inode *inode, u64 start, u64 end,
+   struct extent_state **cached_state);
 int btrfs_create_subvol_root(struct btrfs_trans_handle *trans,
 struct btrfs_root *new_root,
 struct btrfs_root *parent_root,
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 1f980ef..25e0083 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -1601,6 +1601,9 @@ static void btrfs_split_extent_hook(struct inode *inode,
if (!(orig->state & EXTENT_DELALLOC))
return;
 
+   if (btrfs_is_free_space_inode(inode))
+   return;
+
size = orig->end - orig->start + 1;
if (size > BTRFS_MAX_EXTENT_SIZE) {
u64 num_extents;
@@ -1643,6 +1646,9 @@ static void btrfs_merge_extent_hook(struct inode *inode,
if (!(other->state & EXTENT_DELALLOC))
return;
 
+   if

[PATCH v14.1 07/16] btrfs: dedupe: Introduce function to remove hash from in-memory tree

2016-11-14 Thread Qu Wenruo

From: Wang Xiaoguang 

Introduce static function inmem_del() to remove hash from in-memory
dedupe tree.
And implement btrfs_dedupe_del() and btrfs_dedup_disable() interfaces.

Also for btrfs_dedupe_disable(), add new functions to wait existing
writer and block incoming writers to eliminate all possible race.

Cc: Mark Fasheh 
Signed-off-by: Qu Wenruo 
Signed-off-by: Wang Xiaoguang 
---
 fs/btrfs/dedupe.c | 132 +++---
 1 file changed, 126 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/dedupe.c b/fs/btrfs/dedupe.c
index e51412b..14c57fa 100644
--- a/fs/btrfs/dedupe.c
+++ b/fs/btrfs/dedupe.c
@@ -186,12 +186,6 @@ enable:
return ret;
 }
 
-int btrfs_dedupe_disable(struct btrfs_fs_info *fs_info)
-{
-   /* Place holder for bisect, will be implemented in later patches */
-   return 0;
-}
-
 static int inmem_insert_hash(struct rb_root *root,
 struct inmem_hash *hash, int hash_len)
 {
@@ -334,3 +328,129 @@ int btrfs_dedupe_add(struct btrfs_trans_handle *trans,
return inmem_add(dedupe_info, hash);
return -EINVAL;
 }
+
+static struct inmem_hash *
+inmem_search_bytenr(struct btrfs_dedupe_info *dedupe_info, u64 bytenr)
+{
+   struct rb_node **p = _info->bytenr_root.rb_node;
+   struct rb_node *parent = NULL;
+   struct inmem_hash *entry = NULL;
+
+   while (*p) {
+   parent = *p;
+   entry = rb_entry(parent, struct inmem_hash, bytenr_node);
+
+   if (bytenr < entry->bytenr)
+   p = &(*p)->rb_left;
+   else if (bytenr > entry->bytenr)
+   p = &(*p)->rb_right;
+   else
+   return entry;
+   }
+
+   return NULL;
+}
+
+/* Delete a hash from in-memory dedupe tree */
+static int inmem_del(struct btrfs_dedupe_info *dedupe_info, u64 bytenr)
+{
+   struct inmem_hash *hash;
+
+   mutex_lock(_info->lock);
+   hash = inmem_search_bytenr(dedupe_info, bytenr);
+   if (!hash) {
+   mutex_unlock(_info->lock);
+   return 0;
+   }
+
+   __inmem_del(dedupe_info, hash);
+   mutex_unlock(_info->lock);
+   return 0;
+}
+
+/* Remove a dedupe hash from dedupe tree */
+int btrfs_dedupe_del(struct btrfs_trans_handle *trans,
+struct btrfs_fs_info *fs_info, u64 bytenr)
+{
+   struct btrfs_dedupe_info *dedupe_info = fs_info->dedupe_info;
+
+   if (!fs_info->dedupe_enabled)
+   return 0;
+
+   if (WARN_ON(dedupe_info == NULL))
+   return -EINVAL;
+
+   if (dedupe_info->backend == BTRFS_DEDUPE_BACKEND_INMEMORY)
+   return inmem_del(dedupe_info, bytenr);
+   return -EINVAL;
+}
+
+static void inmem_destroy(struct btrfs_dedupe_info *dedupe_info)
+{
+   struct inmem_hash *entry, *tmp;
+
+   mutex_lock(_info->lock);
+   list_for_each_entry_safe(entry, tmp, _info->lru_list, lru_list)
+   __inmem_del(dedupe_info, entry);
+   mutex_unlock(_info->lock);
+}
+
+/*
+ * Helper function to wait and block all incoming writers
+ *
+ * Use rw_sem introduced for freeze to wait/block writers.
+ * So during the block time, no new write will happen, so we can
+ * do something quite safe, espcially helpful for dedupe disable,
+ * as it affect buffered write.
+ */
+static void block_all_writers(struct btrfs_fs_info *fs_info)
+{
+   struct super_block *sb = fs_info->sb;
+
+   percpu_down_write(sb->s_writers.rw_sem + SB_FREEZE_WRITE - 1);
+   down_write(>s_umount);
+}
+
+static void unblock_all_writers(struct btrfs_fs_info *fs_info)
+{
+   struct super_block *sb = fs_info->sb;
+
+   up_write(>s_umount);
+   percpu_up_write(sb->s_writers.rw_sem + SB_FREEZE_WRITE - 1);
+}
+
+int btrfs_dedupe_disable(struct btrfs_fs_info *fs_info)
+{
+   struct btrfs_dedupe_info *dedupe_info;
+   int ret;
+
+   dedupe_info = fs_info->dedupe_info;
+
+   if (!dedupe_info)
+   return 0;
+
+   /* Don't allow disable status change in RO mount */
+   if (fs_info->sb->s_flags & MS_RDONLY)
+   return -EROFS;
+
+   /*
+* Wait for all unfinished writers and block further writers.
+* Then sync the whole fs so all current write will go through
+* dedupe, and all later write won't go through dedupe.
+*/
+   block_all_writers(fs_info);
+   ret = sync_filesystem(fs_info->sb);
+   fs_info->dedupe_enabled = 0;
+   fs_info->dedupe_info = NULL;
+   unblock_all_writers(fs_info);
+   if (ret < 0)
+   return ret;
+
+   /* now we are OK to clean up everything */
+   if (dedupe_info->backend == BTRFS_DEDUPE_BACKEND_INMEMORY)
+   inmem_destroy(dedupe_info);
+
+   crypto_free_shash(dedupe_info->dedupe_driver);
+   kfree(dedupe_info);
+

[PATCH v14.1 09/16] btrfs: dedupe: Introduce function to search for an existing hash

2016-11-14 Thread Qu Wenruo

From: Wang Xiaoguang 

Introduce static function inmem_search() to handle the job for in-memory
hash tree.

The trick is, we must ensure the delayed ref head is not being run at
the time we search the for the hash.

With inmem_search(), we can implement the btrfs_dedupe_search()
interface.

Signed-off-by: Qu Wenruo 
Signed-off-by: Wang Xiaoguang 
Reviewed-by: Josef Bacik 
---
 fs/btrfs/dedupe.c | 185 ++
 1 file changed, 185 insertions(+)

diff --git a/fs/btrfs/dedupe.c b/fs/btrfs/dedupe.c
index 14c57fa..ef4968f 100644
--- a/fs/btrfs/dedupe.c
+++ b/fs/btrfs/dedupe.c
@@ -20,6 +20,7 @@
 #include "btrfs_inode.h"
 #include "transaction.h"
 #include "delayed-ref.h"
+#include "qgroup.h"
 
 struct inmem_hash {
struct rb_node hash_node;
@@ -454,3 +455,187 @@ int btrfs_dedupe_disable(struct btrfs_fs_info *fs_info)
kfree(dedupe_info);
return 0;
 }
+
+/*
+ * Caller must ensure the corresponding ref head is not being run.
+ */
+static struct inmem_hash *
+inmem_search_hash(struct btrfs_dedupe_info *dedupe_info, u8 *hash)
+{
+   struct rb_node **p = _info->hash_root.rb_node;
+   struct rb_node *parent = NULL;
+   struct inmem_hash *entry = NULL;
+   u16 hash_algo = dedupe_info->hash_algo;
+   int hash_len = btrfs_hash_sizes[hash_algo];
+
+   while (*p) {
+   parent = *p;
+   entry = rb_entry(parent, struct inmem_hash, hash_node);
+
+   if (memcmp(hash, entry->hash, hash_len) < 0) {
+   p = &(*p)->rb_left;
+   } else if (memcmp(hash, entry->hash, hash_len) > 0) {
+   p = &(*p)->rb_right;
+   } else {
+   /* Found, need to re-add it to LRU list head */
+   list_del(>lru_list);
+   list_add(>lru_list, _info->lru_list);
+   return entry;
+   }
+   }
+   return NULL;
+}
+
+static int inmem_search(struct btrfs_dedupe_info *dedupe_info,
+   struct inode *inode, u64 file_pos,
+   struct btrfs_dedupe_hash *hash)
+{
+   int ret;
+   struct btrfs_root *root = BTRFS_I(inode)->root;
+   struct btrfs_trans_handle *trans;
+   struct btrfs_delayed_ref_root *delayed_refs;
+   struct btrfs_delayed_ref_head *head;
+   struct btrfs_delayed_ref_head *insert_head;
+   struct btrfs_delayed_data_ref *insert_dref;
+   struct btrfs_qgroup_extent_record *insert_qrecord = NULL;
+   struct inmem_hash *found_hash;
+   int free_insert = 1;
+   u64 bytenr;
+   u32 num_bytes;
+
+   insert_head = kmem_cache_alloc(btrfs_delayed_ref_head_cachep, GFP_NOFS);
+   if (!insert_head)
+   return -ENOMEM;
+   insert_head->extent_op = NULL;
+   insert_dref = kmem_cache_alloc(btrfs_delayed_data_ref_cachep, GFP_NOFS);
+   if (!insert_dref) {
+   kmem_cache_free(btrfs_delayed_ref_head_cachep, insert_head);
+   return -ENOMEM;
+   }
+   if (test_bit(BTRFS_FS_QUOTA_ENABLED, >fs_info->flags) &&
+   is_fstree(root->root_key.objectid)) {
+   insert_qrecord = kmalloc(sizeof(*insert_qrecord), GFP_NOFS);
+   if (!insert_qrecord) {
+   kmem_cache_free(btrfs_delayed_ref_head_cachep,
+   insert_head);
+   kmem_cache_free(btrfs_delayed_data_ref_cachep,
+   insert_dref);
+   return -ENOMEM;
+   }
+   }
+
+   trans = btrfs_join_transaction(root);
+   if (IS_ERR(trans)) {
+   ret = PTR_ERR(trans);
+   goto free_mem;
+   }
+
+again:
+   mutex_lock(_info->lock);
+   found_hash = inmem_search_hash(dedupe_info, hash->hash);
+   /* If we don't find a duplicated extent, just return. */
+   if (!found_hash) {
+   ret = 0;
+   goto out;
+   }
+   bytenr = found_hash->bytenr;
+   num_bytes = found_hash->num_bytes;
+
+   delayed_refs = >transaction->delayed_refs;
+
+   spin_lock(_refs->lock);
+   head = btrfs_find_delayed_ref_head(trans, bytenr);
+   if (!head) {
+   /*
+* We can safely insert a new delayed_ref as long as we
+* hold delayed_refs->lock.
+* Only need to use atomic inc_extent_ref()
+*/
+   btrfs_add_delayed_data_ref_locked(root->fs_info, trans,
+   insert_dref, insert_head, insert_qrecord,
+   bytenr, num_bytes, 0, root->root_key.objectid,
+   btrfs_ino(inode), file_pos, 0,
+   BTRFS_ADD_DELAYED_REF);
+   spin_unlock(_refs->lock);
+
+   /*

[PATCH v14.1 08/16] btrfs: delayed-ref: Add support for increasing data ref under spinlock

2016-11-14 Thread Qu Wenruo

For in-band dedupe, btrfs needs to increase data ref with delayed_ref
locked, so add a new function btrfs_add_delayed_data_ref_lock() to
increase extent ref with delayed_refs already locked.

Signed-off-by: Qu Wenruo 
Reviewed-by: Josef Bacik 
---
 fs/btrfs/delayed-ref.c | 30 +++---
 fs/btrfs/delayed-ref.h |  8 
 2 files changed, 31 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
index 8d93854..dccce10 100644
--- a/fs/btrfs/delayed-ref.c
+++ b/fs/btrfs/delayed-ref.c
@@ -805,6 +805,26 @@ free_ref:
 }
 
 /*
+ * Do real delayed data ref insert.
+ * Caller must hold delayed_refs->lock and allocation memory
+ * for dref,head_ref and record.
+ */
+void btrfs_add_delayed_data_ref_locked(struct btrfs_fs_info *fs_info,
+   struct btrfs_trans_handle *trans,
+   struct btrfs_delayed_data_ref *dref,
+   struct btrfs_delayed_ref_head *head_ref,
+   struct btrfs_qgroup_extent_record *qrecord,
+   u64 bytenr, u64 num_bytes, u64 parent, u64 ref_root,
+   u64 owner, u64 offset, u64 reserved, int action)
+{
+   head_ref = add_delayed_ref_head(fs_info, trans, _ref->node,
+   qrecord, bytenr, num_bytes, ref_root, reserved,
+   action, 1);
+   add_delayed_data_ref(fs_info, trans, head_ref, >node, bytenr,
+   num_bytes, parent, ref_root, owner, offset, action);
+}
+
+/*
  * add a delayed data ref. it's similar to btrfs_add_delayed_tree_ref.
  */
 int btrfs_add_delayed_data_ref(struct btrfs_fs_info *fs_info,
@@ -850,13 +870,9 @@ int btrfs_add_delayed_data_ref(struct btrfs_fs_info 
*fs_info,
 * insert both the head node and the new ref without dropping
 * the spin lock
 */
-   head_ref = add_delayed_ref_head(fs_info, trans, _ref->node, record,
-   bytenr, num_bytes, ref_root, reserved,
-   action, 1);
-
-   add_delayed_data_ref(fs_info, trans, head_ref, >node, bytenr,
-  num_bytes, parent, ref_root, owner, offset,
-  action);
+   btrfs_add_delayed_data_ref_locked(fs_info, trans, ref, head_ref, record,
+   bytenr, num_bytes, parent, ref_root, owner, offset,
+   reserved, action);
spin_unlock(_refs->lock);
 
return 0;
diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h
index 43f3629..d3a4369 100644
--- a/fs/btrfs/delayed-ref.h
+++ b/fs/btrfs/delayed-ref.h
@@ -239,11 +239,19 @@ static inline void btrfs_put_delayed_ref(struct 
btrfs_delayed_ref_node *ref)
}
 }
 
+struct btrfs_qgroup_extent_record;
 int btrfs_add_delayed_tree_ref(struct btrfs_fs_info *fs_info,
   struct btrfs_trans_handle *trans,
   u64 bytenr, u64 num_bytes, u64 parent,
   u64 ref_root, int level, int action,
   struct btrfs_delayed_extent_op *extent_op);
+void btrfs_add_delayed_data_ref_locked(struct btrfs_fs_info *fs_info,
+   struct btrfs_trans_handle *trans,
+   struct btrfs_delayed_data_ref *dref,
+   struct btrfs_delayed_ref_head *head_ref,
+   struct btrfs_qgroup_extent_record *qrecord,
+   u64 bytenr, u64 num_bytes, u64 parent, u64 ref_root,
+   u64 owner, u64 offset, u64 reserved, int action);
 int btrfs_add_delayed_data_ref(struct btrfs_fs_info *fs_info,
   struct btrfs_trans_handle *trans,
   u64 bytenr, u64 num_bytes,
-- 
2.10.2



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v14.1 05/16] btrfs: dedupe: Introduce function to initialize dedupe info

2016-11-14 Thread Qu Wenruo

From: Wang Xiaoguang 

Add generic function to initialize dedupe info.

Signed-off-by: Qu Wenruo 
Signed-off-by: Wang Xiaoguang 
Reviewed-by: Josef Bacik 
---
 fs/btrfs/Makefile  |   2 +-
 fs/btrfs/dedupe.c  | 185 +
 fs/btrfs/dedupe.h  |  13 +++-
 include/uapi/linux/btrfs.h |   4 +-
 4 files changed, 200 insertions(+), 4 deletions(-)
 create mode 100644 fs/btrfs/dedupe.c

diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile
index 128ce17..1b8c627 100644
--- a/fs/btrfs/Makefile
+++ b/fs/btrfs/Makefile
@@ -9,7 +9,7 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o 
root-tree.o dir-item.o \
   export.o tree-log.o free-space-cache.o zlib.o lzo.o \
   compression.o delayed-ref.o relocation.o delayed-inode.o scrub.o \
   reada.o backref.o ulist.o qgroup.o send.o dev-replace.o raid56.o \
-  uuid-tree.o props.o hash.o free-space-tree.o
+  uuid-tree.o props.o hash.o free-space-tree.o dedupe.o
 
 btrfs-$(CONFIG_BTRFS_FS_POSIX_ACL) += acl.o
 btrfs-$(CONFIG_BTRFS_FS_CHECK_INTEGRITY) += check-integrity.o
diff --git a/fs/btrfs/dedupe.c b/fs/btrfs/dedupe.c
new file mode 100644
index 000..b14166a
--- /dev/null
+++ b/fs/btrfs/dedupe.c
@@ -0,0 +1,185 @@
+/*
+ * Copyright (C) 2016 Fujitsu.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ */
+#include "ctree.h"
+#include "dedupe.h"
+#include "btrfs_inode.h"
+#include "transaction.h"
+#include "delayed-ref.h"
+
+struct inmem_hash {
+   struct rb_node hash_node;
+   struct rb_node bytenr_node;
+   struct list_head lru_list;
+
+   u64 bytenr;
+   u32 num_bytes;
+
+   u8 hash[];
+};
+
+static int init_dedupe_info(struct btrfs_dedupe_info **ret_info,
+   struct btrfs_ioctl_dedupe_args *dargs)
+{
+   struct btrfs_dedupe_info *dedupe_info;
+
+   dedupe_info = kzalloc(sizeof(*dedupe_info), GFP_NOFS);
+   if (!dedupe_info)
+   return -ENOMEM;
+
+   dedupe_info->hash_algo = dargs->hash_algo;
+   dedupe_info->backend = dargs->backend;
+   dedupe_info->blocksize = dargs->blocksize;
+   dedupe_info->limit_nr = dargs->limit_nr;
+
+   /* only support SHA256 yet */
+   dedupe_info->dedupe_driver = crypto_alloc_shash("sha256", 0, 0);
+   if (IS_ERR(dedupe_info->dedupe_driver)) {
+   int ret;
+
+   ret = PTR_ERR(dedupe_info->dedupe_driver);
+   kfree(dedupe_info);
+   return ret;
+   }
+
+   dedupe_info->hash_root = RB_ROOT;
+   dedupe_info->bytenr_root = RB_ROOT;
+   dedupe_info->current_nr = 0;
+   INIT_LIST_HEAD(_info->lru_list);
+   mutex_init(_info->lock);
+
+   *ret_info = dedupe_info;
+   return 0;
+}
+
+/*
+ * Helper to check if parameters are valid.
+ * The first invalid field will be set to (-1), to info user which parameter
+ * is invalid.
+ * Except dargs->limit_nr or dargs->limit_mem, in that case, 0 will returned
+ * to info user, since user can specify any value to limit, except 0.
+ */
+static int check_dedupe_parameter(struct btrfs_fs_info *fs_info,
+ struct btrfs_ioctl_dedupe_args *dargs)
+{
+   u64 blocksize = dargs->blocksize;
+   u64 limit_nr = dargs->limit_nr;
+   u64 limit_mem = dargs->limit_mem;
+   u16 hash_algo = dargs->hash_algo;
+   u8 backend = dargs->backend;
+
+   /*
+* Set all reserved fields to -1, allow user to detect
+* unsupported optional parameters.
+*/
+   memset(dargs->__unused, -1, sizeof(dargs->__unused));
+   if (blocksize > BTRFS_DEDUPE_BLOCKSIZE_MAX ||
+   blocksize < BTRFS_DEDUPE_BLOCKSIZE_MIN ||
+   blocksize < fs_info->tree_root->sectorsize ||
+   !is_power_of_2(blocksize) ||
+   blocksize < PAGE_SIZE) {
+   dargs->blocksize = (u64)-1;
+   return -EINVAL;
+   }
+   if (hash_algo >= ARRAY_SIZE(btrfs_hash_sizes)) {
+   dargs->hash_algo = (u16)-1;
+   return -EINVAL;
+   }
+   if (backend >= BTRFS_DEDUPE_BACKEND_COUNT) {
+   dargs->backend = (u8)-1;
+   return -EINVAL;
+   }
+
+   /* Backend specific check */
+   if

[PATCH v14.1 12/16] btrfs: dedupe: Inband in-memory only de-duplication implement

2016-11-14 Thread Qu Wenruo

Core implement for inband de-duplication.
It reuse the async_cow_start() facility to do the calculate dedupe hash.
And use dedupe hash to do inband de-duplication at extent level.

The work flow is as below:
1) Run delalloc range for an inode
2) Calculate hash for the delalloc range at the unit of dedupe_bs
3) For hash match(duplicated) case, just increase source extent ref
   and insert file extent.
   For hash mismatch case, go through the normal cow_file_range()
   fallback, and add hash into dedupe_tree.
   Compress for hash miss case is not supported yet.

Current implement restore all dedupe hash in memory rb-tree, with LRU
behavior to control the limit.

Signed-off-by: Wang Xiaoguang 
Signed-off-by: Qu Wenruo 
---
 fs/btrfs/extent-tree.c |  20 
 fs/btrfs/inode.c   | 253 +++--
 fs/btrfs/relocation.c  |  16 
 3 files changed, 258 insertions(+), 31 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index d5691b0..3c82730 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -37,6 +37,7 @@
 #include "math.h"
 #include "sysfs.h"
 #include "qgroup.h"
+#include "dedupe.h"
 
 #undef SCRAMBLE_DELAYED_REFS
 
@@ -2399,6 +2400,8 @@ static int run_one_delayed_ref(struct btrfs_trans_handle 
*trans,
 
if (btrfs_delayed_ref_is_head(node)) {
struct btrfs_delayed_ref_head *head;
+   struct btrfs_fs_info *fs_info = root->fs_info;
+
/*
 * we've hit the end of the chain and we were supposed
 * to insert this extent into the tree.  But, it got
@@ -2414,6 +2417,18 @@ static int run_one_delayed_ref(struct btrfs_trans_handle 
*trans,
btrfs_pin_extent(root, node->bytenr,
 node->num_bytes, 1);
if (head->is_data) {
+   /*
+* If insert_reserved is given, it means
+* a new extent is revered, then deleted
+* in one tran, and inc/dec get merged to 0.
+*
+* In this case, we need to remove its dedupe
+* hash.
+*/
+   ret = btrfs_dedupe_del(trans, fs_info,
+  node->bytenr);
+   if (ret < 0)
+   return ret;
ret = btrfs_del_csums(trans, root,
  node->bytenr,
  node->num_bytes);
@@ -7087,6 +7102,11 @@ static int __btrfs_free_extent(struct btrfs_trans_handle 
*trans,
btrfs_release_path(path);
 
if (is_data) {
+   ret = btrfs_dedupe_del(trans, info, bytenr);
+   if (ret < 0) {
+   btrfs_abort_transaction(trans, ret);
+   goto out;
+   }
ret = btrfs_del_csums(trans, root, bytenr, num_bytes);
if (ret) {
btrfs_abort_transaction(trans, ret);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index f3ea0e0..1f384e1 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -338,6 +338,7 @@ struct async_extent {
struct page **pages;
unsigned long nr_pages;
int compress_type;
+   struct btrfs_dedupe_hash *hash;
struct list_head list;
 };
 
@@ -349,6 +350,7 @@ struct async_cow {
u64 end;
struct list_head extents;
struct btrfs_work work;
+   enum btrfs_metadata_reserve_type reserve_type;
 };
 
 static noinline int add_async_extent(struct async_cow *cow,
@@ -356,7 +358,8 @@ static noinline int add_async_extent(struct async_cow *cow,
 u64 compressed_size,
 struct page **pages,
 unsigned long nr_pages,
-int compress_type)
+int compress_type,
+struct btrfs_dedupe_hash *hash)
 {
struct async_extent *async_extent;
 
@@ -368,6 +371,7 @@ static noinline int add_async_extent(struct async_cow *cow,
async_extent->pages = pages;
async_extent->nr_pages = nr_pages;
async_extent->compress_type = compress_type;
+   async_extent->hash = hash;
list_add_tail(_extent->list, >extents);
return 0;
 }
@@ -600,7 +604,7 @@ cont:
 */
add_async_extent(async_cow, start, num_bytes,
total_compressed, pages,

[PATCH v14.1 06/16] btrfs: dedupe: Introduce function to add hash into in-memory tree

2016-11-14 Thread Qu Wenruo

From: Wang Xiaoguang 

Introduce static function inmem_add() to add hash into in-memory tree.
And now we can implement the btrfs_dedupe_add() interface.

Signed-off-by: Qu Wenruo 
Signed-off-by: Wang Xiaoguang 
Reviewed-by: Josef Bacik 
---
 fs/btrfs/dedupe.c | 151 ++
 1 file changed, 151 insertions(+)

diff --git a/fs/btrfs/dedupe.c b/fs/btrfs/dedupe.c
index b14166a..e51412b 100644
--- a/fs/btrfs/dedupe.c
+++ b/fs/btrfs/dedupe.c
@@ -32,6 +32,14 @@ struct inmem_hash {
u8 hash[];
 };
 
+static inline struct inmem_hash *inmem_alloc_hash(u16 algo)
+{
+   if (WARN_ON(algo >= ARRAY_SIZE(btrfs_hash_sizes)))
+   return NULL;
+   return kzalloc(sizeof(struct inmem_hash) + btrfs_hash_sizes[algo],
+   GFP_NOFS);
+}
+
 static int init_dedupe_info(struct btrfs_dedupe_info **ret_info,
struct btrfs_ioctl_dedupe_args *dargs)
 {
@@ -183,3 +191,146 @@ int btrfs_dedupe_disable(struct btrfs_fs_info *fs_info)
/* Place holder for bisect, will be implemented in later patches */
return 0;
 }
+
+static int inmem_insert_hash(struct rb_root *root,
+struct inmem_hash *hash, int hash_len)
+{
+   struct rb_node **p = >rb_node;
+   struct rb_node *parent = NULL;
+   struct inmem_hash *entry = NULL;
+
+   while (*p) {
+   parent = *p;
+   entry = rb_entry(parent, struct inmem_hash, hash_node);
+   if (memcmp(hash->hash, entry->hash, hash_len) < 0)
+   p = &(*p)->rb_left;
+   else if (memcmp(hash->hash, entry->hash, hash_len) > 0)
+   p = &(*p)->rb_right;
+   else
+   return 1;
+   }
+   rb_link_node(>hash_node, parent, p);
+   rb_insert_color(>hash_node, root);
+   return 0;
+}
+
+static int inmem_insert_bytenr(struct rb_root *root,
+  struct inmem_hash *hash)
+{
+   struct rb_node **p = >rb_node;
+   struct rb_node *parent = NULL;
+   struct inmem_hash *entry = NULL;
+
+   while (*p) {
+   parent = *p;
+   entry = rb_entry(parent, struct inmem_hash, bytenr_node);
+   if (hash->bytenr < entry->bytenr)
+   p = &(*p)->rb_left;
+   else if (hash->bytenr > entry->bytenr)
+   p = &(*p)->rb_right;
+   else
+   return 1;
+   }
+   rb_link_node(>bytenr_node, parent, p);
+   rb_insert_color(>bytenr_node, root);
+   return 0;
+}
+
+static void __inmem_del(struct btrfs_dedupe_info *dedupe_info,
+   struct inmem_hash *hash)
+{
+   list_del(>lru_list);
+   rb_erase(>hash_node, _info->hash_root);
+   rb_erase(>bytenr_node, _info->bytenr_root);
+
+   if (!WARN_ON(dedupe_info->current_nr == 0))
+   dedupe_info->current_nr--;
+
+   kfree(hash);
+}
+
+/*
+ * Insert a hash into in-memory dedupe tree
+ * Will remove exceeding last recent use hash.
+ *
+ * If the hash mathced with existing one, we won't insert it, to
+ * save memory
+ */
+static int inmem_add(struct btrfs_dedupe_info *dedupe_info,
+struct btrfs_dedupe_hash *hash)
+{
+   int ret = 0;
+   u16 algo = dedupe_info->hash_algo;
+   struct inmem_hash *ihash;
+
+   ihash = inmem_alloc_hash(algo);
+
+   if (!ihash)
+   return -ENOMEM;
+
+   /* Copy the data out */
+   ihash->bytenr = hash->bytenr;
+   ihash->num_bytes = hash->num_bytes;
+   memcpy(ihash->hash, hash->hash, btrfs_hash_sizes[algo]);
+
+   mutex_lock(_info->lock);
+
+   ret = inmem_insert_bytenr(_info->bytenr_root, ihash);
+   if (ret > 0) {
+   kfree(ihash);
+   ret = 0;
+   goto out;
+   }
+
+   ret = inmem_insert_hash(_info->hash_root, ihash,
+   btrfs_hash_sizes[algo]);
+   if (ret > 0) {
+   /*
+* We only keep one hash in tree to save memory, so if
+* hash conflicts, free the one to insert.
+*/
+   rb_erase(>bytenr_node, _info->bytenr_root);
+   kfree(ihash);
+   ret = 0;
+   goto out;
+   }
+
+   list_add(>lru_list, _info->lru_list);
+   dedupe_info->current_nr++;
+
+   /* Remove the last dedupe hash if we exceed limit */
+   while (dedupe_info->current_nr > dedupe_info->limit_nr) {
+   struct inmem_hash *last;
+
+   last = list_entry(dedupe_info->lru_list.prev,
+ struct inmem_hash, lru_list);
+   __inmem_del(dedupe_info, last);
+   }
+out:
+   mutex_unlock(_info->lock);
+   return 0;
+}
+
+int btrfs_dedupe_add(struct

60 matches

Mail list logo