Re: Another ENOSPC situation

2016-04-01 Thread Chris Murphy
On Fri, Apr 1, 2016 at 10:55 PM, Duncan <1i5t5.dun...@cox.net> wrote:
> Marc Haber posted on Fri, 01 Apr 2016 15:40:29 +0200 as excerpted:

>> [4/502]mh@swivel:~$ sudo btrfs fi usage /
>> Overall:
>> Device size: 600.00GiB
>> Device allocated:600.00GiB
>> Device unallocated:1.00MiB
>
> That's the problem right there.  The admin didn't do his job and spot the
> near full allocation issue


I don't yet agree this is an admin problem. This is the 2nd or 3rd
case we've seen only recently where there's plenty of space in all
chunk types and yet ENOSPC happens, seemingly only because there's no
unallocated space remaining. I don't know that this is a regression
for sure, but it sure seems like one.



>>
>> Data,single: Size:553.93GiB, Used:405.73GiB
>>/dev/mapper/swivelbtr 553.93GiB
>>
>> Metadata,DUP: Size:23.00GiB, Used:3.83GiB
>>/dev/mapper/swivelbtr  46.00GiB
>>
>> System,DUP: Size:32.00MiB, Used:112.00KiB
>>/dev/mapper/swivelbtr  64.00MiB
>>
>> Unallocated:
>>/dev/mapper/swivelbtr   1.00MiB
>> [5/503]mh@swivel:~$
>
> Both data and metadata have several GiB free, data ~140 GiB free, and
> metadata isn't into global reserve, so the system isn't totally wedged,
> only partially, due to the lack of unallocated space.

Unallocated space alone hasn't ever caused this that I can remember.
It's most often been totally full metadata chunks, with free space in
allocated data chunks, with no unallocated space out of which to
create another metadata chunk to write out changes.

There should be plenty of space for either a -dusage=1 or -musage=1
balance to free up a bunch of partially allocated chunks. Offhand I
don't think the profiles filter is helpful in this case.

OK so where I could be wrong is that I'm expecting balance doesn't
require allocated space to work. I'd expect that it can COW extents
from one chunk into another existing chunk (of the same type) and then
once that's successful, free up that chunk, i.e. revert it back to
unallocated. If balance can only copy into newly allocated chunks,
that seems like a big problem. I thought that problems had been fixed
a very long time ago.

And what we don't see from 'usage' that we will see from 'df' is the
GlobalReserve values. I'd like to see that.

Anyway, in the meantime there is a work around:

btrfs dev add

Just add a device, even if it's an 8GiB flash drive. But it can be a
spare space on a partition, or it can be a logical volume, or whatever
you want. That'll add some gigs of unallocated space. Now the balance
will work, or for absolutely sure there's a bug (and a new one because
this has always worked in the past). After whatever filtered or full
balance is done, make sure to 'btfs dev rem' and confirm it's gone
with 'btrfs fi show' before removing the device. It's a two device
volume until that device is successfully removed and is in something
of a fragile state until then because any loss of data on that 2nd
device has a good chance of face planting the file system.



-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 10/13] btrfs: introduce helper functions to perform hot replace

2016-04-01 Thread kbuild test robot
Hi Anand,

[auto build test ERROR on btrfs/next]
[also build test ERROR on v4.6-rc1 next-20160401]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improving the system]

url:
https://github.com/0day-ci/linux/commits/Anand-Jain/Introduce-device-state-failed-Hot-spare-and-Auto-replace/20160402-093528
base:   https://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git 
next
config: x86_64-rhel (attached as .config)
reproduce:
# save the attached .config to linux build tree
make ARCH=x86_64 

All errors (new ones prefixed by >>):

   fs/btrfs/dev-replace.c: In function 'btrfs_auto_replace_start':
   fs/btrfs/dev-replace.c:981:38: warning: passing argument 2 of 
'btrfs_dev_replace_start' from incompatible pointer type
 ret = btrfs_dev_replace_start(root, tgt_path,
 ^
   fs/btrfs/dev-replace.c:308:5: note: expected 'struct 
btrfs_ioctl_dev_replace_args *' but argument is of type 'char *'
int btrfs_dev_replace_start(struct btrfs_root *root,
^
>> fs/btrfs/dev-replace.c:981:8: error: too many arguments to function 
>> 'btrfs_dev_replace_start'
 ret = btrfs_dev_replace_start(root, tgt_path,
   ^
   fs/btrfs/dev-replace.c:308:5: note: declared here
int btrfs_dev_replace_start(struct btrfs_root *root,
^

vim +/btrfs_dev_replace_start +981 fs/btrfs/dev-replace.c

   975  src_path = kstrdup(rcu_str_deref(src_device->name), GFP_ATOMIC);
   976  rcu_read_unlock();
   977  if (!src_path) {
   978  kfree(tgt_path);
   979  return -ENOMEM;
   980  }
 > 981  ret = btrfs_dev_replace_start(root, tgt_path,
   982  src_device->devid, src_path,
   983  
BTRFS_IOCTL_DEV_REPLACE_CONT_READING_FROM_SRCDEV_MODE_AVOID);
   984  if (ret)

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: Binary data


Re: Another ENOSPC situation

2016-04-01 Thread Duncan
Marc Haber posted on Fri, 01 Apr 2016 15:40:29 +0200 as excerpted:

> Hi,
> 
> just for a change, this is another btrfs on a different host. The host
> is also running Debian unstable with mainline kernels, the btrfs in
> question was created (not converted) in March 2015 with btrfs-tools
> 3.17. It is the root fs of my main work notebook which is under
> workstation load, with lots of snapshots being created and deleted.
> 
> Balance immediately fails with ENOSPC
> 
> balance -dprofiles=single -dusage=1 goes through "fine" ("had to
> relocate 0 out of 602 chunks")
> 
> balance -dprofiles=single -dusage=2 also ENOSPCes immediately.
> 
> [4/502]mh@swivel:~$ sudo btrfs fi usage /
> Overall:
> Device size: 600.00GiB
> Device allocated:600.00GiB
> Device unallocated:1.00MiB

That's the problem right there.  The admin didn't do his job and spot the 
near full allocation issue (perhaps with the help of some script set to 
run periodically and tell him about it) before it got critical, and now 
there's no room left to balance, to fix the problem.

This despite the fact that the admin chose to run a not yet entirely 
stable filesystem that's well known to run off the rails in precisely 
this sort of way, occasionally, with specific use-cases such as heavy 
snapshotting more often than others.

> Device missing:  0.00B
> Used:413.40GiB
> Free (estimated):148.20GiB  (min: 148.20GiB)

Tho the used vs. free isn't all that bad... it's just that the allocated 
vs. unallocated was allowed to run off the rails and get the filesystem 
in a bind.

But that does mean it should be possible to do something about it. =:^)

> Data ratio:   1.00
> Metadata ratio:   2.00
> Global reserve:  512.00MiB  (used: 0.00B)
> 
> Data,single: Size:553.93GiB, Used:405.73GiB
>/dev/mapper/swivelbtr 553.93GiB
> 
> Metadata,DUP: Size:23.00GiB, Used:3.83GiB
>/dev/mapper/swivelbtr  46.00GiB
> 
> System,DUP: Size:32.00MiB, Used:112.00KiB
>/dev/mapper/swivelbtr  64.00MiB
> 
> Unallocated:
>/dev/mapper/swivelbtr   1.00MiB
> [5/503]mh@swivel:~$

Both data and metadata have several GiB free, data ~140 GiB free, and 
metadata isn't into global reserve, so the system isn't totally wedged, 
only partially, due to the lack of unallocated space.

> btrfs balance -mprofiles seems to do something. one kworked and one
> btrfs-transaction process hog one CPU core each for hours, while
> blocking the filesystem for minutes apiece, which leads to the host
> being nearly unuseable up to the point of "clock and mouse pointer
> frozen for nearly ten minutes".
> 
> The btrfs balance cancel I issued after four hours of this state took
> eleven minutes alone to complete.

It's worth noting as an aside that Linux isn't necessarily tuned for 
interactivity by default, tho there are definitely ways to make it more 
so.  Additionally, on some mobos at least, it's possible to tweak the 
BIOS balance between interactivity and thruput.  An old Tyan board (PCI 
not the newer PCIE, which avoids some of the problems with multiple 
dedicated buses) I had was tilted a bit heavily toward thruput, which did 
make sense as it was actually a server board, until I tweaked things a 
bit.  That made a LOT of difference, curing the dragging, but also curing 
occasional audio runouts, etc.  Turns out it was simply tuned to do huge 
bus "packets" (I forgot the proper in-context term, and that board died a 
few years ago, so...), increasing thruput, but also increasing latency 
beyond what the sound card and keyboard/mouse (or in that case the human 
operating them) could reasonably deal with.  By shortening the PCI 
"packet length", it reduced thruput a bit but greatly improved latency, 
letting other users have their turn when they needed it, not some time 
later.

Of course in addition to PCIE putting many of those things on dedicated 
buses these days, ssds are so much faster that a lot of things that could 
potentially be problems on spinning rust, simply don't tend to be issues 
on ssds.  As much as anything, I think that's what a lot of users 
bothered by such problems are turning to, and I'd bet that's a good part 
of why SSDs are as popular as they are, as well.  I know I've simply not 
had many of the problems here that others had, and while I think part of 
it is the multiple relatively small but independent filesystems and part 
of it may be because I don't use snapshotting, I also think a major part 
of it is simply that the SSDs I'm running btrfs on are simply so much 
faster than spinning rust that the problems either don't occur, or if 
they do, they're done before I even notice them.

FWIW, I do still use spinning rust, but for my media partition and 
(second) backups, not for anything speed critical at all.  And FWIW, I 
still use reiserfs on 

Re: Global hotspare functionality

2016-04-01 Thread Anand Jain



On 04/02/2016 09:33 AM, Yauhen Kharuzhy wrote:

On Sat, Apr 02, 2016 at 09:15:56AM +0800, Anand Jain wrote:



On 03/30/2016 03:47 AM, Yauhen Kharuzhy wrote:

On Tue, Mar 29, 2016 at 10:41:36PM +0800, Anand Jain wrote:


Hi Yauhen,





Issue 2.
At start of autoreplacig drive by hotspare, kernel craches in transaction
handling code (inside of btrfs_commit_transaction() called by autoreplace 
initiating
routines). I 'fixed' this by removing of closing of bdev in 
btrfs_close_one_device_dont_free(), see
https://bitbucket.org/jekhor/linux-btrfs/commits/dfa441c9ec7b3833f6a5e4d0b6f8c678faea29bb?at=master
(oops text is attached also). Bdev is closed after replacing by
btrfs_dev_replace_finishing(), so this is safe but doesn't seem
to be right way.


  I have sent out V2. I don't see that issue with this,
  could you pls try ?


Yes, it reproduced on v4.4.5 kernel. I will try with current
'for-linus-4.6' Chris' tree soon.

To emulate a drive failure, I disconnect the drive in VirtualBox, so bdev
can be freed by kernel after releasing of all references to it.


   So far the raid group profile would adapt to lower suitable
   group profile when device is missing/failed. This appears to
   be not happening with RAID56 OR there are stale IO which wasn't
   flushed out. Anyway to have this fixed I am moving the patch
btrfs: introduce device dynamic state transition to offline or failed
   to the top in v3 for any potential changes.
   But firstly we need a reliable test case, or a very carefully
   crafted test case which can create this situation

   Here below is the dm-error that I am using for testing, which
   apparently doesn't report this issue. Could you please try on V3. ?
   (pls note the device names are hard coded in the test script
   sorry about that) This would eventually be fstests script.


Sure. But I don't see any V3 patches in the list. Are you still
preparing to send them or I missed something?


 Its out now. There was a little distraction when I was about to send it.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Global hotspare functionality

2016-04-01 Thread Yauhen Kharuzhy
On Sat, Apr 02, 2016 at 09:15:56AM +0800, Anand Jain wrote:
> 
> 
> On 03/30/2016 03:47 AM, Yauhen Kharuzhy wrote:
> >On Tue, Mar 29, 2016 at 10:41:36PM +0800, Anand Jain wrote:
> >>
> >>Hi Yauhen,
> >>
> >
> >>>
> >>>Issue 2.
> >>>At start of autoreplacig drive by hotspare, kernel craches in transaction
> >>>handling code (inside of btrfs_commit_transaction() called by autoreplace 
> >>>initiating
> >>>routines). I 'fixed' this by removing of closing of bdev in 
> >>>btrfs_close_one_device_dont_free(), see
> >>>https://bitbucket.org/jekhor/linux-btrfs/commits/dfa441c9ec7b3833f6a5e4d0b6f8c678faea29bb?at=master
> >>>(oops text is attached also). Bdev is closed after replacing by
> >>>btrfs_dev_replace_finishing(), so this is safe but doesn't seem
> >>>to be right way.
> >>
> >>  I have sent out V2. I don't see that issue with this,
> >>  could you pls try ?
> >
> >Yes, it reproduced on v4.4.5 kernel. I will try with current
> >'for-linus-4.6' Chris' tree soon.
> >
> >To emulate a drive failure, I disconnect the drive in VirtualBox, so bdev
> >can be freed by kernel after releasing of all references to it.
> 
>   So far the raid group profile would adapt to lower suitable
>   group profile when device is missing/failed. This appears to
>   be not happening with RAID56 OR there are stale IO which wasn't
>   flushed out. Anyway to have this fixed I am moving the patch
>btrfs: introduce device dynamic state transition to offline or failed
>   to the top in v3 for any potential changes.
>   But firstly we need a reliable test case, or a very carefully
>   crafted test case which can create this situation
> 
>   Here below is the dm-error that I am using for testing, which
>   apparently doesn't report this issue. Could you please try on V3. ?
>   (pls note the device names are hard coded in the test script
>   sorry about that) This would eventually be fstests script.

Sure. But I don't see any V3 patches in the list. Are you still
preparing to send them or I missed something?


-- 
Yauhen Kharuzhy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 00/13 v3] Introduce device state 'failed', Hot spare and Auto replace

2016-04-01 Thread Anand Jain
Thanks for various comments, tests and feedback.

Background: Hot spare and Auto replace:
 Hot spare is predominately used to mitigate or narrow the time
 window of a degraded mode, during which any further disk
 failure might lead to a catastrophic data loss. Data center
 storage generally will have couple of disks reserved as spares
 on the storage, so that it will automatically kickin to resilver
 the storage pool so that the pool is back to a healthy state.
 Mainly this is an storage feature rather than a FS feature,
 I believe people acquainted with enterprise storage use cases
 will appreciate the need of it, and so most/all of the enterprise
 storage has hot spare feature.

Btrfs device states:
 This patch-set adds 'failed' state and makes provision to use
 'offline' state as two new device states. So to summarize
 various device states and their meanings..

 /* missing: device wasn't found at the time of mount */
 int missing;

 /*
  * failed: device confirmed to have experienced critical
  * io failure
  */
 int failed;

 /*
  * offline: When there is no confirmation that a disk has
  * failed. But an interim communication breakdown
  * and not necessarily a candidate for the device replace.
  * Device might be online after user intervention or after
  * block transport layer error recovery.
  */
 int offline;


Device state transition Tuning and visualization:
 Sysfs interfaces are planned to provide the required tuning for
 device state transition, sensitivities and visualization of device
 states. However sysfs framework which could provide such an interface
 is being reviewed/tested and not yet ready as of now. So for the
 testing and debug of these features here I have used an update
 version of the procfs patch which is in the ML.

  [PATCH] btrfs: debug: procfs-devlist: introduce procfs interface for
the device list for debugging

 I find the above patch very useful, easy to use (as compared to
 sysfs to visualize the device state) and stable.

This patch set does not depend on any of the sysfs patches as such.

Backward compatibility:
 Adds a new incompatibility feature flags
 (BTRFS_FEATURE_INCOMPAT_SPARE_DEV) to manage the spare device
 when older kernels are used. So it is tested to be work fine
 with older kernel/prog versions.


Auto replace:
 Replace happens automatically, that is when there is any write
 failed or flush failed, the device will be marked as failed, which
 will stop any further IO attempt to that device. And in the next
 commit cycle the auto replace will pick the spare device to
 replace the failed device. And so the btrfs volume is back to a
 healthy state.

Per FSID spare vs Global spare:
 As of now only global hot spare is supported, that is hot spare(s)
 are for all the btrfs FS in the system. However future there will
 be a fs_info->no_auto_replace tunable which can be tuned by the user
 to limit the use of global spare.


Example use case:
 Here below is an example use case of the hot spare setup.

 Add a spare device:
btrfs spare add /dev/sde -f

 If there is a spare device which is already added before the,
 just run

btrfs dev scan [/dev/sde]

 Which will register the spare device to the kernel.

btrfs fi show
 Label: none uuid: 52f170c1-725c-457d-8cfd-d57090460091
  Total devices 2 FS bytes used 112.00KiB
  devid 1 size 2.00GiB used 417.50MiB path /dev/sdc
  devid 2 size 2.00GiB used 417.50MiB path /dev/sdd

Global spare
  device size 3.00GiB path /dev/sde


Patches:

Kernel:
 First, it needs, Qu's per chunk missing device patchset, which is
 part of the set.

 Next patches 6-9 adds support for Spare device. For kernel without
 spare feature the spare device is kept away. And when the kernel
 supports the spare device, it will inhibit from mounting it. Further
 these patch set provides helper function to pick a spare device and
 release a spare device back to the spare device pool.

 Patch 10 provides helper function to auto replace.
 Patch 11 provides helper function to bring a device to failed state.
 Patch 12 marks a device as failed based on flush and write errors,
  and avoids any further IO to it.
 Last 13 triggers auto replace.

Progs:
 Needs below 4 patches which will add sub cli 'spare' to manage
 the spare device. As of now deleting a spare device has to be
 managed using wipefs. However in the long run we would a proper
 btrfs command to do that job.

V2->V3:
Kernel:
  Thanks to Yauhen and Austin for the review comments.
  Again split Patch 11 and 12 which was merged in V2 for better.
  Patch numbers are reordered (sorry about that) but for better.
  Fix rcu issue in btrfs_get_spare_device(), we don't need rcu
   as its under uuid_mutex
  Fix rcu issue and to check for replace lock at
   btrfs_auto_replace_start()
  Cleanup old: casualty_kthread() new: health_kthread() with
changes as per
838fe188 'btrfs: cleaner_kthread() doesn't need explicit freeze'
(thanks 

[PATCH 10/13] btrfs: introduce helper functions to perform hot replace

2016-04-01 Thread Anand Jain
Hot replace / auto replace is important volume manager feature
and is critical to the data center operations, so that the degraded
volume can be brought back to a healthy state at the earliest and
without manual intervention.

This modifies the existing replace code to suite the need of auto
replace, in the long run I hope both the codes to be merged.

Signed-off-by: Anand Jain 
Tested-by: Austin S. Hemmelgarn 
---
 fs/btrfs/dev-replace.c | 43 +++
 fs/btrfs/dev-replace.h |  1 +
 2 files changed, 44 insertions(+)

diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
index 2b926867d136..ceab4c51db32 100644
--- a/fs/btrfs/dev-replace.c
+++ b/fs/btrfs/dev-replace.c
@@ -957,3 +957,46 @@ void btrfs_bio_counter_inc_blocked(struct btrfs_fs_info 
*fs_info)
 _info->fs_state));
}
 }
+
+int btrfs_auto_replace_start(struct btrfs_root *root,
+   struct btrfs_device *src_device)
+{
+   int ret;
+   char *tgt_path;
+   char *src_path;
+   struct btrfs_fs_info *fs_info = root->fs_info;
+
+   if (fs_info->sb->s_flags & MS_RDONLY)
+   return -EROFS;
+
+   btrfs_dev_replace_lock(_info->dev_replace, 0);
+   if (btrfs_dev_replace_is_ongoing(_info->dev_replace)) {
+   btrfs_dev_replace_unlock(_info->dev_replace, 0);
+   return -EBUSY;
+   }
+   btrfs_dev_replace_unlock(_info->dev_replace, 0);
+
+   if (btrfs_get_spare_device(_path)) {
+   btrfs_err(root->fs_info,
+   "No spare device found/configured in the kernel");
+   return -EINVAL;
+   }
+
+   rcu_read_lock();
+   src_path = kstrdup(rcu_str_deref(src_device->name), GFP_ATOMIC);
+   rcu_read_unlock();
+   if (!src_path) {
+   kfree(tgt_path);
+   return -ENOMEM;
+   }
+   ret = btrfs_dev_replace_start(root, tgt_path,
+   src_device->devid, src_path,
+   BTRFS_IOCTL_DEV_REPLACE_CONT_READING_FROM_SRCDEV_MODE_AVOID);
+   if (ret)
+   btrfs_put_spare_device(tgt_path);
+
+   kfree(tgt_path);
+   kfree(src_path);
+
+   return 0;
+}
diff --git a/fs/btrfs/dev-replace.h b/fs/btrfs/dev-replace.h
index e922b42d91df..b918b9d6e5df 100644
--- a/fs/btrfs/dev-replace.h
+++ b/fs/btrfs/dev-replace.h
@@ -46,4 +46,5 @@ static inline void btrfs_dev_replace_stats_inc(atomic64_t 
*stat_value)
 {
atomic64_inc(stat_value);
 }
+int btrfs_auto_replace_start(struct btrfs_root *root, struct btrfs_device 
*src_device);
 #endif
-- 
2.7.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 02/13] btrfs: Do per-chunk check for mount time check

2016-04-01 Thread Anand Jain
From: Qu Wenruo 

Now use the btrfs_check_degraded() to do mount time degraded check.

With this patch, now we can mount with the following case:
 # mkfs.btrfs -f -m raid1 -d single /dev/sdb /dev/sdc
 # wipefs -a /dev/sdc
 # mount /dev/sdb /mnt/btrfs -o degraded
 As the single data chunk is only in sdb, so it's OK to mount as degraded,
 as missing one device is OK for RAID1.

But still fail with the following case as expected:
 # mkfs.btrfs -f -m raid1 -d single /dev/sdb /dev/sdc
 # wipefs -a /dev/sdb
 # mount /dev/sdc /mnt/btrfs -o degraded
 As the data chunk is only in sdb, so it's not OK to mount it as degraded.

Reported-by: Zhao Lei 
Reported-by: Anand Jain 
Signed-off-by: Qu Wenruo 

[Btrfs: use btrfs_error instead of btrfs_err during mount]
Signed-off-by: Anand Jain 
---
 fs/btrfs/disk-io.c | 18 ++
 1 file changed, 10 insertions(+), 8 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index c95e3ce9f22e..bfea0f8f6a87 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2880,6 +2880,16 @@ int open_ctree(struct super_block *sb,
goto fail_tree_roots;
}
 
+   ret = btrfs_check_degradable(fs_info, fs_info->sb->s_flags);
+   if (ret < 0) {
+   btrfs_err(fs_info, "degraded writable mount failed %d", ret);
+   goto fail_tree_roots;
+   } else if (ret > 0 && !btrfs_test_opt(chunk_root, DEGRADED)) {
+   btrfs_warn(fs_info,
+   "Some device missing, but still degraded mountable, 
please mount with -o degraded option");
+   ret = -EACCES;
+   goto fail_tree_roots;
+   }
/*
 * keep the device that is marked to be the target device for the
 * dev_replace procedure
@@ -2983,14 +2993,6 @@ retry_root_backup:
}
fs_info->num_tolerated_disk_barrier_failures =
btrfs_calc_num_tolerated_disk_barrier_failures(fs_info);
-   if (fs_info->fs_devices->missing_devices >
-fs_info->num_tolerated_disk_barrier_failures &&
-   !(sb->s_flags & MS_RDONLY)) {
-   pr_warn("BTRFS: missing devices(%llu) exceeds the limit(%d), 
writeable mount is not allowed\n",
-   fs_info->fs_devices->missing_devices,
-   fs_info->num_tolerated_disk_barrier_failures);
-   goto fail_sysfs;
-   }
 
fs_info->cleaner_kthread = kthread_run(cleaner_kthread, tree_root,
   "btrfs-cleaner");
-- 
2.7.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 05/13] btrfs: Cleanup num_tolerated_disk_barrier_failures

2016-04-01 Thread Anand Jain
From: Qu Wenruo 

As we use per-chunk degradable check, now the global
num_tolerated_disk_barrier_failures is of no use. So cleanup it.

Signed-off-by: Qu Wenruo 

[Btrfs: resolve conflict to apply 'btrfs: Cleanup 
num_tolerated_disk_barrier_failures']
Signed-off-by: Anand Jain 
---
 fs/btrfs/ctree.h   |  2 --
 fs/btrfs/disk-io.c | 56 --
 fs/btrfs/disk-io.h |  2 --
 fs/btrfs/volumes.c | 17 -
 4 files changed, 77 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 84a6a5b3384a..e0a50f478e01 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1829,8 +1829,6 @@ struct btrfs_fs_info {
/* next backup root to be overwritten */
int backup_root_index;
 
-   int num_tolerated_disk_barrier_failures;
-
/* device replace state */
struct btrfs_dev_replace dev_replace;
 
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 85e26d62c089..7f02f1766037 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2991,8 +2991,6 @@ retry_root_backup:
printk(KERN_ERR "BTRFS: Failed to read block groups: %d\n", 
ret);
goto fail_sysfs;
}
-   fs_info->num_tolerated_disk_barrier_failures =
-   btrfs_calc_num_tolerated_disk_barrier_failures(fs_info);
 
fs_info->cleaner_kthread = kthread_run(cleaner_kthread, tree_root,
   "btrfs-cleaner");
@@ -3559,60 +3557,6 @@ int btrfs_get_num_tolerated_disk_barrier_failures(u64 
flags)
return min_tolerated;
 }
 
-int btrfs_calc_num_tolerated_disk_barrier_failures(
-   struct btrfs_fs_info *fs_info)
-{
-   struct btrfs_ioctl_space_info space;
-   struct btrfs_space_info *sinfo;
-   u64 types[] = {BTRFS_BLOCK_GROUP_DATA,
-  BTRFS_BLOCK_GROUP_SYSTEM,
-  BTRFS_BLOCK_GROUP_METADATA,
-  BTRFS_BLOCK_GROUP_DATA | BTRFS_BLOCK_GROUP_METADATA};
-   int i;
-   int c;
-   int num_tolerated_disk_barrier_failures =
-   (int)fs_info->fs_devices->num_devices;
-
-   for (i = 0; i < ARRAY_SIZE(types); i++) {
-   struct btrfs_space_info *tmp;
-
-   sinfo = NULL;
-   rcu_read_lock();
-   list_for_each_entry_rcu(tmp, _info->space_info, list) {
-   if (tmp->flags == types[i]) {
-   sinfo = tmp;
-   break;
-   }
-   }
-   rcu_read_unlock();
-
-   if (!sinfo)
-   continue;
-
-   down_read(>groups_sem);
-   for (c = 0; c < BTRFS_NR_RAID_TYPES; c++) {
-   u64 flags;
-
-   if (list_empty(>block_groups[c]))
-   continue;
-
-   btrfs_get_block_group_info(>block_groups[c],
-  );
-   if (space.total_bytes == 0 || space.used_bytes == 0)
-   continue;
-   flags = space.flags;
-
-   num_tolerated_disk_barrier_failures = min(
-   num_tolerated_disk_barrier_failures,
-   btrfs_get_num_tolerated_disk_barrier_failures(
-   flags));
-   }
-   up_read(>groups_sem);
-   }
-
-   return num_tolerated_disk_barrier_failures;
-}
-
 static int write_all_supers(struct btrfs_root *root, int max_mirrors)
 {
struct list_head *head;
diff --git a/fs/btrfs/disk-io.h b/fs/btrfs/disk-io.h
index 8e79d0070bcf..dd155621f95f 100644
--- a/fs/btrfs/disk-io.h
+++ b/fs/btrfs/disk-io.h
@@ -141,8 +141,6 @@ struct btrfs_root *btrfs_create_tree(struct 
btrfs_trans_handle *trans,
 int btree_lock_page_hook(struct page *page, void *data,
void (*flush_fn)(void *));
 int btrfs_get_num_tolerated_disk_barrier_failures(u64 flags);
-int btrfs_calc_num_tolerated_disk_barrier_failures(
-   struct btrfs_fs_info *fs_info);
 int __init btrfs_end_io_wq_init(void);
 void btrfs_end_io_wq_exit(void);
 
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index a840d78ba127..dff2deaf88d3 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -1876,9 +1876,6 @@ int btrfs_rm_device(struct btrfs_root *root, char 
*device_path)
free_fs_devices(cur_devices);
}
 
-   root->fs_info->num_tolerated_disk_barrier_failures =
-   btrfs_calc_num_tolerated_disk_barrier_failures(root->fs_info);
-
/*
 * at this point, the device is zero sized.  We want to
 * remove it from the devices list and zero out the old super
@@ -2405,8 +2402,6 @@ int btrfs_init_new_device(struct btrfs_root *root, char 
*device_path)
 

[PATCH 11/13] btrfs: introduce device dynamic state transition to offline or failed

2016-04-01 Thread Anand Jain
This patch provides helper functions to force a device to offline
or failed, and we need this device states for the following reasons,
1) a. it can be reported that device has failed when it does
   b. close the device when it goes offline so that blocklayer can
  cleanup
2) identify the candidate for the auto replace
3) avoid further commit error reported against the failing device and
4) a device in the multi device btrfs may go offline from the system
   (but as of now in in some system config btrfs gets unmounted in this
context, which is not a correct behavior)

Signed-off-by: Anand Jain 
Tested-by: Austin S. Hemmelgarn 
---
 fs/btrfs/volumes.c | 137 +
 fs/btrfs/volumes.h |  13 +
 2 files changed, 150 insertions(+)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 072cefac958c..eb9f28504d3f 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -7149,3 +7149,140 @@ out:
read_unlock(_tree->map_tree.lock);
return ret;
 }
+
+static void __close_device(struct work_struct *work)
+{
+   struct btrfs_device *device;
+
+   device = container_of(work, struct btrfs_device, rcu_work);
+
+   if (device->bdev)
+   blkdev_put(device->bdev, device->mode);
+
+   device->bdev = NULL;
+}
+
+static void close_device(struct rcu_head *head)
+{
+   struct btrfs_device *device;
+
+   device = container_of(head, struct btrfs_device, rcu);
+
+   INIT_WORK(>rcu_work, __close_device);
+   schedule_work(>rcu_work);
+}
+
+void btrfs_close_one_device_dont_free(struct btrfs_device *device)
+{
+   struct btrfs_fs_devices *fs_devices = device->fs_devices;
+
+   if (device->bdev)
+   fs_devices->open_devices--;
+
+   if (device->writeable &&
+   device->devid != BTRFS_DEV_REPLACE_DEVID) {
+   list_del_init(>dev_alloc_list);
+   fs_devices->rw_devices--;
+   }
+
+   device->writeable = 0;
+
+   call_rcu(>rcu, close_device);
+}
+
+void force_device_close(struct btrfs_device *device)
+{
+   struct btrfs_device *next_device;
+   struct btrfs_fs_devices *fs_devices;
+
+   fs_devices = device->fs_devices;
+
+   mutex_lock(_devices->device_list_mutex);
+   lock_chunks(fs_devices->fs_info->fs_root);
+
+   next_device = list_entry(fs_devices->devices.next,
+   struct btrfs_device, dev_list);
+   if (device->bdev == fs_devices->fs_info->sb->s_bdev)
+   fs_devices->fs_info->sb->s_bdev = next_device->bdev;
+
+   if (device->bdev == fs_devices->latest_bdev)
+   fs_devices->latest_bdev = next_device->bdev;
+
+   btrfs_close_one_device_dont_free(device);
+
+   /*
+* TODO: works for now, but its better to keep the state of
+* missing and offline different, and update rest of the
+* places where we check for only missing and not for failed
+* or offline as of now.
+*/
+   device->missing = 1;
+   fs_devices->missing_devices++;
+   device->writeable = 0;
+
+   rcu_barrier();
+
+   unlock_chunks(fs_devices->fs_info->fs_root);
+   mutex_unlock(_devices->device_list_mutex);
+}
+
+void btrfs_enforce_device_state(struct btrfs_device *dev, char *why)
+{
+   bool degrade_option;
+   int tolerated_fail;
+   struct btrfs_fs_info *fs_info;
+   struct btrfs_fs_devices *fs_devices;
+
+   fs_devices = dev->fs_devices;
+   fs_info = fs_devices->fs_info;
+   degrade_option = btrfs_test_opt(fs_info->fs_root, DEGRADED);
+
+   /* todo: support seed later */
+   if (fs_devices->seeding)
+   return;
+
+   /* this shouldn't be called if device is already missing */
+   if (dev->missing || !dev->bdev)
+   return;
+
+   if (dev->offline || dev->failed)
+   return;
+
+   /* Only RW device is requested to force close let FS handle it*/
+   if (fs_devices->rw_devices == 1) {
+   btrfs_std_error(fs_info, -EIO,
+   "force offline last RW device");
+   return;
+   }
+
+   if (!strcmp(why, "offline"))
+   dev->offline = 1;
+   else if (!strcmp(why, "failed"))
+   dev->failed = 1;
+   else
+   return;
+
+   btrfs_sysfs_rm_device_link(fs_devices, dev);
+
+   force_device_close(dev);
+
+   tolerated_fail = btrfs_check_degradable(fs_info,
+   fs_info->sb->s_flags);
+   if (tolerated_fail > 0) {
+   btrfs_warn_in_rcu(fs_info, "device %s %s, chunks degraded",
+   rcu_str_deref(dev->name), why);
+   } else if(tolerated_fail < 0) {
+   btrfs_warn_in_rcu(fs_info,
+   "device %s %s, chunks failed",
+   rcu_str_deref(dev->name), why);
+   

[PATCH 03/13] btrfs: Do per-chunk degraded check for remount

2016-04-01 Thread Anand Jain
From: Qu Wenruo 

Just the same for mount time check, use new btrfs_check_degraded() to do
per chunk check.

Signed-off-by: Qu Wenruo 

Btrfs: use btrfs_error instead of btrfs_err during remount

Signed-off-by: Anand Jain 
---
 fs/btrfs/super.c | 11 +++
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 00b8f37cc306..87639fa53b10 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -1767,11 +1767,14 @@ static int btrfs_remount(struct super_block *sb, int 
*flags, char *data)
goto restore;
}
 
-   if (fs_info->fs_devices->missing_devices >
-fs_info->num_tolerated_disk_barrier_failures &&
-   !(*flags & MS_RDONLY)) {
+   ret = btrfs_check_degradable(fs_info, *flags);
+   if (ret < 0) {
+   btrfs_err(fs_info,
+   "degraded writable remount failed %d", ret);
+   goto restore;
+   } else if (ret > 0 && !btrfs_test_opt(root, DEGRADED)) {
btrfs_warn(fs_info,
-   "too many missing devices, writeable remount is 
not allowed");
+   "some device missing, but still degraded 
mountable, please remount with -o degraded option");
ret = -EACCES;
goto restore;
}
-- 
2.7.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 01/13] btrfs: Introduce a new function to check if all chunks a OK for degraded mount

2016-04-01 Thread Anand Jain
From: Qu Wenruo 

Introduce a new function, btrfs_check_degradable(), to judge if all chunks
in btrfs is OK for degraded mount.

It provides the new basis for accurate btrfs mount/remount and even
runtime degraded mount check other than old one-size-fit-all method.

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/volumes.c | 63 ++
 fs/btrfs/volumes.h |  1 +
 2 files changed, 64 insertions(+)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index e2b54d546b7c..dd3dc53a302a 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -7042,3 +7042,66 @@ static void btrfs_close_one_device(struct btrfs_device 
*device)
 
call_rcu(>rcu, free_device);
 }
+
+/*
+ * Check if all chunks in the fs is OK for degraded mount
+ * Caller itself should do extra check if DEGRADED mount option is given
+ * for >0 return value.
+ *
+ * Return 0 if all chunks are OK.
+ * Return >0 if all chunks are degradable but not all OK.
+ * Return <0 if any chunk is not degradable or other bug.
+ */
+int btrfs_check_degradable(struct btrfs_fs_info *fs_info, unsigned flags)
+{
+   struct btrfs_mapping_tree *map_tree = _info->mapping_tree;
+   struct extent_map *em;
+   u64 next_start = 0;
+   int ret = 0;
+
+   if (flags & MS_RDONLY)
+   return 0;
+
+   read_lock(_tree->map_tree.lock);
+   em = lookup_extent_mapping(_tree->map_tree, 0, (u64)(-1));
+   /* No any chunk? Should be a huge bug */
+   if (!em) {
+   ret = -ENOENT;
+   goto out;
+   }
+
+   while (em) {
+   struct map_lookup *map;
+   int missing = 0;
+   int max_tolerated;
+   int i;
+
+   map = (struct map_lookup *) em->bdev;
+   max_tolerated =
+   btrfs_get_num_tolerated_disk_barrier_failures(
+   map->type);
+   for (i = 0; i < map->num_stripes; i++) {
+   if (map->stripes[i].dev->missing)
+   missing++;
+   }
+   if (missing > max_tolerated) {
+   ret = -EIO;
+   btrfs_warn(fs_info,
+  "missing devices(%d) exceeds the limit(%d), 
writebale mount is not allowed",
+  missing, max_tolerated);
+   goto out;
+   } else if (missing)
+   ret = 1;
+   next_start = extent_map_end(em);
+
+   /*
+* Alwasy search range [next_start, (u64)-1) to find the next
+* chunk map
+*/
+   em = lookup_extent_mapping(_tree->map_tree, next_start,
+  (u64)(-1) - next_start);
+   }
+out:
+   read_unlock(_tree->map_tree.lock);
+   return ret;
+}
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 1939ebde63df..351431a3f5aa 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -566,5 +566,6 @@ static inline void unlock_chunks(struct btrfs_root *root)
 struct list_head *btrfs_get_fs_uuids(void);
 void btrfs_set_fs_info_ptr(struct btrfs_fs_info *fs_info);
 void btrfs_reset_fs_info_ptr(struct btrfs_fs_info *fs_info);
+int btrfs_check_degradable(struct btrfs_fs_info *fs_info, unsigned flags);
 
 #endif
-- 
2.7.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 07/13] btrfs: add check not to mount a spare device

2016-04-01 Thread Anand Jain
Spare devices can be scanned but shouldn't be mountable.

Signed-off-by: Anand Jain 
Tested-by: Austin S. Hemmelgarn 
---
 fs/btrfs/disk-io.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 7f02f1766037..b99329e37965 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2806,6 +2806,14 @@ int open_ctree(struct super_block *sb,
goto fail_alloc;
}
 
+   if (btrfs_super_incompat_flags(disk_super) &
+   BTRFS_FEATURE_INCOMPAT_SPARE_DEV) {
+   /*You can only scan a spare device but not mount*/
+   printk(KERN_ERR "BTRFS: You can't mount a spare device\n");
+   err = -ENOTSUPP;
+   goto fail_alloc;
+   }
+
/*
 * Needn't use the lock because there is no other task which will
 * update the flag.
-- 
2.7.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 08/13] btrfs: support btrfs dev scan for spare device

2016-04-01 Thread Anand Jain
When the user or system calls the BTRFS_IOC_SCAN_DEV,
ioctl this patch will make sure it is added to the device
list and set it as spare.

This operation will be same when BTRFS_IOC_DEVICES_READY
as well since BTRFS_IOC_DEVICES_READY ioctl has been doing
that by legacy.

Signed-off-by: Anand Jain 
Tested-by: Austin S. Hemmelgarn 
---
 fs/btrfs/volumes.c | 4 
 fs/btrfs/volumes.h | 2 ++
 2 files changed, 6 insertions(+)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index dff2deaf88d3..d729539f9612 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -604,6 +604,10 @@ static noinline int device_list_add(const char *path,
if (IS_ERR(fs_devices))
return PTR_ERR(fs_devices);
 
+   if (btrfs_super_incompat_flags(disk_super) &
+   BTRFS_FEATURE_INCOMPAT_SPARE_DEV)
+   fs_devices->spare = 1;
+
list_add(_devices->list, _uuids);
 
device = NULL;
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 48ced5cc09e4..51cf716eb35b 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -263,6 +263,8 @@ struct btrfs_fs_devices {
struct kobject fsid_kobj;
struct kobject *device_dir_kobj;
struct completion kobj_unregister;
+
+   int spare;
 };
 
 #define BTRFS_BIO_INLINE_CSUM_SIZE 64
-- 
2.7.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 12/13] btrfs: check device for critical errors and mark failed

2016-04-01 Thread Anand Jain
Write and Flush errors are considered as critical errors,
upon which the device will be brought offline and marked as
failed. Write and Flush errors are identified using device
error statistics. This is monitored using a kthread
btrfs_health.

Signed-off-by: Anand Jain 
Tested-by: Austin S. Hemmelgarn 
---
 fs/btrfs/ctree.h   |   2 ++
 fs/btrfs/disk-io.c | 101 -
 fs/btrfs/volumes.c |   1 +
 fs/btrfs/volumes.h |   4 +++
 4 files changed, 107 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index aa693cfdc9f0..47e9cd9dd29a 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1569,6 +1569,7 @@ struct btrfs_fs_info {
struct mutex tree_log_mutex;
struct mutex transaction_kthread_mutex;
struct mutex cleaner_mutex;
+   struct mutex health_mutex;
struct mutex chunk_mutex;
struct mutex volume_mutex;
 
@@ -1686,6 +1687,7 @@ struct btrfs_fs_info {
struct btrfs_workqueue *extent_workers;
struct task_struct *transaction_kthread;
struct task_struct *cleaner_kthread;
+   struct task_struct *health_kthread;
int thread_pool_size;
 
struct kobject *space_info_kobj;
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index b99329e37965..b523e56b34e9 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1869,6 +1869,93 @@ sleep:
return 0;
 }
 
+/*
+ * returns:
+ * < 0 : Check didn't run, std error
+ *   0 : No errors found
+ * > 0 : # of devices having fatal errors
+ */
+static int btrfs_update_devices_health(struct btrfs_root *root)
+{
+   int ret = 0;
+   struct btrfs_device *device;
+   struct btrfs_fs_info *fs_info = root->fs_info;
+
+   if (btrfs_fs_closing(fs_info))
+   return -EBUSY;
+
+   /* mark disk(s) with write or flush error(s) as failed */
+   mutex_lock(_info->volume_mutex);
+   list_for_each_entry_rcu(device,
+   _info->fs_devices->devices, dev_list) {
+   int c_err;
+
+   if (device->failed) {
+   ret++;
+   continue;
+   }
+
+   /*
+* todo: replace target device's write/flush error,
+* skip for now
+*/
+   if (device->is_tgtdev_for_dev_replace)
+   continue;
+
+   if (!device->dev_stats_valid)
+   continue;
+
+   c_err = atomic_read(>new_critical_errs);
+   atomic_sub(c_err, >new_critical_errs);
+   if (c_err) {
+   btrfs_crit_in_rcu(fs_info,
+   "fatal error on device %s",
+   rcu_str_deref(device->name));
+   btrfs_enforce_device_state(device, "failed");
+   ret ++;
+   }
+   }
+   mutex_unlock(_info->volume_mutex);
+
+   return ret;
+}
+
+/*
+ * Devices health maintenance kthread, gets woken-up by transaction
+ * kthread, once sysfs is ready, this should publish the report
+ * through sysfs so that user land scripts and invoke actions.
+ */
+static int health_kthread(void *arg)
+{
+   struct btrfs_root *root = arg;
+
+   do {
+   if (btrfs_need_cleaner_sleep(root))
+   goto sleep;
+
+   if (!mutex_trylock(>fs_info->health_mutex))
+   goto sleep;
+
+   if (btrfs_need_cleaner_sleep(root)) {
+   mutex_unlock(>fs_info->health_mutex);
+   goto sleep;
+   }
+
+   /* Check devices health */
+   btrfs_update_devices_health(root);
+
+   mutex_unlock(>fs_info->health_mutex);
+
+sleep:
+   set_current_state(TASK_INTERRUPTIBLE);
+   if (!kthread_should_stop())
+   schedule();
+   __set_current_state(TASK_RUNNING);
+   } while (!kthread_should_stop());
+
+   return 0;
+}
+
 static int transaction_kthread(void *arg)
 {
struct btrfs_root *root = arg;
@@ -1915,6 +2002,7 @@ static int transaction_kthread(void *arg)
btrfs_end_transaction(trans, root);
}
 sleep:
+   wake_up_process(root->fs_info->health_kthread);
wake_up_process(root->fs_info->cleaner_kthread);
mutex_unlock(>fs_info->transaction_kthread_mutex);
 
@@ -2663,6 +2751,7 @@ int open_ctree(struct super_block *sb,
mutex_init(_info->chunk_mutex);
mutex_init(_info->transaction_kthread_mutex);
mutex_init(_info->cleaner_mutex);
+   mutex_init(_info->health_mutex);
mutex_init(_info->volume_mutex);
mutex_init(_info->ro_block_group_mutex);
init_rwsem(_info->commit_root_sem);
@@ -3005,11 +3094,16 @@ retry_root_backup:
if 

[PATCH 06/13] btrfs: introduce BTRFS_FEATURE_INCOMPAT_SPARE_DEV

2016-04-01 Thread Anand Jain
Add BTRFS_FEATURE_INCOMPAT_SPARE_DEV (400) flag to identify
a spare device.

Along with this it checks in the mount context that a spare
device will fail to mount.  As spare devices aren't mountable.

Signed-off-by: Anand Jain 
Tested-by: Austin S. Hemmelgarn 
---
 fs/btrfs/ctree.h | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index e0a50f478e01..2c185a8e92f0 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -531,6 +531,7 @@ struct btrfs_super_block {
 #define BTRFS_FEATURE_INCOMPAT_RAID56  (1ULL << 7)
 #define BTRFS_FEATURE_INCOMPAT_SKINNY_METADATA (1ULL << 8)
 #define BTRFS_FEATURE_INCOMPAT_NO_HOLES(1ULL << 9)
+#define BTRFS_FEATURE_INCOMPAT_SPARE_DEV   (1ULL << 10)
 
 #define BTRFS_FEATURE_COMPAT_SUPP  0ULL
 #define BTRFS_FEATURE_COMPAT_SAFE_SET  0ULL
@@ -551,7 +552,8 @@ struct btrfs_super_block {
 BTRFS_FEATURE_INCOMPAT_RAID56 |\
 BTRFS_FEATURE_INCOMPAT_EXTENDED_IREF | \
 BTRFS_FEATURE_INCOMPAT_SKINNY_METADATA |   \
-BTRFS_FEATURE_INCOMPAT_NO_HOLES)
+BTRFS_FEATURE_INCOMPAT_NO_HOLES |  \
+BTRFS_FEATURE_INCOMPAT_SPARE_DEV)
 
 #define BTRFS_FEATURE_INCOMPAT_SAFE_SET\
(BTRFS_FEATURE_INCOMPAT_EXTENDED_IREF)
-- 
2.7.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 04/13] btrfs: Allow barrier_all_devices to do per-chunk device check

2016-04-01 Thread Anand Jain
From: Qu Wenruo 

The last user of num_tolerated_disk_barrier_failures is
barrier_all_devices(). But it's can be easily changed to new per-chunk
degradable check framework.

Now btrfs_device will have two extra members, representing send/wait
error, set at write_dev_flush() time. And then check it in a similar but
more accurate behavior than old code.

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/disk-io.c | 13 +
 fs/btrfs/volumes.c |  6 +-
 fs/btrfs/volumes.h |  4 
 3 files changed, 14 insertions(+), 9 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index bfea0f8f6a87..85e26d62c089 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3491,8 +3491,6 @@ static int barrier_all_devices(struct btrfs_fs_info *info)
 {
struct list_head *head;
struct btrfs_device *dev;
-   int errors_send = 0;
-   int errors_wait = 0;
int ret;
 
/* send down all the barriers */
@@ -3501,7 +3499,7 @@ static int barrier_all_devices(struct btrfs_fs_info *info)
if (dev->missing)
continue;
if (!dev->bdev) {
-   errors_send++;
+   dev->err_send = 1;
continue;
}
if (!dev->in_fs_metadata || !dev->writeable)
@@ -3509,7 +3507,7 @@ static int barrier_all_devices(struct btrfs_fs_info *info)
 
ret = write_dev_flush(dev, 0);
if (ret)
-   errors_send++;
+   dev->err_send = 1;
}
 
/* wait for all the barriers */
@@ -3517,7 +3515,7 @@ static int barrier_all_devices(struct btrfs_fs_info *info)
if (dev->missing)
continue;
if (!dev->bdev) {
-   errors_wait++;
+   dev->err_wait = 1;
continue;
}
if (!dev->in_fs_metadata || !dev->writeable)
@@ -3525,10 +3523,9 @@ static int barrier_all_devices(struct btrfs_fs_info 
*info)
 
ret = write_dev_flush(dev, 1);
if (ret)
-   errors_wait++;
+   dev->err_wait = 1;
}
-   if (errors_send > info->num_tolerated_disk_barrier_failures ||
-   errors_wait > info->num_tolerated_disk_barrier_failures)
+   if (btrfs_check_degradable(info, info->sb->s_flags) < 0)
return -EIO;
return 0;
 }
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index dd3dc53a302a..a840d78ba127 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -7081,8 +7081,12 @@ int btrfs_check_degradable(struct btrfs_fs_info 
*fs_info, unsigned flags)
btrfs_get_num_tolerated_disk_barrier_failures(
map->type);
for (i = 0; i < map->num_stripes; i++) {
-   if (map->stripes[i].dev->missing)
+   if (map->stripes[i].dev->missing ||
+   map->stripes[i].dev->err_wait ||
+   map->stripes[i].dev->err_send)
missing++;
+   map->stripes[i].dev->err_wait = 0;
+   map->stripes[i].dev->err_send = 0;
}
if (missing > max_tolerated) {
ret = -EIO;
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 351431a3f5aa..48ced5cc09e4 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -76,6 +76,10 @@ struct btrfs_device {
int can_discard;
int is_tgtdev_for_dev_replace;
 
+   /* for barrier_all_devices() check */
+   int err_send;
+   int err_wait;
+
 #ifdef __BTRFS_NEED_DEVICE_DATA_ORDERED
seqcount_t data_seqcount;
 #endif
-- 
2.7.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 09/13] btrfs: provide framework to get and put a spare device

2016-04-01 Thread Anand Jain
This adds functions to get and put a spare device from the list.
So that hot repace code can pick a spare device when needed.

Signed-off-by: Anand Jain 
Tested-by: Austin S. Hemmelgarn 
---
 fs/btrfs/ctree.h   |  1 +
 fs/btrfs/super.c   |  5 +
 fs/btrfs/volumes.c | 53 +
 fs/btrfs/volumes.h |  2 ++
 4 files changed, 61 insertions(+)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 2c185a8e92f0..aa693cfdc9f0 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -4185,6 +4185,7 @@ void btrfs_sysfs_remove_mounted(struct btrfs_fs_info 
*fs_info);
 ssize_t btrfs_listxattr(struct dentry *dentry, char *buffer, size_t size);
 
 /* super.c */
+struct file_system_type *btrfs_get_fs_type(void);
 int btrfs_parse_options(struct btrfs_root *root, char *options,
unsigned long new_flags);
 int btrfs_sync_fs(struct super_block *sb, int wait);
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 87639fa53b10..49ba899b2d36 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -69,6 +69,11 @@ static struct file_system_type btrfs_fs_type;
 
 static int btrfs_remount(struct super_block *sb, int *flags, char *data);
 
+struct file_system_type *btrfs_get_fs_type()
+{
+   return _fs_type;
+}
+
 const char *btrfs_decode_error(int errno)
 {
char *errstr = "unknown";
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index d729539f9612..072cefac958c 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -525,6 +525,59 @@ static void pending_bios_fn(struct btrfs_work *work)
run_scheduled_bios(device);
 }
 
+int btrfs_get_spare_device(char **path)
+{
+   int ret = 1;
+   struct btrfs_fs_devices *fs_devices;
+   struct btrfs_device *device;
+   struct list_head *fs_uuids = btrfs_get_fs_uuids();
+
+   mutex_lock(_mutex);
+   list_for_each_entry(fs_devices, fs_uuids, list) {
+   if (!fs_devices->spare)
+   continue;
+
+   /* as of now there is only one device in the spare fs_devices */
+   device = list_entry(fs_devices->devices.next,
+   struct btrfs_device, dev_list);
+
+   if (!device || !device->name)
+   continue;
+
+   fs_devices->spare = 0;
+   /*
+* Its under uuid_mutex and there is one spare per fsid
+* so rcu lock is actually not required
+*/
+   *path = kstrdup(device->name->str, GFP_KERNEL);
+   if (*path)
+   ret = 0;
+   else
+   ret = -ENOMEM;
+   break;
+   }
+
+   if (!ret) {
+   btrfs_sysfs_remove_fsid(fs_devices);
+   list_del(_devices->list);
+   free_fs_devices(fs_devices);
+   }
+   mutex_unlock(_mutex);
+
+   return ret;
+}
+
+void btrfs_put_spare_device(char *path)
+{
+   struct file_system_type *btrfs_fs_type;
+   struct btrfs_fs_devices *fs_devices;
+
+   btrfs_fs_type = btrfs_get_fs_type();
+
+   if (btrfs_scan_one_device(path, FMODE_READ,
+   btrfs_fs_type, _devices))
+   printk(KERN_INFO "failed to return spare device\n");
+}
 
 void btrfs_free_stale_device(struct btrfs_device *cur_dev)
 {
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 51cf716eb35b..b4308afa3097 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -469,6 +469,8 @@ int btrfs_init_new_device(struct btrfs_root *root, char 
*path);
 int btrfs_init_dev_replace_tgtdev(struct btrfs_root *root, char *device_path,
  struct btrfs_device *srcdev,
  struct btrfs_device **device_out);
+int btrfs_get_spare_device(char **path);
+void btrfs_put_spare_device(char *path);
 int btrfs_balance(struct btrfs_balance_control *bctl,
  struct btrfs_ioctl_balance_args *bargs);
 int btrfs_resume_balance_async(struct btrfs_fs_info *fs_info);
-- 
2.7.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 13/13] btrfs: check for failed device and hot replace

2016-04-01 Thread Anand Jain
This patch checks for failed device and kicks out auto
replace, if when user decided to disable auto replace
it can be done by future sysfs or future ioctl interface
to set fs_info->no_auto_replace parameter to 1.

Signed-off-by: Anand Jain 
Tested-by: Austin S. Hemmelgarn 
---
 fs/btrfs/ctree.h   |  2 ++
 fs/btrfs/disk-io.c | 34 ++
 2 files changed, 36 insertions(+)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 47e9cd9dd29a..67bb36bb82ee 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1862,6 +1862,8 @@ struct btrfs_fs_info {
struct list_head pinned_chunks;
 
int creating_free_space_tree;
+
+   int no_auto_replace;
 };
 
 struct btrfs_subvolume_writers {
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index b523e56b34e9..f205e7e94948 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1869,6 +1869,38 @@ sleep:
return 0;
 }
 
+static int btrfs_recuperate(struct btrfs_root *root)
+{
+   int ret;
+   int found = 0;
+   struct btrfs_device *device;
+   struct btrfs_fs_devices *fs_devices;
+
+   fs_devices = root->fs_info->fs_devices;
+
+   mutex_lock(_devices->device_list_mutex);
+   rcu_read_lock();
+   list_for_each_entry_rcu(device,
+   _devices->devices, dev_list) {
+   if (device->failed) {
+   found = 1;
+   break;
+   }
+   }
+   rcu_read_unlock();
+   mutex_unlock(_devices->device_list_mutex);
+
+   /*
+* We are using the replace code which should be interrupt-able
+* during unmount, and as of now there is no user land stop
+* request that we support and this will run until its complete
+*/
+   if (found && !root->fs_info->no_auto_replace)
+   ret = btrfs_auto_replace_start(root, device);
+
+   return ret;
+}
+
 /*
  * returns:
  * < 0 : Check didn't run, std error
@@ -1944,6 +1976,8 @@ static int health_kthread(void *arg)
/* Check devices health */
btrfs_update_devices_health(root);
 
+   btrfs_recuperate(root);
+
mutex_unlock(>fs_info->health_mutex);
 
 sleep:
-- 
2.7.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Global hotspare functionality

2016-04-01 Thread Anand Jain



On 03/31/2016 06:17 AM, Yauhen Kharuzhy wrote:

On Tue, Mar 29, 2016 at 10:40:40PM +0300, Yauhen Kharuzhy wrote:

Hi.

I am testing hotspare v2 on kernel v4.4.5 (I will try latest Chris' tree later)
now with lockdep debugging enabled. At starting of replacement, lockdep warning 
is displayed,
because kstrdup() is called with GFP_NOFS inside of rcu_read_lock/unlock()
block (GFP_NOFS can sleep).


Similar thing in the btrfs_auto_replace_start(): rcu_str_deref() without
rcu_read_lock():

int btrfs_auto_replace_start(struct btrfs_root *root,
 struct btrfs_device *src_device)
{
 int ret;
 char *tgt_path;

 if (btrfs_get_spare_device(_path)) {
 btrfs_err(root->fs_info,
 "No spare device found/configured in the kernel");
 return -EINVAL;
 }

 ret = btrfs_dev_replace_start(root, tgt_path,
 src_device->devid,
 rcu_str_deref(src_device->name),


This is fixed in V3.

Thanks, Anand



 BTRFS_IOCTL_DEV_REPLACE_CONT_READING_FROM_SRCDEV_MODE_AVOID);
 if (ret)
 btrfs_put_spare_device(tgt_path);

 kfree(tgt_path);

 return 0;
}

[  156.168133] ===
[  156.168963] [ INFO: suspicious RCU usage. ]
[  156.169822] 4.4.5-scst31x+ #20 Not tainted
[  156.170656] ---
[  156.171488] fs/btrfs/dev-replace.c:990 suspicious rcu_dereference_check() 
usage!
[  156.172920]
[  156.172920] other info that might help us debug this:
[  156.172920]
[  156.174825]
[  156.174825] rcu_scheduler_active = 1, debug_locks = 0
[  156.176152] 1 lock held by btrfs-casualty/4807:
[  156.181917]  #0:  (_info->casualty_mutex){+.+...}, at: 
[] casualty_kthread+0x64/0x390 [btrfs]
[  156.193511]
[  156.193511] stack backtrace:
[  156.194680] CPU: 0 PID: 4807 Comm: btrfs-casualty Not tainted 4.4.5-scst31x+ 
#20
[  156.201650] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS 
VirtualBox 12/01/2006
[  156.219100]   88005d79fda0 813529e3 
88005e19c600
[  156.221216]  0001 88005d79fdd0 810d6407 

[  156.224287]   88005f4a0c00 88005da36000 
88005d79fe08
[  156.226375] Call Trace:
[  156.227078]  [] dump_stack+0x85/0xc2
[  156.228152]  [] lockdep_rcu_suspicious+0xd7/0x110
[  156.229418]  [] btrfs_auto_replace_start+0xa6/0xd0 [btrfs]
[  156.230714]  [] casualty_kthread+0x2c4/0x390 [btrfs]
[  156.231915]  [] ? casualty_kthread+0x19c/0x390 [btrfs]
[  156.233105]  [] ? btrfs_check_devices+0x200/0x200 [btrfs]
[  156.234339]  [] kthread+0xef/0x110
[  156.235309]  [] ? 
__raw_callee_save___pv_queued_spin_unlock+0x11/0x20
[  156.236940]  [] ? kthread_create_on_node+0x200/0x200
[  156.239489]  [] ret_from_fork+0x3f/0x70
[  156.240533]  [] ? kthread_create_on_node+0x200/0x200



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Global hotspare functionality

2016-04-01 Thread Anand Jain



On 03/30/2016 03:47 AM, Yauhen Kharuzhy wrote:

On Tue, Mar 29, 2016 at 10:41:36PM +0800, Anand Jain wrote:


Hi Yauhen,





Issue 2.
At start of autoreplacig drive by hotspare, kernel craches in transaction
handling code (inside of btrfs_commit_transaction() called by autoreplace 
initiating
routines). I 'fixed' this by removing of closing of bdev in 
btrfs_close_one_device_dont_free(), see
https://bitbucket.org/jekhor/linux-btrfs/commits/dfa441c9ec7b3833f6a5e4d0b6f8c678faea29bb?at=master
(oops text is attached also). Bdev is closed after replacing by
btrfs_dev_replace_finishing(), so this is safe but doesn't seem
to be right way.


  I have sent out V2. I don't see that issue with this,
  could you pls try ?


Yes, it reproduced on v4.4.5 kernel. I will try with current
'for-linus-4.6' Chris' tree soon.

To emulate a drive failure, I disconnect the drive in VirtualBox, so bdev
can be freed by kernel after releasing of all references to it.


  So far the raid group profile would adapt to lower suitable
  group profile when device is missing/failed. This appears to
  be not happening with RAID56 OR there are stale IO which wasn't
  flushed out. Anyway to have this fixed I am moving the patch
   btrfs: introduce device dynamic state transition to offline or failed
  to the top in v3 for any potential changes.
  But firstly we need a reliable test case, or a very carefully
  crafted test case which can create this situation

  Here below is the dm-error that I am using for testing, which
  apparently doesn't report this issue. Could you please try on V3. ?
  (pls note the device names are hard coded in the test script
  sorry about that) This would eventually be fstests script.



# cat util
run()
{
local ret

echo -- ${*} --
echo ${*} | bash
ret=$?
if [ $ret -ne 0 ]; then
echo
echo "## FAILED: RET $ret #"
echo
exit
fi
echo
#echo "OK?"; read
}

runnt()
{
local ret

echo -- ${*} --
echo ${*} | bash
ret=$?
echo
#echo "OK?"; read
}

wipeall()
{
runnt "wipefs -a /dev/sd[c-h] > /dev/null"
}

create_err_dev_raid1()
{
dm_backing_dev="/dev/sdd"
blk_dev_size=`blockdev --getsz $dm_backing_dev`
dmerror_dev="/dev/mapper/dm-sdd"
dmlinear_table="0 $blk_dev_size linear $dm_backing_dev 0"
dmerror_table="0 $blk_dev_size error $dm_backing_dev 0"

echo -e dm_backing_dev'\t'= $dm_backing_dev
echo -e blk_dev_size'\t'= $blk_dev_size
echo -e dmerror_dev'\t'= $dmerror_dev
echo -e dmlinear_table'\t'= $dmlinear_table
echo -e dmerror_table'\t'= $dmerror_table
echo

runnt "dmsetup remove dm-sdd > /dev/null 2>&1"
run "dmsetup create dm-sdd --table '${dmlinear_table}'"

run "mkfs.btrfs -f -draid1 -mraid1 /dev/sdc $dmerror_dev > /dev/null 
2>&1"
run mount /dev/sdc /btrfs
run "fillfs /btrfs 1000 > /dev/null 2>&1"
run "dd if=/dev/zero of=/btrfs/tf1 bs=4096 count=100 > /dev/null 2>&1"

run btrfs fi show

#   run sleep 32

run dmsetup suspend dm-sdd
run "dmsetup load dm-sdd --table '$dmerror_table'"
run dmsetup resume dm-sdd
run "dd if=/dev/zero of=/btrfs/tf1 bs=4096 count=100 > /dev/null 2>&1"

run btrfs fi show
}

create_err_dev_raid56()
{
dm_backing_dev="/dev/sdd"
blk_dev_size=`blockdev --getsz $dm_backing_dev`
dmerror_dev="/dev/mapper/dm-sdd"
dmlinear_table="0 $blk_dev_size linear $dm_backing_dev 0"
dmerror_table="0 $blk_dev_size error $dm_backing_dev 0"

echo -e dm_backing_dev'\t'= $dm_backing_dev
echo -e blk_dev_size'\t'= $blk_dev_size
echo -e dmerror_dev'\t'= $dmerror_dev
echo -e dmlinear_table'\t'= $dmlinear_table
echo -e dmerror_table'\t'= $dmerror_table
echo

runnt "dmsetup remove dm-sdd > /dev/null 2>&1"
run "dmsetup create dm-sdd --table '${dmlinear_table}'"

	run "mkfs.btrfs -f -draid5 -mraid5 /dev/sdc /dev/sdf $dmerror_dev > 
/dev/null 2>&1"

run mount /dev/sdc /btrfs
run "fillfs /btrfs 1000 > /dev/null 2>&1"
run "dd if=/dev/zero of=/btrfs/tf1 bs=4096 count=100 > /dev/null 2>&1"

run btrfs fi show

#   run sleep 32

run dmsetup suspend dm-sdd
run "dmsetup load dm-sdd --table '$dmerror_table'"
run dmsetup resume dm-sdd
run "dd if=/dev/zero of=/btrfs/tf1 bs=4096 count=100 > /dev/null 2>&1"

run btrfs fi show
}

# cat auto-replace-test56
source $(dirname $0)/util

wipeall

run btrfs spare add /dev/sde

#run cat /proc/fs/btrfs/devlist

create_err_dev_raid56
--


Thanks, Anand




[ 1464.232552] BTRFS info (device sdc): dev_replace from  (devid 
4) to /dev/sdg started
[ 1464.255824] BUG: unable to handle kernel NULL pointer dereference at 
0548
[ 1464.291760] 

Re: [PATCH 12/12] btrfs: check device for critical errors and mark failed

2016-04-01 Thread Anand Jain



On 03/30/2016 08:49 AM, Yauhen Kharuzhy wrote:

On Tue, Mar 29, 2016 at 10:22:29PM +0800, Anand Jain wrote:

Write and Flush errors are considered as critical errors,
upon which the device will be brought offline and marked as
failed. Write and Flush errors are identified using device
error statistics.

Signed-off-by: Anand Jain 

btrfs: check for failed device and hot replace

This patch creates casualty_kthread to check for the failed
devices, and triggers device replace.

Signed-off-by: Anand Jain 
---
  fs/btrfs/ctree.h   |   2 +
  fs/btrfs/disk-io.c | 161 -
  fs/btrfs/disk-io.h |   2 +
  fs/btrfs/volumes.c |   1 +
  fs/btrfs/volumes.h |   4 ++
  5 files changed, 169 insertions(+), 1 deletion(-)


btrfs_check_and_handle_casualty() tries to perfom auto-replacement
only once after each failure. If no hotspare was added in system before 
failure, only one
remaining way to replace drive is to perform replace manually. This sounds
reasonable, so just clarification: are you sure that we shouldn't start
autoreplacement if hotspare will be added after drive failure?

V1 of the patchset tried to perform autoreplace endlessly until replace
drive is added.


Yeah. I did that change purposely, but in V3 I have reverted, so
that code is more flexible and has better design control/change.

Thanks, Anand


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 12/12] btrfs: check device for critical errors and mark failed

2016-04-01 Thread Anand Jain



On 03/30/2016 06:41 AM, Yauhen Kharuzhy wrote:

On Tue, Mar 29, 2016 at 10:22:29PM +0800, Anand Jain wrote:

Write and Flush errors are considered as critical errors,
upon which the device will be brought offline and marked as
failed. Write and Flush errors are identified using device
error statistics.

Signed-off-by: Anand Jain 

btrfs: check for failed device and hot replace

This patch creates casualty_kthread to check for the failed
devices, and triggers device replace.

Signed-off-by: Anand Jain 
---
  fs/btrfs/ctree.h   |   2 +
  fs/btrfs/disk-io.c | 161 -
  fs/btrfs/disk-io.h |   2 +
  fs/btrfs/volumes.c |   1 +
  fs/btrfs/volumes.h |   4 ++
  5 files changed, 169 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 2c185a8e92f0..36f1c29e00a0 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1569,6 +1569,7 @@ struct btrfs_fs_info {
struct mutex tree_log_mutex;
struct mutex transaction_kthread_mutex;
struct mutex cleaner_mutex;
+   struct mutex casualty_mutex;
struct mutex chunk_mutex;
struct mutex volume_mutex;

@@ -1686,6 +1687,7 @@ struct btrfs_fs_info {
struct btrfs_workqueue *extent_workers;
struct task_struct *transaction_kthread;
struct task_struct *cleaner_kthread;
+   struct task_struct *casualty_kthread;
int thread_pool_size;

struct kobject *space_info_kobj;
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index b99329e37965..650e26e0acda 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1869,6 +1869,153 @@ sleep:
return 0;
  }

+static int btrfs_check_and_handle_casualty(void *arg)
+{
+   int ret;
+   int found = 0;
+   struct btrfs_device *device;
+   struct btrfs_root *root = arg;
+   struct btrfs_fs_info *fs_info = root->fs_info;
+   struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
+
+   btrfs_dev_replace_lock(_info->dev_replace, 0);
+   if (btrfs_dev_replace_is_ongoing(_info->dev_replace)) {
+   btrfs_dev_replace_unlock(_info->dev_replace, 0);
+   return -EBUSY;
+   }
+   btrfs_dev_replace_unlock(_info->dev_replace, 0);
+
+   ret = btrfs_check_devices(fs_devices);
+   if (ret == 1) {
+   /*
+* There were some casualties, and if its beyond a
+* chunk group can tolerate, then FS will already
+* be in readonly, so check that. And that's best
+* btrfs could do as of now and no replace will help.
+*/
+   if (fs_info->sb->s_flags & MS_RDONLY)
+   return -EROFS;
+
+   mutex_lock(_devices->device_list_mutex);
+   rcu_read_lock();
+   list_for_each_entry_rcu(device,
+   _devices->devices, dev_list) {
+   if (device->failed) {
+   found = 1;
+   break;
+   }
+   }
+   rcu_read_unlock();
+   mutex_unlock(_devices->device_list_mutex);
+   }
+
+   /*
+* We are using the replace code which should be interrupt-able
+* during unmount, and as of now there is no user land stop
+* request that we support and this will run until its complete
+*/
+   if (found)
+   ret = btrfs_auto_replace_start(root, device);
+
+   return ret;
+}
+
+/*
+ * A kthread to check if any auto maintenance be required. This is
+ * multithread safe, and kthread is running only if
+ * fs_info->casualty_kthread is not NULL, fixme: atomic ?
+ */
+static int casualty_kthread(void *arg)
+{
+   int ret;
+   int again;
+   struct btrfs_root *root = arg;
+
+   do {
+   again = 0;
+
+   if (btrfs_need_cleaner_sleep(root))
+   goto sleep;
+
+   if (!mutex_trylock(>fs_info->casualty_mutex))
+   goto sleep;
+
+   if (btrfs_need_cleaner_sleep(root)) {
+   mutex_unlock(>fs_info->casualty_mutex);
+   goto sleep;
+   }
+
+   ret = btrfs_check_and_handle_casualty(arg);
+   if (ret == -EROFS) {
+   /*
+* When checking and fixing the devices, the
+* FS may be marked as RO in some situations.
+* And on ROFS casualty thread has no work.
+* So optimize here, to stop this thread until
+* FS is back to RW.
+*/
+   }
+   mutex_unlock(>fs_info->casualty_mutex);
+
+sleep:
+   if (!try_to_freeze() && !again) {


This block was copy-pasted from the cleaner_kthread(). 'again' variable

Re: Another ENOSPC situation

2016-04-01 Thread Henk Slager
On Fri, Apr 1, 2016 at 10:40 PM, Marc Haber  wrote:
> On Fri, Apr 01, 2016 at 09:20:52PM +0200, Henk Slager wrote:
>> On Fri, Apr 1, 2016 at 6:50 PM, Marc Haber  
>> wrote:
>> > On Fri, Apr 01, 2016 at 06:30:20PM +0200, Marc Haber wrote:
>> >> On Fri, Apr 01, 2016 at 05:44:30PM +0200, Henk Slager wrote:
>> >> > On Fri, Apr 1, 2016 at 3:40 PM, Marc Haber 
>> >> >  wrote:
>> >> > > btrfs balance -mprofiles seems to do something. one kworked and one
>> >> > > btrfs-transaction process hog one CPU core each for hours, while
>> >> > > blocking the filesystem for minutes apiece, which leads to the host
>> >> > > being nearly unuseable up to the point of "clock and mouse pointer
>> >> > > frozen for nearly ten minutes".
>> >> >
>> >> > I assume you still have your every 10 minutes snapshotting running
>> >> > while balancing?
>> >>
>> >> No, I disabled the cronjob before trying the balance. I might be
>> >> crazy, but not stup^wunexperienced.
>> >
>> > That being said, I would still expect the code not to allow _this_
>> > kind of effect on the entire system when two alledgely incompatible
>> > operations run simultaneously. I mean, Linux is a multi-user,
>> > multi-tasking operating system where one simply cannot expect all
>> > processes to be cooperative to each other. We have the operating
>> > systems to prevent this kind of issues, not to cause them.
>>
>> Maybe look at it differently: Does user mh have trouble using this
>> laptop w.r.t. storing files?
>
> No. I would have cried murder otherwise.
>
>> In openSUSE Tumbleweed (the snapshot from end of march), root access
>> is needed to change the default snapshotting config, otherwise you
>> will have a 10 year history. After that change has been done according
>> to needs of the user, there is no need to run manual balance.
>
> So you are saying the balancing a filesystem should never be
> necessary? Or what are you trying to say?

There is a package  bbtrfsmaintenance  which does balancing for the
user after it is configured by root according to user's wishes and
needs.

Key thing I want to say is that you should change you snapshotting
rate and/or policy. It has been hinted before and it is more a
psychological issue than technical I think.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[GIT PULL] Btrfs

2016-04-01 Thread Chris Mason
Hi Linus,

My for-linus-4.6 branch:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git 
for-linus-4.6

Has a few fixes Dave Sterba had queued up.  These are all pretty small,
but since they were tested I decided against waiting for more:

Alex Lyakas (2) commits (+18/-10):
btrfs: do not write corrupted metadata blocks to disk (+13/-2)
btrfs: csum_tree_block: return proper errno value (+5/-8)

Jiri Kosina (2) commits (+7/-10):
btrfs: cleaner_kthread() doesn't need explicit freeze (+1/-1)
btrfs: transaction_kthread() is not freezable (+6/-9)

Total: (4) commits (+25/-20)

 fs/btrfs/disk-io.c | 45 +
 1 file changed, 25 insertions(+), 20 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Compression causes kernel crashes if there are I/O or checksum errors (was: RE: kernel BUG at fs/btrfs/volumes.c:5519 when hot-removing device in RAID-1)

2016-04-01 Thread James Johnston
> I grabbed this part from the log after the machine crashed again
> following trying to transfer a bunch of files that included ones with
> csum errors, let me know if this looks like the same issue you were
> having:
> 

Idk?  You hit a soft lockup, mine got a "kernel BUG at..."

Your stack trace diverges from mine after bio_endio.

James 



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Another ENOSPC situation

2016-04-01 Thread Marc Haber
On Fri, Apr 01, 2016 at 09:20:52PM +0200, Henk Slager wrote:
> On Fri, Apr 1, 2016 at 6:50 PM, Marc Haber  
> wrote:
> > On Fri, Apr 01, 2016 at 06:30:20PM +0200, Marc Haber wrote:
> >> On Fri, Apr 01, 2016 at 05:44:30PM +0200, Henk Slager wrote:
> >> > On Fri, Apr 1, 2016 at 3:40 PM, Marc Haber  
> >> > wrote:
> >> > > btrfs balance -mprofiles seems to do something. one kworked and one
> >> > > btrfs-transaction process hog one CPU core each for hours, while
> >> > > blocking the filesystem for minutes apiece, which leads to the host
> >> > > being nearly unuseable up to the point of "clock and mouse pointer
> >> > > frozen for nearly ten minutes".
> >> >
> >> > I assume you still have your every 10 minutes snapshotting running
> >> > while balancing?
> >>
> >> No, I disabled the cronjob before trying the balance. I might be
> >> crazy, but not stup^wunexperienced.
> >
> > That being said, I would still expect the code not to allow _this_
> > kind of effect on the entire system when two alledgely incompatible
> > operations run simultaneously. I mean, Linux is a multi-user,
> > multi-tasking operating system where one simply cannot expect all
> > processes to be cooperative to each other. We have the operating
> > systems to prevent this kind of issues, not to cause them.
> 
> Maybe look at it differently: Does user mh have trouble using this
> laptop w.r.t. storing files?

No. I would have cried murder otherwise.

> In openSUSE Tumbleweed (the snapshot from end of march), root access
> is needed to change the default snapshotting config, otherwise you
> will have a 10 year history. After that change has been done according
> to needs of the user, there is no need to run manual balance.

So you are saying the balancing a filesystem should never be
necessary? Or what are you trying to say?

Greetings
Marc

-- 
-
Marc Haber | "I don't trust Computers. They | Mailadresse im Header
Leimen, Germany|  lose things."Winona Ryder | Fon: *49 6224 1600402
Nordisch by Nature |  How to make an American Quilt | Fax: *49 6224 1600421
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/8] btrfs: uapi migration for user-visible API components

2016-04-01 Thread Jeff Mahoney
commit 55e301fd57a (Btrfs: move fs/btrfs/ioctl.h to
include/uapi/linux/btrfs.h) was intended to make the ioctl definitions
available to userspace.  Unfortunately, moving just that file wasn't
enough and many of the ioctls aren't actually usable without the
userspace programmer filling in the gaps.  Specifically, for the routine
ioctls like BTRFS_IOC_SETLABEL, BTRFS_LABEL_SIZE wasn't defined so the
ioctl definition would be incomplete.  We were also missing
the argument structure for defrag.  Beyond that, many of the ioctl
structures have a flags field that may or may not be independent of
the btrfs internals.  Lastly, the SEARCH_TREE ioctl exposes all of the
internal items of the tree to userspace programmers so the item
structures should be exposed so that they can be parsed properly.

So, to make all this more convenient for consumers of these APIs, I've
moved the flags used by the ioctl structures into btrfs.h and
moved the item definitions, key IDs, tree root objectids, and other
well-known objectids into a new btrfs_tree.h.  ctree.h includes this
new header directly, so there aren't any changes to .c files at all.

The only part of this set that isn't just a direct cut-and-paste is
the last one which converts u8 and u64 values to __u8 and __u64 since
the former aren't exported via include/uapi.

The goal is that everything required to use the btrfs ioctls for a
particular kernel release should be made available by exporting the uapi
headers for that release.

I intend to use these for the strace ioctl decoding patch I've been
working on so that I don't need to duplicate of the definitions in the
code I send upstream as the final version of the patch.  Prior to this
patchset, I had to duplicate nearly 100 defines and several structures --
and that's without doing any item decoding at all.

I do expect there might be some discussion here. :)

-Jeff

Jeff Mahoney (8):
  btrfs: uapi/linux/btrfs.h migration, move BTRFS_LABEL_SIZE
  btrfs: uapi/linux/btrfs.h migration, qgroup limit flags
  btrfs: uapi/linux/btrfs.h migration, document subvol flags
  btrfs: uapi/linux/btrfs.h migration, move feature flags
  btrfs: uapi/linux/btrfs.h migration, move balance flags
  btrfs: uapi/linux/btrfs.h migration, move struct
btrfs_ioctl_defrag_range_args
  btrfs: uapi/linux/btrfs_tree.h migration, item types and defines
  btrfs: uapi/linux/btrfs_tree.h, use __u8 and __u64

 fs/btrfs/ctree.h| 1014 +--
 fs/btrfs/volumes.h  |   46 --
 include/uapi/linux/btrfs.h  |  173 ++-
 include/uapi/linux/btrfs_tree.h |  966 +
 4 files changed, 1135 insertions(+), 1064 deletions(-)
 create mode 100644 include/uapi/linux/btrfs_tree.h

-- 
2.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/8] btrfs: uapi/linux/btrfs.h migration, move feature flags

2016-04-01 Thread Jeff Mahoney
The compat/compat_ro/incompat feature flags are used by the feature set/get
ioctls.

Signed-off-by: Jeff Mahoney 
---
 fs/btrfs/ctree.h   | 25 -
 include/uapi/linux/btrfs.h | 31 +++
 2 files changed, 31 insertions(+), 25 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index c228b39..378482c 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -506,31 +506,6 @@ struct btrfs_super_block {
  * Compat flags that we support.  If any incompat flags are set other than the
  * ones specified below then we will fail to mount
  */
-#define BTRFS_FEATURE_COMPAT_RO_FREE_SPACE_TREE(1ULL << 0)
-
-#define BTRFS_FEATURE_INCOMPAT_MIXED_BACKREF   (1ULL << 0)
-#define BTRFS_FEATURE_INCOMPAT_DEFAULT_SUBVOL  (1ULL << 1)
-#define BTRFS_FEATURE_INCOMPAT_MIXED_GROUPS(1ULL << 2)
-#define BTRFS_FEATURE_INCOMPAT_COMPRESS_LZO(1ULL << 3)
-/*
- * some patches floated around with a second compression method
- * lets save that incompat here for when they do get in
- * Note we don't actually support it, we're just reserving the
- * number
- */
-#define BTRFS_FEATURE_INCOMPAT_COMPRESS_LZOv2  (1ULL << 4)
-
-/*
- * older kernels tried to do bigger metadata blocks, but the
- * code was pretty buggy.  Lets not let them try anymore.
- */
-#define BTRFS_FEATURE_INCOMPAT_BIG_METADATA(1ULL << 5)
-
-#define BTRFS_FEATURE_INCOMPAT_EXTENDED_IREF   (1ULL << 6)
-#define BTRFS_FEATURE_INCOMPAT_RAID56  (1ULL << 7)
-#define BTRFS_FEATURE_INCOMPAT_SKINNY_METADATA (1ULL << 8)
-#define BTRFS_FEATURE_INCOMPAT_NO_HOLES(1ULL << 9)
-
 #define BTRFS_FEATURE_COMPAT_SUPP  0ULL
 #define BTRFS_FEATURE_COMPAT_SAFE_SET  0ULL
 #define BTRFS_FEATURE_COMPAT_SAFE_CLEAR0ULL
diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h
index 0316e23..de98717 100644
--- a/include/uapi/linux/btrfs.h
+++ b/include/uapi/linux/btrfs.h
@@ -222,6 +222,37 @@ struct btrfs_ioctl_fs_info_args {
__u64 reserved[122];/* pad to 1k */
 };
 
+/*
+ * feature flags
+ *
+ * Used by:
+ * struct btrfs_ioctl_feature_flags
+ */
+#define BTRFS_FEATURE_COMPAT_RO_FREE_SPACE_TREE(1ULL << 0)
+
+#define BTRFS_FEATURE_INCOMPAT_MIXED_BACKREF   (1ULL << 0)
+#define BTRFS_FEATURE_INCOMPAT_DEFAULT_SUBVOL  (1ULL << 1)
+#define BTRFS_FEATURE_INCOMPAT_MIXED_GROUPS(1ULL << 2)
+#define BTRFS_FEATURE_INCOMPAT_COMPRESS_LZO(1ULL << 3)
+/*
+ * some patches floated around with a second compression method
+ * lets save that incompat here for when they do get in
+ * Note we don't actually support it, we're just reserving the
+ * number
+ */
+#define BTRFS_FEATURE_INCOMPAT_COMPRESS_LZOv2  (1ULL << 4)
+
+/*
+ * older kernels tried to do bigger metadata blocks, but the
+ * code was pretty buggy.  Lets not let them try anymore.
+ */
+#define BTRFS_FEATURE_INCOMPAT_BIG_METADATA(1ULL << 5)
+
+#define BTRFS_FEATURE_INCOMPAT_EXTENDED_IREF   (1ULL << 6)
+#define BTRFS_FEATURE_INCOMPAT_RAID56  (1ULL << 7)
+#define BTRFS_FEATURE_INCOMPAT_SKINNY_METADATA (1ULL << 8)
+#define BTRFS_FEATURE_INCOMPAT_NO_HOLES(1ULL << 9)
+
 struct btrfs_ioctl_feature_flags {
__u64 compat_flags;
__u64 compat_ro_flags;
-- 
2.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 6/8] btrfs: uapi/linux/btrfs.h migration, move struct btrfs_ioctl_defrag_range_args

2016-04-01 Thread Jeff Mahoney
struct btrfs_ioctl_defrag_range_args is used by the BTRFS_IOC_DEFRAG_RANGE
ioctl.

Signed-off-by: Jeff Mahoney 
---
 fs/btrfs/ctree.h   | 31 ---
 include/uapi/linux/btrfs.h | 38 +-
 2 files changed, 37 insertions(+), 32 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 378482c..89f36b6 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1992,37 +1992,6 @@ struct btrfs_root {
atomic_t qgroup_meta_rsv;
 };
 
-struct btrfs_ioctl_defrag_range_args {
-   /* start of the defrag operation */
-   __u64 start;
-
-   /* number of bytes to defrag, use (u64)-1 to say all */
-   __u64 len;
-
-   /*
-* flags for the operation, which can include turning
-* on compression for this one defrag
-*/
-   __u64 flags;
-
-   /*
-* any extent bigger than this will be considered
-* already defragged.  Use 0 to take the kernel default
-* Use 1 to say every single extent must be rewritten
-*/
-   __u32 extent_thresh;
-
-   /*
-* which compression method to use if turning on compression
-* for this defrag operation.  If unspecified, zlib will
-* be used
-*/
-   __u32 compress_type;
-
-   /* spare for later */
-   __u32 unused[4];
-};
-
 
 /*
  * inode items have the data typically returned from stat and store other
diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h
index abae362..98aff38 100644
--- a/include/uapi/linux/btrfs.h
+++ b/include/uapi/linux/btrfs.h
@@ -474,9 +474,45 @@ struct btrfs_ioctl_clone_range_args {
   __u64 dest_offset;
 };
 
-/* flags for the defrag range ioctl */
+/*
+ * flags definition for the defrag range ioctl
+ *
+ * Used by:
+ * struct btrfs_ioctl_defrag_range_args.flags
+ */
 #define BTRFS_DEFRAG_RANGE_COMPRESS 1
 #define BTRFS_DEFRAG_RANGE_START_IO 2
+struct btrfs_ioctl_defrag_range_args {
+   /* start of the defrag operation */
+   __u64 start;
+
+   /* number of bytes to defrag, use (u64)-1 to say all */
+   __u64 len;
+
+   /*
+* flags for the operation, which can include turning
+* on compression for this one defrag
+*/
+   __u64 flags;
+
+   /*
+* any extent bigger than this will be considered
+* already defragged.  Use 0 to take the kernel default
+* Use 1 to say every single extent must be rewritten
+*/
+   __u32 extent_thresh;
+
+   /*
+* which compression method to use if turning on compression
+* for this defrag operation.  If unspecified, zlib will
+* be used
+*/
+   __u32 compress_type;
+
+   /* spare for later */
+   __u32 unused[4];
+};
+
 
 #define BTRFS_SAME_DATA_DIFFERS1
 /* For extent-same ioctl */
-- 
2.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/8] btrfs: uapi/linux/btrfs.h migration, qgroup limit flags

2016-04-01 Thread Jeff Mahoney
The BTRFS_QGROUP_LIMIT_* flags are required to tell the kernel which
fields are valid when using the BTRFS_IOC_QGROUP_LIMIT ioctl.

Signed-off-by: Jeff Mahoney 
---
 fs/btrfs/ctree.h   |  8 
 include/uapi/linux/btrfs.h | 22 +-
 2 files changed, 21 insertions(+), 9 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 3beaa24..c228b39 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1154,14 +1154,6 @@ struct btrfs_qgroup_info_item {
__le64 excl_cmpr;
 } __attribute__ ((__packed__));
 
-/* flags definition for qgroup limits */
-#define BTRFS_QGROUP_LIMIT_MAX_RFER(1ULL << 0)
-#define BTRFS_QGROUP_LIMIT_MAX_EXCL(1ULL << 1)
-#define BTRFS_QGROUP_LIMIT_RSV_RFER(1ULL << 2)
-#define BTRFS_QGROUP_LIMIT_RSV_EXCL(1ULL << 3)
-#define BTRFS_QGROUP_LIMIT_RFER_CMPR   (1ULL << 4)
-#define BTRFS_QGROUP_LIMIT_EXCL_CMPR   (1ULL << 5)
-
 struct btrfs_qgroup_limit_item {
/*
 * only updated when any of the other values change
diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h
index 11eee34..9651af3 100644
--- a/include/uapi/linux/btrfs.h
+++ b/include/uapi/linux/btrfs.h
@@ -41,7 +41,19 @@ struct btrfs_ioctl_vol_args {
 #define BTRFS_UUID_SIZE 16
 #define BTRFS_UUID_UNPARSED_SIZE   37
 
-#define BTRFS_QGROUP_INHERIT_SET_LIMITS(1ULL << 0)
+/*
+ * flags definition for qgroup limits
+ *
+ * Used by:
+ * struct btrfs_qgroup_limit.flags
+ * struct btrfs_qgroup_limit_item.flags
+ */
+#define BTRFS_QGROUP_LIMIT_MAX_RFER(1ULL << 0)
+#define BTRFS_QGROUP_LIMIT_MAX_EXCL(1ULL << 1)
+#define BTRFS_QGROUP_LIMIT_RSV_RFER(1ULL << 2)
+#define BTRFS_QGROUP_LIMIT_RSV_EXCL(1ULL << 3)
+#define BTRFS_QGROUP_LIMIT_RFER_CMPR   (1ULL << 4)
+#define BTRFS_QGROUP_LIMIT_EXCL_CMPR   (1ULL << 5)
 
 struct btrfs_qgroup_limit {
__u64   flags;
@@ -51,6 +63,14 @@ struct btrfs_qgroup_limit {
__u64   rsv_excl;
 };
 
+/*
+ * flags definition for qgroup inheritance
+ *
+ * Used by:
+ * struct btrfs_qgroup_inherit.flags
+ */
+#define BTRFS_QGROUP_INHERIT_SET_LIMITS(1ULL << 0)
+
 struct btrfs_qgroup_inherit {
__u64   flags;
__u64   num_qgroups;
-- 
2.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 7/8] btrfs: uapi/linux/btrfs_tree.h migration, item types and defines

2016-04-01 Thread Jeff Mahoney
The BTRFS_IOC_SEARCH_TREE ioctl returns file system items directly
to userspace.  In order to decode them, full type information is required.

Create a new header, btrfs_tree to contain these since most users won't
need them.

Signed-off-by: Jeff Mahoney 
---
 fs/btrfs/ctree.h| 949 +--
 include/uapi/linux/btrfs_tree.h | 966 
 2 files changed, 967 insertions(+), 948 deletions(-)
 create mode 100644 include/uapi/linux/btrfs_tree.h

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 89f36b6..cf34fb5 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -33,6 +33,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -64,98 +65,6 @@ struct btrfs_ordered_sum;
 
 #define BTRFS_COMPAT_EXTENT_TREE_V0
 
-/* holds pointers to all of the tree roots */
-#define BTRFS_ROOT_TREE_OBJECTID 1ULL
-
-/* stores information about which extents are in use, and reference counts */
-#define BTRFS_EXTENT_TREE_OBJECTID 2ULL
-
-/*
- * chunk tree stores translations from logical -> physical block numbering
- * the super block points to the chunk tree
- */
-#define BTRFS_CHUNK_TREE_OBJECTID 3ULL
-
-/*
- * stores information about which areas of a given device are in use.
- * one per device.  The tree of tree roots points to the device tree
- */
-#define BTRFS_DEV_TREE_OBJECTID 4ULL
-
-/* one per subvolume, storing files and directories */
-#define BTRFS_FS_TREE_OBJECTID 5ULL
-
-/* directory objectid inside the root tree */
-#define BTRFS_ROOT_TREE_DIR_OBJECTID 6ULL
-
-/* holds checksums of all the data extents */
-#define BTRFS_CSUM_TREE_OBJECTID 7ULL
-
-/* holds quota configuration and tracking */
-#define BTRFS_QUOTA_TREE_OBJECTID 8ULL
-
-/* for storing items that use the BTRFS_UUID_KEY* types */
-#define BTRFS_UUID_TREE_OBJECTID 9ULL
-
-/* tracks free space in block groups. */
-#define BTRFS_FREE_SPACE_TREE_OBJECTID 10ULL
-
-/* device stats in the device tree */
-#define BTRFS_DEV_STATS_OBJECTID 0ULL
-
-/* for storing balance parameters in the root tree */
-#define BTRFS_BALANCE_OBJECTID -4ULL
-
-/* orhpan objectid for tracking unlinked/truncated files */
-#define BTRFS_ORPHAN_OBJECTID -5ULL
-
-/* does write ahead logging to speed up fsyncs */
-#define BTRFS_TREE_LOG_OBJECTID -6ULL
-#define BTRFS_TREE_LOG_FIXUP_OBJECTID -7ULL
-
-/* for space balancing */
-#define BTRFS_TREE_RELOC_OBJECTID -8ULL
-#define BTRFS_DATA_RELOC_TREE_OBJECTID -9ULL
-
-/*
- * extent checksums all have this objectid
- * this allows them to share the logging tree
- * for fsyncs
- */
-#define BTRFS_EXTENT_CSUM_OBJECTID -10ULL
-
-/* For storing free space cache */
-#define BTRFS_FREE_SPACE_OBJECTID -11ULL
-
-/*
- * The inode number assigned to the special inode for storing
- * free ino cache
- */
-#define BTRFS_FREE_INO_OBJECTID -12ULL
-
-/* dummy objectid represents multiple objectids */
-#define BTRFS_MULTIPLE_OBJECTIDS -255ULL
-
-/*
- * All files have objectids in this range.
- */
-#define BTRFS_FIRST_FREE_OBJECTID 256ULL
-#define BTRFS_LAST_FREE_OBJECTID -256ULL
-#define BTRFS_FIRST_CHUNK_TREE_OBJECTID 256ULL
-
-
-/*
- * the device items go into the chunk tree.  The key is in the form
- * [ 1 BTRFS_DEV_ITEM_KEY device_id ]
- */
-#define BTRFS_DEV_ITEMS_OBJECTID 1ULL
-
-#define BTRFS_BTREE_INODE_OBJECTID 1
-
-#define BTRFS_EMPTY_SUBVOL_DIR_OBJECTID 2
-
-#define BTRFS_DEV_REPLACE_DEVID 0ULL
-
 /*
  * the max metadata block size.  This limit is somewhat artificial,
  * but the memmove costs go through the roof for larger blocks.
@@ -175,12 +84,6 @@ struct btrfs_ordered_sum;
  */
 #define BTRFS_LINK_MAX 65535U
 
-/* 32 bytes in various csum fields */
-#define BTRFS_CSUM_SIZE 32
-
-/* csum types */
-#define BTRFS_CSUM_TYPE_CRC32  0
-
 static const int btrfs_csum_sizes[] = { 4 };
 
 /* four bytes for CRC32 */
@@ -189,17 +92,6 @@ static const int btrfs_csum_sizes[] = { 4 };
 /* spefic to btrfs_map_block(), therefore not in include/linux/blk_types.h */
 #define REQ_GET_READ_MIRRORS   (1 << 30)
 
-#define BTRFS_FT_UNKNOWN   0
-#define BTRFS_FT_REG_FILE  1
-#define BTRFS_FT_DIR   2
-#define BTRFS_FT_CHRDEV3
-#define BTRFS_FT_BLKDEV4
-#define BTRFS_FT_FIFO  5
-#define BTRFS_FT_SOCK  6
-#define BTRFS_FT_SYMLINK   7
-#define BTRFS_FT_XATTR 8
-#define BTRFS_FT_MAX   9
-
 /* ioprio of readahead is set to idle */
 #define BTRFS_IOPRIO_READA (IOPRIO_PRIO_VALUE(IOPRIO_CLASS_IDLE, 0))
 
@@ -207,138 +99,10 @@ static const int btrfs_csum_sizes[] = { 4 };
 
 #define BTRFS_MAX_EXTENT_SIZE SZ_128M
 
-/*
- * The key defines the order in the tree, and so it also defines (optimal)
- * block layout.
- *
- * objectid corresponds to the inode number.
- *
- * type tells us things about the object, and is a kind of stream selector.
- * so for a given inode, keys with type of 1 might refer to the inode data,
- * type of 2 may point to file data in the btree and type 

[PATCH 3/8] btrfs: uapi/linux/btrfs.h migration, document subvol flags

2016-04-01 Thread Jeff Mahoney
Signed-off-by: Jeff Mahoney 
---
 include/uapi/linux/btrfs.h | 17 ++---
 1 file changed, 14 insertions(+), 3 deletions(-)

diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h
index 9651af3..0316e23 100644
--- a/include/uapi/linux/btrfs.h
+++ b/include/uapi/linux/btrfs.h
@@ -34,9 +34,6 @@ struct btrfs_ioctl_vol_args {
 
 #define BTRFS_DEVICE_PATH_NAME_MAX 1024
 
-#define BTRFS_SUBVOL_CREATE_ASYNC  (1ULL << 0)
-#define BTRFS_SUBVOL_RDONLY(1ULL << 1)
-#define BTRFS_SUBVOL_QGROUP_INHERIT(1ULL << 2)
 #define BTRFS_FSID_SIZE 16
 #define BTRFS_UUID_SIZE 16
 #define BTRFS_UUID_UNPARSED_SIZE   37
@@ -85,6 +82,20 @@ struct btrfs_ioctl_qgroup_limit_args {
struct btrfs_qgroup_limit lim;
 };
 
+/*
+ * flags for subvolumes
+ *
+ * Used by:
+ * struct btrfs_ioctl_vol_args_v2.flags
+ *
+ * BTRFS_SUBVOL_RDONLY is also provided/consumed by the following ioctls:
+ * - BTRFS_IOC_SUBVOL_GETFLAGS
+ * - BTRFS_IOC_SUBVOL_SETFLAGS
+ */
+#define BTRFS_SUBVOL_CREATE_ASYNC  (1ULL << 0)
+#define BTRFS_SUBVOL_RDONLY(1ULL << 1)
+#define BTRFS_SUBVOL_QGROUP_INHERIT(1ULL << 2)
+
 #define BTRFS_SUBVOL_NAME_MAX 4039
 struct btrfs_ioctl_vol_args_v2 {
__s64 fd;
-- 
2.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 8/8] btrfs: uapi/linux/btrfs_tree.h, use __u8 and __u64

2016-04-01 Thread Jeff Mahoney
u8 and u64 aren't exported to userspace, while __u8 and __u64 are.

Signed-off-by: Jeff Mahoney 
---
 include/uapi/linux/btrfs_tree.h | 52 -
 1 file changed, 26 insertions(+), 26 deletions(-)

diff --git a/include/uapi/linux/btrfs_tree.h b/include/uapi/linux/btrfs_tree.h
index 1e87505..d5ad15a 100644
--- a/include/uapi/linux/btrfs_tree.h
+++ b/include/uapi/linux/btrfs_tree.h
@@ -334,14 +334,14 @@
  */
 struct btrfs_disk_key {
__le64 objectid;
-   u8 type;
+   __u8 type;
__le64 offset;
 } __attribute__ ((__packed__));
 
 struct btrfs_key {
-   u64 objectid;
-   u8 type;
-   u64 offset;
+   __u64 objectid;
+   __u8 type;
+   __u64 offset;
 } __attribute__ ((__packed__));
 
 struct btrfs_dev_item {
@@ -379,22 +379,22 @@ struct btrfs_dev_item {
__le32 dev_group;
 
/* seek speed 0-100 where 100 is fastest */
-   u8 seek_speed;
+   __u8 seek_speed;
 
/* bandwidth 0-100 where 100 is fastest */
-   u8 bandwidth;
+   __u8 bandwidth;
 
/* btrfs generated uuid for this device */
-   u8 uuid[BTRFS_UUID_SIZE];
+   __u8 uuid[BTRFS_UUID_SIZE];
 
/* uuid of FS who owns this device */
-   u8 fsid[BTRFS_UUID_SIZE];
+   __u8 fsid[BTRFS_UUID_SIZE];
 } __attribute__ ((__packed__));
 
 struct btrfs_stripe {
__le64 devid;
__le64 offset;
-   u8 dev_uuid[BTRFS_UUID_SIZE];
+   __u8 dev_uuid[BTRFS_UUID_SIZE];
 } __attribute__ ((__packed__));
 
 struct btrfs_chunk {
@@ -433,7 +433,7 @@ struct btrfs_chunk {
 struct btrfs_free_space_entry {
__le64 offset;
__le64 bytes;
-   u8 type;
+   __u8 type;
 } __attribute__ ((__packed__));
 
 struct btrfs_free_space_header {
@@ -486,7 +486,7 @@ struct btrfs_extent_item_v0 {
 
 struct btrfs_tree_block_info {
struct btrfs_disk_key key;
-   u8 level;
+   __u8 level;
 } __attribute__ ((__packed__));
 
 struct btrfs_extent_data_ref {
@@ -501,7 +501,7 @@ struct btrfs_shared_data_ref {
 } __attribute__ ((__packed__));
 
 struct btrfs_extent_inline_ref {
-   u8 type;
+   __u8 type;
__le64 offset;
 } __attribute__ ((__packed__));
 
@@ -523,7 +523,7 @@ struct btrfs_dev_extent {
__le64 chunk_objectid;
__le64 chunk_offset;
__le64 length;
-   u8 chunk_tree_uuid[BTRFS_UUID_SIZE];
+   __u8 chunk_tree_uuid[BTRFS_UUID_SIZE];
 } __attribute__ ((__packed__));
 
 struct btrfs_inode_ref {
@@ -583,7 +583,7 @@ struct btrfs_dir_item {
__le64 transid;
__le16 data_len;
__le16 name_len;
-   u8 type;
+   __u8 type;
 } __attribute__ ((__packed__));
 
 #define BTRFS_ROOT_SUBVOL_RDONLY   (1ULL << 0)
@@ -605,8 +605,8 @@ struct btrfs_root_item {
__le64 flags;
__le32 refs;
struct btrfs_disk_key drop_progress;
-   u8 drop_level;
-   u8 level;
+   __u8 drop_level;
+   __u8 level;
 
/*
 * The following fields appear after subvol_uuids+subvol_times
@@ -625,9 +625,9 @@ struct btrfs_root_item {
 * when invalidating the fields.
 */
__le64 generation_v2;
-   u8 uuid[BTRFS_UUID_SIZE];
-   u8 parent_uuid[BTRFS_UUID_SIZE];
-   u8 received_uuid[BTRFS_UUID_SIZE];
+   __u8 uuid[BTRFS_UUID_SIZE];
+   __u8 parent_uuid[BTRFS_UUID_SIZE];
+   __u8 received_uuid[BTRFS_UUID_SIZE];
__le64 ctransid; /* updated when an inode changes */
__le64 otransid; /* trans when created */
__le64 stransid; /* trans when sent. non-zero for received subvol */
@@ -751,12 +751,12 @@ struct btrfs_file_extent_item {
 * it is treated like an incompat flag for reading and writing,
 * but not for stat.
 */
-   u8 compression;
-   u8 encryption;
+   __u8 compression;
+   __u8 encryption;
__le16 other_encoding; /* spare for later use */
 
/* are we inline data or a real extent? */
-   u8 type;
+   __u8 type;
 
/*
 * disk space consumed by the extent, checksum blocks are included
@@ -783,7 +783,7 @@ struct btrfs_file_extent_item {
 } __attribute__ ((__packed__));
 
 struct btrfs_csum_item {
-   u8 csum;
+   __u8 csum;
 } __attribute__ ((__packed__));
 
 struct btrfs_dev_stats_item {
@@ -874,14 +874,14 @@ enum btrfs_raid_types {
 #define BTRFS_EXTENDED_PROFILE_MASK(BTRFS_BLOCK_GROUP_PROFILE_MASK | \
 BTRFS_AVAIL_ALLOC_BIT_SINGLE)
 
-static inline u64 chunk_to_extended(u64 flags)
+static inline __u64 chunk_to_extended(__u64 flags)
 {
if ((flags & BTRFS_BLOCK_GROUP_PROFILE_MASK) == 0)
flags |= BTRFS_AVAIL_ALLOC_BIT_SINGLE;
 
return flags;
 }
-static inline u64 extended_to_chunk(u64 flags)
+static inline __u64 extended_to_chunk(__u64 flags)
 {
return flags & ~BTRFS_AVAIL_ALLOC_BIT_SINGLE;
 }
@@ -900,7 +900,7 @@ struct btrfs_free_space_info {
 #define 

[PATCH 5/8] btrfs: uapi/linux/btrfs.h migration, move balance flags

2016-04-01 Thread Jeff Mahoney
The BTRFS_BALANCE_* flags are used by struct btrfs_ioctl_balance_args.flags
and btrfs_ioctl_balance_args.{data,meta,sys}.flags in the BTRFS_IOC_BALANCE
ioctl.

Signed-off-by: Jeff Mahoney 
---
 fs/btrfs/volumes.h | 46 -
 include/uapi/linux/btrfs.h | 64 ++
 2 files changed, 64 insertions(+), 46 deletions(-)

diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 1939ebd..144cec3 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -357,52 +357,6 @@ struct map_lookup {
 #define map_lookup_size(n) (sizeof(struct map_lookup) + \
(sizeof(struct btrfs_bio_stripe) * (n)))
 
-/*
- * Restriper's general type filter
- */
-#define BTRFS_BALANCE_DATA (1ULL << 0)
-#define BTRFS_BALANCE_SYSTEM   (1ULL << 1)
-#define BTRFS_BALANCE_METADATA (1ULL << 2)
-
-#define BTRFS_BALANCE_TYPE_MASK(BTRFS_BALANCE_DATA |   \
-BTRFS_BALANCE_SYSTEM | \
-BTRFS_BALANCE_METADATA)
-
-#define BTRFS_BALANCE_FORCE(1ULL << 3)
-#define BTRFS_BALANCE_RESUME   (1ULL << 4)
-
-/*
- * Balance filters
- */
-#define BTRFS_BALANCE_ARGS_PROFILES(1ULL << 0)
-#define BTRFS_BALANCE_ARGS_USAGE   (1ULL << 1)
-#define BTRFS_BALANCE_ARGS_DEVID   (1ULL << 2)
-#define BTRFS_BALANCE_ARGS_DRANGE  (1ULL << 3)
-#define BTRFS_BALANCE_ARGS_VRANGE  (1ULL << 4)
-#define BTRFS_BALANCE_ARGS_LIMIT   (1ULL << 5)
-#define BTRFS_BALANCE_ARGS_LIMIT_RANGE (1ULL << 6)
-#define BTRFS_BALANCE_ARGS_STRIPES_RANGE (1ULL << 7)
-#define BTRFS_BALANCE_ARGS_USAGE_RANGE (1ULL << 10)
-
-#define BTRFS_BALANCE_ARGS_MASK\
-   (BTRFS_BALANCE_ARGS_PROFILES |  \
-BTRFS_BALANCE_ARGS_USAGE | \
-BTRFS_BALANCE_ARGS_DEVID | \
-BTRFS_BALANCE_ARGS_DRANGE |\
-BTRFS_BALANCE_ARGS_VRANGE |\
-BTRFS_BALANCE_ARGS_LIMIT | \
-BTRFS_BALANCE_ARGS_LIMIT_RANGE |   \
-BTRFS_BALANCE_ARGS_STRIPES_RANGE | \
-BTRFS_BALANCE_ARGS_USAGE_RANGE)
-
-/*
- * Profile changing flags.  When SOFT is set we won't relocate chunk if
- * it already has the target profile (even though it may be
- * half-filled).
- */
-#define BTRFS_BALANCE_ARGS_CONVERT (1ULL << 8)
-#define BTRFS_BALANCE_ARGS_SOFT(1ULL << 9)
-
 struct btrfs_balance_args;
 struct btrfs_balance_progress;
 struct btrfs_balance_control {
diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h
index de98717..abae362 100644
--- a/include/uapi/linux/btrfs.h
+++ b/include/uapi/linux/btrfs.h
@@ -317,6 +317,70 @@ struct btrfs_balance_progress {
__u64 completed;/* # of chunks relocated so far */
 };
 
+/*
+ * flags definition for balance
+ *
+ * Restriper's general type filter
+ *
+ * Used by:
+ * btrfs_ioctl_balance_args.flags
+ * btrfs_balance_control.flags (internal)
+ */
+#define BTRFS_BALANCE_DATA (1ULL << 0)
+#define BTRFS_BALANCE_SYSTEM   (1ULL << 1)
+#define BTRFS_BALANCE_METADATA (1ULL << 2)
+
+#define BTRFS_BALANCE_TYPE_MASK(BTRFS_BALANCE_DATA |   \
+BTRFS_BALANCE_SYSTEM | \
+BTRFS_BALANCE_METADATA)
+
+#define BTRFS_BALANCE_FORCE(1ULL << 3)
+#define BTRFS_BALANCE_RESUME   (1ULL << 4)
+
+/*
+ * flags definitions for per-type balance args
+ *
+ * Balance filters
+ *
+ * Used by:
+ * struct btrfs_balance_args
+ */
+#define BTRFS_BALANCE_ARGS_PROFILES(1ULL << 0)
+#define BTRFS_BALANCE_ARGS_USAGE   (1ULL << 1)
+#define BTRFS_BALANCE_ARGS_DEVID   (1ULL << 2)
+#define BTRFS_BALANCE_ARGS_DRANGE  (1ULL << 3)
+#define BTRFS_BALANCE_ARGS_VRANGE  (1ULL << 4)
+#define BTRFS_BALANCE_ARGS_LIMIT   (1ULL << 5)
+#define BTRFS_BALANCE_ARGS_LIMIT_RANGE (1ULL << 6)
+#define BTRFS_BALANCE_ARGS_STRIPES_RANGE (1ULL << 7)
+#define BTRFS_BALANCE_ARGS_USAGE_RANGE (1ULL << 10)
+
+#define BTRFS_BALANCE_ARGS_MASK\
+   (BTRFS_BALANCE_ARGS_PROFILES |  \
+BTRFS_BALANCE_ARGS_USAGE | \
+BTRFS_BALANCE_ARGS_DEVID | \
+BTRFS_BALANCE_ARGS_DRANGE |\
+BTRFS_BALANCE_ARGS_VRANGE |\
+BTRFS_BALANCE_ARGS_LIMIT | \
+BTRFS_BALANCE_ARGS_LIMIT_RANGE |   \
+BTRFS_BALANCE_ARGS_STRIPES_RANGE | \
+BTRFS_BALANCE_ARGS_USAGE_RANGE)
+
+/*
+ * Profile changing flags.  When SOFT is set we won't relocate chunk if
+ * it already has the target profile (even though it may be
+ * half-filled).
+ */
+#define BTRFS_BALANCE_ARGS_CONVERT (1ULL << 8)
+#define BTRFS_BALANCE_ARGS_SOFT(1ULL << 9)
+
+
+/*
+ * flags definition for balance state
+ *
+ * Used by:
+ * 

[PATCH 1/8] btrfs: uapi/linux/btrfs.h migration, move BTRFS_LABEL_SIZE

2016-04-01 Thread Jeff Mahoney
BTRFS_LABEL_SIZE is required to define the BTRFS_IOC_GET_FSLABEL and
BTRFS_IOC_SET_FSLABEL ioctls.

Signed-off-by: Jeff Mahoney 
---
 fs/btrfs/ctree.h   | 1 -
 include/uapi/linux/btrfs.h | 1 +
 2 files changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 84a6a5b..3beaa24 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -410,7 +410,6 @@ struct btrfs_header {
  * room to translate 14 chunks with 3 stripes each.
  */
 #define BTRFS_SYSTEM_CHUNK_ARRAY_SIZE 2048
-#define BTRFS_LABEL_SIZE 256
 
 /*
  * just in case we somehow lose the roots and are not able to mount,
diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h
index dea8931..11eee34 100644
--- a/include/uapi/linux/btrfs.h
+++ b/include/uapi/linux/btrfs.h
@@ -23,6 +23,7 @@
 
 #define BTRFS_IOCTL_MAGIC 0x94
 #define BTRFS_VOL_NAME_MAX 255
+#define BTRFS_LABEL_SIZE 256
 
 /* this should be 4k */
 #define BTRFS_PATH_NAME_MAX 4087
-- 
2.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Another ENOSPC situation

2016-04-01 Thread Henk Slager
On Fri, Apr 1, 2016 at 6:50 PM, Marc Haber  wrote:
> On Fri, Apr 01, 2016 at 06:30:20PM +0200, Marc Haber wrote:
>> On Fri, Apr 01, 2016 at 05:44:30PM +0200, Henk Slager wrote:
>> > On Fri, Apr 1, 2016 at 3:40 PM, Marc Haber  
>> > wrote:
>> > > btrfs balance -mprofiles seems to do something. one kworked and one
>> > > btrfs-transaction process hog one CPU core each for hours, while
>> > > blocking the filesystem for minutes apiece, which leads to the host
>> > > being nearly unuseable up to the point of "clock and mouse pointer
>> > > frozen for nearly ten minutes".
>> >
>> > I assume you still have your every 10 minutes snapshotting running
>> > while balancing?
>>
>> No, I disabled the cronjob before trying the balance. I might be
>> crazy, but not stup^wunexperienced.
>
> That being said, I would still expect the code not to allow _this_
> kind of effect on the entire system when two alledgely incompatible
> operations run simultaneously. I mean, Linux is a multi-user,
> multi-tasking operating system where one simply cannot expect all
> processes to be cooperative to each other. We have the operating
> systems to prevent this kind of issues, not to cause them.

Maybe look at it differently: Does user mh have trouble using this
laptop w.r.t. storing files?

In openSUSE Tumbleweed (the snapshot from end of march), root access
is needed to change the default snapshotting config, otherwise you
will have a 10 year history. After that change has been done according
to needs of the user, there is no need to run manual balance.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Compression causes kernel crashes if there are I/O or checksum errors (was: RE: kernel BUG at fs/btrfs/volumes.c:5519 when hot-removing device in RAID-1)

2016-04-01 Thread mitch
I grabbed this part from the log after the machine crashed again
following trying to transfer a bunch of files that included ones with
csum errors, let me know if this looks like the same issue you were
having:


Mar 31 00:49:42 sl-server kernel: NMI watchdog: BUG: soft lockup -
CPU#21 stuck for 22s! [kworker/u67:5:80994]
Mar 31 00:49:42 sl-server kernel: Modules linked in: fuse xt_CHECKSUM
ipt_MASQUERADE nf_nat_masquerade_ipv4 tun ip6t_rpfilter ip6t_REJECT
nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ebtable_nat
ebtable_broute ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6
nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security
ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4
nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle
iptable_security iptable_raw iptable_filter dm_mirror dm_region_hash
dm_log dm_mod kvm_amd kvm irqbypass crct10dif_pclmul crc32_pclmul
ghash_clmulni_intel xfs aesni_intel lrw gf128mul glue_helper libcrc32c
ablk_helper cryptd joydev input_leds edac_mce_amd k10temp edac_core
fam15h_power sp5100_tco sg i2c_piix4 8250_fintek acpi_cpufreq shpchp
nfsd auth_rpcgss nfs_acl
Mar 31 00:49:42 sl-server kernel:  lockd grace sunrpc ip_tables btrfs
xor ata_generic pata_acpi raid6_pq sd_mod mgag200 crc32c_intel
drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm ahci
serio_raw pata_atiixp libahci igb drm ptp pps_core mpt3sas dca
raid_class libata i2c_algo_bit scsi_transport_sas fjes uas usb_storage
Mar 31 00:49:42 sl-server kernel: CPU: 21 PID: 80994 Comm:
kworker/u67:5 Not tainted 4.5.0-1.el7.elrepo.x86_64 #1
Mar 31 00:49:42 sl-server kernel: Hardware name: Supermicro
H8DG6/H8DGi/H8DG6/H8DGi, BIOS 3.511/25/2013
Mar 31 00:49:42 sl-server kernel: Workqueue: btrfs-endio
btrfs_endio_helper [btrfs]
Mar 31 00:49:42 sl-server kernel: task: 8817f6fa8000 ti:
8800b731 task.ti: 8800b731
Mar 31 00:49:42 sl-server kernel: RIP:
0010:[]  []
btrfs_decompress_buf2page+0x123/0x200 [btrfs]
Mar 31 00:49:42 sl-server kernel: RSP: 0018:8800b7313be0  EFLAGS:
0246
Mar 31 00:49:42 sl-server kernel: RAX:  RBX:
 RCX: 
Mar 31 00:49:42 sl-server kernel: RDX:  RSI:
c9000e3d8000 RDI: 88144c7cc000
Mar 31 00:49:42 sl-server kernel: RBP: 8800b7313c48 R08:
8810f0295000 R09: 0020
Mar 31 00:49:42 sl-server kernel: R10: 8810d2ba7869 R11:
00010008 R12: 8817f6fa8000
Mar 31 00:49:42 sl-server kernel: R13: 8800b7313ce0 R14:
0008 R15: 1000
Mar 31 00:49:42 sl-server kernel: FS:  7efce58fb740()
GS:881807d4() knlGS:
Mar 31 00:49:42 sl-server kernel: CS:  0010 DS:  ES:  CR0:
8005003b
Mar 31 00:49:42 sl-server kernel: CR2: 7f00caf249e8 CR3:
001062121000 CR4: 000406e0
Mar 31 00:49:42 sl-server kernel: Stack:
Mar 31 00:49:42 sl-server kernel:  0020 f000
8810f0295000 8744
Mar 31 00:49:42 sl-server kernel:  00010008 c9000e3d7000
ea005131f300 0001
Mar 31 00:49:42 sl-server kernel:  0797 2869
0869 8810d2ba7000
Mar 31 00:49:42 sl-server kernel: Call Trace:
Mar 31 00:49:42 sl-server kernel:  []
lzo_decompress_biovec+0x202/0x300 [btrfs]
Mar 31 00:49:42 sl-server kernel:  []
end_compressed_bio_read+0x1f6/0x2f0 [btrfs]
Mar 31 00:49:42 sl-server kernel:  []
bio_endio+0x40/0x60
Mar 31 00:49:42 sl-server kernel:  []
end_workqueue_fn+0x3c/0x40 [btrfs]
Mar 31 00:49:42 sl-server kernel:  []
normal_work_helper+0xc0/0x2c0 [btrfs]
Mar 31 00:49:42 sl-server kernel:  []
btrfs_endio_helper+0x12/0x20 [btrfs]
Mar 31 00:49:42 sl-server kernel:  []
process_one_work+0x14f/0x400
Mar 31 00:49:42 sl-server kernel:  []
worker_thread+0x125/0x4b0
Mar 31 00:49:42 sl-server kernel:  [] ?
rescuer_thread+0x370/0x370
Mar 31 00:49:42 sl-server kernel:  []
kthread+0xd8/0xf0
Mar 31 00:49:42 sl-server kernel:  [] ?
kthread_park+0x60/0x60
Mar 31 00:49:42 sl-server kernel:  []
ret_from_fork+0x3f/0x70
Mar 31 00:49:42 sl-server kernel:  [] ?
kthread_park+0x60/0x60
Mar 31 00:49:42 sl-server kernel: Code: c7 48 8b 45 c0 49 03 7d 00 4a
8d 34 38 e8 06 18 00 e1 41 83 ac 24 28 12 00 00 01 41 8b 84 24 28 12 00
00 85 c0 0f 88 bf 00 00 00 <48> 89 d8 49 03 45 00 49 01 df 49 29 de 48
01 5d d0 48 3d 00 10 
Mar 31 00:49:43 sl-server sh[1297]: abrt-dump-oops: Found oopses: 1
Mar 31 00:49:43 sl-server sh[1297]: abrt-dump-oops: Creating problem
directories
Mar 31 00:49:43 sl-server sh[1297]: abrt-dump-oops: Not going to make
dump directories world readable because PrivateReports is on
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs_destroy_inode WARN_ON.

2016-04-01 Thread Dave Jones
On Fri, Apr 01, 2016 at 02:12:27PM -0400, Dave Jones wrote:
 > BUG: workqueue lockup - pool cpus=1 node=0 flags=0x0 nice=0 stuck for 30s!
 > Showing busy workqueues and worker pools:
 > workqueue events: flags=0x0
 >   pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=1/256
 > pending: vmstat_shepherd
 >   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256
 > pending: check_corruption
 >   pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=3/256
 > pending: usb_serial_port_work, lru_add_drain_per_cpu BAR(17230), 
 > e1000_watchdog_task
 > workqueue events_power_efficient: flags=0x82
 >   pwq 8: cpus=0-3 flags=0x4 nice=0 active=3/256
 > pending: fb_flashcursor, neigh_periodic_work, neigh_periodic_work
 > workqueue events_freezable_power_: flags=0x86
 >   pwq 8: cpus=0-3 flags=0x4 nice=0 active=1/256
 > pending: disk_events_workfn
 > workqueue netns: flags=0x6000a
 >   pwq 8: cpus=0-3 flags=0x4 nice=0 active=1/1
 > in-flight: 10038:cleanup_net
 > workqueue writeback: flags=0x4e
 >   pwq 8: cpus=0-3 flags=0x4 nice=0 active=2/256
 > pending: wb_workfn, wb_workfn
 > workqueue kblockd: flags=0x18
 >   pwq 3: cpus=1 node=0 flags=0x0 nice=-20 active=2/256
 > pending: blk_mq_timeout_work, blk_mq_timeout_work
 > workqueue vmstat: flags=0xc
 >   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256
 > pending: vmstat_update
 >   pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/256
 > pending: vmstat_update
 >   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
 > pending: vmstat_update
 > pool 8: cpus=0-3 flags=0x4 nice=0 hung=0s workers=11 idle: 11638 10276 609 
 > 17937 606 9237 605 891 15998 14100
 > note: trinity-c13[18815] exited with preempt_count 1

This has wedged userspace too:

23082 pts/2SN+0:00  |   \_ /bin/bash scripts/test-multi.sh
14140 pts/2SNL+   0:15  |   \_ ../trinity -q -l off -N 100 -a64 -x 
fsync -x fdatasync
16900 ?DNs0:04  |   \_ ../trinity -q -l off -N 100 -a64 
-x fsync -x fdata
18894 ?DNs0:02  |   \_ ../trinity -q -l off -N 100 -a64 
-x fsync -x fdata

(14:16:02:davej@think:trinity[master])$ stack 16900
[] wait_on_page_bit_killable+0x156/0x1b0
[] __lock_page_or_retry+0x112/0x1b0
[] filemap_fault+0x367/0xb30
[] __do_fault+0x167/0x3d0
[] handle_mm_fault+0x1837/0x2520
[] __do_page_fault+0x248/0x770
[] do_page_fault+0x39/0xa0
[] page_fault+0x1f/0x30
[] mm_release+0x1ec/0x230
[] do_exit+0x5d0/0x18c0
[] do_group_exit+0xac/0x190
[] get_signal+0x48f/0xeb0
[] do_signal+0xa0/0xb50
[] exit_to_usermode_loop+0xd9/0x100
[] do_syscall_64+0x238/0x2b0
[] return_from_SYSCALL_64+0x0/0x7a
[] 0x

(14:16:09:davej@think:trinity[master])$ stack 18894
[] btrfs_file_write_iter+0xe8/0x9a0 [btrfs]
[] __vfs_write+0x279/0x2e0
[] vfs_write+0x11e/0x2b0
[] SyS_write+0xd2/0x1a0
[] do_syscall_64+0x103/0x2b0
[] return_from_SYSCALL_64+0x0/0x7a
[] 0x

I tried to ftrace the latter process, and the box completely hung.

Dave
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs_destroy_inode WARN_ON.

2016-04-01 Thread Dave Jones
On Sun, Mar 27, 2016 at 09:14:00PM -0400, Dave Jones wrote:
 
 >  > WARNING: CPU: 2 PID: 32570 at fs/btrfs/inode.c:9261 
 > btrfs_destroy_inode+0x389/0x3f0 [btrfs]
 >  > CPU: 2 PID: 32570 Comm: rm Not tainted 4.5.0-think+ #14
 >  >  c039baf9 ef721ef0 88025966fc08 8957bcdb
 >  >    88025966fc50 890b41f1
 >  >  88045d918040 242d4eed6048 88024eed6048 88024eed6048
 >  > Call Trace:
 >  >  [] ? btrfs_destroy_inode+0x389/0x3f0 [btrfs]
 >  >  [] dump_stack+0x68/0x9d
 >  >  [] __warn+0x111/0x130
 >  >  [] warn_slowpath_null+0x1d/0x20
 >  >  [] btrfs_destroy_inode+0x389/0x3f0 [btrfs]
 >  >  [] destroy_inode+0x67/0x90
 >  >  [] evict+0x1b7/0x240
 >  >  [] iput+0x3ae/0x4e0
 >  >  [] ? dput+0x20e/0x460
 >  >  [] do_unlinkat+0x256/0x440
 >  >  [] ? do_rmdir+0x350/0x350
 >  >  [] ? syscall_trace_enter_phase1+0x87/0x260
 >  >  [] ? enter_from_user_mode+0x50/0x50
 >  >  [] ? __lock_is_held+0x25/0xd0
 >  >  [] ? mark_held_locks+0x22/0xc0
 >  >  [] ? syscall_trace_enter_phase2+0x12d/0x3d0
 >  >  [] ? SyS_rmdir+0x20/0x20
 >  >  [] SyS_unlinkat+0x1b/0x30
 >  >  [] do_syscall_64+0xf4/0x240
 >  >  [] entry_SYSCALL64_slow_path+0x25/0x25
 >  > ---[ end trace a48ce4e6a1b5e409 ]---
 >  > 
 >  > That's WARN_ON(BTRFS_I(inode)->csum_bytes);
 >  > 
 >  > *maybe* it's a bad disk, but there's no indication in dmesg of anything 
 > awry.
 >  > Spinning rust on SATA, nothing special.
 > 
 > Same WARN_ON is reachable from umount too..
 > 
 > WARNING: CPU: 2 PID: 20092 at fs/btrfs/inode.c:9261 
 > btrfs_destroy_inode+0x40c/0x480 [btrfs]
 > CPU: 2 PID: 20092 Comm: umount Tainted: GW   4.5.0-think+ #1
 >   a32c482b 8803cd187b60 9d63af84
 >    c05c5e40 c04d316c
 >  8803cd187ba8 9d0c4c27 880460d80040 242dcd187bb0
 > Call Trace:
 >  [] dump_stack+0x95/0xe1
 >  [] ? btrfs_destroy_inode+0x40c/0x480 [btrfs]
 >  [] __warn+0x147/0x170
 >  [] warn_slowpath_null+0x31/0x40
 >  [] btrfs_destroy_inode+0x40c/0x480 [btrfs]
 >  [] ? btrfs_test_destroy_inode+0x40/0x40 [btrfs]
 >  [] destroy_inode+0x77/0xb0
 >  [] evict+0x20e/0x2c0
 >  [] dispose_list+0x70/0xb0
 >  [] evict_inodes+0x26f/0x2c0
 >  [] ? inode_add_lru+0x60/0x60
 >  [] ? fsnotify_unmount_inodes+0x215/0x2c0
 >  [] generic_shutdown_super+0x76/0x1c0
 >  [] kill_anon_super+0x29/0x40
 >  [] btrfs_kill_super+0x31/0x130 [btrfs]
 >  [] deactivate_locked_super+0x6f/0xb0
 >  [] deactivate_super+0x99/0xb0
 >  [] cleanup_mnt+0x70/0xd0
 >  [] __cleanup_mnt+0x1b/0x20
 >  [] task_work_run+0xef/0x130
 >  [] exit_to_usermode_loop+0xf9/0x100
 >  [] do_syscall_64+0x238/0x2b0
 >  [] entry_SYSCALL64_slow_path+0x25/0x25

Additional fallout:

BTRFS: assertion failed: num_extents, file: fs/btrfs/extent-tree.c, line: 5584
[ cut here ]
kernel BUG at fs/btrfs/ctree.h:4320!
invalid opcode:  [#1] PREEMPT SMP DEBUG_PAGEALLOC KASAN
CPU: 1 PID: 18815 Comm: trinity-c13 Tainted: GW   4.6.0-rc1-think+ 
#1
task: 88045de10040 ti: 8803afa38000 task.ti: 8803afa38000
RIP: 0010:[]  [] 
assfail.constprop.88+0x2b/0x2d [btrfs]
RSP: 0018:8803afa3f838  EFLAGS: 00010282
RAX: 004e RBX: c046e200 RCX: 
RDX:  RSI: 0003 RDI: ed0075f47efb
RBP: 8803afa3f848 R08: 0001 R09: 0001
R10:  R11: 0001 R12: 15d0
R13: 8803fda0e048 R14: 8803fda0dc38 R15: 8803fda0dc58
FS:  7fa0566d6700() GS:880468a0() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 7fa0566d9000 CR3: 000333bc4000 CR4: 001406e0
DR0: 7fa0554fb000 DR1:  DR2: 
DR3:  DR6: 0ff0 DR7: 0600
Stack:
  8803fda0e048 8803afa3f880 c032288b
  880460bb33f8 8803fda0e048 8803fda0dc38
 8803fda0dc58 8803afa3f8c8 c032f851 0001
Call Trace:
 [] drop_outstanding_extent+0x10b/0x130 [btrfs]
 [] btrfs_delalloc_release_metadata+0x71/0x480 [btrfs]
 [] ? __btrfs_buffered_write+0xa6f/0xb50 [btrfs]
 [] btrfs_delalloc_release_space+0x27/0x50 [btrfs]
 [] __btrfs_buffered_write+0xa28/0xb50 [btrfs]
 [] ? btrfs_dirty_pages+0x1c0/0x1c0 [btrfs]
 [] ? filemap_fdatawait_range+0x3e/0x50
 [] ? generic_file_direct_write+0x237/0x2f0
 [] ? filemap_write_and_wait_range+0xa0/0xa0
 [] ? btrfs_file_write_iter+0x670/0x9a0 [btrfs]
 [] btrfs_file_write_iter+0x74d/0x9a0 [btrfs]
 [] do_iter_readv_writev+0x153/0x1f0
 [] ? btrfs_sync_file+0x920/0x920 [btrfs]
 [] ? vfs_iter_read+0x1e0/0x1e0
 [] ? preempt_count_sub+0xb9/0x130
 [] ? percpu_down_read+0x57/0xa0
 [] ? __sb_start_write+0xee/0x130
 [] ? btrfs_sync_file+0x920/0x920 [btrfs]
 [] do_readv_writev+0x30f/0x460
 [] ? vfs_write+0x2b0/0x2b0
 [] ? 

Re: Another ENOSPC situation

2016-04-01 Thread Marc Haber
On Fri, Apr 01, 2016 at 06:30:20PM +0200, Marc Haber wrote:
> On Fri, Apr 01, 2016 at 05:44:30PM +0200, Henk Slager wrote:
> > On Fri, Apr 1, 2016 at 3:40 PM, Marc Haber  
> > wrote:
> > > btrfs balance -mprofiles seems to do something. one kworked and one
> > > btrfs-transaction process hog one CPU core each for hours, while
> > > blocking the filesystem for minutes apiece, which leads to the host
> > > being nearly unuseable up to the point of "clock and mouse pointer
> > > frozen for nearly ten minutes".
> > 
> > I assume you still have your every 10 minutes snapshotting running
> > while balancing?
> 
> No, I disabled the cronjob before trying the balance. I might be
> crazy, but not stup^wunexperienced.

That being said, I would still expect the code not to allow _this_
kind of effect on the entire system when two alledgely incompatible
operations run simultaneously. I mean, Linux is a multi-user,
multi-tasking operating system where one simply cannot expect all
processes to be cooperative to each other. We have the operating
systems to prevent this kind of issues, not to cause them.

Greetings
Marc

-- 
-
Marc Haber | "I don't trust Computers. They | Mailadresse im Header
Leimen, Germany|  lose things."Winona Ryder | Fon: *49 6224 1600402
Nordisch by Nature |  How to make an American Quilt | Fax: *49 6224 1600421
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v9 00/19] Btrfs dedupe framework

2016-04-01 Thread David Sterba
On Fri, Apr 01, 2016 at 08:26:43AM +0800, Qu Wenruo wrote:
> 
> 
> David Sterba wrote on 2016/03/31 18:12 +0200:
> > On Wed, Mar 30, 2016 at 03:55:55PM +0800, Qu Wenruo wrote:
> >> This March 30th patchset update mostly addresses the patchset structure
> >> comment from David:
> >> 1) Change the patchset sequence
> >> Not If only apply the first 14 patches, it can provide the full
> >> backward compatible in-memory only dedupe backend.
> >>
> >> Only starts from patch 15, on-disk format will be changed.
> >>
> >> So patch 1~14 is going to be pushed for next merge window, while I'll
> >> still submit them all for review purpose.
> >
> > I'll buy 1-10 with the ioctl hidden under the BTRFS_DEBUG config option
> > until the interface is settled.
> >
> >
> Nice to hear that.
> 
> I'll add BTRFS_DEBUG config then.

Independent of the next merge window, I'll add them to my for-next after
you send the updated version. I'll also try to review them next week,
but I don't remember any critical issue during first reading, so there's
no blocker.

> BTW, any comment on btrfs-convert rewrite?

This not the right place to ask, better to ping as reply to the thread
as I could miss it. Nevertheless, the answer is that it's going to devel
branch, the convert tests passed (as required minimum), but the patchset
is still not reviewed up to my satisfaction.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Another ENOSPC situation

2016-04-01 Thread Marc Haber
On Fri, Apr 01, 2016 at 05:44:30PM +0200, Henk Slager wrote:
> On Fri, Apr 1, 2016 at 3:40 PM, Marc Haber  
> wrote:
> > btrfs balance -mprofiles seems to do something. one kworked and one
> > btrfs-transaction process hog one CPU core each for hours, while
> > blocking the filesystem for minutes apiece, which leads to the host
> > being nearly unuseable up to the point of "clock and mouse pointer
> > frozen for nearly ten minutes".
> 
> I assume you still have your every 10 minutes snapshotting running
> while balancing?

No, I disabled the cronjob before trying the balance. I might be
crazy, but not stup^wunexperienced.

Greetings
Marc

-- 
-
Marc Haber | "I don't trust Computers. They | Mailadresse im Header
Leimen, Germany|  lose things."Winona Ryder | Fon: *49 6224 1600402
Nordisch by Nature |  How to make an American Quilt | Fax: *49 6224 1600421
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Another ENOSPC situation

2016-04-01 Thread Henk Slager
On Fri, Apr 1, 2016 at 3:40 PM, Marc Haber  wrote:
> Hi,
>
> just for a change, this is another btrfs on a different host. The host
> is also running Debian unstable with mainline kernels, the btrfs in
> question was created (not converted) in March 2015 with btrfs-tools
> 3.17. It is the root fs of my main work notebook which is under
> workstation load, with lots of snapshots being created and deleted.
>
> Balance immediately fails with ENOSPC
>
> balance -dprofiles=single -dusage=1 goes through "fine" ("had to
> relocate 0 out of 602 chunks")
>
> balance -dprofiles=single -dusage=2 also ENOSPCes immediately.
>
> [4/502]mh@swivel:~$ sudo btrfs fi usage /
> Overall:
> Device size: 600.00GiB
> Device allocated:600.00GiB
> Device unallocated:1.00MiB
> Device missing:  0.00B
> Used:413.40GiB
> Free (estimated):148.20GiB  (min: 148.20GiB)
> Data ratio:   1.00
> Metadata ratio:   2.00
> Global reserve:  512.00MiB  (used: 0.00B)
>
> Data,single: Size:553.93GiB, Used:405.73GiB
>/dev/mapper/swivelbtr 553.93GiB
>
> Metadata,DUP: Size:23.00GiB, Used:3.83GiB
>/dev/mapper/swivelbtr  46.00GiB
>
> System,DUP: Size:32.00MiB, Used:112.00KiB
>/dev/mapper/swivelbtr  64.00MiB
>
> Unallocated:
>/dev/mapper/swivelbtr   1.00MiB
> [5/503]mh@swivel:~$
>
> btrfs balance -mprofiles seems to do something. one kworked and one
> btrfs-transaction process hog one CPU core each for hours, while
> blocking the filesystem for minutes apiece, which leads to the host
> being nearly unuseable up to the point of "clock and mouse pointer
> frozen for nearly ten minutes".

I assume you still have your every 10 minutes snapshotting running
while balancing?

> The btrfs balance cancel I issued after four hours of this state took
> eleven minutes alone to complete.
>
> These are all log entries that were obtained after starting btrfs
> balance -mprofiles on 09:43
> Apr  1 12:18:21 swivel kernel: [253651.970413] BTRFS info (device dm-14): 
> found 3523 extents
> Apr  1 12:18:21 swivel kernel: [253652.035572] BTRFS info (device dm-14): 
> relocating block group 1538365849600 flags 36
> Apr  1 13:30:57 swivel kernel: [258007.653597] BTRFS info (device dm-14): 
> found 3585 extents
> Apr  1 13:30:57 swivel kernel: [258007.746541] BTRFS info (device dm-14): 
> relocating block group 1536755236864 flags 36
> Apr  1 13:49:39 swivel kernel: [259130.296184] BTRFS info (device dm-14): 
> found 3047 extents
> Apr  1 13:49:39 swivel kernel: [259130.357314] BTRFS info (device dm-14): 
> relocating block group 1528702173184 flags 36
> Apr  1 14:30:00 swivel kernel: [261550.776348] BTRFS info (device dm-14): 
> found 4200 extents
>
> This kernel trace from 11:16 is not btrfs-related, is it? I guess it's
> bluetooth related since it happened simultaneously to the bluetooth
> device popping out an in:
> Apr  1 11:16:38 swivel kernel: [249948.993751] usb 1-1.4: USB disconnect, 
> device number 39
> Apr  1 11:16:38 swivel systemd[1]: Starting Load/Save RF Kill Switch Status...
> Apr  1 11:16:38 swivel systemd[1]: Started Load/Save RF Kill Switch Status.
> Apr  1 11:16:38 swivel systemd[1]: bluetooth.target: Unit not needed anymore. 
> Stopping.
> Apr  1 11:16:38 swivel systemd[1]: Stopped target Bluetooth.
> Apr  1 11:16:38 swivel laptop-mode: Laptop mode
> Apr  1 11:16:38 swivel laptop-mode: enabled, not active
> Apr  1 11:16:39 swivel kernel: [249949.211549] usb 1-1.4: new full-speed USB 
> device number 40 using ehci-pci
> Apr  1 11:16:39 swivel kernel: [249949.308386] usb 1-1.4: New USB device 
> found, idVendor=0a5c, idProduct=217f
> Apr  1 11:16:39 swivel kernel: [249949.308397] usb 1-1.4: New USB device 
> strings: Mfr=1, Product=2, SerialNumber=3
> Apr  1 11:16:39 swivel kernel: [249949.308402] usb 1-1.4: Product: Broadcom 
> Bluetooth Device
> Apr  1 11:16:39 swivel kernel: [249949.308407] usb 1-1.4: Manufacturer: 
> Broadcom Corp
> Apr  1 11:16:39 swivel kernel: [249949.308412] usb 1-1.4: SerialNumber: 
> CCAF78F1274F
> Apr  1 11:16:39 swivel systemd[1]: Reached target Bluetooth.
> Apr  1 11:16:39 swivel kernel: [249949.507794] [ cut here 
> ]
> Apr  1 11:16:39 swivel kernel: [249949.507810] WARNING: CPU: 1 PID: 11 at 
> arch/x86/kernel/cpu/perf_event_intel_ds.c:325 reserve_ds_buffers+0x102/0x326()
> Apr  1 11:16:39 swivel kernel: [249949.507813] alloc_bts_buffer: BTS buffer 
> allocation failure
> Apr  1 11:16:39 swivel kernel: [249949.507816] Modules linked in: cpuid 
> hid_generic usbhid hid e1000e tun ctr ccm rfcomm bridge stp llc 
> cpufreq_userspace cpufreq_stats cpufreq_conservative cpufreq_powersave 
> nf_conntrack_netlink nfnetlink bnep binfmt_misc intel_rapl 
> x86_pkg_temp_thermal arc4 intel_powerclamp kvm_intel kvm irqbypass iwldvm 
> snd_hda_codec_conexant 

Re: Again, no space left on device while rebalancing and recipe doesnt work

2016-04-01 Thread Marc Haber
On Sat, Feb 27, 2016 at 10:14:50PM +0100, Marc Haber wrote:
> I have again the issue of no space left on device while rebalancing
> (with btrfs-tools 4.4.1 on kernel 4.4.2 on Debian unstable):

just for the record: The host started acting up in more and more
interesting ways, and after a call of rm during kernel build resulted
in SIGSEGV, I did the backup-format-restore routine for this system
back to ext4 just to find out whether I have bad hardware or a bad
filesystem.

And, since going back to ext4, the system is just fine again. So it's
not bad hardware.

This systems's root drive is going to stay on ext4 for a loong
time. If I get the btrfs phenomena I experience on other hosts get
solved at some time in the future, I might migrate /home back to
btrfs, but that's not going to happen in the next six months.

This is a really bad experience which has made me lost a lot of faith
in the new filesystem. I really feel sad about that.

Greetings
Marc

-- 
-
Marc Haber | "I don't trust Computers. They | Mailadresse im Header
Leimen, Germany|  lose things."Winona Ryder | Fon: *49 6224 1600402
Nordisch by Nature |  How to make an American Quilt | Fax: *49 6224 1600421
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Another ENOSPC situation

2016-04-01 Thread Marc Haber
Hi,

just for a change, this is another btrfs on a different host. The host
is also running Debian unstable with mainline kernels, the btrfs in
question was created (not converted) in March 2015 with btrfs-tools
3.17. It is the root fs of my main work notebook which is under
workstation load, with lots of snapshots being created and deleted.

Balance immediately fails with ENOSPC

balance -dprofiles=single -dusage=1 goes through "fine" ("had to
relocate 0 out of 602 chunks")

balance -dprofiles=single -dusage=2 also ENOSPCes immediately.

[4/502]mh@swivel:~$ sudo btrfs fi usage /
Overall:
Device size: 600.00GiB
Device allocated:600.00GiB
Device unallocated:1.00MiB
Device missing:  0.00B
Used:413.40GiB
Free (estimated):148.20GiB  (min: 148.20GiB)
Data ratio:   1.00
Metadata ratio:   2.00
Global reserve:  512.00MiB  (used: 0.00B)

Data,single: Size:553.93GiB, Used:405.73GiB
   /dev/mapper/swivelbtr 553.93GiB

Metadata,DUP: Size:23.00GiB, Used:3.83GiB
   /dev/mapper/swivelbtr  46.00GiB

System,DUP: Size:32.00MiB, Used:112.00KiB
   /dev/mapper/swivelbtr  64.00MiB

Unallocated:
   /dev/mapper/swivelbtr   1.00MiB
[5/503]mh@swivel:~$ 

btrfs balance -mprofiles seems to do something. one kworked and one
btrfs-transaction process hog one CPU core each for hours, while
blocking the filesystem for minutes apiece, which leads to the host
being nearly unuseable up to the point of "clock and mouse pointer
frozen for nearly ten minutes".

The btrfs balance cancel I issued after four hours of this state took
eleven minutes alone to complete.

These are all log entries that were obtained after starting btrfs
balance -mprofiles on 09:43
Apr  1 12:18:21 swivel kernel: [253651.970413] BTRFS info (device dm-14): found 
3523 extents
Apr  1 12:18:21 swivel kernel: [253652.035572] BTRFS info (device dm-14): 
relocating block group 1538365849600 flags 36
Apr  1 13:30:57 swivel kernel: [258007.653597] BTRFS info (device dm-14): found 
3585 extents
Apr  1 13:30:57 swivel kernel: [258007.746541] BTRFS info (device dm-14): 
relocating block group 1536755236864 flags 36
Apr  1 13:49:39 swivel kernel: [259130.296184] BTRFS info (device dm-14): found 
3047 extents
Apr  1 13:49:39 swivel kernel: [259130.357314] BTRFS info (device dm-14): 
relocating block group 1528702173184 flags 36
Apr  1 14:30:00 swivel kernel: [261550.776348] BTRFS info (device dm-14): found 
4200 extents

This kernel trace from 11:16 is not btrfs-related, is it? I guess it's
bluetooth related since it happened simultaneously to the bluetooth
device popping out an in:
Apr  1 11:16:38 swivel kernel: [249948.993751] usb 1-1.4: USB disconnect, 
device number 39
Apr  1 11:16:38 swivel systemd[1]: Starting Load/Save RF Kill Switch Status...
Apr  1 11:16:38 swivel systemd[1]: Started Load/Save RF Kill Switch Status.
Apr  1 11:16:38 swivel systemd[1]: bluetooth.target: Unit not needed anymore. 
Stopping.
Apr  1 11:16:38 swivel systemd[1]: Stopped target Bluetooth.
Apr  1 11:16:38 swivel laptop-mode: Laptop mode
Apr  1 11:16:38 swivel laptop-mode: enabled, not active
Apr  1 11:16:39 swivel kernel: [249949.211549] usb 1-1.4: new full-speed USB 
device number 40 using ehci-pci
Apr  1 11:16:39 swivel kernel: [249949.308386] usb 1-1.4: New USB device found, 
idVendor=0a5c, idProduct=217f
Apr  1 11:16:39 swivel kernel: [249949.308397] usb 1-1.4: New USB device 
strings: Mfr=1, Product=2, SerialNumber=3
Apr  1 11:16:39 swivel kernel: [249949.308402] usb 1-1.4: Product: Broadcom 
Bluetooth Device
Apr  1 11:16:39 swivel kernel: [249949.308407] usb 1-1.4: Manufacturer: 
Broadcom Corp
Apr  1 11:16:39 swivel kernel: [249949.308412] usb 1-1.4: SerialNumber: 
CCAF78F1274F
Apr  1 11:16:39 swivel systemd[1]: Reached target Bluetooth.
Apr  1 11:16:39 swivel kernel: [249949.507794] [ cut here 
]
Apr  1 11:16:39 swivel kernel: [249949.507810] WARNING: CPU: 1 PID: 11 at 
arch/x86/kernel/cpu/perf_event_intel_ds.c:325 reserve_ds_buffers+0x102/0x326()
Apr  1 11:16:39 swivel kernel: [249949.507813] alloc_bts_buffer: BTS buffer 
allocation failure
Apr  1 11:16:39 swivel kernel: [249949.507816] Modules linked in: cpuid 
hid_generic usbhid hid e1000e tun ctr ccm rfcomm bridge stp llc 
cpufreq_userspace cpufreq_stats cpufreq_conservative cpufreq_powersave 
nf_conntrack_netlink nfnetlink bnep binfmt_misc intel_rapl x86_pkg_temp_thermal 
arc4 intel_powerclamp kvm_intel kvm irqbypass iwldvm snd_hda_codec_conexant 
snd_hda_codec_generic mac80211 input_leds btusb btbcm i2c_i801 snd_hda_intel 
btintel snd_hda_codec bluetooth iwlwifi snd_hda_core cfg80211 snd_hwdep sg 
snd_pcm_oss snd_mixer_oss lpc_ich mfd_core snd_pcm shpchp snd_timer 
thinkpad_acpi nvram snd battery soundcore rfkill ac tpm_tis tpm evdev processor 
xt_TCPMSS xt_tcpudp iptable_mangle iptable_filter 

Re: [PATCH] btrfs-progs: fsck: Fix a false metadata extent warning

2016-04-01 Thread David Sterba
On Fri, Apr 01, 2016 at 08:09:56PM +0800, Qu Wenruo wrote:
> 
> 
> On 04/01/2016 07:39 PM, David Sterba wrote:
> > On Fri, Apr 01, 2016 at 04:50:06PM +0800, Qu Wenruo wrote:
> >>> After another look, why don't we use nodesize directly? Or stripesize
> >>> where applies. With max_size == 0 the test does not make sense, we ought
> >>> to know the alignment.
> >>>
> >> Yes, my first though is also to use nodesize directly, which should be
> >> always correct.
> >>
> >> But the problem is, the related function call stack doesn't have any
> >> member to reach btrfs_root or btrfs_fs_info.
> >>
> >> In the very beginning version of such crossing stripe check, I used to
> >> add a btrfs_root/btrfs_fs_info parameter to the function.
> >>
> >> But the code change are too many, so I use 'max_size'.
> >>
> >> I can try to re-do such modification, but IIRC it didn't cause a good
> >> result.
> >
> > Yes it would require refactoring, which would be good on itself, because
> > add_extent_rec takes 12(!) parameters. Some of its callers would need to
> > be updated, but it seems doable.
> 
> I'll try to refactor.

I'm working on it.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: "bad metadata" not fixed by btrfs repair

2016-04-01 Thread Marc Haber
On Thu, Mar 31, 2016 at 08:42:46PM +0200, Henk Slager wrote:
> So also false alerts.

btrfs-tools 4.5.1 with Qu's patch from patchwork doesnt show those
warnings any more.

Greetings
Marc

-- 
-
Marc Haber | "I don't trust Computers. They | Mailadresse im Header
Leimen, Germany|  lose things."Winona Ryder | Fon: *49 6224 1600402
Nordisch by Nature |  How to make an American Quilt | Fax: *49 6224 1600421
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs-progs: fsck: Fix a false metadata extent warning

2016-04-01 Thread Qu Wenruo



On 04/01/2016 07:39 PM, David Sterba wrote:

On Fri, Apr 01, 2016 at 04:50:06PM +0800, Qu Wenruo wrote:

After another look, why don't we use nodesize directly? Or stripesize
where applies. With max_size == 0 the test does not make sense, we ought
to know the alignment.


Yes, my first though is also to use nodesize directly, which should be
always correct.

But the problem is, the related function call stack doesn't have any
member to reach btrfs_root or btrfs_fs_info.

In the very beginning version of such crossing stripe check, I used to
add a btrfs_root/btrfs_fs_info parameter to the function.

But the code change are too many, so I use 'max_size'.

I can try to re-do such modification, but IIRC it didn't cause a good
result.


Yes it would require refactoring, which would be good on itself, because
add_extent_rec takes 12(!) parameters. Some of its callers would need to
be updated, but it seems doable.


I'll try to refactor.
I though current extent-tree rework would change all these mess, but 
considering the case of btrfs-convert, I'd better refactor current code 
other than waiting other reviewers to appear.


Thanks,
Qu


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs-progs: fsck: Fix a false metadata extent warning

2016-04-01 Thread David Sterba
On Fri, Apr 01, 2016 at 04:50:06PM +0800, Qu Wenruo wrote:
> > After another look, why don't we use nodesize directly? Or stripesize
> > where applies. With max_size == 0 the test does not make sense, we ought
> > to know the alignment.
> >
> Yes, my first though is also to use nodesize directly, which should be 
> always correct.
> 
> But the problem is, the related function call stack doesn't have any 
> member to reach btrfs_root or btrfs_fs_info.
> 
> In the very beginning version of such crossing stripe check, I used to 
> add a btrfs_root/btrfs_fs_info parameter to the function.
> 
> But the code change are too many, so I use 'max_size'.
> 
> I can try to re-do such modification, but IIRC it didn't cause a good 
> result.

Yes it would require refactoring, which would be good on itself, because
add_extent_rec takes 12(!) parameters. Some of its callers would need to
be updated, but it seems doable.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: empty disk reports full

2016-04-01 Thread Hugo Mills
On Fri, Apr 01, 2016 at 11:50:50AM +0200, Alejandro Vargas wrote:
> I am using a 2Tb disk for incremental backups.
> 
> I use rsync for backing up to a subvolume, and each day I creates an snapshot 
> of the lastest snapshot and do rsync in this.
> 
> When the disk becomes nearly full (100Gb or less available) I deletes the 
> oldest subvolume (withbtrfs subvolume delete).
> 
> My problem is that *even removing ALL the subvolumes*, the free space does
> not change. It continues reporting the same size (disk is nearly full).
> 
> I tried "btrfs balance start /mnt/backup" but it takes hours and hours.
> 
> I'm using linux 4.1.15
> btrfs-progs v4.1.2

   Can you show us the output of both "sudo btrfs fi show" and "btrfs
fi df /mnt/backup", please?

   Hugo.

-- 
Hugo Mills | The Creature from the Black Logon
hugo@... carfax.org.uk |
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


Re: [PATCH v10 02/21] btrfs: dedupe: Introduce function to initialize dedupe info

2016-04-01 Thread kbuild test robot
Hi Wang,

[auto build test ERROR on btrfs/next]
[also build test ERROR on v4.6-rc1 next-20160401]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improving the system]

url:
https://github.com/0day-ci/linux/commits/Qu-Wenruo/Btrfs-dedupe-framework/20160401-143937
base:   https://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git 
next
config: x86_64-rhel (attached as .config)
reproduce:
# save the attached .config to linux build tree
make ARCH=x86_64 

Note: the linux-review/Qu-Wenruo/Btrfs-dedupe-framework/20160401-143937 HEAD 
0a445f5009c064ee1d3fc966e41bb75627594afe builds fine.
  It only hurts bisectibility.

All errors (new ones prefixed by >>):

>> ERROR: "btrfs_dedupe_disable" [fs/btrfs/btrfs.ko] undefined!

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: Binary data


empty disk reports full

2016-04-01 Thread Alejandro Vargas
I am using a 2Tb disk for incremental backups.

I use rsync for backing up to a subvolume, and each day I creates an snapshot 
of the lastest snapshot and do rsync in this.

When the disk becomes nearly full (100Gb or less available) I deletes the 
oldest subvolume (withbtrfs subvolume delete).

My problem is that *even removing ALL the subvolumes*, the free space does
not change. It continues reporting the same size (disk is nearly full).

I tried "btrfs balance start /mnt/backup" but it takes hours and hours.

I'm using linux 4.1.15
btrfs-progs v4.1.2

BEGIN:VCARD
VERSION:3.0
EMAIL:a...@zener.es
FN:Alejandro Vargas
N:Vargas;Alejandro;;;
NICKNAME:anv
PHOTO;ENCODING=b;TYPE=jpeg:/9j/4AAQSkZJRgABAQEAAQABAAD/2wBDAAgGBgcGBQgHBwcJ
 CQgKDBQNDAsLDBkSEw8UHRofHh0aHBwgJC4nICIsIxwcKDcpLDAxNDQ0Hyc5PTgyPC4zNDL/2wB
 DAQkJCQwLDBgNDRgyIRwhMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMj
 IyMjIyMjIyMjL/wAARCAC0ALQDASIAAhEBAxEB/8QAHwAAAQUBAQEBAQEAAAECAwQFB
 gcICQoL/8QAtRAAAgEDAwIEAwUFBAQAAAF9AQIDAAQRBRIhMUEGE1FhByJxFDKBkaEII0KxwRVS
 0fAkM2JyggkKFhcYGRolJicoKSo0NTY3ODk6Q0RFRkdISUpTVFVWV1hZWmNkZWZnaGlqc3R1dnd
 4eXqDhIWGh4iJipKTlJWWl5iZmqKjpKWmp6ipqrKztLW2t7i5usLDxMXGx8jJytLT1NXW19jZ2u
 Hi4+Tl5ufo6erx8vP09fb3+Pn6/8QAHwEAAwEBAQEBAQEBAQECAwQFBgcICQoL/8QAt
 REAAgECBAQDBAcFBAQAAQJ3AAECAxEEBSExBhJBUQdhcRMiMoEIFEKRobHBCSMzUvAVYnLRChYk
 NOEl8RcYGRomJygpKjU2Nzg5OkNERUZHSElKU1RVVldYWVpjZGVmZ2hpanN0dXZ3eHl6goOEhYa
 HiImKkpOUlZaXmJmaoqOkpaanqKmqsrO0tba3uLm6wsPExcbHyMnK0tPU1dbX2Nna4uPk5ebn6O
 nq8vP09fb3+Pn6/9oADAMBAAIRAxEAPwD3uiiigAqrqN19j0+4ueP3UTPz7DNWqxfFriLwtqDc/
 wCqxwfUgUAeez/E3VQxVI7cD3QnH60q/ELXZmWOCOOSRgBkRjA9/b8a4h4HdTJIfKhHJPr7CtSz
 VktwqgqG9TyR71E2o6msIczO/wBQ8bv9kS3tCDcbAJZsDAbHO3/GuMmeW5kaSV2Z2OSSetOSAYF
 ThAB0rnlUbOuFOMVoUxGM0jRDPSrmAegphXtjNRdmyKRj54FNZDV3yxnpQYwRRzDMwxFvWkMBxW
 msKnPFK0AHajmEZQgYHOKNhHJ4q+0eM8VCyU+cLDIp3iBAY7T1HrXo3gvX45oVsZpTvB+QMc8e1
 ebMuKfBcS20iywuUdeQwPIrWE+5z1aKlqj3vIpwx2rxKHx/rOm3AJbzE7hyWDV6H4b8c6frsQSR
 0t7oDmNm4b3B7/St9NziaadmdXRWS/ibR48hr+IEdRgnH6VUk8beH4+uoofojH+lFwszoaK5g+P
 /AA8D/wAfp/79t/hRRcLM3n1Cyj+/dwL/AL0gH9arSeINHi+9qVrx6Sg14t9mGMFz+VJ9lQ9XNZ
 e1NfY+Z7BJ4u0KPk6jGT6DJqnf67o2r6Vdwi5EiBfmGCASOcZNeWC1iXnLH3zVLULsxqtjbggNw
 RnrmqjLmYnTtsSOw1PU2ZFC2sLEKB0JrUiQDHFVrK3EECoByOv1q8FwKxqO7OunHliSjHFIzdqb
 ux2pGc+lZ8pqh3Pemsp60B+OlIXzxnFOxQ3njg0o96QsR3pu40WHYepxUoYEVXycZHWlUmlyiHu
 gNV2iyelWN3SmMelKwFR4+aiZOMGrTComjPFNCZm3UW9CpGQawDPLZ3G1SVIOVI6g11rxlhyOK5
 rWICpL9x1FdNKXRnNWimrmxYTwaqS1w7Lcr94A43e9Xf7NtupDn6tXH2N2yX6OvysBiu3gmE0Ky
 L0YZpVU46oyg76EP9nWv/PNv++j/jRVrmisOaRpZEYPHNNJ96GPy8daT607DF55OePSsu2j+0aw
 zHkJzWsi7sj1qloyKb67bqQQBWtPZkNao1Ixg1IDniqNxeLC20Alz0AqJNSVcb8ClyNm90ka3AG
 KQIWOcVWi1C3kIy6jNXYpIzja4OfSjlsXGaDyeelRmLnpV7AIGOppfLU8ZqbFpmf5OTigwkcYq+
 YlUDFRSFcdelOw3IpldvbimNjJwas/IepqJmjHpRykOSIgT60tMMiDjeKAwYcGp5RpoXAJ609Y6
 Yikt0qzjavNKwpMr3CAJgDrXL6z9w9M1vX0+BySMVzWpTbkIOcE1tT3MJ7GPaj99g9VrqdDuwUa
 At0JIrlo8rMSOh7GtHTJPKuQd3XrWtRXRzrRnZAjHWimK5KAg5GKK4dTosh7IopmwDtUz+1MUMz
 KoHzMcAdzVR1EwjwpJxVDRCQ+oSEYwcfzrrtEs9Mj8+XVZzt8vd87NjJ464GR+OaTWdT0h4
 kg0ez8nnLSkYJx7V0QjZGUqiuc7BYlzubJY81cbR4JIsSIKI71oz88pBH0ptxrJWE/vSx6YwKaT
 6GqlfoULjw9GDujLAdqitobi0fbuJHvV+31JHUl3Oewap/NguRhcByeCp4JpvmtqNNXLFpOWUbj
 j61ZEgGelYtvcgsQAeOuB0rQclI8sCv1GM1i0zdND5LjA9KybzUGGQnJ9anlcLEJZS4Vs7FUctj
 qfYVBbT2ckmDasf8Ael/wAq4xIlJGLM2pzvmNm9gOKsQWmqMPnAwO5NdJHcWScG2iGB/fb/Gpvt
 tmfuxRfg7f41o2ZXVzmzaXLJhxx6g1HbmW1m+Ykp710U11asMeSef7sv8A9as6YWjdVmjPqAHH8
 xU77lKSL9ttkQsuMUkxyCB0qtpxG1zFIskY4yAQR9QelWHIHB6+lZOJd7sxr8lEJb8K5W8YtnHf
 9K6rVG/csSOh/OuRmfLNitqSMqpWbcJv+A1NZSEMzE4KioCw80H3waG3Rybl+6a2Zznc237y2jY
 PjIoqPTziwh6/dorge5umabZ7UzYyzpKjbZU+63XFVjNJ3PWmNO5f75/CpjLsW4dyG+a5j1OZoG
 B3YznqxxyT+NRx3NykiiWF1ODyORS3rSC+LAnkK3T2q9Oh+xxuSfmHGa6uezS7kKnfUxru+2ykE
 E+w71VvLieJU+cRq3PAzirggUylnGSe9PkghdRlc+xrVTSG6bZjjUpPOVEPmg8ZK4ra0tmmu44S
 xQlssfQDkmqxt48/KiqB6CrVpGUiuHiX95s8tD2yxx/LJ/CnKSewcjii3by3t3kPcyCPPABwB9A
 OKutdX9shEV7Kw7rIdwP4GmRIsMSovQDGfU+tNmbjrXPfU6Y01YL5yLWGeZwzyw7gAMKvJGAKwo
 rh4gXB/WtO9ydEgLEAxu8ecZ6/MB/OsSMrImxweO1bR2Oe17o0o9RTgKpdyM4Xk1GdWLIzm1O1e
 pLAfkKit1CEbCV4xxSPpULksSQTzxVe71IlSl0IzqgkcNFIVb+4x61bt795hzgnuKy5tMRQSpIP
 rSW2+CQBjz0z60motaFxi1ozrdAlt47yUXHmeSY23CPGT+dWHu7VL/Du6WjKQrMOQ3visnSiS1y
 +OFiP6kCphJIqFRHuU9Qe9Yvcaj2E1vItywyU/vAcVxjsHOMjOe1dJqOpz2to0RiIt3+/Hjgj1B
 7VzzwxY8+1l82PuvRlPoRWtNaGVS97MqmNhIw4FTRLvjXIBcMBkc8Ujy5lPY8DJ9aksGRrkiNsA
 nkHt7VUnpczsdXYSeXZRIQeBRUiI+wbCAMd6K8tyd9zsSjYje4YSFCv4+gqOdjbkMMNwMk0rtls
 9/YdaHAk65x04rRTQezZJERfKrblSQDGOxAPFW5oLryo42Uts6AHOKr6dbhrxM7jjnn+Va/lAsz
 EnrW0Zp6k+zaZim3lHWF/wFJ5Mg5MUn/fJrVaJz0kcfQ00wTsMfaJPzq1I0UZGT5E8p2pA5z6Ka
 twRmFBGcZBLMB2bpjPsM/nVhrVlGZJZGA7FyM0Rwk/dUAD0puWg1Bt6iElU6VWYknk/WtB4mKcC
 s+ZGGc9KhGw+JDcWk9oBljh0Hqw7flmsXyhuPGGB6HtWpFjfkHBHp1q2ZrvOcxTLjpNCrfrjP61
 onYycWndGTEoKkEAH0qbbhOM1dmvDFGWfSLCTHorr/JqZBf21wv/ACBYAe4WWT/4uq6XJ17Ga8X
 Uk96j8pc4rab7O3TS0AP/AE0k/wDiqcPLH+rsIIyPVWY/+PE1Nx2fYi0xBHZSuR/rGCL7gcn+lX
 4kB5xxUI3vgtngY9APwq/AmIzx2rNvUm1kYevYFkcgYrkWsvNm82AFT7d67DXUzZMcdOay9LgJQ
 

[PATCH v10 16/21] btrfs: dedupe: Add basic tree structure for on-disk dedupe method

2016-04-01 Thread Qu Wenruo
Introduce a new tree, dedupe tree to record on-disk dedupe hash.
As a persist hash storage instead of in-memeory only implement.

Unlike Liu Bo's implement, in this version we won't do hack for
bytenr -> hash search, but add a new type, DEDUP_BYTENR_ITEM for such
search case, just like in-memory backend.

Signed-off-by: Liu Bo 
Signed-off-by: Wang Xiaoguang 
Signed-off-by: Qu Wenruo 
---
Fix a small rebase bug, which missed 4 lines.
---
 fs/btrfs/ctree.h | 53 +++-
 fs/btrfs/dedupe.h|  5 +
 fs/btrfs/disk-io.c   |  6 +
 fs/btrfs/relocation.c|  3 ++-
 include/trace/events/btrfs.h |  3 ++-
 5 files changed, 67 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 0e8933c..659790c 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -100,6 +100,9 @@ struct btrfs_ordered_sum;
 /* tracks free space in block groups. */
 #define BTRFS_FREE_SPACE_TREE_OBJECTID 10ULL
 
+/* on-disk dedupe tree (EXPERIMENTAL) */
+#define BTRFS_DEDUPE_TREE_OBJECTID 11ULL
+
 /* device stats in the device tree */
 #define BTRFS_DEV_STATS_OBJECTID 0ULL
 
@@ -538,7 +541,8 @@ struct btrfs_super_block {
 #define BTRFS_FEATURE_COMPAT_SAFE_CLEAR0ULL
 
 #define BTRFS_FEATURE_COMPAT_RO_SUPP   \
-   (BTRFS_FEATURE_COMPAT_RO_FREE_SPACE_TREE)
+   (BTRFS_FEATURE_COMPAT_RO_FREE_SPACE_TREE |  \
+BTRFS_FEATURE_COMPAT_RO_DEDUPE)
 
 #define BTRFS_FEATURE_COMPAT_RO_SAFE_SET   0ULL
 #define BTRFS_FEATURE_COMPAT_RO_SAFE_CLEAR 0ULL
@@ -960,6 +964,36 @@ struct btrfs_csum_item {
u8 csum;
 } __attribute__ ((__packed__));
 
+/*
+ * Objectid: 0
+ * Type: BTRFS_DEDUPE_STATUS_ITEM_KEY
+ * Offset: 0
+ */
+struct btrfs_dedupe_status_item {
+   __le64 blocksize;
+   __le64 limit_nr;
+   __le16 hash_type;
+   __le16 backend;
+} __attribute__ ((__packed__));
+
+/*
+ * Objectid: Last 64 bit of the hash
+ * Type: BTRFS_DEDUPE_HASH_ITEM_KEY
+ * Offset: Bytenr of the hash
+ *
+ * Used for hash <-> bytenr search
+ * Hash exclude the last 64 bit follows
+ */
+
+/*
+ * Objectid: bytenr
+ * Type: BTRFS_DEDUPE_BYTENR_ITEM_KEY
+ * offset: Last 64 bit of the hash
+ *
+ * Used for bytenr <-> hash search (for free_extent)
+ * Its itemsize should always be 0.
+ */
+
 struct btrfs_dev_stats_item {
/*
 * grow this item struct at the end for future enhancements and keep
@@ -2168,6 +2202,13 @@ struct btrfs_ioctl_defrag_range_args {
 #define BTRFS_CHUNK_ITEM_KEY   228
 
 /*
+ * Dedup item and status
+ */
+#define BTRFS_DEDUPE_STATUS_ITEM_KEY   230
+#define BTRFS_DEDUPE_HASH_ITEM_KEY 231
+#define BTRFS_DEDUPE_BYTENR_ITEM_KEY   232
+
+/*
  * Records the overall state of the qgroups.
  * There's only one instance of this key present,
  * (0, BTRFS_QGROUP_STATUS_KEY, 0)
@@ -3265,6 +3306,16 @@ static inline unsigned long btrfs_leaf_data(struct 
extent_buffer *l)
return offsetof(struct btrfs_leaf, items);
 }
 
+/* btrfs_dedupe_status */
+BTRFS_SETGET_FUNCS(dedupe_status_blocksize, struct btrfs_dedupe_status_item,
+  blocksize, 64);
+BTRFS_SETGET_FUNCS(dedupe_status_limit, struct btrfs_dedupe_status_item,
+  limit_nr, 64);
+BTRFS_SETGET_FUNCS(dedupe_status_hash_type, struct btrfs_dedupe_status_item,
+  hash_type, 16);
+BTRFS_SETGET_FUNCS(dedupe_status_backend, struct btrfs_dedupe_status_item,
+  backend, 16);
+
 /* struct btrfs_file_extent_item */
 BTRFS_SETGET_FUNCS(file_extent_type, struct btrfs_file_extent_item, type, 8);
 BTRFS_SETGET_STACK_FUNCS(stack_file_extent_disk_bytenr,
diff --git a/fs/btrfs/dedupe.h b/fs/btrfs/dedupe.h
index f5d2b45..1ac1bcb 100644
--- a/fs/btrfs/dedupe.h
+++ b/fs/btrfs/dedupe.h
@@ -60,6 +60,8 @@ struct btrfs_dedupe_hash {
u8 hash[];
 };
 
+struct btrfs_root;
+
 struct btrfs_dedupe_info {
/* dedupe blocksize */
u64 blocksize;
@@ -75,6 +77,9 @@ struct btrfs_dedupe_info {
struct list_head lru_list;
u64 limit_nr;
u64 current_nr;
+
+   /* for persist data like dedup-hash and dedupe status */
+   struct btrfs_root *dedupe_root;
 };
 
 struct btrfs_trans_handle;
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index ed6a6fd..c7eda03 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -184,6 +184,7 @@ static struct btrfs_lockdep_keyset {
{ .id = BTRFS_DATA_RELOC_TREE_OBJECTID, .name_stem = "dreloc"   },
{ .id = BTRFS_UUID_TREE_OBJECTID,   .name_stem = "uuid" },
{ .id = BTRFS_FREE_SPACE_TREE_OBJECTID, .name_stem = "free-space" },
+   { .id = BTRFS_DEDUPE_TREE_OBJECTID, .name_stem = "dedupe"   },
{ .id = 0,  .name_stem = "tree" },
 };
 
@@ -1678,6 +1679,11 @@ struct btrfs_root *btrfs_get_fs_root(struct 
btrfs_fs_info *fs_info,
if (location->objectid == 

Re: [PATCH] btrfs-progs: fsck: Fix a false metadata extent warning

2016-04-01 Thread Qu Wenruo



David Sterba wrote on 2016/04/01 10:44 +0200:

On Fri, Apr 01, 2016 at 08:28:18AM +0800, Qu Wenruo wrote:



David Sterba wrote on 2016/03/31 18:30 +0200:

On Thu, Mar 31, 2016 at 10:19:34AM +0800, Qu Wenruo wrote:

At least 2 user from mail list reported btrfsck reported false alert of
"bad metadata [,) crossing stripe boundary".

While the reported number are all inside the same 64K boundary.
After some check, all the false alert have the same bytenr feature,
which can be divided by stripe size (64K).

The result seems to be initial 'max_size' can be 0, causing 'start' +
'max_size' - 1, to cross the stripe boundary.

Fix it by always update extent_record->cross_stripe when the
extent_record is updated, to avoid temporary false alert to be reported.

Signed-off-by: Qu Wenruo 


Applied, thanks.

Do you have a test image for that?



Unfortunately, no.

Although I figured out the cause the the false alert, I still didn't
find a image/method to reproduce it, except the images of reporters.

I can dig a little further trying to make a image.


After another look, why don't we use nodesize directly? Or stripesize
where applies. With max_size == 0 the test does not make sense, we ought
to know the alignment.


Yes, my first though is also to use nodesize directly, which should be 
always correct.


But the problem is, the related function call stack doesn't have any 
member to reach btrfs_root or btrfs_fs_info.


In the very beginning version of such crossing stripe check, I used to 
add a btrfs_root/btrfs_fs_info parameter to the function.


But the code change are too many, so I use 'max_size'.

I can try to re-do such modification, but IIRC it didn't cause a good 
result.


Thanks,
Qu


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs-progs: fsck: Fix a false metadata extent warning

2016-04-01 Thread David Sterba
On Fri, Apr 01, 2016 at 08:28:18AM +0800, Qu Wenruo wrote:
> 
> 
> David Sterba wrote on 2016/03/31 18:30 +0200:
> > On Thu, Mar 31, 2016 at 10:19:34AM +0800, Qu Wenruo wrote:
> >> At least 2 user from mail list reported btrfsck reported false alert of
> >> "bad metadata [,) crossing stripe boundary".
> >>
> >> While the reported number are all inside the same 64K boundary.
> >> After some check, all the false alert have the same bytenr feature,
> >> which can be divided by stripe size (64K).
> >>
> >> The result seems to be initial 'max_size' can be 0, causing 'start' +
> >> 'max_size' - 1, to cross the stripe boundary.
> >>
> >> Fix it by always update extent_record->cross_stripe when the
> >> extent_record is updated, to avoid temporary false alert to be reported.
> >>
> >> Signed-off-by: Qu Wenruo 
> >
> > Applied, thanks.
> >
> > Do you have a test image for that?
> >
> >
> Unfortunately, no.
> 
> Although I figured out the cause the the false alert, I still didn't 
> find a image/method to reproduce it, except the images of reporters.
> 
> I can dig a little further trying to make a image.

After another look, why don't we use nodesize directly? Or stripesize
where applies. With max_size == 0 the test does not make sense, we ought
to know the alignment.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v7 5/8] btrfs-progs: Add dedupe feature for mkfs and convert

2016-04-01 Thread Qu Wenruo
Add new DEDUPE ro compat flag and corresponding mkfs/convert flag
'dedupe'.

Since dedupe tree is completely isolated from fs tree, so even old kernel
could do read mount.
So add it to RO compat flag instead of common incompat flags

Signed-off-by: Qu Wenruo 
---
 Documentation/mkfs.btrfs.asciidoc |  9 
 btrfs-convert.c   | 19 +++-
 mkfs.c|  8 +--
 utils.c   | 47 +--
 utils.h   |  7 +++---
 5 files changed, 67 insertions(+), 23 deletions(-)

diff --git a/Documentation/mkfs.btrfs.asciidoc 
b/Documentation/mkfs.btrfs.asciidoc
index e4321de..12a41c6 100644
--- a/Documentation/mkfs.btrfs.asciidoc
+++ b/Documentation/mkfs.btrfs.asciidoc
@@ -208,6 +208,15 @@ reduced-size metadata for extent references, saves a few 
percent of metadata
 improved representation of file extents where holes are not explicitly
 stored as an extent, saves a few percent of metadata if sparse files are used
 
+*dedupe*::
+allow btrfs to use new on-disk format designed for in-band(write time)
+de-duplication.
++
+on-disk storage backend and persist de-duplication status needs this feature.
++
+this feature is RO compat feature, means old kernel can still mount it
+read-only.
+
 BLOCK GROUPS, CHUNKS, RAID
 --
 
diff --git a/btrfs-convert.c b/btrfs-convert.c
index 4474489..77e72f6 100644
--- a/btrfs-convert.c
+++ b/btrfs-convert.c
@@ -2453,7 +2453,7 @@ static int convert_open_fs(const char *devname,
 
 static int do_convert(const char *devname, int datacsum, int packing, int 
noxattr,
u32 nodesize, int copylabel, const char *fslabel, int progress,
-   u64 features)
+   u64 features, u64 ro_features)
 {
int i, ret, blocks_per_node;
int fd = -1;
@@ -2504,8 +2504,9 @@ static int do_convert(const char *devname, int datacsum, 
int packing, int noxatt
fprintf(stderr, "unable to open %s\n", devname);
goto fail;
}
-   btrfs_parse_features_to_string(features_buf, features);
-   if (features == BTRFS_MKFS_DEFAULT_FEATURES)
+   btrfs_parse_features_to_string(features_buf, features, ro_features);
+   if (features == BTRFS_MKFS_DEFAULT_FEATURES &&
+   ro_features == 0)
strcat(features_buf, " (default)");
 
printf("create btrfs filesystem:\n");
@@ -2521,6 +2522,7 @@ static int do_convert(const char *devname, int datacsum, 
int packing, int noxatt
mkfs_cfg.sectorsize = blocksize;
mkfs_cfg.stripesize = blocksize;
mkfs_cfg.features = features;
+   mkfs_cfg.ro_features = ro_features;
 
ret = make_btrfs(fd, _cfg);
if (ret) {
@@ -3071,6 +3073,7 @@ int main(int argc, char *argv[])
char *file;
char fslabel[BTRFS_LABEL_SIZE];
u64 features = BTRFS_MKFS_DEFAULT_FEATURES;
+   u64 ro_features = 0;
 
while(1) {
enum { GETOPT_VAL_NO_PROGRESS = 256 };
@@ -3128,7 +3131,8 @@ int main(int argc, char *argv[])
char *orig = strdup(optarg);
char *tmp = orig;
 
-   tmp = btrfs_parse_fs_features(tmp, );
+   tmp = btrfs_parse_fs_features(tmp, ,
+ _features);
if (tmp) {
fprintf(stderr,
"Unrecognized filesystem 
feature '%s'\n",
@@ -3146,7 +3150,9 @@ int main(int argc, char *argv[])
char buf[64];
 
btrfs_parse_features_to_string(buf,
-   features & 
~BTRFS_CONVERT_ALLOWED_FEATURES);
+   features &
+   ~BTRFS_CONVERT_ALLOWED_FEATURES,
+   ro_features);
fprintf(stderr,
"ERROR: features not allowed 
for convert: %s\n",
buf);
@@ -3196,7 +3202,8 @@ int main(int argc, char *argv[])
ret = do_rollback(file);
} else {
ret = do_convert(file, datacsum, packing, noxattr, nodesize,
-   copylabel, fslabel, progress, features);
+   copylabel, fslabel, progress, features,
+   ro_features);
}
if (ret)
return 1;
diff --git a/mkfs.c b/mkfs.c
index 5e79e0b..5071060 100644
--- a/mkfs.c
+++ b/mkfs.c
@@ -1369,6 +1369,7 @@ int main(int argc, char **argv)
int saved_optind;
char fs_uuid[BTRFS_UUID_UNPARSED_SIZE] = { 0 };
 

[PATCH v7 2/8] btrfs-progs: dedupe: Add enable command for dedupe command group

2016-04-01 Thread Qu Wenruo
Add enable subcommand for dedupe commmand group.

Signed-off-by: Qu Wenruo 
---
 Documentation/btrfs-dedupe.asciidoc | 105 +++-
 btrfs-completion|   6 +-
 cmds-dedupe.c   | 155 
 ioctl.h |   2 +
 4 files changed, 266 insertions(+), 2 deletions(-)

diff --git a/Documentation/btrfs-dedupe.asciidoc 
b/Documentation/btrfs-dedupe.asciidoc
index 5d63c32..8ab40ab 100644
--- a/Documentation/btrfs-dedupe.asciidoc
+++ b/Documentation/btrfs-dedupe.asciidoc
@@ -21,7 +21,110 @@ use with caution.
 
 SUBCOMMAND
 --
-Nothing yet
+*enable* [options] ::
+Enable in-band de-duplication for a filesystem.
++
+`Options`
++
+-s|--storage-backend 
+Specify de-duplication hash storage backend.
+Supported backends are 'ondisk' and 'inmemory'.
+If not specified, default value is 'inmemory'.
++
+Refer to *BACKENDS* sector for more information.
+
+-b|--blocksize 
+Specify dedupe block size.
+Supported values are power of 2 from '16K' to '8M'.
+Default value is '128K'.
++
+Refer to *BLOCKSIZE* sector for more information.
+
+-a|--hash-algorithm 
+Specify hash algorithm.
+Only 'sha256' is supported yet.
+
+-l|--limit-hash 
+Specify maximum number of hashes stored in memory.
+Only works for 'inmemory' backend.
+Conflicts with '-m' option.
++
+Only positive values are valid.
+Default value is '32K'.
+
+-m|--limit-memory 
+Specify maximum memory used for hashes.
+Only works for 'inmemory' backend.
+Conflicts with '-l' option.
++
+Only value larger than or equal to '1024' is valid.
+No default value.
++
+NOTE: Memory limit will be rounded down to kernel internal hash size,
+so the memory limit shown in 'btrfs dedupe status' may be different
+from the .
+
+WARNING: Too large value for '-l' or '-m' will easily trigger OOM.
+Please use with caution according to system memory or use 'ondisk' backend
+if memory usage is critical.
+
+BACKENDS
+
+Btrfs in-band de-duplication support two different backends with their own
+features.
+
+In-memory backend::
+This backend provides backward-compatibility, and more fine-tuning options.
+But hash pool is non-persistent and may exhaust kernel memory if not setup
+properly.
++
+This backend can be used on old btrfs(without '-O dedupe' mkfs option).
+When used on old btrfs, this backend needs to be enabled manually after mount.
++
+Designed for fast hash search speed, in-memory backend will keep all dedupe
+hashes in memory. (Although overall performance is still much the same with
+'ondisk' backend)
++
+And only keeps limited number of hash in memory to avoid exhausting memory.
+Hashes over the limit will be dropped following Last-Recent-Use behavior.
+So this backend has a consistent overhead for given limit but can\'t ensure
+any all duplicated blocks will be de-duplicated.
++
+After umount and mount, in-memory backend need to refill its hash pool.
+
+On-disk backend::
+This backend provides persistent hash pool, with more smart memory management
+for hash pool.
+But it\'s not backward-compatible, meaning it must be used with '-O dedupe' 
mkfs
+option and older kernel can\'t mount it read-write.
++
+Designed for de-duplication rate, hash pool is stored as B+ tree on disk.
+Although this behavior may cause extra disk IO for hash search under extreme
+high memory pressure,
+under most case the overall performance should be on par with 'inmemory'
+backend.
++
+After umount and mount, on-disk backend still has its hash on disk, no need to
+refill its dedupe hash pool.
+
+DEDUPE BLOCK SIZE
+
+In-band de-duplication is done at dedupe block size.
+Any data smaller than dedupe block size won\'t go through in-band
+de-duplication.
+
+And dedupe block size affects dedupe rate and fragmentation heavily.
+
+Smaller block size will cause more fragments, but higher dedupe rate.
+
+Larger block size will cause less fragments, but lower dedupe rate.
+
+In-band de-duplication rate is highly related to the workload pattern.
+So it\'s highly recommended to align dedupe block size to the workload
+block size to make full use of de-duplication.
+
+And dedupe block size larger than 128K will cause compression unavailable, as
+compression only support maximum extent size of 128K.
 
 EXIT STATUS
 ---
diff --git a/btrfs-completion b/btrfs-completion
index 3ede77b..50f7ea2 100644
--- a/btrfs-completion
+++ b/btrfs-completion
@@ -29,7 +29,7 @@ _btrfs()
 
local cmd=${words[1]}
 
-commands='subvolume filesystem balance device scrub check rescue restore 
inspect-internal property send receive quota qgroup replace help version'
+commands='subvolume filesystem balance device scrub check rescue restore 
inspect-internal property send receive quota qgroup dedupe replace help version'
 commands_subvolume='create delete list snapshot find-new get-default 
set-default show sync'
 commands_filesystem='defragment 

[PATCH v7 3/8] btrfs-progs: dedupe: Add disable support for inband dedupelication

2016-04-01 Thread Qu Wenruo
Add disable subcommand for dedupe command group.

Signed-off-by: Qu Wenruo 
---
 Documentation/btrfs-dedupe.asciidoc |  5 +
 btrfs-completion|  2 +-
 cmds-dedupe.c   | 42 +
 3 files changed, 48 insertions(+), 1 deletion(-)

diff --git a/Documentation/btrfs-dedupe.asciidoc 
b/Documentation/btrfs-dedupe.asciidoc
index 8ab40ab..28fe05f 100644
--- a/Documentation/btrfs-dedupe.asciidoc
+++ b/Documentation/btrfs-dedupe.asciidoc
@@ -21,6 +21,11 @@ use with caution.
 
 SUBCOMMAND
 --
+*disable* ::
+Disable in-band de-duplication for a filesystem.
++
+This will trash all stored dedupe hash.
++
 *enable* [options] ::
 Enable in-band de-duplication for a filesystem.
 +
diff --git a/btrfs-completion b/btrfs-completion
index 50f7ea2..9a6c73b 100644
--- a/btrfs-completion
+++ b/btrfs-completion
@@ -40,7 +40,7 @@ _btrfs()
 commands_property='get set list'
 commands_quota='enable disable rescan'
 commands_qgroup='assign remove create destroy show limit'
-commands_dedupe='enable'
+commands_dedupe='enable disable'
 commands_replace='start status cancel'
 
if [[ "$cur" == -* && $cword -le 3 && "$cmd" != "help" ]]; then
diff --git a/cmds-dedupe.c b/cmds-dedupe.c
index d9dcb10..64ac0f2 100644
--- a/cmds-dedupe.c
+++ b/cmds-dedupe.c
@@ -190,9 +190,51 @@ out:
return ret;
 }
 
+static const char * const cmd_dedupe_disable_usage[] = {
+   "btrfs dedupe disable ",
+   "Disable in-band(write time) de-duplication of a btrfs.",
+   NULL
+};
+
+static int cmd_dedupe_disable(int argc, char **argv)
+{
+   struct btrfs_ioctl_dedupe_args dargs;
+   DIR *dirstream;
+   char *path;
+   int fd;
+   int ret;
+
+   if (check_argc_exact(argc, 2))
+   usage(cmd_dedupe_disable_usage);
+
+   path = argv[1];
+   fd = open_file_or_dir(path, );
+   if (fd < 0) {
+   error("failed to open file or directory: %s", path);
+   return 1;
+   }
+   memset(, 0, sizeof(dargs));
+   dargs.cmd = BTRFS_DEDUPE_CTL_DISABLE;
+
+   ret = ioctl(fd, BTRFS_IOC_DEDUPE_CTL, );
+   if (ret < 0) {
+   error("failed to disable inband deduplication: %s",
+ strerror(errno));
+   ret = 1;
+   goto out;
+   }
+   ret = 0;
+
+out:
+   close_file_or_dir(fd, dirstream);
+   return 0;
+}
+
 const struct cmd_group dedupe_cmd_group = {
dedupe_cmd_group_usage, dedupe_cmd_group_info, {
{ "enable", cmd_dedupe_enable, cmd_dedupe_enable_usage, NULL, 
0},
+   { "disable", cmd_dedupe_disable, cmd_dedupe_disable_usage,
+ NULL, 0},
NULL_CMD_STRUCT
}
 };
-- 
2.7.4



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v7 0/8] Inband dedupe for btrfs-progs

2016-04-01 Thread Qu Wenruo
No much change from previous version.
1) Rebased to latest devel branch

2) Update ctree.h to follow kernel structure change

3) Update print-tree to follow kernel structure change

Qu Wenruo (7):
  btrfs-progs: Basic framework for dedupe command group
  btrfs-progs: dedupe: Add enable command for dedupe command group
  btrfs-progs: dedupe: Add disable support for inband dedupelication
  btrfs-progs: dedupe: Add status subcommand
  btrfs-progs: Add dedupe feature for mkfs and convert
  btrfs-progs: Add show-super support for new DEDUPE flag
  btrfs-progs: debug-tree: Add dedupe tree support

Wang Xiaoguang (1):
  btrfs-progs: property: add a dedupe property

 Documentation/Makefile.in |   1 +
 Documentation/btrfs-dedupe.asciidoc   | 150 
 Documentation/btrfs-property.asciidoc |   2 +
 Documentation/btrfs.asciidoc  |   4 +
 Documentation/mkfs.btrfs.asciidoc |   9 +
 Makefile.in   |   3 +-
 btrfs-completion  |   6 +-
 btrfs-convert.c   |  19 +-
 btrfs.c   |   1 +
 cmds-dedupe.c | 329 ++
 cmds-inspect-dump-super.c |  18 ++
 cmds-inspect-dump-tree.c  |   4 +
 commands.h|   2 +
 ctree.h   |  46 -
 dedupe.h  |  42 +
 ioctl.h   |  23 +++
 mkfs.c|   8 +-
 print-tree.c  | 118 
 props.c   |  73 
 utils.c   |  47 +++--
 utils.h   |   7 +-
 21 files changed, 886 insertions(+), 26 deletions(-)
 create mode 100644 Documentation/btrfs-dedupe.asciidoc
 create mode 100644 cmds-dedupe.c
 create mode 100644 dedupe.h

-- 
2.7.4



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v7 8/8] btrfs-progs: property: add a dedupe property

2016-04-01 Thread Qu Wenruo
From: Wang Xiaoguang 

Normally if we enable online dedupe for a fs, it's filesystem wide
de-duplication. With this property, we can explicitly disable data
de-duplication for specified files.

Signed-off-by: Wang Xiaoguang 
---
 Documentation/btrfs-property.asciidoc |  2 +
 props.c   | 73 +++
 2 files changed, 75 insertions(+)

diff --git a/Documentation/btrfs-property.asciidoc 
b/Documentation/btrfs-property.asciidoc
index 8b9b7f0..ca90035 100644
--- a/Documentation/btrfs-property.asciidoc
+++ b/Documentation/btrfs-property.asciidoc
@@ -44,6 +44,8 @@ label
 label of device
 compression
 compression setting for an inode: lzo, zlib, or "" (empty string)
+dedupe
+online dedupe setting for an inode: disable or "" (empty string)
 
 *list* [-t ] ::
 Lists available properties with their descriptions for the given object.
diff --git a/props.c b/props.c
index 5b74932..d8f6925 100644
--- a/props.c
+++ b/props.c
@@ -187,6 +187,77 @@ out:
return ret;
 }
 
+static int prop_dedupe(enum prop_object_type type, const char *object,
+   const char *name, const char *value)
+{
+   int ret;
+   ssize_t sret;
+   int fd = -1;
+   DIR *dirstream = NULL;
+   char *buf = NULL;
+   char *xattr_name = NULL;
+   int open_flags = value ? O_RDWR : O_RDONLY;
+
+   fd = open_file_or_dir3(object, , open_flags);
+   if (fd == -1) {
+   ret = -errno;
+   fprintf(stderr, "ERROR: open %s failed. %s\n",
+   object, strerror(-ret));
+   goto out;
+   }
+
+   xattr_name = malloc(XATTR_BTRFS_PREFIX_LEN + strlen(name) + 1);
+   if (!xattr_name) {
+   ret = -ENOMEM;
+   goto out;
+   }
+   memcpy(xattr_name, XATTR_BTRFS_PREFIX, XATTR_BTRFS_PREFIX_LEN);
+   memcpy(xattr_name + XATTR_BTRFS_PREFIX_LEN, name, strlen(name));
+   xattr_name[XATTR_BTRFS_PREFIX_LEN + strlen(name)] = '\0';
+
+   if (value)
+   sret = fsetxattr(fd, xattr_name, value, strlen(value), 0);
+   else
+   sret = fgetxattr(fd, xattr_name, NULL, 0);
+   if (sret < 0) {
+   ret = -errno;
+   if (ret != -ENOATTR)
+   fprintf(stderr,
+   "ERROR: failed to %s dedupe for %s. %s\n",
+   value ? "set" : "get", object, strerror(-ret));
+   else
+   ret = 0;
+   goto out;
+   }
+   if (!value) {
+   size_t len = sret;
+
+   buf = malloc(len);
+   if (!buf) {
+   ret = -ENOMEM;
+   goto out;
+   }
+   sret = fgetxattr(fd, xattr_name, buf, len);
+   if (sret < 0) {
+   ret = -errno;
+   fprintf(stderr,
+   "ERROR: failed to get dedupe for %s. %s\n",
+   object, strerror(-ret));
+   goto out;
+   }
+   fprintf(stdout, "dedupe=%.*s\n", (int)len, buf);
+   }
+
+   ret = 0;
+out:
+   free(xattr_name);
+   free(buf);
+   if (fd >= 0)
+   close_file_or_dir(fd, dirstream);
+
+   return ret;
+}
+
 const struct prop_handler prop_handlers[] = {
{"ro", "Set/get read-only flag of subvolume.", 0, prop_object_subvol,
 prop_read_only},
@@ -194,5 +265,7 @@ const struct prop_handler prop_handlers[] = {
 prop_object_dev | prop_object_root, prop_label},
{"compression", "Set/get compression for a file or directory", 0,
 prop_object_inode, prop_compression},
+   {"dedupe", "Set/get dedupe for a file or directory", 0,
+prop_object_inode, prop_dedupe},
{NULL, NULL, 0, 0, NULL}
 };
-- 
2.7.4



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v7 6/8] btrfs-progs: Add show-super support for new DEDUPE flag

2016-04-01 Thread Qu Wenruo
Now btrfs-show-super can handle DEDUPE ro compat flag.

Signed-off-by: Qu Wenruo 
---
 cmds-inspect-dump-super.c | 18 ++
 1 file changed, 18 insertions(+)

diff --git a/cmds-inspect-dump-super.c b/cmds-inspect-dump-super.c
index 3e09ee8..6a939c9 100644
--- a/cmds-inspect-dump-super.c
+++ b/cmds-inspect-dump-super.c
@@ -198,6 +198,16 @@ struct readable_flag_entry {
char *output;
 };
 
+#define DEF_RO_COMPAT_FLAG_ENTRY(bit_name) \
+   {BTRFS_FEATURE_COMPAT_RO_##bit_name, #bit_name}
+
+struct readable_flag_entry ro_compat_flags_array[] = {
+   DEF_RO_COMPAT_FLAG_ENTRY(DEDUPE)
+};
+
+static const int ro_compat_flags_num = sizeof(ro_compat_flags_array) /
+ sizeof(struct readable_flag_entry);
+
 #define DEF_INCOMPAT_FLAG_ENTRY(bit_name)  \
{BTRFS_FEATURE_INCOMPAT_##bit_name, #bit_name}
 
@@ -269,6 +279,13 @@ static void __print_readable_flag(u64 flag, struct 
readable_flag_entry *array,
printf(")\n");
 }
 
+static void print_readable_ro_compat_flag(u64 ro_flag)
+{
+   return __print_readable_flag(ro_flag, ro_compat_flags_array,
+ro_compat_flags_num,
+BTRFS_FEATURE_COMPAT_RO_SUPP);
+}
+
 static void print_readable_incompat_flag(u64 flag)
 {
return __print_readable_flag(flag, incompat_flags_array,
@@ -360,6 +377,7 @@ static void dump_superblock(struct btrfs_super_block *sb, 
int full)
   (unsigned long long)btrfs_super_compat_flags(sb));
printf("compat_ro_flags\t\t0x%llx\n",
   (unsigned long long)btrfs_super_compat_ro_flags(sb));
+   print_readable_ro_compat_flag(btrfs_super_compat_ro_flags(sb));
printf("incompat_flags\t\t0x%llx\n",
   (unsigned long long)btrfs_super_incompat_flags(sb));
print_readable_incompat_flag(btrfs_super_incompat_flags(sb));
-- 
2.7.4



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v7 7/8] btrfs-progs: debug-tree: Add dedupe tree support

2016-04-01 Thread Qu Wenruo
Add dedupe tree support for btrfs-debug-tree.

Signed-off-by: Qu Wenruo 
---
 cmds-inspect-dump-tree.c |   4 ++
 ctree.h  |   7 +++
 print-tree.c | 118 +++
 3 files changed, 129 insertions(+)

diff --git a/cmds-inspect-dump-tree.c b/cmds-inspect-dump-tree.c
index 43c8b67..0c75a3c 100644
--- a/cmds-inspect-dump-tree.c
+++ b/cmds-inspect-dump-tree.c
@@ -496,6 +496,10 @@ again:
printf("multiple");
}
break;
+   case BTRFS_DEDUPE_TREE_OBJECTID:
+   if (!skip)
+   printf("dedupe");
+   break;
default:
if (!skip) {
printf("file");
diff --git a/ctree.h b/ctree.h
index 87ea684..15504b2 100644
--- a/ctree.h
+++ b/ctree.h
@@ -79,6 +79,9 @@ struct btrfs_free_space_ctl;
 /* tracks free space in block groups. */
 #define BTRFS_FREE_SPACE_TREE_OBJECTID 10ULL
 
+/* on-disk dedupe tree (EXPERIMENTAL) */
+#define BTRFS_DEDUPE_TREE_OBJECTID 11ULL
+
 /* for storing balance parameters in the root tree */
 #define BTRFS_BALANCE_OBJECTID -4ULL
 
@@ -1219,6 +1222,10 @@ struct btrfs_root {
 #define BTRFS_DEV_ITEM_KEY 216
 #define BTRFS_CHUNK_ITEM_KEY   228
 
+#define BTRFS_DEDUPE_STATUS_ITEM_KEY   230
+#define BTRFS_DEDUPE_HASH_ITEM_KEY 231
+#define BTRFS_DEDUPE_BYTENR_ITEM_KEY   232
+
 #define BTRFS_BALANCE_ITEM_KEY 248
 
 /*
diff --git a/print-tree.c b/print-tree.c
index d0f37a5..5b8b90c 100644
--- a/print-tree.c
+++ b/print-tree.c
@@ -25,6 +25,7 @@
 #include "disk-io.h"
 #include "print-tree.h"
 #include "utils.h"
+#include "dedupe.h"
 
 
 static void print_dir_item_type(struct extent_buffer *eb,
@@ -687,11 +688,31 @@ static void print_key_type(u64 objectid, u8 type)
case BTRFS_UUID_KEY_RECEIVED_SUBVOL:
printf("UUID_KEY_RECEIVED_SUBVOL");
break;
+   case BTRFS_DEDUPE_STATUS_ITEM_KEY:
+   printf("DEDUPE_STATUS_ITEM");
+   break;
+   case BTRFS_DEDUPE_HASH_ITEM_KEY:
+   printf("DEDUPE_HASH_ITEM");
+   break;
+   case BTRFS_DEDUPE_BYTENR_ITEM_KEY:
+   printf("DEDUPE_BYTENR_ITEM");
+   break;
default:
printf("UNKNOWN.%d", type);
};
 }
 
+static void print_64bit_hash(u64 hash)
+{
+   int i;
+   unsigned char buf[8];
+
+   memcpy(buf, , 8);
+   printf("0x");
+   for (i = 0; i < 8; i++)
+   printf("%02x", buf[i]);
+}
+
 static void print_objectid(u64 objectid, u8 type)
 {
switch (type) {
@@ -706,6 +727,9 @@ static void print_objectid(u64 objectid, u8 type)
case BTRFS_UUID_KEY_RECEIVED_SUBVOL:
printf("0x%016llx", (unsigned long long)objectid);
return;
+   case BTRFS_DEDUPE_HASH_ITEM_KEY:
+   print_64bit_hash(objectid);
+   return;
}
 
switch (objectid) {
@@ -772,6 +796,9 @@ static void print_objectid(u64 objectid, u8 type)
case BTRFS_MULTIPLE_OBJECTIDS:
printf("MULTIPLE");
break;
+   case BTRFS_DEDUPE_TREE_OBJECTID:
+   printf("DEDUPE_TREE");
+   break;
case (u64)-1:
printf("-1");
break;
@@ -807,6 +834,9 @@ void btrfs_print_key(struct btrfs_disk_key *disk_key)
case BTRFS_UUID_KEY_RECEIVED_SUBVOL:
printf(" 0x%016llx)", (unsigned long long)offset);
break;
+   case BTRFS_DEDUPE_BYTENR_ITEM_KEY:
+   print_64bit_hash(offset);
+   break;
default:
if (offset == (u64)-1)
printf(" -1)");
@@ -835,6 +865,85 @@ static void print_uuid_item(struct extent_buffer *l, 
unsigned long offset,
}
 }
 
+static void print_dedupe_status(struct extent_buffer *node, int slot)
+{
+   struct btrfs_dedupe_status_item *status_item;
+   u64 blocksize;
+   u64 limit;
+   u16 hash_type;
+   u16 backend;
+
+   status_item = btrfs_item_ptr(node, slot,
+   struct btrfs_dedupe_status_item);
+   blocksize = btrfs_dedupe_status_blocksize(node, status_item);
+   limit = btrfs_dedupe_status_limit(node, status_item);
+   hash_type = btrfs_dedupe_status_hash_type(node, status_item);
+   backend = btrfs_dedupe_status_backend(node, status_item);
+
+   printf("\t\tdedupe status item ");
+   if (backend == BTRFS_DEDUPE_BACKEND_INMEMORY)
+   printf("backend: inmemory\n");
+   else if (backend == BTRFS_DEDUPE_BACKEND_ONDISK)
+   printf("backend: ondisk\n");
+   else
+   printf("backend: Unrecognized(%u)\n", backend);
+
+   if (hash_type == 

[PATCH v7 4/8] btrfs-progs: dedupe: Add status subcommand

2016-04-01 Thread Qu Wenruo
Add status subcommand for dedupe command group.

Signed-off-by: Qu Wenruo 
---
 Documentation/btrfs-dedupe.asciidoc |  3 ++
 btrfs-completion|  2 +-
 cmds-dedupe.c   | 84 +
 3 files changed, 88 insertions(+), 1 deletion(-)

diff --git a/Documentation/btrfs-dedupe.asciidoc 
b/Documentation/btrfs-dedupe.asciidoc
index 28fe05f..5a5bf52 100644
--- a/Documentation/btrfs-dedupe.asciidoc
+++ b/Documentation/btrfs-dedupe.asciidoc
@@ -73,6 +73,9 @@ WARNING: Too large value for '-l' or '-m' will easily trigger 
OOM.
 Please use with caution according to system memory or use 'ondisk' backend
 if memory usage is critical.
 
+*status* ::
+Show current in-band de-duplication status of a filesystem.
+
 BACKENDS
 
 Btrfs in-band de-duplication support two different backends with their own
diff --git a/btrfs-completion b/btrfs-completion
index 9a6c73b..fbaae0c 100644
--- a/btrfs-completion
+++ b/btrfs-completion
@@ -40,7 +40,7 @@ _btrfs()
 commands_property='get set list'
 commands_quota='enable disable rescan'
 commands_qgroup='assign remove create destroy show limit'
-commands_dedupe='enable disable'
+commands_dedupe='enable disable status'
 commands_replace='start status cancel'
 
if [[ "$cur" == -* && $cword -le 3 && "$cmd" != "help" ]]; then
diff --git a/cmds-dedupe.c b/cmds-dedupe.c
index 64ac0f2..8005b6e 100644
--- a/cmds-dedupe.c
+++ b/cmds-dedupe.c
@@ -230,11 +230,95 @@ out:
return 0;
 }
 
+static const char * const cmd_dedupe_status_usage[] = {
+   "btrfs dedupe status ",
+   "Show current in-band(write time) de-duplication status of a btrfs.",
+   NULL
+};
+
+static int cmd_dedupe_status(int argc, char **argv)
+{
+   struct btrfs_ioctl_dedupe_args dargs;
+   DIR *dirstream;
+   char *path;
+   int fd;
+   int ret;
+   int print_limit = 1;
+
+   if (check_argc_exact(argc, 2))
+   usage(cmd_dedupe_status_usage);
+
+   path = argv[1];
+   fd = open_file_or_dir(path, );
+   if (fd < 0) {
+   error("failed to open file or directory: %s", path);
+   ret = 1;
+   goto out;
+   }
+   memset(, 0, sizeof(dargs));
+   dargs.cmd = BTRFS_DEDUPE_CTL_STATUS;
+
+   ret = ioctl(fd, BTRFS_IOC_DEDUPE_CTL, );
+   if (ret < 0) {
+   error("failed to get inband deduplication status: %s",
+ strerror(errno));
+   ret = 1;
+   goto out;
+   }
+   ret = 0;
+   if (dargs.status == 0) {
+   printf("Status: \t\t\tDisabled\n");
+   goto out;
+   }
+   printf("Status:\t\t\tEnabled\n");
+
+   if (dargs.hash_type == BTRFS_DEDUPE_HASH_SHA256)
+   printf("Hash algorithm:\t\tSHA-256\n");
+   else
+   printf("Hash algorithm:\t\tUnrecognized(%x)\n",
+   dargs.hash_type);
+
+   if (dargs.backend == BTRFS_DEDUPE_BACKEND_INMEMORY) {
+   printf("Backend:\t\tIn-memory\n");
+   print_limit = 1;
+   } else if (dargs.backend == BTRFS_DEDUPE_BACKEND_ONDISK) {
+   printf("Backend:\t\tOn-disk\n");
+   print_limit = 0;
+   } else  {
+   printf("Backend:\t\tUnrecognized(%x)\n",
+   dargs.backend);
+   }
+
+   printf("Dedup Blocksize:\t%llu\n", dargs.blocksize);
+
+   if (print_limit) {
+   u64 cur_mem;
+
+   /* Limit nr may be 0 */
+   if (dargs.limit_nr)
+   cur_mem = dargs.current_nr * (dargs.limit_mem /
+   dargs.limit_nr);
+   else
+   cur_mem = 0;
+
+   printf("Number of hash: \t[%llu/%llu]\n", dargs.current_nr,
+   dargs.limit_nr);
+   printf("Memory usage: \t\t[%s/%s]\n",
+   pretty_size(cur_mem),
+   pretty_size(dargs.limit_mem));
+   }
+out:
+   close_file_or_dir(fd, dirstream);
+   return ret;
+}
+
 const struct cmd_group dedupe_cmd_group = {
dedupe_cmd_group_usage, dedupe_cmd_group_info, {
{ "enable", cmd_dedupe_enable, cmd_dedupe_enable_usage, NULL, 
0},
{ "disable", cmd_dedupe_disable, cmd_dedupe_disable_usage,
  NULL, 0},
+   { "status", cmd_dedupe_status, cmd_dedupe_status_usage,
+ NULL, 0},
NULL_CMD_STRUCT
}
 };
-- 
2.7.4



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v7 1/8] btrfs-progs: Basic framework for dedupe command group

2016-04-01 Thread Qu Wenruo
Add basic ioctl header and command group framework for later use.
Alone with basic man page doc.

Signed-off-by: Qu Wenruo 
---
 Documentation/Makefile.in   |  1 +
 Documentation/btrfs-dedupe.asciidoc | 39 ++
 Documentation/btrfs.asciidoc|  4 
 Makefile.in |  3 ++-
 btrfs.c |  1 +
 cmds-dedupe.c   | 48 +
 commands.h  |  2 ++
 ctree.h | 39 +-
 dedupe.h| 42 
 ioctl.h | 21 
 10 files changed, 198 insertions(+), 2 deletions(-)
 create mode 100644 Documentation/btrfs-dedupe.asciidoc
 create mode 100644 cmds-dedupe.c
 create mode 100644 dedupe.h

diff --git a/Documentation/Makefile.in b/Documentation/Makefile.in
index aea2cb4..24fd35e 100644
--- a/Documentation/Makefile.in
+++ b/Documentation/Makefile.in
@@ -28,6 +28,7 @@ MAN8_TXT += btrfs-qgroup.asciidoc
 MAN8_TXT += btrfs-replace.asciidoc
 MAN8_TXT += btrfs-restore.asciidoc
 MAN8_TXT += btrfs-property.asciidoc
+MAN8_TXT += btrfs-dedupe.asciidoc
 
 # Category 5 manual page
 MAN5_TXT += btrfs-man5.asciidoc
diff --git a/Documentation/btrfs-dedupe.asciidoc 
b/Documentation/btrfs-dedupe.asciidoc
new file mode 100644
index 000..5d63c32
--- /dev/null
+++ b/Documentation/btrfs-dedupe.asciidoc
@@ -0,0 +1,39 @@
+btrfs-dedupe(8)
+==
+
+NAME
+
+btrfs-dedupe - manage in-band (write time) de-duplication of a btrfs filesystem
+
+SYNOPSIS
+
+*btrfs dedupe*  
+
+DESCRIPTION
+---
+*btrfs dedupe* is used to enable/disable or show current in-band de-duplication
+status of a btrfs filesystem.
+
+Kernel support for in-band de-duplication starts from 4.6.
+
+WARNING: In-band de-duplication is still an experimental feautre of btrfs,
+use with caution.
+
+SUBCOMMAND
+--
+Nothing yet
+
+EXIT STATUS
+---
+*btrfs dedupe* returns a zero exit status if it succeeds. Non zero is
+returned in case of failure.
+
+AVAILABILITY
+
+*btrfs* is part of btrfs-progs.
+Please refer to the btrfs wiki http://btrfs.wiki.kernel.org for
+further details.
+
+SEE ALSO
+
+`mkfs.btrfs`(8),
diff --git a/Documentation/btrfs.asciidoc b/Documentation/btrfs.asciidoc
index 6a77a85..8ded842 100644
--- a/Documentation/btrfs.asciidoc
+++ b/Documentation/btrfs.asciidoc
@@ -43,6 +43,10 @@ COMMANDS
Do off-line check on a btrfs filesystem. +
See `btrfs-check`(8) for details.
 
+*dedupe*::
+   Control btrfs in-band(write time) de-duplication. +
+   See `btrfs-dedupe`(8) for details.
+
 *device*::
Manage devices managed by btrfs, including add/delete/scan and so
on. +
diff --git a/Makefile.in b/Makefile.in
index 0a1aece..c3f7072 100644
--- a/Makefile.in
+++ b/Makefile.in
@@ -76,7 +76,8 @@ cmds_objects = cmds-subvolume.o cmds-filesystem.o 
cmds-device.o cmds-scrub.o \
   cmds-quota.o cmds-qgroup.o cmds-replace.o cmds-check.o \
   cmds-restore.o cmds-rescue.o chunk-recover.o super-recover.o \
   cmds-property.o cmds-fi-usage.o cmds-inspect-dump-tree.o \
-  cmds-inspect-dump-super.o cmds-inspect-tree-stats.o cmds-fi-du.o
+  cmds-inspect-dump-super.o cmds-inspect-tree-stats.o cmds-fi-du.o 
\
+  cmds-dedupe.o
 libbtrfs_objects = send-stream.o send-utils.o rbtree.o btrfs-list.o crc32c.o \
   uuid-tree.o utils-lib.o rbtree-utils.o
 libbtrfs_headers = send-stream.h send-utils.h send.h rbtree.h btrfs-list.h \
diff --git a/btrfs.c b/btrfs.c
index cc70515..c0c8f27 100644
--- a/btrfs.c
+++ b/btrfs.c
@@ -199,6 +199,7 @@ static const struct cmd_group btrfs_cmd_group = {
{ "receive", cmd_receive, cmd_receive_usage, NULL, 0 },
{ "quota", cmd_quota, NULL, _cmd_group, 0 },
{ "qgroup", cmd_qgroup, NULL, _cmd_group, 0 },
+   { "dedupe", cmd_dedupe, NULL, _cmd_group, 0 },
{ "replace", cmd_replace, NULL, _cmd_group, 0 },
{ "help", cmd_help, cmd_help_usage, NULL, 0 },
{ "version", cmd_version, cmd_version_usage, NULL, 0 },
diff --git a/cmds-dedupe.c b/cmds-dedupe.c
new file mode 100644
index 000..b25b8db
--- /dev/null
+++ b/cmds-dedupe.c
@@ -0,0 +1,48 @@
+/*
+ * Copyright (C) 2015 Fujitsu.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy 

[PATCH v10 07/21] btrfs: dedupe: Implement btrfs_dedupe_calc_hash interface

2016-04-01 Thread Qu Wenruo
From: Wang Xiaoguang 

Unlike in-memory or on-disk dedupe method, only SHA256 hash method is
supported yet, so implement btrfs_dedupe_calc_hash() interface using
SHA256.

Signed-off-by: Qu Wenruo 
Signed-off-by: Wang Xiaoguang 
---
 fs/btrfs/dedupe.c | 49 +
 1 file changed, 49 insertions(+)

diff --git a/fs/btrfs/dedupe.c b/fs/btrfs/dedupe.c
index 9175a5f..bdaea3a 100644
--- a/fs/btrfs/dedupe.c
+++ b/fs/btrfs/dedupe.c
@@ -593,3 +593,52 @@ int btrfs_dedupe_search(struct btrfs_fs_info *fs_info,
}
return ret;
 }
+
+int btrfs_dedupe_calc_hash(struct btrfs_fs_info *fs_info,
+  struct inode *inode, u64 start,
+  struct btrfs_dedupe_hash *hash)
+{
+   int i;
+   int ret;
+   struct page *p;
+   struct btrfs_dedupe_info *dedupe_info = fs_info->dedupe_info;
+   struct crypto_shash *tfm = dedupe_info->dedupe_driver;
+   struct {
+   struct shash_desc desc;
+   char ctx[crypto_shash_descsize(tfm)];
+   } sdesc;
+   u64 dedupe_bs;
+   u64 sectorsize = BTRFS_I(inode)->root->sectorsize;
+
+   if (!fs_info->dedupe_enabled || !hash)
+   return 0;
+
+   if (WARN_ON(dedupe_info == NULL))
+   return -EINVAL;
+
+   WARN_ON(!IS_ALIGNED(start, sectorsize));
+
+   dedupe_bs = dedupe_info->blocksize;
+
+   sdesc.desc.tfm = tfm;
+   sdesc.desc.flags = 0;
+   ret = crypto_shash_init();
+   if (ret)
+   return ret;
+   for (i = 0; sectorsize * i < dedupe_bs; i++) {
+   char *d;
+
+   p = find_get_page(inode->i_mapping,
+ (start >> PAGE_CACHE_SHIFT) + i);
+   if (WARN_ON(!p))
+   return -ENOENT;
+   d = kmap(p);
+   ret = crypto_shash_update(, d, sectorsize);
+   kunmap(p);
+   page_cache_release(p);
+   if (ret)
+   return ret;
+   }
+   ret = crypto_shash_final(, hash->hash);
+   return ret;
+}
-- 
2.7.4



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v10 12/21] btrfs: dedupe: add an inode nodedupe flag

2016-04-01 Thread Qu Wenruo
From: Wang Xiaoguang 

Introduce BTRFS_INODE_NODEDUP flag, then we can explicitly disable
online data dedupelication for specified files.

Signed-off-by: Wang Xiaoguang 
---
 fs/btrfs/ctree.h | 1 +
 fs/btrfs/ioctl.c | 6 +-
 2 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 85044bf..0e8933c 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -2381,6 +2381,7 @@ do {  
 \
 #define BTRFS_INODE_NOATIME(1 << 9)
 #define BTRFS_INODE_DIRSYNC(1 << 10)
 #define BTRFS_INODE_COMPRESS   (1 << 11)
+#define BTRFS_INODE_NODEDUPE   (1 << 12)
 
 #define BTRFS_INODE_ROOT_ITEM_INIT (1 << 31)
 
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index f659ed5..1fca655 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -161,7 +161,8 @@ void btrfs_update_iflags(struct inode *inode)
 /*
  * Inherit flags from the parent inode.
  *
- * Currently only the compression flags and the cow flags are inherited.
+ * Currently only the compression flags, dedupe flags and the cow flags
+ * are inherited.
  */
 void btrfs_inherit_iflags(struct inode *inode, struct inode *dir)
 {
@@ -186,6 +187,9 @@ void btrfs_inherit_iflags(struct inode *inode, struct inode 
*dir)
BTRFS_I(inode)->flags |= BTRFS_INODE_NODATASUM;
}
 
+   if (flags & BTRFS_INODE_NODEDUPE)
+   BTRFS_I(inode)->flags |= BTRFS_INODE_NODEDUPE;
+
btrfs_update_iflags(inode);
 }
 
-- 
2.7.4



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v10 14/21] btrfs: dedupe: add per-file online dedupe control

2016-04-01 Thread Qu Wenruo
From: Wang Xiaoguang 

Introduce inode_need_dedupe() to implement per-file online dedupe control.

Signed-off-by: Wang Xiaoguang 
---
 fs/btrfs/inode.c | 15 ++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 96790d0..c80fd74 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -708,6 +708,18 @@ static void end_dedupe_extent(struct inode *inode, u64 
start,
}
 }
 
+static inline int inode_need_dedupe(struct btrfs_fs_info *fs_info,
+   struct inode *inode)
+{
+   if (!fs_info->dedupe_enabled)
+   return 0;
+
+   if (BTRFS_I(inode)->flags & BTRFS_INODE_NODEDUPE)
+   return 0;
+
+   return 1;
+}
+
 /*
  * phase two of compressed writeback.  This is the ordered portion
  * of the code, which only gets called in the order the work was
@@ -1680,7 +1692,8 @@ static int run_delalloc_range(struct inode *inode, struct 
page *locked_page,
} else if (BTRFS_I(inode)->flags & BTRFS_INODE_PREALLOC && !force_cow) {
ret = run_delalloc_nocow(inode, locked_page, start, end,
 page_started, 0, nr_written);
-   } else if (!inode_need_compress(inode) && !fs_info->dedupe_enabled) {
+   } else if (!inode_need_compress(inode) &&
+  !inode_need_dedupe(fs_info, inode)) {
ret = cow_file_range(inode, locked_page, start, end,
  page_started, nr_written, 1, NULL);
} else {
-- 
2.7.4



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v10 10/21] btrfs: try more times to alloc metadata reserve space

2016-04-01 Thread Qu Wenruo
From: Wang Xiaoguang 

In btrfs_delalloc_reserve_metadata(), the number of metadata bytes we try
to reserve is calculated by the difference between outstanding_extents and
reserved_extents.

When reserve_metadata_bytes() fails to reserve desited metadata space,
it has already done some reclaim work, such as write ordered extents.

In that case, outstanding_extents and reserved_extents may already
changed, and we may reserve enough metadata space then.

So this patch will try to call reserve_metadata_bytes() at most 3 times
to ensure we really run out of space.

Such false ENOSPC is mainly caused by small file extents and time
consuming delalloc functions, which mainly affects in-band
de-duplication. (Compress should also be affected, but LZO/zlib is
faster than SHA256, so still harder to trigger than dedupe).

Signed-off-by: Wang Xiaoguang 
---
 fs/btrfs/extent-tree.c | 25 ++---
 1 file changed, 22 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index dabd721..016d2ec 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2421,7 +2421,7 @@ static int run_one_delayed_ref(struct btrfs_trans_handle 
*trans,
 * a new extent is revered, then deleted
 * in one tran, and inc/dec get merged to 0.
 *
-* In this case, we need to remove its dedup
+* In this case, we need to remove its dedupe
 * hash.
 */
btrfs_dedupe_del(trans, fs_info, node->bytenr);
@@ -5675,6 +5675,7 @@ int btrfs_delalloc_reserve_metadata(struct inode *inode, 
u64 num_bytes)
bool delalloc_lock = true;
u64 to_free = 0;
unsigned dropped;
+   int loops = 0;
 
/* If we are a free space inode we need to not flush since we will be in
 * the middle of a transaction commit.  We also don't need the delalloc
@@ -5690,11 +5691,12 @@ int btrfs_delalloc_reserve_metadata(struct inode 
*inode, u64 num_bytes)
btrfs_transaction_in_commit(root->fs_info))
schedule_timeout(1);
 
+   num_bytes = ALIGN(num_bytes, root->sectorsize);
+
+again:
if (delalloc_lock)
mutex_lock(_I(inode)->delalloc_mutex);
 
-   num_bytes = ALIGN(num_bytes, root->sectorsize);
-
spin_lock(_I(inode)->lock);
nr_extents = (unsigned)div64_u64(num_bytes +
 BTRFS_MAX_EXTENT_SIZE - 1,
@@ -5815,6 +5817,23 @@ out_fail:
}
if (delalloc_lock)
mutex_unlock(_I(inode)->delalloc_mutex);
+   /*
+* The number of metadata bytes is calculated by the difference
+* between outstanding_extents and reserved_extents. Sometimes though
+* reserve_metadata_bytes() fails to reserve the wanted metadata bytes,
+* indeed it has already done some work to reclaim metadata space, hence
+* both outstanding_extents and reserved_extents would have changed and
+* the bytes we try to reserve would also has changed(may be smaller).
+* So here we try to reserve again. This is much useful for online
+* dedupe, which will easily eat almost all meta space.
+*
+* XXX: Indeed here 3 is arbitrarily choosed, it's a good workaround for
+* online dedupe, later we should find a better method to avoid dedupe
+* enospc issue.
+*/
+   if (unlikely(ret == -ENOSPC && loops++ < 3))
+   goto again;
+
return ret;
 }
 
-- 
2.7.4



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v10 02/21] btrfs: dedupe: Introduce function to initialize dedupe info

2016-04-01 Thread Qu Wenruo
From: Wang Xiaoguang 

Add generic function to initialize dedupe info.

Signed-off-by: Qu Wenruo 
Signed-off-by: Wang Xiaoguang 
---
 fs/btrfs/Makefile |   2 +-
 fs/btrfs/dedupe.c | 154 ++
 fs/btrfs/dedupe.h |  16 +-
 3 files changed, 169 insertions(+), 3 deletions(-)
 create mode 100644 fs/btrfs/dedupe.c

diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile
index 128ce17..1b8c627 100644
--- a/fs/btrfs/Makefile
+++ b/fs/btrfs/Makefile
@@ -9,7 +9,7 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o 
root-tree.o dir-item.o \
   export.o tree-log.o free-space-cache.o zlib.o lzo.o \
   compression.o delayed-ref.o relocation.o delayed-inode.o scrub.o \
   reada.o backref.o ulist.o qgroup.o send.o dev-replace.o raid56.o \
-  uuid-tree.o props.o hash.o free-space-tree.o
+  uuid-tree.o props.o hash.o free-space-tree.o dedupe.o
 
 btrfs-$(CONFIG_BTRFS_FS_POSIX_ACL) += acl.o
 btrfs-$(CONFIG_BTRFS_FS_CHECK_INTEGRITY) += check-integrity.o
diff --git a/fs/btrfs/dedupe.c b/fs/btrfs/dedupe.c
new file mode 100644
index 000..2211588
--- /dev/null
+++ b/fs/btrfs/dedupe.c
@@ -0,0 +1,154 @@
+/*
+ * Copyright (C) 2016 Fujitsu.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ */
+#include "ctree.h"
+#include "dedupe.h"
+#include "btrfs_inode.h"
+#include "transaction.h"
+#include "delayed-ref.h"
+
+struct inmem_hash {
+   struct rb_node hash_node;
+   struct rb_node bytenr_node;
+   struct list_head lru_list;
+
+   u64 bytenr;
+   u32 num_bytes;
+
+   u8 hash[];
+};
+
+static int init_dedupe_info(struct btrfs_dedupe_info **ret_info, u16 type,
+   u16 backend, u64 blocksize, u64 limit)
+{
+   struct btrfs_dedupe_info *dedupe_info;
+
+   dedupe_info = kzalloc(sizeof(*dedupe_info), GFP_NOFS);
+   if (!dedupe_info)
+   return -ENOMEM;
+
+   dedupe_info->hash_type = type;
+   dedupe_info->backend = backend;
+   dedupe_info->blocksize = blocksize;
+   dedupe_info->limit_nr = limit;
+
+   /* only support SHA256 yet */
+   dedupe_info->dedupe_driver = crypto_alloc_shash("sha256", 0, 0);
+   if (IS_ERR(dedupe_info->dedupe_driver)) {
+   int ret;
+
+   ret = PTR_ERR(dedupe_info->dedupe_driver);
+   kfree(dedupe_info);
+   return ret;
+   }
+
+   dedupe_info->hash_root = RB_ROOT;
+   dedupe_info->bytenr_root = RB_ROOT;
+   dedupe_info->current_nr = 0;
+   INIT_LIST_HEAD(_info->lru_list);
+   mutex_init(_info->lock);
+
+   *ret_info = dedupe_info;
+   return 0;
+}
+
+static int check_dedupe_parameter(struct btrfs_fs_info *fs_info, u16 hash_type,
+ u16 backend, u64 blocksize, u64 limit_nr,
+ u64 limit_mem, u64 *ret_limit)
+{
+   if (blocksize > BTRFS_DEDUPE_BLOCKSIZE_MAX ||
+   blocksize < BTRFS_DEDUPE_BLOCKSIZE_MIN ||
+   blocksize < fs_info->tree_root->sectorsize ||
+   !is_power_of_2(blocksize))
+   return -EINVAL;
+   /*
+* For new backend and hash type, we return special return code
+* as they can be easily expended.
+*/
+   if (hash_type >= ARRAY_SIZE(btrfs_dedupe_sizes))
+   return -EOPNOTSUPP;
+   if (backend >= BTRFS_DEDUPE_BACKEND_COUNT)
+   return -EOPNOTSUPP;
+
+   /* Backend specific check */
+   if (backend == BTRFS_DEDUPE_BACKEND_INMEMORY) {
+   if (!limit_nr && !limit_mem)
+   *ret_limit = BTRFS_DEDUPE_LIMIT_NR_DEFAULT;
+   else {
+   u64 tmp = (u64)-1;
+
+   if (limit_mem) {
+   tmp = limit_mem / (sizeof(struct inmem_hash) +
+   btrfs_dedupe_hash_size(hash_type));
+   /* Too small limit_mem to fill a hash item */
+   if (!tmp)
+   return -EINVAL;
+   }
+   if (!limit_nr)
+   limit_nr = (u64)-1;
+
+   *ret_limit = min(tmp, limit_nr);
+   

[PATCH v10 16/21] btrfs: dedupe: Add basic tree structure for on-disk dedupe method

2016-04-01 Thread Qu Wenruo
Introduce a new tree, dedupe tree to record on-disk dedupe hash.
As a persist hash storage instead of in-memeory only implement.

Unlike Liu Bo's implement, in this version we won't do hack for
bytenr -> hash search, but add a new type, DEDUP_BYTENR_ITEM for such
search case, just like in-memory backend.

Signed-off-by: Liu Bo 
Signed-off-by: Wang Xiaoguang 
Signed-off-by: Qu Wenruo 
---
 fs/btrfs/ctree.h | 53 +++-
 fs/btrfs/dedupe.h|  5 +
 fs/btrfs/disk-io.c   |  6 +
 fs/btrfs/relocation.c|  3 ++-
 include/trace/events/btrfs.h |  3 ++-
 5 files changed, 67 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 0e8933c..659790c 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -100,6 +100,9 @@ struct btrfs_ordered_sum;
 /* tracks free space in block groups. */
 #define BTRFS_FREE_SPACE_TREE_OBJECTID 10ULL
 
+/* on-disk dedupe tree (EXPERIMENTAL) */
+#define BTRFS_DEDUPE_TREE_OBJECTID 11ULL
+
 /* device stats in the device tree */
 #define BTRFS_DEV_STATS_OBJECTID 0ULL
 
@@ -538,7 +541,8 @@ struct btrfs_super_block {
 #define BTRFS_FEATURE_COMPAT_SAFE_CLEAR0ULL
 
 #define BTRFS_FEATURE_COMPAT_RO_SUPP   \
-   (BTRFS_FEATURE_COMPAT_RO_FREE_SPACE_TREE)
+   (BTRFS_FEATURE_COMPAT_RO_FREE_SPACE_TREE |  \
+BTRFS_FEATURE_COMPAT_RO_DEDUPE)
 
 #define BTRFS_FEATURE_COMPAT_RO_SAFE_SET   0ULL
 #define BTRFS_FEATURE_COMPAT_RO_SAFE_CLEAR 0ULL
@@ -960,6 +964,36 @@ struct btrfs_csum_item {
u8 csum;
 } __attribute__ ((__packed__));
 
+/*
+ * Objectid: 0
+ * Type: BTRFS_DEDUPE_STATUS_ITEM_KEY
+ * Offset: 0
+ */
+struct btrfs_dedupe_status_item {
+   __le64 blocksize;
+   __le64 limit_nr;
+   __le16 hash_type;
+   __le16 backend;
+} __attribute__ ((__packed__));
+
+/*
+ * Objectid: Last 64 bit of the hash
+ * Type: BTRFS_DEDUPE_HASH_ITEM_KEY
+ * Offset: Bytenr of the hash
+ *
+ * Used for hash <-> bytenr search
+ * Hash exclude the last 64 bit follows
+ */
+
+/*
+ * Objectid: bytenr
+ * Type: BTRFS_DEDUPE_BYTENR_ITEM_KEY
+ * offset: Last 64 bit of the hash
+ *
+ * Used for bytenr <-> hash search (for free_extent)
+ * Its itemsize should always be 0.
+ */
+
 struct btrfs_dev_stats_item {
/*
 * grow this item struct at the end for future enhancements and keep
@@ -2168,6 +2202,13 @@ struct btrfs_ioctl_defrag_range_args {
 #define BTRFS_CHUNK_ITEM_KEY   228
 
 /*
+ * Dedup item and status
+ */
+#define BTRFS_DEDUPE_STATUS_ITEM_KEY   230
+#define BTRFS_DEDUPE_HASH_ITEM_KEY 231
+#define BTRFS_DEDUPE_BYTENR_ITEM_KEY   232
+
+/*
  * Records the overall state of the qgroups.
  * There's only one instance of this key present,
  * (0, BTRFS_QGROUP_STATUS_KEY, 0)
@@ -3265,6 +3306,16 @@ static inline unsigned long btrfs_leaf_data(struct 
extent_buffer *l)
return offsetof(struct btrfs_leaf, items);
 }
 
+/* btrfs_dedupe_status */
+BTRFS_SETGET_FUNCS(dedupe_status_blocksize, struct btrfs_dedupe_status_item,
+  blocksize, 64);
+BTRFS_SETGET_FUNCS(dedupe_status_limit, struct btrfs_dedupe_status_item,
+  limit_nr, 64);
+BTRFS_SETGET_FUNCS(dedupe_status_hash_type, struct btrfs_dedupe_status_item,
+  hash_type, 16);
+BTRFS_SETGET_FUNCS(dedupe_status_backend, struct btrfs_dedupe_status_item,
+  backend, 16);
+
 /* struct btrfs_file_extent_item */
 BTRFS_SETGET_FUNCS(file_extent_type, struct btrfs_file_extent_item, type, 8);
 BTRFS_SETGET_STACK_FUNCS(stack_file_extent_disk_bytenr,
diff --git a/fs/btrfs/dedupe.h b/fs/btrfs/dedupe.h
index f5d2b45..1ac1bcb 100644
--- a/fs/btrfs/dedupe.h
+++ b/fs/btrfs/dedupe.h
@@ -60,6 +60,8 @@ struct btrfs_dedupe_hash {
u8 hash[];
 };
 
+struct btrfs_root;
+
 struct btrfs_dedupe_info {
/* dedupe blocksize */
u64 blocksize;
@@ -75,6 +77,9 @@ struct btrfs_dedupe_info {
struct list_head lru_list;
u64 limit_nr;
u64 current_nr;
+
+   /* for persist data like dedup-hash and dedupe status */
+   struct btrfs_root *dedupe_root;
 };
 
 struct btrfs_trans_handle;
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index ed6a6fd..c7eda03 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -184,6 +184,7 @@ static struct btrfs_lockdep_keyset {
{ .id = BTRFS_DATA_RELOC_TREE_OBJECTID, .name_stem = "dreloc"   },
{ .id = BTRFS_UUID_TREE_OBJECTID,   .name_stem = "uuid" },
{ .id = BTRFS_FREE_SPACE_TREE_OBJECTID, .name_stem = "free-space" },
+   { .id = BTRFS_DEDUPE_TREE_OBJECTID, .name_stem = "dedupe"   },
{ .id = 0,  .name_stem = "tree" },
 };
 
@@ -1678,6 +1679,11 @@ struct btrfs_root *btrfs_get_fs_root(struct 
btrfs_fs_info *fs_info,
if (location->objectid == BTRFS_FREE_SPACE_TREE_OBJECTID)

[PATCH v10 11/21] btrfs: dedupe: Add ioctl for inband dedupelication

2016-04-01 Thread Qu Wenruo
From: Wang Xiaoguang 

Add ioctl interface for inband dedupelication, which includes:
1) enable
2) disable
3) status

And a pseudo RO compat flag, to imply that btrfs now supports inband
dedup.
However we don't add any ondisk format change, it's just a pseudo RO
compat flag.

All these ioctl interface are state-less, which means caller don't need
to bother previous dedupe state before calling them, and only need to
care the final desired state.

For example, if user want to enable dedupe with specified block size and
limit, just fill the ioctl structure and call enable ioctl.
No need to check if dedupe is already running.

These ioctls will handle things like re-configure or disable quite well.

Signed-off-by: Qu Wenruo 
Signed-off-by: Wang Xiaoguang 
---
 fs/btrfs/ctree.h   |  1 +
 fs/btrfs/dedupe.c  | 48 
 fs/btrfs/dedupe.h  | 15 ++
 fs/btrfs/disk-io.c |  3 ++
 fs/btrfs/ioctl.c   | 68 ++
 fs/btrfs/sysfs.c   |  2 ++
 include/uapi/linux/btrfs.h | 23 
 7 files changed, 160 insertions(+)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 022ab61..85044bf 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -508,6 +508,7 @@ struct btrfs_super_block {
  * ones specified below then we will fail to mount
  */
 #define BTRFS_FEATURE_COMPAT_RO_FREE_SPACE_TREE(1ULL << 0)
+#define BTRFS_FEATURE_COMPAT_RO_DEDUPE (1ULL << 1)
 
 #define BTRFS_FEATURE_INCOMPAT_MIXED_BACKREF   (1ULL << 0)
 #define BTRFS_FEATURE_INCOMPAT_DEFAULT_SUBVOL  (1ULL << 1)
diff --git a/fs/btrfs/dedupe.c b/fs/btrfs/dedupe.c
index bdaea3a..cfb7fea 100644
--- a/fs/btrfs/dedupe.c
+++ b/fs/btrfs/dedupe.c
@@ -41,6 +41,33 @@ static inline struct inmem_hash *inmem_alloc_hash(u16 type)
GFP_NOFS);
 }
 
+void btrfs_dedupe_status(struct btrfs_fs_info *fs_info,
+struct btrfs_ioctl_dedupe_args *dargs)
+{
+   struct btrfs_dedupe_info *dedupe_info = fs_info->dedupe_info;
+
+   if (!fs_info->dedupe_enabled || !dedupe_info) {
+   dargs->status = 0;
+   dargs->blocksize = 0;
+   dargs->backend = 0;
+   dargs->hash_type = 0;
+   dargs->limit_nr = 0;
+   dargs->current_nr = 0;
+   return;
+   }
+   mutex_lock(_info->lock);
+   dargs->status = 1;
+   dargs->blocksize = dedupe_info->blocksize;
+   dargs->backend = dedupe_info->backend;
+   dargs->hash_type = dedupe_info->hash_type;
+   dargs->limit_nr = dedupe_info->limit_nr;
+   dargs->limit_mem = dedupe_info->limit_nr *
+   (sizeof(struct inmem_hash) +
+btrfs_dedupe_sizes[dedupe_info->hash_type]);
+   dargs->current_nr = dedupe_info->current_nr;
+   mutex_unlock(_info->lock);
+}
+
 static int init_dedupe_info(struct btrfs_dedupe_info **ret_info, u16 type,
u16 backend, u64 blocksize, u64 limit)
 {
@@ -371,6 +398,27 @@ static void inmem_destroy(struct btrfs_dedupe_info 
*dedupe_info)
mutex_unlock(_info->lock);
 }
 
+int btrfs_dedupe_cleanup(struct btrfs_fs_info *fs_info)
+{
+   struct btrfs_dedupe_info *dedupe_info;
+
+   fs_info->dedupe_enabled = 0;
+   /* same as disable */
+   smp_wmb();
+   dedupe_info = fs_info->dedupe_info;
+   fs_info->dedupe_info = NULL;
+
+   if (!dedupe_info)
+   return 0;
+
+   if (dedupe_info->backend == BTRFS_DEDUPE_BACKEND_INMEMORY)
+   inmem_destroy(dedupe_info);
+
+   crypto_free_shash(dedupe_info->dedupe_driver);
+   kfree(dedupe_info);
+   return 0;
+}
+
 int btrfs_dedupe_disable(struct btrfs_fs_info *fs_info)
 {
struct btrfs_dedupe_info *dedupe_info;
diff --git a/fs/btrfs/dedupe.h b/fs/btrfs/dedupe.h
index e5d0d34..f5d2b45 100644
--- a/fs/btrfs/dedupe.h
+++ b/fs/btrfs/dedupe.h
@@ -103,6 +103,15 @@ static inline struct btrfs_dedupe_hash 
*btrfs_dedupe_alloc_hash(u16 type)
 int btrfs_dedupe_enable(struct btrfs_fs_info *fs_info, u16 type, u16 backend,
u64 blocksize, u64 limit_nr, u64 limit_mem);
 
+
+ /*
+ * Get inband dedupe info
+ * Since it needs to access different backends' hash size, which
+ * is not exported, we need such simple function.
+ */
+void btrfs_dedupe_status(struct btrfs_fs_info *fs_info,
+struct btrfs_ioctl_dedupe_args *dargs);
+
 /*
  * Disable dedupe and invalidate all its dedupe data.
  * Called at dedupe disable time.
@@ -110,6 +119,12 @@ int btrfs_dedupe_enable(struct btrfs_fs_info *fs_info, u16 
type, u16 backend,
 int btrfs_dedupe_disable(struct btrfs_fs_info *fs_info);
 
 /*
+ * Cleanup current btrfs_dedupe_info
+ * Called in umount time
+ */
+int btrfs_dedupe_cleanup(struct btrfs_fs_info *fs_info);
+
+/*
  * Calculate hash for dedup.
  * Caller must 

[PATCH v10 15/21] btrfs: relocation: Enhance error handling to avoid BUG_ON

2016-04-01 Thread Qu Wenruo
Since the introduce of btrfs dedupe tree, it's possible that balance can
race with dedupe disabling.

When this happens, dedupe_enabled will make btrfs_get_fs_root() return
PTR_ERR(-ENOENT).
But due to a bug in error handling branch, when this happens
backref_cache->nr_nodes is increased but the node is neither added to
backref_cache or nr_nodes decreased.
Causing BUG_ON() in backref_cache_cleanup()

[ 2611.668810] [ cut here ]
[ 2611.669946] kernel BUG at
/home/sat/ktest/linux/fs/btrfs/relocation.c:243!
[ 2611.670572] invalid opcode:  [#1] SMP
[ 2611.686797] Call Trace:
[ 2611.687034]  []
btrfs_relocate_block_group+0x1b3/0x290 [btrfs]
[ 2611.687706]  []
btrfs_relocate_chunk.isra.40+0x47/0xd0 [btrfs]
[ 2611.688385]  [] btrfs_balance+0xb22/0x11e0 [btrfs]
[ 2611.688966]  [] btrfs_ioctl_balance+0x391/0x3a0
[btrfs]
[ 2611.689587]  [] btrfs_ioctl+0x1650/0x2290 [btrfs]
[ 2611.690145]  [] ? lru_cache_add+0x3a/0x80
[ 2611.690647]  [] ?
lru_cache_add_active_or_unevictable+0x4c/0xc0
[ 2611.691310]  [] ? handle_mm_fault+0xcd4/0x17f0
[ 2611.691842]  [] ? cp_new_stat+0x153/0x180
[ 2611.692342]  [] ? __vma_link_rb+0xfd/0x110
[ 2611.692842]  [] ? vma_link+0xb9/0xc0
[ 2611.693303]  [] do_vfs_ioctl+0xa1/0x5a0
[ 2611.693781]  [] ? __do_page_fault+0x1b4/0x400
[ 2611.694310]  [] SyS_ioctl+0x41/0x70
[ 2611.694758]  [] entry_SYSCALL_64_fastpath+0x12/0x71
[ 2611.695331] Code: ff 48 8b 45 bf 49 83 af a8 05 00 00 01 49 89 87 a0
05 00 00 e9 2e fd ff ff b8 f4 ff ff ff e9 e4 fb ff ff 0f 0b 0f 0b 0f 0b
0f 0b <0f> 0b 0f 0b 41 89 c6 e9 b8 fb ff ff e8 9e a6 e8 e0 4c 89 e7 44
[ 2611.697870] RIP  []
relocate_block_group+0x741/0x7a0 [btrfs]
[ 2611.698818]  RSP 

This patch will call remove_backref_node() in error handling branch, and
cache the returned -ENOENT in relocate_tree_block() and continue
balancing.

Reported-by: Satoru Takeuchi 
Signed-off-by: Qu Wenruo 
---
 fs/btrfs/relocation.c | 22 +-
 1 file changed, 17 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index 33183ce..d72a981 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -887,6 +887,13 @@ again:
root = read_fs_root(rc->extent_root->fs_info, key.offset);
if (IS_ERR(root)) {
err = PTR_ERR(root);
+   /*
+* Don't forget to cleanup current node.
+* As it may not be added to backref_cache but nr_node
+* increased.
+* This will cause BUG_ON() in backref_cache_cleanup().
+*/
+   remove_backref_node(>backref_cache, cur);
goto out;
}
 
@@ -2990,14 +2997,21 @@ int relocate_tree_blocks(struct btrfs_trans_handle 
*trans,
}
 
rb_node = rb_first(blocks);
-   while (rb_node) {
+   for (rb_node = rb_first(blocks); rb_node; rb_node = rb_next(rb_node)) {
block = rb_entry(rb_node, struct tree_block, rb_node);
 
node = build_backref_tree(rc, >key,
  block->level, block->bytenr);
if (IS_ERR(node)) {
+   /*
+* The root(dedupe tree yet) of the tree block is
+* going to be freed and can't be reached.
+* Just skip it and continue balancing.
+*/
+   if (PTR_ERR(node) == -ENOENT)
+   continue;
err = PTR_ERR(node);
-   goto out;
+   break;
}
 
ret = relocate_tree_block(trans, rc, node, >key,
@@ -3005,11 +3019,9 @@ int relocate_tree_blocks(struct btrfs_trans_handle 
*trans,
if (ret < 0) {
if (ret != -EAGAIN || rb_node == rb_first(blocks))
err = ret;
-   goto out;
+   break;
}
-   rb_node = rb_next(rb_node);
}
-out:
err = finish_pending_nodes(trans, rc, path, err);
 
 out_free_path:
-- 
2.7.4



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v10 08/21] btrfs: ordered-extent: Add support for dedupe

2016-04-01 Thread Qu Wenruo
From: Wang Xiaoguang 

Add ordered-extent support for dedupe.

Note, current ordered-extent support only supports non-compressed source
extent.
Support for compressed source extent will be added later.

Signed-off-by: Qu Wenruo 
Signed-off-by: Wang Xiaoguang 
---
 fs/btrfs/ordered-data.c | 44 
 fs/btrfs/ordered-data.h | 13 +
 2 files changed, 53 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index 0de7da5..ef24ad1 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -26,6 +26,7 @@
 #include "extent_io.h"
 #include "disk-io.h"
 #include "compression.h"
+#include "dedupe.h"
 
 static struct kmem_cache *btrfs_ordered_extent_cache;
 
@@ -184,7 +185,8 @@ static inline struct rb_node *tree_search(struct 
btrfs_ordered_inode_tree *tree,
  */
 static int __btrfs_add_ordered_extent(struct inode *inode, u64 file_offset,
  u64 start, u64 len, u64 disk_len,
- int type, int dio, int compress_type)
+ int type, int dio, int compress_type,
+ struct btrfs_dedupe_hash *hash)
 {
struct btrfs_root *root = BTRFS_I(inode)->root;
struct btrfs_ordered_inode_tree *tree;
@@ -204,6 +206,31 @@ static int __btrfs_add_ordered_extent(struct inode *inode, 
u64 file_offset,
entry->inode = igrab(inode);
entry->compress_type = compress_type;
entry->truncated_len = (u64)-1;
+   entry->hash = NULL;
+   /*
+* Hash hit must go through dedupe routine at all cost, even dedupe
+* is disabled. As its delayed ref is already increased.
+*/
+   if (hash && (hash->bytenr || root->fs_info->dedupe_enabled)) {
+   struct btrfs_dedupe_info *dedupe_info;
+
+   dedupe_info = root->fs_info->dedupe_info;
+   if (WARN_ON(dedupe_info == NULL)) {
+   kmem_cache_free(btrfs_ordered_extent_cache,
+   entry);
+   return -EINVAL;
+   }
+   entry->hash = btrfs_dedupe_alloc_hash(dedupe_info->hash_type);
+   if (!entry->hash) {
+   kmem_cache_free(btrfs_ordered_extent_cache, entry);
+   return -ENOMEM;
+   }
+   entry->hash->bytenr = hash->bytenr;
+   entry->hash->num_bytes = hash->num_bytes;
+   memcpy(entry->hash->hash, hash->hash,
+  btrfs_dedupe_sizes[dedupe_info->hash_type]);
+   }
+
if (type != BTRFS_ORDERED_IO_DONE && type != BTRFS_ORDERED_COMPLETE)
set_bit(type, >flags);
 
@@ -250,15 +277,23 @@ int btrfs_add_ordered_extent(struct inode *inode, u64 
file_offset,
 {
return __btrfs_add_ordered_extent(inode, file_offset, start, len,
  disk_len, type, 0,
- BTRFS_COMPRESS_NONE);
+ BTRFS_COMPRESS_NONE, NULL);
 }
 
+int btrfs_add_ordered_extent_dedupe(struct inode *inode, u64 file_offset,
+  u64 start, u64 len, u64 disk_len, int type,
+  struct btrfs_dedupe_hash *hash)
+{
+   return __btrfs_add_ordered_extent(inode, file_offset, start, len,
+ disk_len, type, 0,
+ BTRFS_COMPRESS_NONE, hash);
+}
 int btrfs_add_ordered_extent_dio(struct inode *inode, u64 file_offset,
 u64 start, u64 len, u64 disk_len, int type)
 {
return __btrfs_add_ordered_extent(inode, file_offset, start, len,
  disk_len, type, 1,
- BTRFS_COMPRESS_NONE);
+ BTRFS_COMPRESS_NONE, NULL);
 }
 
 int btrfs_add_ordered_extent_compress(struct inode *inode, u64 file_offset,
@@ -267,7 +302,7 @@ int btrfs_add_ordered_extent_compress(struct inode *inode, 
u64 file_offset,
 {
return __btrfs_add_ordered_extent(inode, file_offset, start, len,
  disk_len, type, 0,
- compress_type);
+ compress_type, NULL);
 }
 
 /*
@@ -577,6 +612,7 @@ void btrfs_put_ordered_extent(struct btrfs_ordered_extent 
*entry)
list_del(>list);
kfree(sum);
}
+   kfree(entry->hash);
kmem_cache_free(btrfs_ordered_extent_cache, entry);
}
 }
diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h
index 23c9605..8a54476 100644
--- a/fs/btrfs/ordered-data.h
+++ b/fs/btrfs/ordered-data.h
@@ -139,6 +139,16 

[PATCH v10 17/21] btrfs: dedupe: Introduce interfaces to resume and cleanup dedupe info

2016-04-01 Thread Qu Wenruo
Since we will introduce a new on-disk based dedupe method, introduce new
interfaces to resume previous dedupe setup.

And since we introduce a new tree for status, also add disable handler
for it.

Signed-off-by: Wang Xiaoguang 
Signed-off-by: Qu Wenruo 
---
 fs/btrfs/dedupe.c  | 197 -
 fs/btrfs/dedupe.h  |  13 
 fs/btrfs/disk-io.c |  25 ++-
 fs/btrfs/disk-io.h |   1 +
 4 files changed, 232 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/dedupe.c b/fs/btrfs/dedupe.c
index cfb7fea..a274c1c 100644
--- a/fs/btrfs/dedupe.c
+++ b/fs/btrfs/dedupe.c
@@ -21,6 +21,8 @@
 #include "transaction.h"
 #include "delayed-ref.h"
 #include "qgroup.h"
+#include "disk-io.h"
+#include "locking.h"
 
 struct inmem_hash {
struct rb_node hash_node;
@@ -102,10 +104,69 @@ static int init_dedupe_info(struct btrfs_dedupe_info 
**ret_info, u16 type,
return 0;
 }
 
+static int init_dedupe_tree(struct btrfs_fs_info *fs_info,
+   struct btrfs_dedupe_info *dedupe_info)
+{
+   struct btrfs_root *dedupe_root;
+   struct btrfs_key key;
+   struct btrfs_path *path;
+   struct btrfs_dedupe_status_item *status;
+   struct btrfs_trans_handle *trans;
+   int ret;
+
+   path = btrfs_alloc_path();
+   if (!path)
+   return -ENOMEM;
+
+   trans = btrfs_start_transaction(fs_info->tree_root, 2);
+   if (IS_ERR(trans)) {
+   ret = PTR_ERR(trans);
+   goto out;
+   }
+   dedupe_root = btrfs_create_tree(trans, fs_info,
+  BTRFS_DEDUPE_TREE_OBJECTID);
+   if (IS_ERR(dedupe_root)) {
+   ret = PTR_ERR(dedupe_root);
+   btrfs_abort_transaction(trans, fs_info->tree_root, ret);
+   goto out;
+   }
+   dedupe_info->dedupe_root = dedupe_root;
+
+   key.objectid = 0;
+   key.type = BTRFS_DEDUPE_STATUS_ITEM_KEY;
+   key.offset = 0;
+
+   ret = btrfs_insert_empty_item(trans, dedupe_root, path, ,
+ sizeof(*status));
+   if (ret < 0) {
+   btrfs_abort_transaction(trans, fs_info->tree_root, ret);
+   goto out;
+   }
+
+   status = btrfs_item_ptr(path->nodes[0], path->slots[0],
+   struct btrfs_dedupe_status_item);
+   btrfs_set_dedupe_status_blocksize(path->nodes[0], status,
+dedupe_info->blocksize);
+   btrfs_set_dedupe_status_limit(path->nodes[0], status,
+   dedupe_info->limit_nr);
+   btrfs_set_dedupe_status_hash_type(path->nodes[0], status,
+   dedupe_info->hash_type);
+   btrfs_set_dedupe_status_backend(path->nodes[0], status,
+   dedupe_info->backend);
+   btrfs_mark_buffer_dirty(path->nodes[0]);
+out:
+   btrfs_free_path(path);
+   if (ret == 0)
+   btrfs_commit_transaction(trans, fs_info->tree_root);
+   return ret;
+}
+
 static int check_dedupe_parameter(struct btrfs_fs_info *fs_info, u16 hash_type,
  u16 backend, u64 blocksize, u64 limit_nr,
  u64 limit_mem, u64 *ret_limit)
 {
+   u64 compat_ro_flag = btrfs_super_compat_ro_flags(fs_info->super_copy);
+
if (blocksize > BTRFS_DEDUPE_BLOCKSIZE_MAX ||
blocksize < BTRFS_DEDUPE_BLOCKSIZE_MIN ||
blocksize < fs_info->tree_root->sectorsize ||
@@ -140,8 +201,12 @@ static int check_dedupe_parameter(struct btrfs_fs_info 
*fs_info, u16 hash_type,
*ret_limit = min(tmp, limit_nr);
}
}
-   if (backend == BTRFS_DEDUPE_BACKEND_ONDISK)
+   if (backend == BTRFS_DEDUPE_BACKEND_ONDISK) {
+   /* Ondisk backend must use RO compat feature */
+   if (!(compat_ro_flag & BTRFS_FEATURE_COMPAT_RO_DEDUPE))
+   return -EOPNOTSUPP;
*ret_limit = 0;
+   }
return 0;
 }
 
@@ -150,11 +215,16 @@ int btrfs_dedupe_enable(struct btrfs_fs_info *fs_info, 
u16 type, u16 backend,
 {
struct btrfs_dedupe_info *dedupe_info;
u64 limit = 0;
+   u64 compat_ro_flag = btrfs_super_compat_ro_flags(fs_info->super_copy);
+   int create_tree;
int ret = 0;
 
/* only one limit is accepted for enable*/
if (limit_nr && limit_mem)
return -EINVAL;
+   /* enable and disable may modify ondisk data, so block RO fs*/
+   if (fs_info->sb->s_flags & MS_RDONLY)
+   return -EROFS;
 
ret = check_dedupe_parameter(fs_info, type, backend, blocksize,
 limit_nr, limit_mem, );
@@ -179,9 +249,19 @@ int btrfs_dedupe_enable(struct btrfs_fs_info *fs_info, u16 
type, u16 backend,
}
 
 enable:
+   create_tree = compat_ro_flag & 

[PATCH v10 20/21] btrfs: dedupe: Add support for adding hash for on-disk backend

2016-04-01 Thread Qu Wenruo
Now on-disk backend can add hash now.

Since all needed on-disk backend functions are added, also allow on-disk
backend to be used, by changing DEDUPE_BACKEND_COUNT from 1(inmemory
only) to 2 (inmemory + ondisk).

Signed-off-by: Wang Xiaoguang 
Signed-off-by: Qu Wenruo 
---
 fs/btrfs/dedupe.c | 83 +++
 fs/btrfs/dedupe.h |  3 +-
 2 files changed, 84 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/dedupe.c b/fs/btrfs/dedupe.c
index 7c5d58a..1f0178e 100644
--- a/fs/btrfs/dedupe.c
+++ b/fs/btrfs/dedupe.c
@@ -437,6 +437,87 @@ out:
return 0;
 }
 
+static int ondisk_search_bytenr(struct btrfs_trans_handle *trans,
+   struct btrfs_dedupe_info *dedupe_info,
+   struct btrfs_path *path, u64 bytenr,
+   int prepare_del);
+static int ondisk_search_hash(struct btrfs_dedupe_info *dedupe_info, u8 *hash,
+ u64 *bytenr_ret, u32 *num_bytes_ret);
+static int ondisk_add(struct btrfs_trans_handle *trans,
+ struct btrfs_dedupe_info *dedupe_info,
+ struct btrfs_dedupe_hash *hash)
+{
+   struct btrfs_path *path;
+   struct btrfs_root *dedupe_root = dedupe_info->dedupe_root;
+   struct btrfs_key key;
+   u64 hash_offset;
+   u64 bytenr;
+   u32 num_bytes;
+   int hash_len = btrfs_dedupe_sizes[dedupe_info->hash_type];
+   int ret;
+
+   if (WARN_ON(hash_len <= 8 ||
+   !IS_ALIGNED(hash->bytenr, dedupe_root->sectorsize)))
+   return -EINVAL;
+
+   path = btrfs_alloc_path();
+   if (!path)
+   return -ENOMEM;
+
+   mutex_lock(_info->lock);
+
+   ret = ondisk_search_bytenr(NULL, dedupe_info, path, hash->bytenr, 0);
+   if (ret < 0)
+   goto out;
+   if (ret > 0) {
+   ret = 0;
+   goto out;
+   }
+   btrfs_release_path(path);
+
+   ret = ondisk_search_hash(dedupe_info, hash->hash, , _bytes);
+   if (ret < 0)
+   goto out;
+   /* Same hash found, don't re-add to save dedupe tree space */
+   if (ret > 0) {
+   ret = 0;
+   goto out;
+   }
+
+   /* Insert hash->bytenr item */
+   memcpy(, hash->hash + hash_len - 8, 8);
+   key.type = BTRFS_DEDUPE_HASH_ITEM_KEY;
+   key.offset = hash->bytenr;
+
+   /* The last 8 bit will not be included into hash */
+   ret = btrfs_insert_empty_item(trans, dedupe_root, path, ,
+ hash_len - 8);
+   WARN_ON(ret == -EEXIST);
+   if (ret < 0)
+   goto out;
+   hash_offset = btrfs_item_ptr_offset(path->nodes[0], path->slots[0]);
+   write_extent_buffer(path->nodes[0], hash->hash,
+   hash_offset, hash_len - 8);
+   btrfs_mark_buffer_dirty(path->nodes[0]);
+   btrfs_release_path(path);
+
+   /* Then bytenr->hash item */
+   key.objectid = hash->bytenr;
+   key.type = BTRFS_DEDUPE_BYTENR_ITEM_KEY;
+   memcpy(, hash->hash + hash_len - 8, 8);
+
+   ret = btrfs_insert_empty_item(trans, dedupe_root, path, , 0);
+   WARN_ON(ret == -EEXIST);
+   if (ret < 0)
+   goto out;
+   btrfs_mark_buffer_dirty(path->nodes[0]);
+
+out:
+   mutex_unlock(_info->lock);
+   btrfs_free_path(path);
+   return ret;
+}
+
 int btrfs_dedupe_add(struct btrfs_trans_handle *trans,
 struct btrfs_fs_info *fs_info,
 struct btrfs_dedupe_hash *hash)
@@ -458,6 +539,8 @@ int btrfs_dedupe_add(struct btrfs_trans_handle *trans,
 
if (dedupe_info->backend == BTRFS_DEDUPE_BACKEND_INMEMORY)
return inmem_add(dedupe_info, hash);
+   if (dedupe_info->backend == BTRFS_DEDUPE_BACKEND_ONDISK)
+   return ondisk_add(trans, dedupe_info, hash);
return -EINVAL;
 }
 
diff --git a/fs/btrfs/dedupe.h b/fs/btrfs/dedupe.h
index bfcacd7..1573456 100644
--- a/fs/btrfs/dedupe.h
+++ b/fs/btrfs/dedupe.h
@@ -31,8 +31,7 @@
 #define BTRFS_DEDUPE_BACKEND_INMEMORY  0
 #define BTRFS_DEDUPE_BACKEND_ONDISK1
 
-/* Only support inmemory yet, so count is still only 1 */
-#define BTRFS_DEDUPE_BACKEND_COUNT 1
+#define BTRFS_DEDUPE_BACKEND_COUNT 2
 
 /* Dedup block size limit and default value */
 #define BTRFS_DEDUPE_BLOCKSIZE_MAX (8 * 1024 * 1024)
-- 
2.7.4



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v10 04/21] btrfs: dedupe: Introduce function to remove hash from in-memory tree

2016-04-01 Thread Qu Wenruo
From: Wang Xiaoguang 

Introduce static function inmem_del() to remove hash from in-memory
dedupe tree.
And implement btrfs_dedupe_del() and btrfs_dedup_destroy() interfaces.

Signed-off-by: Qu Wenruo 
Signed-off-by: Wang Xiaoguang 
---
 fs/btrfs/dedupe.c | 105 ++
 1 file changed, 105 insertions(+)

diff --git a/fs/btrfs/dedupe.c b/fs/btrfs/dedupe.c
index 4e8455e..a229ded 100644
--- a/fs/btrfs/dedupe.c
+++ b/fs/btrfs/dedupe.c
@@ -303,3 +303,108 @@ int btrfs_dedupe_add(struct btrfs_trans_handle *trans,
return inmem_add(dedupe_info, hash);
return -EINVAL;
 }
+
+static struct inmem_hash *
+inmem_search_bytenr(struct btrfs_dedupe_info *dedupe_info, u64 bytenr)
+{
+   struct rb_node **p = _info->bytenr_root.rb_node;
+   struct rb_node *parent = NULL;
+   struct inmem_hash *entry = NULL;
+
+   while (*p) {
+   parent = *p;
+   entry = rb_entry(parent, struct inmem_hash, bytenr_node);
+
+   if (bytenr < entry->bytenr)
+   p = &(*p)->rb_left;
+   else if (bytenr > entry->bytenr)
+   p = &(*p)->rb_right;
+   else
+   return entry;
+   }
+
+   return NULL;
+}
+
+/* Delete a hash from in-memory dedupe tree */
+static int inmem_del(struct btrfs_dedupe_info *dedupe_info, u64 bytenr)
+{
+   struct inmem_hash *hash;
+
+   mutex_lock(_info->lock);
+   hash = inmem_search_bytenr(dedupe_info, bytenr);
+   if (!hash) {
+   mutex_unlock(_info->lock);
+   return 0;
+   }
+
+   __inmem_del(dedupe_info, hash);
+   mutex_unlock(_info->lock);
+   return 0;
+}
+
+/* Remove a dedupe hash from dedupe tree */
+int btrfs_dedupe_del(struct btrfs_trans_handle *trans,
+struct btrfs_fs_info *fs_info, u64 bytenr)
+{
+   struct btrfs_dedupe_info *dedupe_info = fs_info->dedupe_info;
+
+   if (!fs_info->dedupe_enabled)
+   return 0;
+
+   if (WARN_ON(dedupe_info == NULL))
+   return -EINVAL;
+
+   if (dedupe_info->backend == BTRFS_DEDUPE_BACKEND_INMEMORY)
+   return inmem_del(dedupe_info, bytenr);
+   return -EINVAL;
+}
+
+static void inmem_destroy(struct btrfs_dedupe_info *dedupe_info)
+{
+   struct inmem_hash *entry, *tmp;
+
+   mutex_lock(_info->lock);
+   list_for_each_entry_safe(entry, tmp, _info->lru_list, lru_list)
+   __inmem_del(dedupe_info, entry);
+   mutex_unlock(_info->lock);
+}
+
+int btrfs_dedupe_disable(struct btrfs_fs_info *fs_info)
+{
+   struct btrfs_dedupe_info *dedupe_info;
+   int ret;
+
+   /* Here we don't want to increase refs of dedupe_info */
+   fs_info->dedupe_enabled = 0;
+
+   dedupe_info = fs_info->dedupe_info;
+
+   if (!dedupe_info)
+   return 0;
+
+   /* Don't allow disable status change in RO mount */
+   if (fs_info->sb->s_flags & MS_RDONLY)
+   return -EROFS;
+
+   /*
+* Wait for all unfinished write to complete dedupe routine
+* As disable operation is not a frequent operation, we are
+* OK to use heavy but safe sync_filesystem().
+*/
+   down_read(_info->sb->s_umount);
+   ret = sync_filesystem(fs_info->sb);
+   up_read(_info->sb->s_umount);
+   if (ret < 0)
+   return ret;
+
+   fs_info->dedupe_info = NULL;
+
+   /* now we are OK to clean up everything */
+   if (dedupe_info->backend == BTRFS_DEDUPE_BACKEND_INMEMORY)
+   inmem_destroy(dedupe_info);
+
+   crypto_free_shash(dedupe_info->dedupe_driver);
+   kfree(dedupe_info);
+   return 0;
+}
-- 
2.7.4



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v10 06/21] btrfs: dedupe: Introduce function to search for an existing hash

2016-04-01 Thread Qu Wenruo
From: Wang Xiaoguang 

Introduce static function inmem_search() to handle the job for in-memory
hash tree.

The trick is, we must ensure the delayed ref head is not being run at
the time we search the for the hash.

With inmem_search(), we can implement the btrfs_dedupe_search()
interface.

Signed-off-by: Qu Wenruo 
Signed-off-by: Wang Xiaoguang 
---
 fs/btrfs/dedupe.c | 185 ++
 1 file changed, 185 insertions(+)

diff --git a/fs/btrfs/dedupe.c b/fs/btrfs/dedupe.c
index a229ded..9175a5f 100644
--- a/fs/btrfs/dedupe.c
+++ b/fs/btrfs/dedupe.c
@@ -20,6 +20,7 @@
 #include "btrfs_inode.h"
 #include "transaction.h"
 #include "delayed-ref.h"
+#include "qgroup.h"
 
 struct inmem_hash {
struct rb_node hash_node;
@@ -408,3 +409,187 @@ int btrfs_dedupe_disable(struct btrfs_fs_info *fs_info)
kfree(dedupe_info);
return 0;
 }
+
+/*
+ * Caller must ensure the corresponding ref head is not being run.
+ */
+static struct inmem_hash *
+inmem_search_hash(struct btrfs_dedupe_info *dedupe_info, u8 *hash)
+{
+   struct rb_node **p = _info->hash_root.rb_node;
+   struct rb_node *parent = NULL;
+   struct inmem_hash *entry = NULL;
+   u16 hash_type = dedupe_info->hash_type;
+   int hash_len = btrfs_dedupe_sizes[hash_type];
+
+   while (*p) {
+   parent = *p;
+   entry = rb_entry(parent, struct inmem_hash, hash_node);
+
+   if (memcmp(hash, entry->hash, hash_len) < 0) {
+   p = &(*p)->rb_left;
+   } else if (memcmp(hash, entry->hash, hash_len) > 0) {
+   p = &(*p)->rb_right;
+   } else {
+   /* Found, need to re-add it to LRU list head */
+   list_del(>lru_list);
+   list_add(>lru_list, _info->lru_list);
+   return entry;
+   }
+   }
+   return NULL;
+}
+
+static int inmem_search(struct btrfs_dedupe_info *dedupe_info,
+   struct inode *inode, u64 file_pos,
+   struct btrfs_dedupe_hash *hash)
+{
+   int ret;
+   struct btrfs_root *root = BTRFS_I(inode)->root;
+   struct btrfs_trans_handle *trans;
+   struct btrfs_delayed_ref_root *delayed_refs;
+   struct btrfs_delayed_ref_head *head;
+   struct btrfs_delayed_ref_head *insert_head;
+   struct btrfs_delayed_data_ref *insert_dref;
+   struct btrfs_qgroup_extent_record *insert_qrecord = NULL;
+   struct inmem_hash *found_hash;
+   int free_insert = 1;
+   u64 bytenr;
+   u32 num_bytes;
+
+   insert_head = kmem_cache_alloc(btrfs_delayed_ref_head_cachep, GFP_NOFS);
+   if (!insert_head)
+   return -ENOMEM;
+   insert_head->extent_op = NULL;
+   insert_dref = kmem_cache_alloc(btrfs_delayed_data_ref_cachep, GFP_NOFS);
+   if (!insert_dref) {
+   kmem_cache_free(btrfs_delayed_ref_head_cachep, insert_head);
+   return -ENOMEM;
+   }
+   if (root->fs_info->quota_enabled &&
+   is_fstree(root->root_key.objectid)) {
+   insert_qrecord = kmalloc(sizeof(*insert_qrecord), GFP_NOFS);
+   if (!insert_qrecord) {
+   kmem_cache_free(btrfs_delayed_ref_head_cachep,
+   insert_head);
+   kmem_cache_free(btrfs_delayed_data_ref_cachep,
+   insert_dref);
+   return -ENOMEM;
+   }
+   }
+
+   trans = btrfs_join_transaction(root);
+   if (IS_ERR(trans)) {
+   ret = PTR_ERR(trans);
+   goto free_mem;
+   }
+
+again:
+   mutex_lock(_info->lock);
+   found_hash = inmem_search_hash(dedupe_info, hash->hash);
+   /* If we don't find a duplicated extent, just return. */
+   if (!found_hash) {
+   ret = 0;
+   goto out;
+   }
+   bytenr = found_hash->bytenr;
+   num_bytes = found_hash->num_bytes;
+
+   delayed_refs = >transaction->delayed_refs;
+
+   spin_lock(_refs->lock);
+   head = btrfs_find_delayed_ref_head(trans, bytenr);
+   if (!head) {
+   /*
+* We can safely insert a new delayed_ref as long as we
+* hold delayed_refs->lock.
+* Only need to use atomic inc_extent_ref()
+*/
+   btrfs_add_delayed_data_ref_locked(root->fs_info, trans,
+   insert_dref, insert_head, insert_qrecord,
+   bytenr, num_bytes, 0, root->root_key.objectid,
+   btrfs_ino(inode), file_pos, 0,
+   BTRFS_ADD_DELAYED_REF);
+   spin_unlock(_refs->lock);
+
+   /* add_delayed_data_ref_locked will free unused memory */
+  

[PATCH v10 19/21] btrfs: dedupe: Add support to delete hash for on-disk backend

2016-04-01 Thread Qu Wenruo
Now on-disk backend can delete hash now.

Signed-off-by: Wang Xiaoguang 
Signed-off-by: Qu Wenruo 
---
 fs/btrfs/dedupe.c | 100 ++
 1 file changed, 100 insertions(+)

diff --git a/fs/btrfs/dedupe.c b/fs/btrfs/dedupe.c
index 00f2a01..7c5d58a 100644
--- a/fs/btrfs/dedupe.c
+++ b/fs/btrfs/dedupe.c
@@ -500,6 +500,104 @@ static int inmem_del(struct btrfs_dedupe_info 
*dedupe_info, u64 bytenr)
return 0;
 }
 
+/*
+ * If prepare_del is given, this will setup search_slot() for delete.
+ * Caller needs to do proper locking.
+ *
+ * Return > 0 for found.
+ * Return 0 for not found.
+ * Return < 0 for error.
+ */
+static int ondisk_search_bytenr(struct btrfs_trans_handle *trans,
+   struct btrfs_dedupe_info *dedupe_info,
+   struct btrfs_path *path, u64 bytenr,
+   int prepare_del)
+{
+   struct btrfs_key key;
+   struct btrfs_root *dedupe_root = dedupe_info->dedupe_root;
+   int ret;
+   int ins_len = 0;
+   int cow = 0;
+
+   if (prepare_del) {
+   if (WARN_ON(trans == NULL))
+   return -EINVAL;
+   cow = 1;
+   ins_len = -1;
+   }
+
+   key.objectid = bytenr;
+   key.type = BTRFS_DEDUPE_BYTENR_ITEM_KEY;
+   key.offset = (u64)-1;
+
+   ret = btrfs_search_slot(trans, dedupe_root, , path,
+   ins_len, cow);
+
+   if (ret < 0)
+   return ret;
+   /*
+* Although it's almost impossible, it's still possible that
+* the last 64bits are all 1.
+*/
+   if (ret == 0)
+   return 1;
+
+   ret = btrfs_previous_item(dedupe_root, path, bytenr,
+ BTRFS_DEDUPE_BYTENR_ITEM_KEY);
+   if (ret < 0)
+   return ret;
+   if (ret > 0)
+   return 0;
+   return 1;
+}
+
+static int ondisk_del(struct btrfs_trans_handle *trans,
+ struct btrfs_dedupe_info *dedupe_info, u64 bytenr)
+{
+   struct btrfs_root *dedupe_root = dedupe_info->dedupe_root;
+   struct btrfs_path *path;
+   struct btrfs_key key;
+   int ret;
+
+   path = btrfs_alloc_path();
+   if (!path)
+   return -ENOMEM;
+
+   key.objectid = bytenr;
+   key.type = BTRFS_DEDUPE_BYTENR_ITEM_KEY;
+   key.offset = 0;
+
+   mutex_lock(_info->lock);
+
+   ret = ondisk_search_bytenr(trans, dedupe_info, path, bytenr, 1);
+   if (ret <= 0)
+   goto out;
+
+   btrfs_item_key_to_cpu(path->nodes[0], , path->slots[0]);
+   ret = btrfs_del_item(trans, dedupe_root, path);
+   btrfs_release_path(path);
+   if (ret < 0)
+   goto out;
+   /* Search for hash item and delete it */
+   key.objectid = key.offset;
+   key.type = BTRFS_DEDUPE_HASH_ITEM_KEY;
+   key.offset = bytenr;
+
+   ret = btrfs_search_slot(trans, dedupe_root, , path, -1, 1);
+   if (WARN_ON(ret > 0)) {
+   ret = -ENOENT;
+   goto out;
+   }
+   if (ret < 0)
+   goto out;
+   ret = btrfs_del_item(trans, dedupe_root, path);
+
+out:
+   btrfs_free_path(path);
+   mutex_unlock(_info->lock);
+   return ret;
+}
+
 /* Remove a dedupe hash from dedupe tree */
 int btrfs_dedupe_del(struct btrfs_trans_handle *trans,
 struct btrfs_fs_info *fs_info, u64 bytenr)
@@ -514,6 +612,8 @@ int btrfs_dedupe_del(struct btrfs_trans_handle *trans,
 
if (dedupe_info->backend == BTRFS_DEDUPE_BACKEND_INMEMORY)
return inmem_del(dedupe_info, bytenr);
+   if (dedupe_info->backend == BTRFS_DEDUPE_BACKEND_ONDISK)
+   return ondisk_del(trans, dedupe_info, bytenr);
return -EINVAL;
 }
 
-- 
2.7.4



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v10 05/21] btrfs: delayed-ref: Add support for increasing data ref under spinlock

2016-04-01 Thread Qu Wenruo
For in-band dedupe, btrfs needs to increase data ref with delayed_ref
locked, so add a new function btrfs_add_delayed_data_ref_lock() to
increase extent ref with delayed_refs already locked.

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/delayed-ref.c | 30 +++---
 fs/btrfs/delayed-ref.h |  8 
 2 files changed, 31 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
index 430b368..07474e8 100644
--- a/fs/btrfs/delayed-ref.c
+++ b/fs/btrfs/delayed-ref.c
@@ -805,6 +805,26 @@ free_ref:
 }
 
 /*
+ * Do real delayed data ref insert.
+ * Caller must hold delayed_refs->lock and allocation memory
+ * for dref,head_ref and record.
+ */
+void btrfs_add_delayed_data_ref_locked(struct btrfs_fs_info *fs_info,
+   struct btrfs_trans_handle *trans,
+   struct btrfs_delayed_data_ref *dref,
+   struct btrfs_delayed_ref_head *head_ref,
+   struct btrfs_qgroup_extent_record *qrecord,
+   u64 bytenr, u64 num_bytes, u64 parent, u64 ref_root,
+   u64 owner, u64 offset, u64 reserved, int action)
+{
+   head_ref = add_delayed_ref_head(fs_info, trans, _ref->node,
+   qrecord, bytenr, num_bytes, ref_root, reserved,
+   action, 1);
+   add_delayed_data_ref(fs_info, trans, head_ref, >node, bytenr,
+   num_bytes, parent, ref_root, owner, offset, action);
+}
+
+/*
  * add a delayed data ref. it's similar to btrfs_add_delayed_tree_ref.
  */
 int btrfs_add_delayed_data_ref(struct btrfs_fs_info *fs_info,
@@ -849,13 +869,9 @@ int btrfs_add_delayed_data_ref(struct btrfs_fs_info 
*fs_info,
 * insert both the head node and the new ref without dropping
 * the spin lock
 */
-   head_ref = add_delayed_ref_head(fs_info, trans, _ref->node, record,
-   bytenr, num_bytes, ref_root, reserved,
-   action, 1);
-
-   add_delayed_data_ref(fs_info, trans, head_ref, >node, bytenr,
-  num_bytes, parent, ref_root, owner, offset,
-  action);
+   btrfs_add_delayed_data_ref_locked(fs_info, trans, ref, head_ref, record,
+   bytenr, num_bytes, parent, ref_root, owner, offset,
+   reserved, action);
spin_unlock(_refs->lock);
 
return 0;
diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h
index c24b653..2765858 100644
--- a/fs/btrfs/delayed-ref.h
+++ b/fs/btrfs/delayed-ref.h
@@ -239,11 +239,19 @@ static inline void btrfs_put_delayed_ref(struct 
btrfs_delayed_ref_node *ref)
}
 }
 
+struct btrfs_qgroup_extent_record;
 int btrfs_add_delayed_tree_ref(struct btrfs_fs_info *fs_info,
   struct btrfs_trans_handle *trans,
   u64 bytenr, u64 num_bytes, u64 parent,
   u64 ref_root, int level, int action,
   struct btrfs_delayed_extent_op *extent_op);
+void btrfs_add_delayed_data_ref_locked(struct btrfs_fs_info *fs_info,
+   struct btrfs_trans_handle *trans,
+   struct btrfs_delayed_data_ref *dref,
+   struct btrfs_delayed_ref_head *head_ref,
+   struct btrfs_qgroup_extent_record *qrecord,
+   u64 bytenr, u64 num_bytes, u64 parent, u64 ref_root,
+   u64 owner, u64 offset, u64 reserved, int action);
 int btrfs_add_delayed_data_ref(struct btrfs_fs_info *fs_info,
   struct btrfs_trans_handle *trans,
   u64 bytenr, u64 num_bytes,
-- 
2.7.4



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v10 18/21] btrfs: dedupe: Add support for on-disk hash search

2016-04-01 Thread Qu Wenruo
Now on-disk backend should be able to search hash now.

Signed-off-by: Wang Xiaoguang 
Signed-off-by: Qu Wenruo 
---
 fs/btrfs/dedupe.c | 167 --
 fs/btrfs/dedupe.h |   1 +
 2 files changed, 151 insertions(+), 17 deletions(-)

diff --git a/fs/btrfs/dedupe.c b/fs/btrfs/dedupe.c
index a274c1c..00f2a01 100644
--- a/fs/btrfs/dedupe.c
+++ b/fs/btrfs/dedupe.c
@@ -652,6 +652,112 @@ int btrfs_dedupe_disable(struct btrfs_fs_info *fs_info)
 }
 
 /*
+ * Compare ondisk hash with src.
+ * Return 0 if hash matches.
+ * Return non-zero for hash mismatch
+ *
+ * Caller should ensure the slot contains a valid hash item.
+ */
+static int memcmp_ondisk_hash(const struct btrfs_key *key,
+ struct extent_buffer *node, int slot,
+ int hash_len, const u8 *src)
+{
+   u64 offset;
+   int ret;
+
+   /* Return value doesn't make sense in this case though */
+   if (WARN_ON(hash_len <= 8 || key->type != BTRFS_DEDUPE_HASH_ITEM_KEY))
+   return -EINVAL;
+
+   /* compare the hash exlcuding the last 64 bits */
+   offset = btrfs_item_ptr_offset(node, slot);
+   ret = memcmp_extent_buffer(node, src, offset, hash_len - 8);
+   if (ret)
+   return ret;
+   return memcmp(>objectid, src + hash_len - 8, 8);
+}
+
+ /*
+ * Return 0 for not found
+ * Return >0 for found and set bytenr_ret
+ * Return <0 for error
+ */
+static int ondisk_search_hash(struct btrfs_dedupe_info *dedupe_info, u8 *hash,
+ u64 *bytenr_ret, u32 *num_bytes_ret)
+{
+   struct btrfs_path *path;
+   struct btrfs_key key;
+   struct btrfs_root *dedupe_root = dedupe_info->dedupe_root;
+   u8 *buf = NULL;
+   u64 hash_key;
+   int hash_len = btrfs_dedupe_sizes[dedupe_info->hash_type];
+   int ret;
+
+   path = btrfs_alloc_path();
+   if (!path)
+   return -ENOMEM;
+
+   buf = kmalloc(hash_len, GFP_NOFS);
+   if (!buf) {
+   ret = -ENOMEM;
+   goto out;
+   }
+
+   memcpy(_key, hash + hash_len - 8, 8);
+   key.objectid = hash_key;
+   key.type = BTRFS_DEDUPE_HASH_ITEM_KEY;
+   key.offset = (u64)-1;
+
+   ret = btrfs_search_slot(NULL, dedupe_root, , path, 0, 0);
+   if (ret < 0)
+   goto out;
+   WARN_ON(ret == 0);
+   while (1) {
+   struct extent_buffer *node;
+   struct btrfs_dedupe_hash_item *hash_item;
+   int slot;
+
+   ret = btrfs_previous_item(dedupe_root, path, hash_key,
+ BTRFS_DEDUPE_HASH_ITEM_KEY);
+   if (ret < 0)
+   break;
+   if (ret > 0) {
+   ret = 0;
+   break;
+   }
+
+   node = path->nodes[0];
+   slot = path->slots[0];
+   btrfs_item_key_to_cpu(node, , slot);
+
+   /*
+* Type of objectid mismatch means no previous item may
+* hit, exit searching
+*/
+   if (key.type != BTRFS_DEDUPE_HASH_ITEM_KEY ||
+   memcmp(, _key, 8))
+   break;
+   hash_item = btrfs_item_ptr(node, slot,
+   struct btrfs_dedupe_hash_item);
+   /*
+* If the hash mismatch, it's still possible that previous item
+* has the desired hash.
+*/
+   if (memcmp_ondisk_hash(, node, slot, hash_len, hash))
+   continue;
+   /* Found */
+   ret = 1;
+   *bytenr_ret = key.offset;
+   *num_bytes_ret = dedupe_info->blocksize;
+   break;
+   }
+out:
+   kfree(buf);
+   btrfs_free_path(path);
+   return ret;
+}
+
+/*
  * Caller must ensure the corresponding ref head is not being run.
  */
 static struct inmem_hash *
@@ -681,9 +787,36 @@ inmem_search_hash(struct btrfs_dedupe_info *dedupe_info, 
u8 *hash)
return NULL;
 }
 
-static int inmem_search(struct btrfs_dedupe_info *dedupe_info,
-   struct inode *inode, u64 file_pos,
-   struct btrfs_dedupe_hash *hash)
+/* Wapper for different backends, caller needs to hold dedupe_info->lock */
+static inline int generic_search_hash(struct btrfs_dedupe_info *dedupe_info,
+ u8 *hash, u64 *bytenr_ret,
+ u32 *num_bytes_ret)
+{
+   if (dedupe_info->backend == BTRFS_DEDUPE_BACKEND_INMEMORY) {
+   struct inmem_hash *found_hash;
+   int ret;
+
+   found_hash = inmem_search_hash(dedupe_info, hash);
+   if (found_hash) {
+   ret = 1;
+   *bytenr_ret = found_hash->bytenr;
+   

[PATCH v10 03/21] btrfs: dedupe: Introduce function to add hash into in-memory tree

2016-04-01 Thread Qu Wenruo
From: Wang Xiaoguang 

Introduce static function inmem_add() to add hash into in-memory tree.
And now we can implement the btrfs_dedupe_add() interface.

Signed-off-by: Qu Wenruo 
Signed-off-by: Wang Xiaoguang 
---
 fs/btrfs/dedupe.c | 151 ++
 1 file changed, 151 insertions(+)

diff --git a/fs/btrfs/dedupe.c b/fs/btrfs/dedupe.c
index 2211588..4e8455e 100644
--- a/fs/btrfs/dedupe.c
+++ b/fs/btrfs/dedupe.c
@@ -32,6 +32,14 @@ struct inmem_hash {
u8 hash[];
 };
 
+static inline struct inmem_hash *inmem_alloc_hash(u16 type)
+{
+   if (WARN_ON(type >= ARRAY_SIZE(btrfs_dedupe_sizes)))
+   return NULL;
+   return kzalloc(sizeof(struct inmem_hash) + btrfs_dedupe_sizes[type],
+   GFP_NOFS);
+}
+
 static int init_dedupe_info(struct btrfs_dedupe_info **ret_info, u16 type,
u16 backend, u64 blocksize, u64 limit)
 {
@@ -152,3 +160,146 @@ enable:
fs_info->dedupe_enabled = 1;
return ret;
 }
+
+static int inmem_insert_hash(struct rb_root *root,
+struct inmem_hash *hash, int hash_len)
+{
+   struct rb_node **p = >rb_node;
+   struct rb_node *parent = NULL;
+   struct inmem_hash *entry = NULL;
+
+   while (*p) {
+   parent = *p;
+   entry = rb_entry(parent, struct inmem_hash, hash_node);
+   if (memcmp(hash->hash, entry->hash, hash_len) < 0)
+   p = &(*p)->rb_left;
+   else if (memcmp(hash->hash, entry->hash, hash_len) > 0)
+   p = &(*p)->rb_right;
+   else
+   return 1;
+   }
+   rb_link_node(>hash_node, parent, p);
+   rb_insert_color(>hash_node, root);
+   return 0;
+}
+
+static int inmem_insert_bytenr(struct rb_root *root,
+  struct inmem_hash *hash)
+{
+   struct rb_node **p = >rb_node;
+   struct rb_node *parent = NULL;
+   struct inmem_hash *entry = NULL;
+
+   while (*p) {
+   parent = *p;
+   entry = rb_entry(parent, struct inmem_hash, bytenr_node);
+   if (hash->bytenr < entry->bytenr)
+   p = &(*p)->rb_left;
+   else if (hash->bytenr > entry->bytenr)
+   p = &(*p)->rb_right;
+   else
+   return 1;
+   }
+   rb_link_node(>bytenr_node, parent, p);
+   rb_insert_color(>bytenr_node, root);
+   return 0;
+}
+
+static void __inmem_del(struct btrfs_dedupe_info *dedupe_info,
+   struct inmem_hash *hash)
+{
+   list_del(>lru_list);
+   rb_erase(>hash_node, _info->hash_root);
+   rb_erase(>bytenr_node, _info->bytenr_root);
+
+   if (!WARN_ON(dedupe_info->current_nr == 0))
+   dedupe_info->current_nr--;
+
+   kfree(hash);
+}
+
+/*
+ * Insert a hash into in-memory dedupe tree
+ * Will remove exceeding last recent use hash.
+ *
+ * If the hash mathced with existing one, we won't insert it, to
+ * save memory
+ */
+static int inmem_add(struct btrfs_dedupe_info *dedupe_info,
+struct btrfs_dedupe_hash *hash)
+{
+   int ret = 0;
+   u16 type = dedupe_info->hash_type;
+   struct inmem_hash *ihash;
+
+   ihash = inmem_alloc_hash(type);
+
+   if (!ihash)
+   return -ENOMEM;
+
+   /* Copy the data out */
+   ihash->bytenr = hash->bytenr;
+   ihash->num_bytes = hash->num_bytes;
+   memcpy(ihash->hash, hash->hash, btrfs_dedupe_sizes[type]);
+
+   mutex_lock(_info->lock);
+
+   ret = inmem_insert_bytenr(_info->bytenr_root, ihash);
+   if (ret > 0) {
+   kfree(ihash);
+   ret = 0;
+   goto out;
+   }
+
+   ret = inmem_insert_hash(_info->hash_root, ihash,
+   btrfs_dedupe_sizes[type]);
+   if (ret > 0) {
+   /*
+* We only keep one hash in tree to save memory, so if
+* hash conflicts, free the one to insert.
+*/
+   rb_erase(>bytenr_node, _info->bytenr_root);
+   kfree(ihash);
+   ret = 0;
+   goto out;
+   }
+
+   list_add(>lru_list, _info->lru_list);
+   dedupe_info->current_nr++;
+
+   /* Remove the last dedupe hash if we exceed limit */
+   while (dedupe_info->current_nr > dedupe_info->limit_nr) {
+   struct inmem_hash *last;
+
+   last = list_entry(dedupe_info->lru_list.prev,
+ struct inmem_hash, lru_list);
+   __inmem_del(dedupe_info, last);
+   }
+out:
+   mutex_unlock(_info->lock);
+   return 0;
+}
+
+int btrfs_dedupe_add(struct btrfs_trans_handle *trans,
+struct btrfs_fs_info *fs_info,
+struct btrfs_dedupe_hash 

[PATCH v10 13/21] btrfs: dedupe: add a property handler for online dedupe

2016-04-01 Thread Qu Wenruo
From: Wang Xiaoguang 

We use btrfs extended attribute "btrfs.dedupe" to record per-file online
dedupe status, so add a dedupe property handler.

Signed-off-by: Wang Xiaoguang 
---
 fs/btrfs/props.c | 41 +
 1 file changed, 41 insertions(+)

diff --git a/fs/btrfs/props.c b/fs/btrfs/props.c
index 3699212..a430886 100644
--- a/fs/btrfs/props.c
+++ b/fs/btrfs/props.c
@@ -42,6 +42,11 @@ static int prop_compression_apply(struct inode *inode,
  size_t len);
 static const char *prop_compression_extract(struct inode *inode);
 
+static int prop_dedupe_validate(const char *value, size_t len);
+static int prop_dedupe_apply(struct inode *inode, const char *value,
+size_t len);
+static const char *prop_dedupe_extract(struct inode *inode);
+
 static struct prop_handler prop_handlers[] = {
{
.xattr_name = XATTR_BTRFS_PREFIX "compression",
@@ -50,6 +55,13 @@ static struct prop_handler prop_handlers[] = {
.extract = prop_compression_extract,
.inheritable = 1
},
+   {
+   .xattr_name = XATTR_BTRFS_PREFIX "dedupe",
+   .validate = prop_dedupe_validate,
+   .apply = prop_dedupe_apply,
+   .extract = prop_dedupe_extract,
+   .inheritable = 1
+   },
 };
 
 void __init btrfs_props_init(void)
@@ -426,4 +438,33 @@ static const char *prop_compression_extract(struct inode 
*inode)
return NULL;
 }
 
+static int prop_dedupe_validate(const char *value, size_t len)
+{
+   if (!strncmp("disable", value, len))
+   return 0;
+
+   return -EINVAL;
+}
+
+static int prop_dedupe_apply(struct inode *inode, const char *value, size_t 
len)
+{
+   if (len == 0) {
+   BTRFS_I(inode)->flags &= ~BTRFS_INODE_NODEDUPE;
+   return 0;
+   }
+
+   if (!strncmp("disable", value, len)) {
+   BTRFS_I(inode)->flags |= BTRFS_INODE_NODEDUPE;
+   return 0;
+   }
+
+   return -EINVAL;
+}
+
+static const char *prop_dedupe_extract(struct inode *inode)
+{
+   if (BTRFS_I(inode)->flags & BTRFS_INODE_NODEDUPE)
+   return "disable";
 
+   return NULL;
+}
-- 
2.7.4



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v10 01/21] btrfs: dedupe: Introduce dedupe framework and its header

2016-04-01 Thread Qu Wenruo
From: Wang Xiaoguang 

Introduce the header for btrfs online(write time) de-duplication
framework and needed header.

The new de-duplication framework is going to support 2 different dedupe
methods and 1 dedupe hash.

Signed-off-by: Qu Wenruo 
Signed-off-by: Wang Xiaoguang 
---
 fs/btrfs/ctree.h   |   5 ++
 fs/btrfs/dedupe.h  | 134 +
 fs/btrfs/disk-io.c |   1 +
 3 files changed, 140 insertions(+)
 create mode 100644 fs/btrfs/dedupe.h

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 84a6a5b..022ab61 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1860,6 +1860,11 @@ struct btrfs_fs_info {
struct list_head pinned_chunks;
 
int creating_free_space_tree;
+
+   /* Inband de-duplication related structures*/
+   unsigned int dedupe_enabled:1;
+   struct btrfs_dedupe_info *dedupe_info;
+   struct mutex dedupe_ioctl_lock;
 };
 
 struct btrfs_subvolume_writers {
diff --git a/fs/btrfs/dedupe.h b/fs/btrfs/dedupe.h
new file mode 100644
index 000..40f4808
--- /dev/null
+++ b/fs/btrfs/dedupe.h
@@ -0,0 +1,134 @@
+/*
+ * Copyright (C) 2015 Fujitsu.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ */
+
+#ifndef __BTRFS_DEDUPE__
+#define __BTRFS_DEDUPE__
+
+#include 
+#include 
+#include 
+
+/*
+ * Dedup storage backend
+ * On disk is persist storage but overhead is large
+ * In memory is fast but will lose all its hash on umount
+ */
+#define BTRFS_DEDUPE_BACKEND_INMEMORY  0
+#define BTRFS_DEDUPE_BACKEND_ONDISK1
+
+/* Only support inmemory yet, so count is still only 1 */
+#define BTRFS_DEDUPE_BACKEND_COUNT 1
+
+/* Dedup block size limit and default value */
+#define BTRFS_DEDUPE_BLOCKSIZE_MAX (8 * 1024 * 1024)
+#define BTRFS_DEDUPE_BLOCKSIZE_MIN (16 * 1024)
+#define BTRFS_DEDUPE_BLOCKSIZE_DEFAULT (128 * 1024)
+
+/* Hash algorithm, only support SHA256 yet */
+#define BTRFS_DEDUPE_HASH_SHA256   0
+
+static int btrfs_dedupe_sizes[] = { 32 };
+
+/*
+ * For caller outside of dedup.c
+ *
+ * Different dedupe backends should have their own hash structure
+ */
+struct btrfs_dedupe_hash {
+   u64 bytenr;
+   u32 num_bytes;
+
+   /* last field is a variable length array of dedupe hash */
+   u8 hash[];
+};
+
+struct btrfs_dedupe_info {
+   /* dedupe blocksize */
+   u64 blocksize;
+   u16 backend;
+   u16 hash_type;
+
+   struct crypto_shash *dedupe_driver;
+   struct mutex lock;
+
+   /* following members are only used in in-memory dedupe mode */
+   struct rb_root hash_root;
+   struct rb_root bytenr_root;
+   struct list_head lru_list;
+   u64 limit_nr;
+   u64 current_nr;
+};
+
+struct btrfs_trans_handle;
+
+static inline int btrfs_dedupe_hash_hit(struct btrfs_dedupe_hash *hash)
+{
+   return (hash && hash->bytenr);
+}
+
+int btrfs_dedupe_hash_size(u16 type);
+struct btrfs_dedupe_hash *btrfs_dedupe_alloc_hash(u16 type);
+
+/*
+ * Initial inband dedupe info
+ * Called at dedupe enable time.
+ */
+int btrfs_dedupe_enable(struct btrfs_fs_info *fs_info, u16 type, u16 backend,
+   u64 blocksize, u64 limit_nr, u64 limit_mem);
+
+/*
+ * Disable dedupe and invalidate all its dedupe data.
+ * Called at dedupe disable time.
+ */
+int btrfs_dedupe_disable(struct btrfs_fs_info *fs_info);
+
+/*
+ * Calculate hash for dedup.
+ * Caller must ensure [start, start + dedupe_bs) has valid data.
+ */
+int btrfs_dedupe_calc_hash(struct btrfs_fs_info *fs_info,
+  struct inode *inode, u64 start,
+  struct btrfs_dedupe_hash *hash);
+
+/*
+ * Search for duplicated extents by calculated hash
+ * Caller must call btrfs_dedupe_calc_hash() first to get the hash.
+ *
+ * @inode: the inode for we are writing
+ * @file_pos: offset inside the inode
+ * As we will increase extent ref immediately after a hash match,
+ * we need @file_pos and @inode in this case.
+ *
+ * Return > 0 for a hash match, and the extent ref will be
+ * *INCREASED*, and hash->bytenr/num_bytes will record the existing
+ * extent data.
+ * Return 0 for a hash miss. Nothing is done
+ */
+int btrfs_dedupe_search(struct btrfs_fs_info *fs_info,
+   struct inode *inode, u64 file_pos,

[PATCH v10 21/21] btrfs: dedupe: Preparation for compress-dedupe co-work

2016-04-01 Thread Qu Wenruo
For dedupe to work with compression, new members recording compression
algorithm and on-disk extent length are needed.

Add them for later compress-dedupe co-work.

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/ctree.h| 22 +-
 fs/btrfs/dedupe.c   | 78 -
 fs/btrfs/dedupe.h   |  2 ++
 fs/btrfs/inode.c|  2 ++
 fs/btrfs/ordered-data.c |  2 ++
 5 files changed, 85 insertions(+), 21 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 659790c..fdbe66b 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -982,8 +982,22 @@ struct btrfs_dedupe_status_item {
  * Offset: Bytenr of the hash
  *
  * Used for hash <-> bytenr search
- * Hash exclude the last 64 bit follows
  */
+struct btrfs_dedupe_hash_item {
+   /*
+* length of dedupe range on disk
+* For in-memory length, it's always
+* dedupe_info->block_size
+*/
+   __le32 disk_len;
+
+   u8 compression;
+
+   /*
+* Hash follows, exclude the last 64bit,
+* as it's already in key.objectid.
+*/
+} __attribute__ ((__packed__));
 
 /*
  * Objectid: bytenr
@@ -3316,6 +3330,12 @@ BTRFS_SETGET_FUNCS(dedupe_status_hash_type, struct 
btrfs_dedupe_status_item,
 BTRFS_SETGET_FUNCS(dedupe_status_backend, struct btrfs_dedupe_status_item,
   backend, 16);
 
+/* btrfs_dedupe_hash_item */
+BTRFS_SETGET_FUNCS(dedupe_hash_disk_len, struct btrfs_dedupe_hash_item,
+  disk_len, 32);
+BTRFS_SETGET_FUNCS(dedupe_hash_compression, struct btrfs_dedupe_hash_item,
+  compression, 8);
+
 /* struct btrfs_file_extent_item */
 BTRFS_SETGET_FUNCS(file_extent_type, struct btrfs_file_extent_item, type, 8);
 BTRFS_SETGET_STACK_FUNCS(stack_file_extent_disk_bytenr,
diff --git a/fs/btrfs/dedupe.c b/fs/btrfs/dedupe.c
index 1f0178e..e91420d 100644
--- a/fs/btrfs/dedupe.c
+++ b/fs/btrfs/dedupe.c
@@ -31,6 +31,8 @@ struct inmem_hash {
 
u64 bytenr;
u32 num_bytes;
+   u32 disk_num_bytes;
+   u8 compression;
 
u8 hash[];
 };
@@ -397,6 +399,8 @@ static int inmem_add(struct btrfs_dedupe_info *dedupe_info,
/* Copy the data out */
ihash->bytenr = hash->bytenr;
ihash->num_bytes = hash->num_bytes;
+   ihash->disk_num_bytes = hash->disk_num_bytes;
+   ihash->compression = hash->compression;
memcpy(ihash->hash, hash->hash, btrfs_dedupe_sizes[type]);
 
mutex_lock(_info->lock);
@@ -442,7 +446,8 @@ static int ondisk_search_bytenr(struct btrfs_trans_handle 
*trans,
struct btrfs_path *path, u64 bytenr,
int prepare_del);
 static int ondisk_search_hash(struct btrfs_dedupe_info *dedupe_info, u8 *hash,
- u64 *bytenr_ret, u32 *num_bytes_ret);
+ u64 *bytenr_ret, u32 *num_bytes_ret,
+ u32 *disk_num_bytes_ret, u8 *compression);
 static int ondisk_add(struct btrfs_trans_handle *trans,
  struct btrfs_dedupe_info *dedupe_info,
  struct btrfs_dedupe_hash *hash)
@@ -450,7 +455,7 @@ static int ondisk_add(struct btrfs_trans_handle *trans,
struct btrfs_path *path;
struct btrfs_root *dedupe_root = dedupe_info->dedupe_root;
struct btrfs_key key;
-   u64 hash_offset;
+   struct btrfs_dedupe_hash_item *hash_item;
u64 bytenr;
u32 num_bytes;
int hash_len = btrfs_dedupe_sizes[dedupe_info->hash_type];
@@ -475,7 +480,8 @@ static int ondisk_add(struct btrfs_trans_handle *trans,
}
btrfs_release_path(path);
 
-   ret = ondisk_search_hash(dedupe_info, hash->hash, , _bytes);
+   ret = ondisk_search_hash(dedupe_info, hash->hash, , _bytes,
+NULL, NULL);
if (ret < 0)
goto out;
/* Same hash found, don't re-add to save dedupe tree space */
@@ -491,13 +497,18 @@ static int ondisk_add(struct btrfs_trans_handle *trans,
 
/* The last 8 bit will not be included into hash */
ret = btrfs_insert_empty_item(trans, dedupe_root, path, ,
- hash_len - 8);
+ sizeof(*hash_item) + hash_len - 8);
WARN_ON(ret == -EEXIST);
if (ret < 0)
goto out;
-   hash_offset = btrfs_item_ptr_offset(path->nodes[0], path->slots[0]);
+   hash_item = btrfs_item_ptr(path->nodes[0], path->slots[0],
+  struct btrfs_dedupe_hash_item);
+   btrfs_set_dedupe_hash_disk_len(path->nodes[0], hash_item,
+  hash->disk_num_bytes);
+   btrfs_set_dedupe_hash_compression(path->nodes[0], hash_item,
+ hash->compression);
write_extent_buffer(path->nodes[0], hash->hash,
-   hash_offset, hash_len - 8);

[PATCH v10 09/21] btrfs: dedupe: Inband in-memory only de-duplication implement

2016-04-01 Thread Qu Wenruo
Core implement for inband de-duplication.
It reuse the async_cow_start() facility to do the calculate dedupe hash.
And use dedupe hash to do inband de-duplication at extent level.

The work flow is as below:
1) Run delalloc range for an inode
2) Calculate hash for the delalloc range at the unit of dedupe_bs
3) For hash match(duplicated) case, just increase source extent ref
   and insert file extent.
   For hash mismatch case, go through the normal cow_file_range()
   fallback, and add hash into dedupe_tree.
   Compress for hash miss case is not supported yet.

Current implement restore all dedupe hash in memory rb-tree, with LRU
behavior to control the limit.

Signed-off-by: Wang Xiaoguang 
Signed-off-by: Qu Wenruo 
---
 fs/btrfs/extent-tree.c |  18 
 fs/btrfs/inode.c   | 235 ++---
 fs/btrfs/relocation.c  |  16 
 3 files changed, 236 insertions(+), 33 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 53e1297..dabd721 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -37,6 +37,7 @@
 #include "math.h"
 #include "sysfs.h"
 #include "qgroup.h"
+#include "dedupe.h"
 
 #undef SCRAMBLE_DELAYED_REFS
 
@@ -2399,6 +2400,8 @@ static int run_one_delayed_ref(struct btrfs_trans_handle 
*trans,
 
if (btrfs_delayed_ref_is_head(node)) {
struct btrfs_delayed_ref_head *head;
+   struct btrfs_fs_info *fs_info = root->fs_info;
+
/*
 * we've hit the end of the chain and we were supposed
 * to insert this extent into the tree.  But, it got
@@ -2413,6 +2416,15 @@ static int run_one_delayed_ref(struct btrfs_trans_handle 
*trans,
btrfs_pin_extent(root, node->bytenr,
 node->num_bytes, 1);
if (head->is_data) {
+   /*
+* If insert_reserved is given, it means
+* a new extent is revered, then deleted
+* in one tran, and inc/dec get merged to 0.
+*
+* In this case, we need to remove its dedup
+* hash.
+*/
+   btrfs_dedupe_del(trans, fs_info, node->bytenr);
ret = btrfs_del_csums(trans, root,
  node->bytenr,
  node->num_bytes);
@@ -6713,6 +6725,12 @@ static int __btrfs_free_extent(struct btrfs_trans_handle 
*trans,
btrfs_release_path(path);
 
if (is_data) {
+   ret = btrfs_dedupe_del(trans, info, bytenr);
+   if (ret < 0) {
+   btrfs_abort_transaction(trans, extent_root,
+   ret);
+   goto out;
+   }
ret = btrfs_del_csums(trans, root, bytenr, num_bytes);
if (ret) {
btrfs_abort_transaction(trans, extent_root, 
ret);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 41a5688..96790d0 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -60,6 +60,7 @@
 #include "hash.h"
 #include "props.h"
 #include "qgroup.h"
+#include "dedupe.h"
 
 struct btrfs_iget_args {
struct btrfs_key *location;
@@ -106,7 +107,8 @@ static int btrfs_finish_ordered_io(struct 
btrfs_ordered_extent *ordered_extent);
 static noinline int cow_file_range(struct inode *inode,
   struct page *locked_page,
   u64 start, u64 end, int *page_started,
-  unsigned long *nr_written, int unlock);
+  unsigned long *nr_written, int unlock,
+  struct btrfs_dedupe_hash *hash);
 static struct extent_map *create_pinned_em(struct inode *inode, u64 start,
   u64 len, u64 orig_start,
   u64 block_start, u64 block_len,
@@ -335,6 +337,7 @@ struct async_extent {
struct page **pages;
unsigned long nr_pages;
int compress_type;
+   struct btrfs_dedupe_hash *hash;
struct list_head list;
 };
 
@@ -353,7 +356,8 @@ static noinline int add_async_extent(struct async_cow *cow,
 u64 compressed_size,
 struct page **pages,
 unsigned long nr_pages,
-int compress_type)
+int compress_type,
+struct 

[PATCH v10 00/21] Btrfs dedupe framework

2016-04-01 Thread Qu Wenruo
This patchset can be fetched from github:
https://github.com/adam900710/linux.git wang_dedupe_20160401

In this patchset, we're proud to bring a completely new storage backend:
Khala backend.

With Khala backend, all dedupe hash will be restored in the Khala,
shared with every Kalai protoss, with unlimited storage and almost zero
search latency.
A perfect backend for any Kalai protoss. "My life for Aiur!"

Unfortunately, such backend is not available for human.


OK, except the super-fancy and date-related backend, the patchset is
still a serious patchset.
In this patchset, we mostly addressed the on-disk format change comment from
Chris:
1) Reduced dedupe hash item and bytenr item.
   Now dedupe hash item structure size is reduced from 41 bytes
   (9 bytes hash_item + 32 bytes hash)
   to 29 bytes (5 bytes hash_item + 24 bytes hash)
   Without the last patch, it's even less with only 24 bytes
   (24 bytes hash only).
   And dedupe bytenr item structure size is reduced from 32 bytes (full
   hash) to 0.

2) Hide dedupe ioctls into CONFIG_BTRFS_DEBUG
   Advised by David, to make btrfs dedupe as an experimental feature for
   advanced user.
   This is used to allow this patchset to be merged while still allow us
   to change ioctl in the further.

3) Add back missing bug fix patches
   I just missed 2 bug fix patches in previous iteration.
   Adding them back.

Now patch 1~11 provide the full backward-compatible in-memory backend.
And patch 12~14 provide per-file dedupe flag feature.
Patch 15~20 provide on-disk dedupe backend with persist dedupe state for
in-memory backend.
The last patch is just preparation for possible dedupe-compress co-work.


Changelog:
v2:
  Totally reworked to handle multiple backends
v3:
  Fix a stupid but deadly on-disk backend bug
  Add handle for multiple hash on same bytenr corner case to fix abort
  trans error
  Increase dedup rate by enhancing delayed ref handler for both backend.
  Move dedup_add() to run_delayed_ref() time, to fix abort trans error.
  Increase dedup block size up limit to 8M.
v4:
  Add dedup prop for disabling dedup for given files/dirs.
  Merge inmem_search() and ondisk_search() into generic_search() to save
  some code
  Fix another delayed_ref related bug.
  Use the same mutex for both inmem and ondisk backend.
  Move dedup_add() back to btrfs_finish_ordered_io() to increase dedup
  rate.
v5:
  Reuse compress routine for much simpler dedup function.
  Slightly improved performance due to above modification.
  Fix race between dedup enable/disable
  Fix for false ENOSPC report
v6:
  Further enable/disable race window fix.
  Minor format change according to checkpatch.
v7:
  Fix one concurrency bug with balance.
  Slightly modify return value from -EINVAL to -EOPNOTSUPP for
  btrfs_dedup_ioctl() to allow progs to distinguish unsupported commands
  and wrong parameter.
  Rebased to integration-4.6.
v8:
  Rename 'dedup' to 'dedupe'.
  Add support to allow dedupe and compression work at the same time.
  Fix several balance related bugs. Special thanks to Satoru Takeuchi,
  who exposed most of them.
  Small dedupe hit case performance improvement.
v9:
  Re-order the patchset to completely separate pure in-memory and any
  on-disk format change.
  Fold bug fixes into its original patch.
v10:
  Adding back missing bug fix patch.
  Reduce on-disk item size.
  Hide dedupe ioctl under CONFIG_BTRFS_DEBUG.

Qu Wenruo (9):
  btrfs: delayed-ref: Add support for increasing data ref under spinlock
  btrfs: dedupe: Inband in-memory only de-duplication implement
  btrfs: relocation: Enhance error handling to avoid BUG_ON
  btrfs: dedupe: Add basic tree structure for on-disk dedupe method
  btrfs: dedupe: Introduce interfaces to resume and cleanup dedupe info
  btrfs: dedupe: Add support for on-disk hash search
  btrfs: dedupe: Add support to delete hash for on-disk backend
  btrfs: dedupe: Add support for adding hash for on-disk backend
  btrfs: dedupe: Preparation for compress-dedupe co-work

Wang Xiaoguang (12):
  btrfs: dedupe: Introduce dedupe framework and its header
  btrfs: dedupe: Introduce function to initialize dedupe info
  btrfs: dedupe: Introduce function to add hash into in-memory tree
  btrfs: dedupe: Introduce function to remove hash from in-memory tree
  btrfs: dedupe: Introduce function to search for an existing hash
  btrfs: dedupe: Implement btrfs_dedupe_calc_hash interface
  btrfs: ordered-extent: Add support for dedupe
  btrfs: try more times to alloc metadata reserve space
  btrfs: dedupe: Add ioctl for inband dedupelication
  btrfs: dedupe: add an inode nodedupe flag
  btrfs: dedupe: add a property handler for online dedupe
  btrfs: dedupe: add per-file online dedupe control

 fs/btrfs/Makefile|2 +-
 fs/btrfs/ctree.h |   80 ++-
 fs/btrfs/dedupe.c| 1239 ++
 fs/btrfs/dedupe.h|  181 ++
 fs/btrfs/delayed-ref.c   |   30 +-
 fs/btrfs/delayed-ref.h   |