Re: [survey] sysfs layout for btrfs

2015-08-18 Thread Gabriel de Perthuis
On Sat, 15 Aug 2015 07:40:40 +0800, Anand Jain wrote:

> Hello,
> 
> as of now btrfs sysfs does not include the attributes for the volume 
> manager part in its sysfs layout, so its being developed and there are 
> two types of layout here below, so I have a quick survey to know which 
> will be preferred. contenders are:
> 1. FS and VM (volume manager) attributes[1] merged sysfs layout
> 
>/sys/fs/btrfs/ <-- holds FS attr, VM attr will be added here.
>/sys/fs/btrfs//devices/ <-- btrfs_devices attr here

My vote is for the first one.
Lengthening the UI/API with /pools/ seems unnecessary, and it's
better to get attributes exposed earlier.

> 2. FS and VM attributes separated sysfs layout.
> 
>   /sys/fs/btrfs/ <--- as is, will continue to hold fs attributes.
>   /sys/fs/btrfs/pools// <-- will hold VM attributes
>   /sys/fs/btrfs/pools//devices/ <-- btrfs_devices attr here

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs filesystem show _exact_ freaking size?

2014-11-18 Thread Gabriel de Perthuis
Le 18/11/2014 11:39, Robert White a écrit :
> Howdy,
> 
> How does one get the exact size (in blocks preferably, but bytes okay)
> of the filesystem inside a partition? I know how to get the partition
> size, but that's not useful when shrinking a partition...

dev_item.total_bytes in brtfs-show-super's output is what you're after.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Q: Why subvolumes?

2013-07-23 Thread Gabriel de Perthuis
Le mar. 23 juil. 2013 21:30:13 CEST, Hugo Mills a écrit :
> On Tue, Jul 23, 2013 at 07:47:41PM +0200, Gabriel de Perthuis wrote:
>>>Now... since the snapshot's FS tree is a direct duplicate of the
>>> original FS tree (actually, it's the same tree, but they look like
>>> different things to the outside world), they share everything --
>>> including things like inode numbers. This is OK within a subvolume,
>>> because we have the semantics that subvolumes have their own distinct
>>> inode-number spaces. If we could snapshot arbitrary subsections of the
>>> FS, we'd end up having to fix up inode numbers to ensure that they
>>> were unique -- which can't really be an atomic operation (unless you
>>> want to have the FS locked while the kernel updates the inodes of the
>>> billion files you just snapshotted).
>>
>> I don't think so; I just checked some snapshots and the inos are the same.
>> Btrfs just changes the dev_id of subvolumes (somehow the vfs allows this).
>
>That's what I said. Our current implementation allows different
> subvolumes to have the same inode numbers, which is what makes it
> work. If you threw out the concept of subvolumes, or allowed snapshots
> within subvolumes, then you'd be duplicating inodes within a
> subvolume, which is one reason it doesn't work.

Sorry for misreading you.
Directory snapshots can work by giving a new device number to the snapshot.
There is no need to update inode numbers in that case.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Q: Why subvolumes?

2013-07-23 Thread Gabriel de Perthuis
>Now... since the snapshot's FS tree is a direct duplicate of the
> original FS tree (actually, it's the same tree, but they look like
> different things to the outside world), they share everything --
> including things like inode numbers. This is OK within a subvolume,
> because we have the semantics that subvolumes have their own distinct
> inode-number spaces. If we could snapshot arbitrary subsections of the
> FS, we'd end up having to fix up inode numbers to ensure that they
> were unique -- which can't really be an atomic operation (unless you
> want to have the FS locked while the kernel updates the inodes of the
> billion files you just snapshotted).

I don't think so; I just checked some snapshots and the inos are the same.
Btrfs just changes the dev_id of subvolumes (somehow the vfs allows this).

>The other thing to talk about here is that while the FS tree is a
> tree structure, it's not a direct one-to-one map to the directory tree
> structure. In fact, it looks more like a list of inodes, in inode
> order, with some extra info for easily tracking through the list. The
> B-tree structure of the FS tree is just a fast indexing method. So
> snapshotting a directory entry within the FS tree would require
> (somehow) making an atomic copy, or CoW copy, of only the parts of the
> FS tree that fall under the directory in question -- so you'd end up
> trying to take a sequence of records in the FS tree, of arbitrary size
> (proportional roughly to the number of entries in the directory) and
> copying them to somewhere else in the same tree in such a way that you
> can automatically dereference the copies when you modify them. So,
> ultimately, it boils down to being able to do CoW operations at the
> byte level, which is going to introduce huge quantities of extra
> metadata, and it all starts looking really awkward to implement (plus
> having to deal with the long time taken to copy the directory entries
> for the thing you're snapshotting).

Btrfs already does CoW of arbitrarily-large files (extent lists);
doing the same for directories doesn't seem impossible.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Manual deduplication would be useful

2013-07-23 Thread Gabriel de Perthuis
> Hello,
> 
> For over a year now, I've been experimenting with stacked filesystems
> as a way to save on resources.  A basic OS layer is shared among
> Containers, each of which stacks a layer with modifications on top of
> it.  This approach means that Containers share buffer cache and
> loaded executables.  Concrete technology choices aside, the result is
> rock-solid and the efficiency improvements are incredible, as
> documented here:
> 
> http://rickywiki.vanrein.org/doku.php?id=openvz-aufs
> 
> One problem with this setup is updating software.  In lieu of
> stacking-support in package managers, it is necessary to do this on a
> per-Container basis, meaning that each installs their own versions,
> including overwrites of the basic OS layer.  Deduplication could
> remedy this, but the generic mechanism is known from ZFS to be fairly
> inefficient.
> 
> Interestingly however, this particular use case demonstrates that a
> much simpler deduplication mechanism than normally considered could
> be useful.  It would suffice if the filesystem could check on manual
> hints, or stack-specifying hints, to see if overlaid files share the
> same file contents; when they do, deduplication could commence.  This
> saves searching through the entire filesystem for every file or block
> written.  It might also mean that the actual stacking is not needed,
> but instead a basic OS could be cloned to form a new basic install,
> and kept around for this hint processing.
> 
> I'm not sure if this should ideally be implemented inside the
> stacking approach (where it would be
> stacking-implementation-specific) or in the filesystem (for which it
> might be too far off the main purpose) but I thought it wouldn't hurt
> to start a discussion on it, given that (1) filesystems nowadays
> service multiple instances, (2) filesystems like Btrfs are based on
> COW, and (3) deduplication is a goal but the generic mechanism could
> use some efficiency improvements.
> 
> I hope having seen this approach is useful to you!

Have a look at bedup[1] (disclaimer: I wrote it).  The normal mode
does incremental scans, and there's also a subcommand for
deduplicating files that you already know are identical:
  bedup dedup-files

The implementation in master uses a clone ioctl.  Here is Mark
Fasheh's latest patch series to implement a dedup ioctl[2]; it
also comes with a command to work on listed files
(btrfs-extent-same in [3]).

[1] https://github.com/g2p/bedup
[2] http://comments.gmane.org/gmane.comp.file-systems.btrfs/26310/
[3] https://github.com/markfasheh/duperemove
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Lots of harddrive chatter on after booting with btrfs on root (slow boot)

2013-07-20 Thread Gabriel de Perthuis
On Sat, 20 Jul 2013 17:15:50 +0200, Jason Russell wrote:
> Ive also noted that this excessive hdd chatter does not occur
> immediately after a fresh format with arch on btrfs root.
> 
> Ive made some deductions/assumptions:
> This only seems to occur with btrfs roots.
> This only happens after some number of reboots OR after the partition
> fills up a little bit.
> Im pretty sure of ruled out everything except for the filesystem.

In my experience (as of 3.8 or so), Btrfs performance degrades on a
filled-up filesystem, even a comparatively new one.  Various
background workers start to eat io according to atop.

> I have just done two clean installs to more thoroughly compare ext4
> and btrfs roots. So far no excessive hdd chatter from btrfs.
> 
> I have also seen what I have described on two other computers
> (different hardware entirely) where there is lots of hdd chatter from
> btrfs root, and nothing from ext4.
> 
> Here are two threads:
> https://bbs.archlinux.org/viewtopic.php?pid=1117932
> https://bbs.archlinux.org/viewtopic.php?pid=1301684

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] btrfs: offline dedupe

2013-07-16 Thread Gabriel de Perthuis
On Mon, 15 Jul 2013 13:55:51 -0700, Zach Brown wrote:
> I'd get rid of all this code by only copying each input argument on to
> the stack as it's needed and by getting rid of the writable output
> struct fields.  (more on this later)

> As I said, I'd get rid of the output fields.  Like the other vectored io
> syscalls, the return value can indicate the number of initial
> consecutive bytes that worked.  When no progess was made then it can
> return errors.  Userspace is left to sort out the resulting state and
> figure out the extents to retry in exactly the same way that it found
> the initial extents to attempt to dedupe in the first place.
> 
> (And imagine strace trying to print the inputs and outputs.  Poor, poor,
> strace! :))

The dedup branch that uses this syscall[1] doesn't compare files
before submitting them anymore (the kernel will do it, and ranges
may not fit in cache once I get rid of an unnecessary loop).

I don't have strong opinions on the return style, but it would be
good to have the syscall always make progress by finding at least
one good range before bailing out, and signaling which files were
involved.  With those constraints, the current struct seems like the
cleanest way to pass the data.  The early return you suggest is a
good idea if Mark agrees, but the return condition should be
something like: if one range with bytes_deduped != 0 doesn't get
bytes_deduped incremented by iteration_len in this iteration, bail
out.
That's sufficient to guarantee progress and to know which ranges
were involved.

> I hope this helps!
> 
> - z

Thank you and everyone involved for the progress on this.

[1] https://github.com/g2p/bedup/tree/wip/dedup-syscall


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[XFSTESTS PATCH] btrfs: Test deduplication

2013-06-26 Thread Gabriel de Perthuis
---
The matching kernel patch is here:
https://github.com/g2p/linux/tree/v3.10%2Bextent-same (rebased on 3.10, fixing 
a small conflict)
Requires the btrfs-extent-same command:

- http://permalink.gmane.org/gmane.comp.file-systems.btrfs/26579
- https://github.com/markfasheh/duperemove


 tests/btrfs/313 | 93 +
 tests/btrfs/313.out | 25 ++
 tests/btrfs/group   |  1 +
 3 files changed, 119 insertions(+)
 create mode 100755 tests/btrfs/313
 create mode 100644 tests/btrfs/313.out

diff --git a/tests/btrfs/313 b/tests/btrfs/313
new file mode 100755
index 000..04e4ccb
--- /dev/null
+++ b/tests/btrfs/313
@@ -0,0 +1,93 @@
+#! /bin/bash
+# FS QA Test No. 313
+#
+# Test the deduplication syscall
+#
+#---
+# Copyright (c) 2013 Red Hat, Inc.  All Rights Reserved.
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License as
+# published by the Free Software Foundation.
+#
+# This program is distributed in the hope that it would be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write the Free Software Foundation,
+# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+#---
+#
+
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo "QA output created by $seq"
+
+here=`pwd`
+tmp=/tmp/$$
+status=1   # failure is the default!
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+_cleanup()
+{
+cd /
+rm -f $tmp.*
+}
+
+. ./common/rc
+. ./common/filter
+
+ESAME=`set_prog_path btrfs-extent-same`
+
+_need_to_be_root
+_supported_fs btrfs
+_supported_os Linux
+_require_command $ESAME
+_require_command $XFS_IO_PROG
+_require_scratch
+
+_scratch_mkfs &>/dev/null
+_scratch_mount >>$seqres.full 2>&1
+
+fiemap() {
+xfs_io -r -c fiemap $1 |tail -n+2
+}
+
+dedup() {
+! diff -q <(fiemap $1) <(fiemap $2)
+$ESAME $(stat -c %s $1) $1 0 $2 0
+diff -u <(fiemap $1) <(fiemap $2)
+}
+
+echo "Silence is golden"
+set -e
+
+v1=$SCRATCH_MNT/v1
+v2=$SCRATCH_MNT/v2
+v3=$SCRATCH_MNT/v3
+
+$BTRFS_UTIL_PROG subvolume create $v1
+$BTRFS_UTIL_PROG subvolume create $v2
+
+dd bs=1M status=none if=/dev/urandom of=$v1/file1 count=1
+dd bs=1M status=none if=/dev/urandom of=$v1/file2 count=1
+dd bs=1M status=none if=$v1/file1 of=$v2/file3
+dd bs=1M status=none if=$v1/file1 of=$v2/file4
+
+$BTRFS_UTIL_PROG subvolume snapshot -r $v2 $v3
+
+# identical, multiple volumes
+dedup $v1/file1 $v2/file3
+
+# not identical, same volume
+! $ESAME $((2**20)) $v1/file1 0 $v1/file2 0
+
+# identical, second file on a frozen volume
+dedup $v1/file1 $v3/file4
+
+_scratch_unmount
+_check_scratch_fs
+status=0
+exit
diff --git a/tests/btrfs/313.out b/tests/btrfs/313.out
new file mode 100644
index 000..eabe6be
--- /dev/null
+++ b/tests/btrfs/313.out
@@ -0,0 +1,25 @@
+QA output created by 313
+Silence is golden
+Create subvolume 'sdir/v1'
+Create subvolume 'sdir/v2'
+Create a readonly snapshot of 'sdir/v2' in 'sdir/v3'
+Files /dev/fd/63 and /dev/fd/62 differ
+Deduping 2 total files
+(0, 1048576): sdir/v1/file1
+(0, 1048576): sdir/v2/file3
+1 files asked to be deduped
+i: 0, status: 0, bytes_deduped: 1048576
+1048576 total bytes deduped in this operation
+Deduping 2 total files
+(0, 1048576): sdir/v1/file1
+(0, 1048576): sdir/v1/file2
+1 files asked to be deduped
+i: 0, status: 1, bytes_deduped: 0
+0 total bytes deduped in this operation
+Files /dev/fd/63 and /dev/fd/62 differ
+Deduping 2 total files
+(0, 1048576): sdir/v1/file1
+(0, 1048576): sdir/v3/file4
+1 files asked to be deduped
+i: 0, status: 0, bytes_deduped: 1048576
+1048576 total bytes deduped in this operation
diff --git a/tests/btrfs/group b/tests/btrfs/group
index bc6c256..4c868c8 100644
--- a/tests/btrfs/group
+++ b/tests/btrfs/group
@@ -7,5 +7,6 @@
 264 auto
 265 auto
 276 auto rw metadata
 284 auto
 307 auto quick
+313 auto
-- 
1.8.3.1.588.gb04834f

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PROGS PATCH] Import btrfs-extent-same

2013-06-26 Thread Gabriel de Perthuis
Originally from
https://github.com/markfasheh/duperemove/blob/master/btrfs-extent-same.c

Signed-off-by: Gabriel de Perthuis 
---
 .gitignore  |   1 +
 Makefile|   2 +-
 btrfs-extent-same.c | 145 
 3 files changed, 147 insertions(+), 1 deletion(-)
 create mode 100644 btrfs-extent-same.c

diff --git a/.gitignore b/.gitignore
index a7e1b19..8f0ec54 100644
--- a/.gitignore
+++ b/.gitignore
@@ -5,10 +5,11 @@ version.h
 version
 man/*.gz
 btrfs
 btrfs.static
 btrfs-debug-tree
+btrfs-extent-same
 btrfs-map-logical
 btrfs-fragments
 btrfsck
 calc-size
 ioctl-test
diff --git a/Makefile b/Makefile
index da7438e..35cc502 100644
--- a/Makefile
+++ b/Makefile
@@ -43,11 +43,11 @@ endif
 
 MAKEOPTS = --no-print-directory Q=$(Q)
 
 progs = mkfs.btrfs btrfs-debug-tree btrfsck \
btrfs btrfs-map-logical btrfs-image btrfs-zero-log btrfs-convert \
-   btrfs-find-root btrfstune btrfs-show-super
+   btrfs-find-root btrfstune btrfs-show-super btrfs-extent-same
 
 # external libs required by various binaries; for btrfs-foo,
 # specify btrfs_foo_libs = ; see $($(subst...)) rules below
 btrfs_convert_libs = -lext2fs -lcom_err
 btrfs_image_libs = -lpthread
diff --git a/btrfs-extent-same.c b/btrfs-extent-same.c
new file mode 100644
index 000..acf46b7
--- /dev/null
+++ b/btrfs-extent-same.c
@@ -0,0 +1,145 @@
+/*
+ * btrfs-extent-same.c
+ *
+ * Copyright (C) 2013 SUSE.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License version 2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#define BTRFS_IOCTL_MAGIC 0x94
+
+#define BTRFS_IOC_FILE_EXTENT_SAME _IOWR(BTRFS_IOCTL_MAGIC, 54, \
+struct btrfs_ioctl_same_args)
+
+#define BTRFS_SAME_DATA_DIFFERS1
+/* For extent-same ioctl */
+struct btrfs_ioctl_same_extent_info {
+   int64_t fd; /* in - destination file */
+   uint64_t logical_offset;/* in - start of extent in destination 
*/
+   uint64_t bytes_deduped; /* out - total # of bytes we
+* were able to dedupe from
+* this file */
+   /* status of this dedupe operation:
+* 0 if dedup succeeds
+* < 0 for error
+* == BTRFS_SAME_DATA_DIFFERS if data differs
+*/
+   int32_t status; /* out - see above description */
+   uint32_t reserved;
+};
+
+struct btrfs_ioctl_same_args {
+   uint64_t logical_offset;/* in - start of extent in source */
+   uint64_t length;/* in - length of extent */
+   uint16_t dest_count;/* in - total elements in info array */
+   uint16_t reserved1; /* out - number of files that got 
deduped */
+   uint32_t reserved2;
+   struct btrfs_ioctl_same_extent_info info[0];
+};
+
+static void usage(const char *prog)
+{
+   printf("Usage: %s len file1 loff1 file2 loff2\n", prog);
+}
+
+int main(int argc, char **argv)
+{
+   int ret, src_fd, i, numfiles;
+   char *srcf, *destf;
+   struct btrfs_ioctl_same_args *same;
+   struct btrfs_ioctl_same_extent_info *info;
+   unsigned long long bytes = 0ULL;
+
+   if (argc < 6 || (argc % 2)) {
+   usage(argv[0]);
+   return 1;
+   }
+
+   numfiles = (argc / 2) - 2;
+
+   printf("Deduping %d total files\n", numfiles + 1);
+
+   same = calloc(1,
+ sizeof(struct btrfs_ioctl_same_args) +
+ sizeof(struct btrfs_ioctl_same_extent_info) * numfiles);
+   if (!same)
+   return -ENOMEM;
+
+   srcf = argv[2];
+   same->length = atoll(argv[1]);
+   same->logical_offset = atoll(argv[3]);
+   same->dest_count = numfiles;
+
+   ret = open(srcf, O_RDONLY);
+   if (ret < 0) {
+   ret = errno;
+   fprintf(stderr, "Could not open file %s: (%d) %s\n", srcf, ret,
+   strerror(ret));
+   return -ret;
+   }
+   src_fd = ret;
+
+   printf("(%llu, %llu): %s\n", (unsigned long long)same->logical_offset,
+  (unsigned long long)same->length, srcf);
+
+   for (i = 0; i < same->dest_count; i++) {
+   destf = argv[4 + (i * 2)];
+
+   ret = open(destf, O_RDONLY);
+   if (ret < 0) {
+   ret = errno;
+

Re: Two identical copies of an image mounted result in changes to both images if only one is modified

2013-06-20 Thread Gabriel de Perthuis
>> Instead of redirecting to a different block device, Btrfs could and
>> should refuse to mount an already-mounted superblock when the block
>> device doesn't match, somewhere in or below btrfs_mount.  Registering
>> extra, distinct superblocks for an already mounted raid is a different
>> matter, but that isn't done through the mount syscall anyway.
> 
>The problem here is that you could quite legitimately mount
> /dev/sda (with UUID=AA1234) on, say, /mnt/fs-a, and /dev/sdb (with
> UUID=AA1234) on /mnt/fs-b -- _provided_ that /dev/sda and /dev/sdb are
> both part of the same filesystem. So you can't simply prevent mounting
> based on the device that the mount's being done with.

Okay.  The check should rely on a list of known block devices
for a given filesystem uuid.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Two identical copies of an image mounted result in changes to both images if only one is modified

2013-06-20 Thread Gabriel de Perthuis
On Thu, 20 Jun 2013 10:16:22 +0100, Hugo Mills wrote:
> On Thu, Jun 20, 2013 at 10:47:53AM +0200, Clemens Eisserer wrote:
>> Hi,
>> 
>> I've observed a rather strange behaviour while trying to mount two
>> identical copies of the same image to different mount points.
>> Each modification to one image is also performed in the second one.

>> touch m2/hello
>> ls -la m1  //will now also include a file calles "hello"
>> 
>> Is this behaviour intentional and known or should I create a bug-report?
> 
>It's known, and not desired behaviour. The problem is that you've
> ended up with two filesystems with the same UUID, and the FS code gets
> rather confused about that. The same problem exists with LVM snapshots
> (or other block-device-layer copies).
> 
>The solution is a combination of a tool to scan an image and change
> the UUID (offline), and of some code in the kernel that detects when
> it's being told about a duplicate image (rather than an additional
> device in the same FS). Neither of these has been written yet, I'm
> afraid.

To clarify, the loop devices are properly distinct, but the first
device ends up mounted twice.

I've had a look at the vfs code, and it doesn't seem to be uuid-aware,
which makes sense because the uuid is a property of the superblock and
the fs structure doesn't expose it.  It's a Btrfs problem.

Instead of redirecting to a different block device, Btrfs could and
should refuse to mount an already-mounted superblock when the block
device doesn't match, somewhere in or below btrfs_mount.  Registering
extra, distinct superblocks for an already mounted raid is a different
matter, but that isn't done through the mount syscall anyway.

>> I've deleted quite a bunch of files on my production system because of 
>> this...
> 
>Oops. I'm sorry to hear that. :(
> 
>Hugo.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] btrfs: offline dedupe v2

2013-06-11 Thread Gabriel de Perthuis
Le 11/06/2013 23:04, Mark Fasheh a écrit :
> On Tue, Jun 11, 2013 at 10:56:59PM +0200, Gabriel de Perthuis wrote:
>>> What I found however is that neither of these is a great idea ;)
>>>
>>> - We want to require that the inode be open for writing so that an
>>>   unprivileged user can't do things like run dedupe on a performance
>>>   sensitive file that they might only have read access to.  In addition I
>>>   could see it as kind of a surprise (non-standard behavior) to an
>>>   administrator that users could alter the layout of files they are only
>>>   allowed to read.
>>>
>>> - Readonly snapshots won't let you open for write anyway (unsuprisingly,
>>>   open() returns -EROFS).  So that kind of kills the idea of them being able
>>>   to open those files for write which we want to dedupe.
>>>
>>> That said, I still think being able to run this against a set of readonly
>>> snapshots makes sense especially if those snapshots are taken for backup
>>> purposes. I'm just not sure how we can sanely enable it.
>>
>> The check could be: if (fmode_write || cap_sys_admin).
>>
>> This isn't incompatible with mnt_want_write, that check is at the
>> level of the superblocks and vfsmount and not the subvolume fsid.
> 
> Oh ok that's certainly better. I think we still have a problem though - how
> does a process gets write access to a file from a ro-snapshot? If I open a
> file (as root) on a ro-snapshot on my test machine here I'll get -EROFS.

Your first series did work in that case.
The process does get a read-only fd, but that's no obstacle for the ioctl.

> I'm a bit confused - how does mnt_want_write factor in here? I think that's
> for a totally seperate kind of accounting, right?

It doesn't, it's just that I had spent a few minutes checking anyway.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] btrfs: offline dedupe v2

2013-06-11 Thread Gabriel de Perthuis
Le 11/06/2013 22:31, Mark Fasheh a écrit :
> Perhaps this isn't a limiation per-se but extent-same requires read/write
> access to the files we want to dedupe.  During my last series I had a
> conversation with Gabriel de Perthuis about access checking where we tried
> to maintain the ability for a user to run extent-same against a readonly
> snapshot. In addition, I reasoned that since the underlying data won't
> change (at least to the user) that we ought only require the files to be
> open for read.
> 
> What I found however is that neither of these is a great idea ;)
> 
> - We want to require that the inode be open for writing so that an
>   unprivileged user can't do things like run dedupe on a performance
>   sensitive file that they might only have read access to.  In addition I
>   could see it as kind of a surprise (non-standard behavior) to an
>   administrator that users could alter the layout of files they are only
>   allowed to read.
> 
> - Readonly snapshots won't let you open for write anyway (unsuprisingly,
>   open() returns -EROFS).  So that kind of kills the idea of them being able
>   to open those files for write which we want to dedupe.
> 
> That said, I still think being able to run this against a set of readonly
> snapshots makes sense especially if those snapshots are taken for backup
> purposes. I'm just not sure how we can sanely enable it.

The check could be: if (fmode_write || cap_sys_admin).

This isn't incompatible with mnt_want_write, that check is at the
level of the superblocks and vfsmount and not the subvolume fsid.

> Code review is very much appreciated. Thanks,
>  --Mark
> 
> 
> ChangeLog
> 
> - check that we have appropriate access to each file before deduping. For
>   the source, we only check that it is opened for read. Target files have to
>   be open for write.
> 
> - don't dedupe on readonly submounts (this is to maintain 
> 
> - check that we don't dedupe files with different checksumming states
>  (compare BTRFS_INODE_NODATASUM flags)
> 
> - get and maintain write access to the mount during the extent same
>   operation (mount_want_write())
> 
> - allocate our read buffers up front in btrfs_ioctl_file_extent_same() and
>   pass them through for re-use on every call to btrfs_extent_same(). (thanks
>   to David Sterba  for reporting this
> 
> - As the read buffers could possibly be up to 1MB (depending on user
>   request), we now conditionally vmalloc them.
> 
> - removed redundant check for same inode. btrfs_extent_same() catches it now
>   and bubbles the error up.
> 
> - remove some unnecessary printks
> 
> Changes from RFC to v1:
> 
> - don't error on large length value in btrfs exent-same, instead we just
>   dedupe the maximum allowed.  That way userspace doesn't have to worry
>   about an arbitrary length limit.
> 
> - btrfs_extent_same will now loop over the dedupe range at 1MB increments (for
>   a total of 16MB per request)
> 
> - cleaned up poorly coded while loop in __extent_read_full_page() (thanks to
>   David Sterba  for reporting this)
> 
> - included two fixes from Gabriel de Perthuis :
>- allow dedupe across subvolumes
>- don't lock compressed pages twice when deduplicating
> 
> - removed some unused / poorly designed fields in btrfs_ioctl_same_args.
>   This should also give us a bit more reserved bytes.
> 
> - return -E2BIG instead of -ENOMEM when arg list is too large (thanks to
>   David Sterba  for reporting this)
> 
> - Some more reserved bytes are now included as a result of some of my
>   cleanups. Quite possibly we could add a couple more.
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] btrfs: offline dedupe

2013-05-24 Thread Gabriel de Perthuis
Le sam. 25 mai 2013 00:38:27 CEST, Mark Fasheh a écrit :
> On Fri, May 24, 2013 at 09:50:14PM +0200, Gabriel de Perthuis wrote:
>>> Sure. Actually, you got me thinking about some sanity checks... I need to
>>> add at least this check:
>>>
>>> if (btrfs_root_readonly(root))
>>> return -EROFS;
>>>
>>> which isn't in there as of now.
>>
>> It's not needed and I'd rather do without, read-only snapshots and 
>> deduplication go together well for backups.
>> Data and metadata are guaranteed to be immutable, extent storage isn't.  
>> This is also the case with raid.
>
> You're absolutely right - I miswrote the check I meant.
> Specifically, I was thinking about when the entire fs is readonly due to
> either some error or the user mounted with -oro. So something more like:
>
>   if (root->fs_info->sb->s_flags & MS_RDONLY)
>   return -EROFS;
>
> I think that should be reasonable and wouldn't affect most use cases,
> right?

That's all right.

>>> Also I don't really check the open mode (read, write, etc) on files passed
>>> in. We do this in the clone ioctl and it makes sense there since data (to
>>> the user) can change. With this ioctl though data won't ever change (even if
>>> the underlying extent does). So I left the checks out. A part of me is
>>> thinking we might want to be conservative to start with though and just add
>>> those type of checks in. Basically, I figure the source should be open for
>>> read at least and target files need write access.
>>
>> I don't know of any privileged files that one would be able to open(2),
>> but if this is available to unprivileged users the files all need to be
>> open for reading so that it can't be used to guess at their contents. As
>> long as root gets to bypass the checks (no btrfs_root_readonly) it doesn't
>> hurt my use case.
>
> Oh ok so this seems to make sense. How does this logic sound:
>
> We're not going to worry about write access since it would be entirely
> reasonable for the user to want to do this on a readonly submount
> (specifically for the purpose of deduplicating backups).
>
> Read access needs to be provided however so we know that the user has access
> to the file data.
>
> So basically, if a user can open any files for read, they can check their
> contents and dedupe them.
>
> Letting users dedupe files in say, /etc seems kind of weird to me but I'm
> struggling to come up with a good explanation of why that should mean we
> limit this ioctl to root.
>   --Mark

I agree with that model.  Most of the code is shared with clone (and the
copy_range RFC) which are unprivileged, so it doesn't increase the potential
surface for bugs much.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] btrfs: offline dedupe

2013-05-24 Thread Gabriel de Perthuis
>>> +#define BTRFS_MAX_DEDUPE_LEN   (16 * 1024 * 1024)
>>> +#define BTRFS_ONE_DEDUPE_LEN   (1 * 1024 * 1024)
>>> +
>>> +static long btrfs_ioctl_file_extent_same(struct file *file,
>>> +void __user *argp)
>>> +{
>>> +   struct btrfs_ioctl_same_args *args;
>>> +   struct btrfs_ioctl_same_args tmp;
>>> +   struct btrfs_ioctl_same_extent_info *info;
>>> +   struct inode *src = file->f_dentry->d_inode;
>>> +   struct file *dst_file = NULL;
>>> +   struct inode *dst;
>>> +   u64 off;
>>> +   u64 len;
>>> +   int args_size;
>>> +   int i;
>>> +   int ret;
>>> +   u64 bs = BTRFS_I(src)->root->fs_info->sb->s_blocksize;
>>
>> The ioctl is available to non-root, so an extra care should be taken to
>> potentail overflows etc. I haven't spotted anything so far.
> 
> 
> Sure. Actually, you got me thinking about some sanity checks... I need to
> add at least this check:
> 
>   if (btrfs_root_readonly(root))
>   return -EROFS;
> 
> which isn't in there as of now.

It's not needed and I'd rather do without, read-only snapshots and 
deduplication go together well for backups.
Data and metadata are guaranteed to be immutable, extent storage isn't.  This 
is also the case with raid.


> Also I don't really check the open mode (read, write, etc) on files passed
> in. We do this in the clone ioctl and it makes sense there since data (to
> the user) can change. With this ioctl though data won't ever change (even if
> the underlying extent does). So I left the checks out. A part of me is
> thinking we might want to be conservative to start with though and just add
> those type of checks in. Basically, I figure the source should be open for
> read at least and target files need write access.

I don't know of any privileged files that one would be able to open(2), but if 
this is available to unprivileged users the files all need to be open for 
reading so that it can't be used to guess at their contents.
As long as root gets to bypass the checks (no btrfs_root_readonly) it doesn't 
hurt my use case.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] a structure for the disks scan for btrfs

2013-05-17 Thread Gabriel de Perthuis
> Just scan /dev/block/*.  That contains all block devices.

Oh, this is about finding nicer names.  Never mind.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] a structure for the disks scan for btrfs

2013-05-17 Thread Gabriel de Perthuis
On Fri, 17 May 2013 18:54:38 +0800, Anand Jain wrote:
> The idea was to introduce /dev/mapper to find for btrfs disk, 
> However I found first we need to congregate the disk scan
> procedure at a function so it would help to consistently tune
> it across the btrfs-progs. As of now both fi show and
> dev scan use the disks scan they do it on their own.
> 
> So here it would congregate btrfs-disk scans at the function
> scan_devs_for_btrfs, adds /dev/mapper to be used to scan
> for btrfs, and updates its calling functions and few bug fixes.

Just scan /dev/block/*.  That contains all block devices.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: subvol copying

2013-05-15 Thread Gabriel de Perthuis
> A user of a workstation has a home directory /home/john as a subvolume.  I 
> wrote a cron job to make read-only snapshots of it under /home/john/backup 
> which was fortunate as they just ran a script that did something like
> "rm -rf ~".
> 
> Apart from copying dozens of gigs of data back, is there a good way of 
> recovering it all?  Whatever you suggest isn't going to work for this time 
> (the copy is almost done) but will be useful for next time.
> 
> Should I have put the backups under /backup instead so that I could just 
> delete the corrupted subvol and make a read-write snapshot of the last good 
> one?

You can move subvolumes at any time, as if they were regular directories.

For example: move the backups to an external location, move what's left
of the home to another location out of the way, and make a snapshot to
restore.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/4] btrfs: Introduce extent_read_full_page_nolock()

2013-05-09 Thread Gabriel de Perthuis
>> We want this for btrfs_extent_same. Basically readpage and friends do their
>> own extent locking but for the purposes of dedupe, we want to have both
>> files locked down across a set of readpage operations (so that we can
>> compare data). Introduce this variant and a flag which can be set for
>> extent_read_full_page() to indicate that we are already locked.
> 
> This one can get stuck in TASK_UNINTERRUPTIBLE:
> 
> [32129.522257] SysRq : Show Blocked State
> [32129.524337]   taskPC stack   pid father
> [32129.526515] python  D 88021f394280 0 16281  1 
> 0x0004
> [32129.528656]  88020e079a48 0082 88013d3cdd40 
> 88020e079fd8
> [32129.530840]  88020e079fd8 88020e079fd8 8802138dc5f0 
> 88013d3cdd40
> [32129.533044]   1fff 88015286f440 
> 0008
> [32129.535285] Call Trace:
> [32129.537522]  [] schedule+0x29/0x70
> [32129.539829]  [] wait_extent_bit+0xf8/0x150 [btrfs]
> [32129.542130]  [] ? finish_wait+0x80/0x80
> [32129.544463]  [] lock_extent_bits+0x44/0xa0 [btrfs]
> [32129.546824]  [] lock_extent+0x13/0x20 [btrfs]
> [32129.549198]  [] add_ra_bio_pages.isra.8+0x17f/0x2d0 
> [btrfs]
> [32129.551602]  [] btrfs_submit_compressed_read+0x25c/0x4c0 
> [btrfs]
> [32129.554028]  [] btrfs_submit_bio_hook+0x1d1/0x1e0 [btrfs]
> [32129.556457]  [] submit_one_bio+0x67/0xa0 [btrfs]
> [32129.558899]  [] extent_read_full_page_nolock+0x4d/0x60 
> [btrfs]
> [32129.561290]  [] fill_data+0xb2/0x230 [btrfs]
> [32129.563623]  [] btrfs_ioctl+0x1f7e/0x2560 [btrfs]
> [32129.565924]  [] ? _raw_spin_lock+0xe/0x20
> [32129.568207]  [] ? inode_get_bytes+0x47/0x60
> [32129.570472]  [] do_vfs_ioctl+0x97/0x560
> [32129.572700]  [] ? sys_newfstat+0x2a/0x40
> [32129.574882]  [] sys_ioctl+0x91/0xb0
> [32129.577008]  [] system_call_fastpath+0x1a/0x1f

For anyone trying those patches, there's a fix here:
https://github.com/g2p/linux/tree/v3.9%2Bbtrfs-extent-same

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 0/5] BTRFS hot relocation support

2013-05-09 Thread Gabriel de Perthuis
On Thu, 09 May 2013 07:13:56 +0800, Zhi Yong Wu wrote:
> HI, all
> 
>I saw that bcache will be merged into kernel upstream soon, so i
> want to know if btrfs hot relocation support is still meanful, if no,
> i will not continue to work on it. can anyone let me know this?
> thanks.

bcache performance would be poor if Btrfs-raid1 is used;
the hits will be spread across mirrors are random, bcache will have
to cache twice as much and convergence could be slow.

Hot tracking also has much more precise info about which files are hot, 
whereas bcache can't know anything about files that are in the
page cache and risks evicting some of them.

It all comes down to benchmarks, but these two situations could be easy
wins for hot-reloc.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Possible to deduplicate read-only snapshots for space-efficient backups

2013-05-07 Thread Gabriel de Perthuis
On Wed, 08 May 2013 01:04:38 +0200, Kai Krakow wrote:
> Gabriel de Perthuis  schrieb:
>> It sounds simple, and was sort-of prompted by the new syscall taking
>> short ranges, but it is tricky figuring out a sane heuristic (when to
>> hash, when to bail, when to submit without comparing, what should be the
>> source in the last case), and it's not something I have an immediate
>> need for.  It is also possible to use 9p (with standard cow and/or
>> small-file dedup) and trade a bit of configuration for much more
>> space-efficient VMs.
>> 
>> Finer-grained tracking of which ranges have changed, and maybe some
>> caching of range hashes, would be a good first step before doing any
>> crazy large-file heuristics.  The hash caching would actually benefit
>> all use cases.
> 
> Looking back to good old peer-2-peer days (I think we all got in touch with 
> that the one or the other way), one title pops back into my mind: tiger-
> tree-hash...
> 
> I'm not really into it, but would it be possible to use tiger-tree-hashes to 
> find identical blocks? Even accross different sized files...

Possible, but bedup is all about doing as little io as it can get away
with, doing streaming reads only when it has sampled that the files are
likely duplicates and not spending a ton of disk space for indexing.

Hashing everything in the hope that there are identical blocks at
unrelated places on the disk is a much more resource-intensive approach;
Liu Bo is working on that, following ZFS's design choices.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 0/5] BTRFS hot relocation support

2013-05-07 Thread Gabriel de Perthuis
On Tue, 07 May 2013 23:58:08 +0200, Kai Krakow wrote:
> Gabriel de Perthuis  schrieb:
>>>   On the side note: dm-cache, which is already in-kernel, do not need to
>>> reformat backing storage.
>> 
>> On the other hand dm-cache is somewhat complex to assemble, and letting
>> the system automount the unsynchronised backing device is a recipe for
>> data loss.
> 
> Yes, that was my first impression, too, after reading of how it works. How 
> safe is bcache on that matter?

The bcache superblock is there just to prevent the naked backing device
from becoming available.  So it's safe in that respect.  LVM has
something similar with hidden volumes.

>> Anyway, here's a shameless plug for a tool that converts to bcache
>> in-place:  https://github.com/g2p/blocks#bcache-conversion
> 
> Did I say: I love your shameless plugs? ;-)
> 
> I've read the docs for this tool with interest. Still I do not feel very 
> comfortable with converting my storage for some unknown outcome. Sure, I can 
> take backups (and by any means: I will). But it takes time: backup, try, 
> restore, try again, maybe restore... I don't want to find out that it was 
> all useless because it's just not ready to boot a multi-device btrfs through 
> dracut. So you see, the point is: Will that work? I didn't see any docs 
> answering my questions.

Try it with a throwaway filesystem inside a VM.  The bcache list will
appreciate the feedback on Dracut, even if you don't make the switch
for real.

> Of course, if it would work I'd happily contribute documentation to your 
> project.

That would be very welcome.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Possible to dedpulicate read-only snapshots for space-efficient backups

2013-05-07 Thread Gabriel de Perthuis
> Do you plan to support deduplication on a finer grained basis than file 
> level? As an example, in the end it could be interesting to deduplicate 1M 
> blocks of huge files. Backups of VM images come to my mind as a good 
> candidate. While my current backup script[1] takes care of this by using 
> "rsync --inplace" it won't consider files moved between two backup cycles. 
> This is the main purpose I'm using bedup for on my backup drive.
> 
> Maybe you could define another cutoff value to consider huge files for 
> block-level deduplication?

I'm considering deduplicating aligned blocks of large files sharing the
same size (VMs with the same baseline.  Those would ideally come
pre-cowed, but rsync or scp could have broken that).

It sounds simple, and was sort-of prompted by the new syscall taking
short ranges, but it is tricky figuring out a sane heuristic (when to
hash, when to bail, when to submit without comparing, what should be the
source in the last case), and it's not something I have an immediate
need for.  It is also possible to use 9p (with standard cow and/or
small-file dedup) and trade a bit of configuration for much more
space-efficient VMs.

Finer-grained tracking of which ranges have changed, and maybe some
caching of range hashes, would be a good first step before doing any
crazy large-file heuristics.  The hash caching would actually benefit
all use cases.

> Regards,
> Kai
> 
> [1]: https://gist.github.com/kakra/5520370


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 0/5] BTRFS hot relocation support

2013-05-07 Thread Gabriel de Perthuis
>> How will it compare to bcache? I'm currently thinking about buying an SSD 
>> but bcache requires some efforts in migrating the storage to use. And after 
>> all those hassles I am even not sure if it would work easily with a dracut 
>> generated initramfs.

>   On the side note: dm-cache, which is already in-kernel, do not need to
> reformat backing storage.

On the other hand dm-cache is somewhat complex to assemble, and letting
the system automount the unsynchronised backing device is a recipe for
data loss.

It will need lvm integration to become really convenient to use.

Anyway, here's a shameless plug for a tool that converts to bcache
in-place:  https://github.com/g2p/blocks#bcache-conversion

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/4] btrfs: Introduce extent_read_full_page_nolock()

2013-05-07 Thread Gabriel de Perthuis
> We want this for btrfs_extent_same. Basically readpage and friends do their
> own extent locking but for the purposes of dedupe, we want to have both
> files locked down across a set of readpage operations (so that we can
> compare data). Introduce this variant and a flag which can be set for
> extent_read_full_page() to indicate that we are already locked.

This one can get stuck in TASK_UNINTERRUPTIBLE:

[32129.522257] SysRq : Show Blocked State
[32129.524337]   taskPC stack   pid father
[32129.526515] python  D 88021f394280 0 16281  1 0x0004
[32129.528656]  88020e079a48 0082 88013d3cdd40 
88020e079fd8
[32129.530840]  88020e079fd8 88020e079fd8 8802138dc5f0 
88013d3cdd40
[32129.533044]   1fff 88015286f440 
0008
[32129.535285] Call Trace:
[32129.537522]  [] schedule+0x29/0x70
[32129.539829]  [] wait_extent_bit+0xf8/0x150 [btrfs]
[32129.542130]  [] ? finish_wait+0x80/0x80
[32129.544463]  [] lock_extent_bits+0x44/0xa0 [btrfs]
[32129.546824]  [] lock_extent+0x13/0x20 [btrfs]
[32129.549198]  [] add_ra_bio_pages.isra.8+0x17f/0x2d0 [btrfs]
[32129.551602]  [] btrfs_submit_compressed_read+0x25c/0x4c0 
[btrfs]
[32129.554028]  [] btrfs_submit_bio_hook+0x1d1/0x1e0 [btrfs]
[32129.556457]  [] submit_one_bio+0x67/0xa0 [btrfs]
[32129.558899]  [] extent_read_full_page_nolock+0x4d/0x60 
[btrfs]
[32129.561290]  [] fill_data+0xb2/0x230 [btrfs]
[32129.563623]  [] btrfs_ioctl+0x1f7e/0x2560 [btrfs]
[32129.565924]  [] ? _raw_spin_lock+0xe/0x20
[32129.568207]  [] ? inode_get_bytes+0x47/0x60
[32129.570472]  [] do_vfs_ioctl+0x97/0x560
[32129.572700]  [] ? sys_newfstat+0x2a/0x40
[32129.574882]  [] sys_ioctl+0x91/0xb0
[32129.577008]  [] system_call_fastpath+0x1a/0x1f

Side note, I wish btrfs used TASK_KILLABLE[1] instead.

[1]: https://lwn.net/Articles/288056/

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs: don't stop searching after encountering the wrong item

2013-05-06 Thread Gabriel de Perthuis
The search ioctl skips items that are too large for a result buffer, but
inline items of a certain size occuring before any search result is
found would trigger an overflow and stop the search entirely.

Cc: sta...@vger.kernel.org
Bug: https://bugzilla.kernel.org/show_bug.cgi?id=57641

Signed-off-by: Gabriel de Perthuis 
---
(resent, with the correct header to have stable copied)

 fs/btrfs/ioctl.c | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 2c02310..f49b62f 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -1794,23 +1794,23 @@ static noinline int copy_to_sk(struct btrfs_root *root,
 
for (i = slot; i < nritems; i++) {
item_off = btrfs_item_ptr_offset(leaf, i);
item_len = btrfs_item_size_nr(leaf, i);
 
-   if (item_len > BTRFS_SEARCH_ARGS_BUFSIZE)
+   btrfs_item_key_to_cpu(leaf, key, i);
+   if (!key_in_sk(key, sk))
+   continue;
+
+   if (sizeof(sh) + item_len > BTRFS_SEARCH_ARGS_BUFSIZE)
item_len = 0;
 
if (sizeof(sh) + item_len + *sk_offset >
BTRFS_SEARCH_ARGS_BUFSIZE) {
ret = 1;
goto overflow;
}
 
-   btrfs_item_key_to_cpu(leaf, key, i);
-   if (!key_in_sk(key, sk))
-   continue;
-
sh.objectid = key->objectid;
sh.offset = key->offset;
sh.type = key->type;
sh.len = item_len;
sh.transid = found_transid;
-- 
1.8.2.1.419.ga0b97c6

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs: don't stop searching after encountering the wrong item

2013-05-06 Thread Gabriel de Perthuis
The search ioctl skips items that are too large for a result buffer, but
inline items of a certain size occuring before any search result is
found would trigger an overflow and stop the search entirely.

Bug: https://bugzilla.kernel.org/show_bug.cgi?id=57641

Signed-off-by: Gabriel de Perthuis 
---
 fs/btrfs/ioctl.c | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 95d46cc..b3f0276 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -1797,23 +1797,23 @@ static noinline int copy_to_sk(struct btrfs_root *root,
 
for (i = slot; i < nritems; i++) {
item_off = btrfs_item_ptr_offset(leaf, i);
item_len = btrfs_item_size_nr(leaf, i);
 
-   if (item_len > BTRFS_SEARCH_ARGS_BUFSIZE)
+   btrfs_item_key_to_cpu(leaf, key, i);
+   if (!key_in_sk(key, sk))
+   continue;
+
+   if (sizeof(sh) + item_len > BTRFS_SEARCH_ARGS_BUFSIZE)
item_len = 0;
 
if (sizeof(sh) + item_len + *sk_offset >
BTRFS_SEARCH_ARGS_BUFSIZE) {
ret = 1;
goto overflow;
}
 
-   btrfs_item_key_to_cpu(leaf, key, i);
-   if (!key_in_sk(key, sk))
-   continue;
-
sh.objectid = key->objectid;
sh.offset = key->offset;
sh.type = key->type;
sh.len = item_len;
sh.transid = found_transid;
-- 
1.8.2.1.419.ga0b97c6



Re: Possible to dedpulicate read-only snapshots for space-efficient backups

2013-05-05 Thread Gabriel de Perthuis
On Sun, 05 May 2013 12:07:17 +0200, Kai Krakow wrote:
> Hey list,
> 
> I wonder if it is possible to deduplicate read-only snapshots.
> 
> Background:
> 
> I'm using an bash/rsync script[1] to backup my whole system on a nightly 
> basis to an attached USB3 drive into a scratch area, then take a snapshot of 
> this area. I'd like to have these snapshots immutable, so they should be 
> read-only.
> 
> Since rsync won't discover moved files but instead place a new copy of that 
> in the backup, I'm running the wonderful bedup application[2] to deduplicate 
> my backup drive from time to time and it almost always gains back a good 
> pile of gigabytes. The rest of storage space issues is taken care of by 
> using rsync's inplace option (although this won't cover the case of files 
> moved and changed between backup runs) and using compress-force=gzip.

> I've read about ongoing work to integrate offline (and even online) 
> deduplication into the kernel so that this process can be made atomic (and 
> even block-based instead of file-based). This would - to my understandings - 
> result in the immutable attribute no longer needed. So, given the fact above 
> and for the case read-only snapshots cannot be used for this application 
> currently, will these patches address the problem and read-only snapshots 
> could be deduplicated? Or are read-only snapshots meant to be what the name 
> suggests: Immutable, even for deduplication?

There's no deep reason read-only snapshots should keep their storage
immutable, they can be affected by raid rebalancing for example.

The current bedup restriction comes from the clone call; Mark Fasheh's
dedup ioctl[3] appears to be fine with snapshots.  The bedup integration
(in a branch) is a work in progress at the moment.  I need to fix a scan
bug, tweak parameters for the latest kernel dedup patch, remove a lot of
logic that is now unnecessary, and figure out the compatibility story.

> Regards,
> Kai
> 
> [1]: https://gist.github.com/kakra/5520370
> [2]: https://github.com/g2p/bedup

[3]: http://comments.gmane.org/gmane.comp.file-systems.btrfs/25062


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 2/2] Btrfs: online data deduplication

2013-05-01 Thread Gabriel de Perthuis
>  #define BTRFS_IOC_DEV_REPLACE _IOWR(BTRFS_IOCTL_MAGIC, 53, \
>   struct btrfs_ioctl_dev_replace_args)
> +#define BTRFS_IOC_DEDUP_REGISTER _IO(BTRFS_IOCTL_MAGIC, 54)

This number has already been used by the offline dedup patches.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Best Practice - Partition, or not?

2013-05-01 Thread Gabriel de Perthuis
> Hello
> 
> If I want to manage a complete disk with btrfs, what's the "Best
> Practice"? Would it be best to create the btrfs filesystem on
> "/dev/sdb", or would it be better to create just one partition from
> start to end and then do "mkfs.btrfs /dev/sdb1"?

Partitions (GPT) are always more flexible and future-proof.  If you ever
need to shrink the btrfs filesystem and give the space to another
partition, or do a conversion to lvm/bcache/luks (shameless plug:
https://github.com/g2p/blocks ), it'd be stupid to be locked into your
current setup for want of a few megabytes of space before your
filesystem.

> Would the same recomendation hold true, if we're talking about huge
> disks, like 4TB or so?

More so, since it can be infeasible to move this much data.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] [RFC] btrfs: offline dedupe

2013-04-22 Thread Gabriel de Perthuis

On Sat, Apr 20, 2013 at 05:49:25PM +0200, Gabriel de Perthuis wrote:

Hi,

The following series of patches implements in btrfs an ioctl to do
offline deduplication of file extents.


I am a fan of this patch, the API is just right.  I just have a few tweaks
to suggest to the argument checking.


Awesome, thanks for looking this over!


At first the 1M limitation on the length was a bit inconvenient, but making
repeated calls in userspace is okay and allows for better error recovery
(for example, repeating the calls when only a portion of the ranges is
identical).  The destination file appears to be fragmented into 1M extents,
but these are contiguous so it's not really a problem.


Yeah I agree it's a bit inconvenient. To that end, I fixed things up so that
instead of erroring, we just limit the dedupe to 1M. If you want to see what
I'm talking about, the patch is at the top of my tree now:

https://github.com/markfasheh/btrfs-extent-same/commit/b39f93c2e78385ceea850b59edbd759120543a8b

This way userspace doesn't have to guess at what size is the max, and we can
change it in the future, etc.

Furthermore, I'm thinking it might even be better for us to just internally
loop on the entire range asked for. That won't necessarily fix the issue
where we fragment into 1M extents, but it would ease the interface even
more.

My only concern with looping over a large range would be the (almost)
unbounded nature of the operation... For example, if someone passes in a 4
Gig range and 100 files to do that on, we could be in the ioctl for some
time.

The middle ground would be to loop like I was talking about but limit the
maximum length (by just truncating the value, as above). The limit in this
case would obviously be much larger than 1 megabyte but not so large that we
can go off for an extreme amount of time. I'm thinking maybe 16 megabytes or
so to start?


A cursor-style API could work here: make the offset and length 
parameters in/out, exit early in case of error or after the read quota 
has been used up.


The caller can retry as long as the length is nonzero (and at least one 
block), and the syscall will return frequently enough that it won't 
block an unmount operation or concurrent access to the ranges.



Requiring the offset or the length to align is spurious however; it doesn't
translate to any alignment in the extent tree (especially with
compression).  Requiring a minimum length of a few blocks or dropping the
alignment condition entirely would make more sense.


I'll take a look at this. Some of those checks are there for my own sanity
at the moment.

I really like that the start offset should align but there's no reason that
length can't be aligned to blocksize internally.

Are you sure that extents don't have to start at block boundaries? If that's
the case and we never have to change the start offset (to align it) then we
could drop the check entirely.


I've had a look, and btrfs fiemap only sets FIEMAP_EXTENT_NOT_ALIGNED 
for inline extents, so the alignment requirement makes sense.  The 
caller should do the alignment and decide if it wants to extend a bit 
and accept a not-same status or shrink a bit, so just keep it as is and 
maybe add explanatory comments.



I notice there is a restriction on cross-subvolume deduplication. Hopefully
it can be lifted like it was done in 3.6 for the clone ioctl.


Ok if it was removed from clone then this might be a spurious check on my
part. Most of the real extent work in btrfs-same is done by the code from
the clone ioctl.


Good to have this code shared (compression support is another win). 
bedup will need feature parity to switch from one ioctl to the other.



Deduplicating frozen subvolumes works fine, which is great for backups.

Basic integration with bedup, my offline deduplication tool, is in an
experimental branch:

   https://github.com/g2p/bedup/tree/wip/dedup-syscall

Thanks to this, I look forward to shedding most of the caveats given in the
bedup readme and some unnecessary subtleties in the code.


Again, I'm really glad this is working out for you :)

I'll check out your bedup patch early this week. It will be instructive to
see how another engineer uses the ioctl.


See ranges_same and dedup_fileset.  The ImmutableFDs stuff can be 
removed and the fact that dedup can be partially successful over a range 
will ripple through.



I've made significant updates and changes from the original. In
particular the structure passed is more fleshed out, this series has a
high degree of code sharing between itself and the clone code, and the
locking has been updated.

The ioctl accepts a struct:

struct btrfs_ioctl_same_args {
__u64 logical_offset;   /* in - start of extent in source */
__u64 length;   /* in - length of extent */
__u16 total_files;  /* in - total elements in info array */


Nit: total_files sounds like it 

Re: [PATCH 0/4] [RFC] btrfs: offline dedupe

2013-04-20 Thread Gabriel de Perthuis

Hi,

The following series of patches implements in btrfs an ioctl to do
offline deduplication of file extents.


I am a fan of this patch, the API is just right.  I just have a few 
tweaks to suggest to the argument checking.


At first the 1M limitation on the length was a bit inconvenient, but 
making repeated calls in userspace is okay and allows for better error 
recovery (for example, repeating the calls when only a portion of the 
ranges is identical).  The destination file appears to be fragmented 
into 1M extents, but these are contiguous so it's not really a problem.


Requiring the offset or the length to align is spurious however; it 
doesn't translate to any alignment in the extent tree (especially with 
compression).  Requiring a minimum length of a few blocks or dropping 
the alignment condition entirely would make more sense.


I notice there is a restriction on cross-subvolume deduplication. 
Hopefully it can be lifted like it was done in 3.6 for the clone ioctl.


Deduplicating frozen subvolumes works fine, which is great for backups.

Basic integration with bedup, my offline deduplication tool, is in an 
experimental branch:


  https://github.com/g2p/bedup/tree/wip/dedup-syscall

Thanks to this, I look forward to shedding most of the caveats given in 
the bedup readme and some unnecessary subtleties in the code.



I've made significant updates and changes from the original. In
particular the structure passed is more fleshed out, this series has a
high degree of code sharing between itself and the clone code, and the
locking has been updated.

The ioctl accepts a struct:

struct btrfs_ioctl_same_args {
__u64 logical_offset;   /* in - start of extent in source */
__u64 length;   /* in - length of extent */
__u16 total_files;  /* in - total elements in info array */


Nit: total_files sounds like it would count the source file.
dest_count would be better.

By the way, extent-same might be better named range-same, since there is 
no need for the input to fall on extent boundaries.



__u16 files_deduped;/* out - number of files that got deduped */
__u32 reserved;
struct btrfs_ioctl_same_extent_info info[0];
};

Userspace puts each duplicate extent (other than the source) in an
item in the info array. As there can be multiple dedupes in one
operation, each info item has it's own status and 'bytes_deduped'
member. This provides a number of benefits:

- We don't have to fail the entire ioctl because one of the dedupes failed.

- Userspace will always know how much progress was made on a file as we always
   return the number of bytes deduped.


#define BTRFS_SAME_DATA_DIFFERS 1
/* For extent-same ioctl */
struct btrfs_ioctl_same_extent_info {
__s64 fd;   /* in - destination file */
__u64 logical_offset;   /* in - start of extent in destination */
__u64 bytes_deduped;/* out - total # of bytes we were able
 * to dedupe from this file */
/* status of this dedupe operation:
 * 0 if dedup succeeds
 * < 0 for error
 * == BTRFS_SAME_DATA_DIFFERS if data differs
 */
__s32 status;   /* out - see above description */
__u32 reserved;
};


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Problem with building instructions for btrfs-tools in https://btrfs.wiki.kernel.org/index.php/Btrfs_source_repositories

2013-03-20 Thread Gabriel de Perthuis
> There is a missing dependency: liblzo2-dev

> I suggest to make amendment to the wiki and add a "liblzo2-dev" to the 
> apt-get line for Ubuntu/Debian.

Added. Other distros may need some additions too.

Anyone can edit the wiki, as the spambots will attest; a ConfirmEdit 
captcha at signup would be nice.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Permanent uncancellable balance

2013-03-02 Thread Gabriel de Perthuis
On Sat, 02 Mar 2013 17:12:37 +0600, Roman Mamedov wrote:
> Mount with the "skip_balance" option
> https://btrfs.wiki.kernel.org/index.php/Mount_options then you can issue
> "btrfs fi balance cancel" and it will succeed.

Excellent, thank you.
I had just thought of doing the same thing with ro and it worked.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Permanent uncancellable balance

2013-03-02 Thread Gabriel de Perthuis
Hello,
I have a filesystem that has become unusable because of a balance I can't 
stop. It is very close to full, and the balance is preventing me from 
growing it.

It was started like this:
sudo btrfs filesystem balance start -v -musage=60 -dusage=60 /srv/backups

It has been stuck at 0% across reboots and kernel upgrades (currently on 
3.8.1), and cancelling it had no effect:

Balance on '/srv/backups' is running
0 out of about 5 chunks balanced (95 considered), 100% left

According to atop it is writing but not reading anything.
Unmounting never terminates, so does remounting ro, the only way to 
temporarilly kill it is to reboot. SIGKILL has no effect either. Is there 
*any* way I can get rid of it?

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Fix a sign bug causing invalid memory access in the ino_paths ioctl.

2012-10-10 Thread Gabriel de Perthuis
To see the problem, create many hardlinks to the same file (120 should do it),
then look up paths by inode with:

  ls -i
  btrfs inspect inode-resolve -v $ino /mnt/btrfs

I noticed the memory layout of the fspath->val data had some irregularities
(some unnecessary gaps that stop appearing about halfway),
so I'm not sure there aren't any bugs left in it.

---
 fs/btrfs/backref.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c
index 868cf5b..29d05c6 100644
--- a/fs/btrfs/backref.c
+++ b/fs/btrfs/backref.c
@@ -1131,7 +1131,7 @@ char *btrfs_iref_to_path(struct btrfs_root *fs_root, 
struct btrfs_path *path,
int slot;
u64 next_inum;
int ret;
-   s64 bytes_left = size - 1;
+   s64 bytes_left = ((s64)size) - 1;
struct extent_buffer *eb = eb_in;
struct btrfs_key found_key;
int leave_spinning = path->leave_spinning;
-- 
1.7.12.117.gdc24c27

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] warn when skipping snapshots created with older

2012-09-05 Thread Gabriel de Perthuis
Thanks, I fixed the objectid test.
Apply with --scissors.

-- >8 --
Subject: [PATCH] btrfs send: warn when skipping snapshots created with older
 kernels.

This message is more explicit than "ERROR: could not resolve root_id",
the message that will be shown immediately before `btrfs send` bails.

Also skip invalid high OIDs, to prevent spurious warnings.

Signed-off-by: Gabriel de Perthuis 
---
 send-utils.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/send-utils.c b/send-utils.c
index a43d47e..03ca72a 100644
--- a/send-utils.c
+++ b/send-utils.c
@@ -224,13 +224,18 @@ int subvol_uuid_search_init(int mnt_fd, struct 
subvol_uuid_search *s)
 
if ((sh->objectid != 5 &&
sh->objectid < BTRFS_FIRST_FREE_OBJECTID) ||
-   sh->objectid == BTRFS_FREE_INO_OBJECTID)
+   sh->objectid > BTRFS_LAST_FREE_OBJECTID)
goto skip;
 
if (sh->type == BTRFS_ROOT_ITEM_KEY) {
/* older kernels don't have uuids+times */
if (sh->len < sizeof(root_item)) {
root_item_valid = 0;
+   fprintf(stderr,
+   "Ignoring subvolume id %llu, "
+   "btrfs send needs snapshots "
+   "created with kernel 3.6+\n",
+   sh->objectid);
goto skip;
}
root_item_ptr = (struct btrfs_root_item *)
-- 
1.7.12.117.gdc24c27

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Warn when skipping snapshots created with older kernels.

2012-09-04 Thread Gabriel de Perthuis
This message is more explicit than "ERROR: could not resolve root_id",
the message that will be shown immediately before `btrfs send` bails.

Also skip invalid high OIDs.

Signed-off-by: Gabriel de Perthuis 
---
 send-utils.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/send-utils.c b/send-utils.c
index a43d47e..386aeb3 100644
--- a/send-utils.c
+++ b/send-utils.c
@@ -224,6 +224,7 @@ int subvol_uuid_search_init(int mnt_fd, struct 
subvol_uuid_search *s)
 
if ((sh->objectid != 5 &&
sh->objectid < BTRFS_FIRST_FREE_OBJECTID) ||
+   sh->objectid >= BTRFS_LAST_FREE_OBJECTID ||
sh->objectid == BTRFS_FREE_INO_OBJECTID)
goto skip;
 
@@ -231,6 +232,11 @@ int subvol_uuid_search_init(int mnt_fd, struct 
subvol_uuid_search *s)
/* older kernels don't have uuids+times */
if (sh->len < sizeof(root_item)) {
root_item_valid = 0;
+   fprintf(stderr,
+   "Ignoring subvolume id %llu, "
+   "btrfs send needs snapshots "
+   "created with kernel 3.6+\n",
+   sh->objectid);
goto skip;
}
root_item_ptr = (struct btrfs_root_item *)
-- 
1.7.12.117.gdc24c27

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html