Re: Unexpected: send/receive much slower than rsync ?

2017-04-12 Thread Hermann Schwärzler
Hi,

I am not an expert just a btrfs user who uses send/receive quite
frequently but I am pretty sure your problem is not on the receive side
but on the send end.
Can you check with e.g. iotop if receive is writing anything to the disk
or if it's just waiting for send?
How much is send reading from disk and how much memory is it allocating?

I am asking this because I reckon your problem is caused by the way
clone detection is done in send. There is a proposed patch
https://patchwork.kernel.org/patch/9245287/
that addresses the problem. This did indeed help me when I had a similar
problem when trying to send a previously deduplicated filesystem!

Greetings
Hermann

On 04/11/2017 05:11 PM, J. Hart wrote:
> I'm trying to update from an old snapshot of a directory to a new one
> using send/receive. It seems a great deal slower than I was expecting,
> perhaps much slower than rsync and has been running for hours.
> Everything looks ok with how I set up the snapshots, and there are no
> error messages, but I don't think it should be running this long. The
> directory structure is rather complex, so that may have something to do
> with it. It contains reflinked incremental backups of root file systems
> from a number of machines. It should not actually be very large due to
> the reflinks.
> 
> Sending the old version of the snapshot for the directory did not seem
> to take this long, and I expected the "send -p  " to be much
> faster than that.
> 
> I tried running the "send" and "receive" with "-vv" to get more detail
> on what was happening.
> 
> I had thought that btrfs send/receive purely dealt with block/extent
> level changes.
> 
> I could be mistaken, but it seems that btrfs receive actually does a
> great deal of manipulation at the level of individual files, and rather
> less efficiently than rsync at that. I am not sure whether it is using
> system calls to do this, or actual shell commands themselves. I see
> quite a bit of what looks like file level manipulation in the verbose
> output. It is indeed very fast for simple directory trees even with very
> large files. However, it seems to be far slower than rsync with
> moderately complex directory trees, even if no large files are present.
> 
> I hope I'm overlooking something, and that this is not actually the
> case. Any ideas on this ?
> 
> J. Hart
> 
> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS as a GlusterFS storage back-end, and what I've learned from using it as such.

2017-04-12 Thread Sargun Dhillon
Not to change the topic too much but, is there a suite of tracing
scripts that one can attach to their BtrFS installation to gather
metrics about tree locking performance? We see an awful lot of
machines with a task waiting on btrfs_tree_lock, and a bunch of other
tasks that are also in disk sleep waiting on BtrFS. We also see a
bunch of hung timeouts around btrfs_destroy_inode -- We're running
Kernel 4.8, so we can pretty easily plug in BPF based probes into the
kernel to get this information, and aggregate it.

Rather than doing this work ourselves, I'm wondering if anyone else
has a good set of tools to collect perf data about BtrFS performance
and lock contention?

On Tue, Apr 11, 2017 at 10:49 PM, Qu Wenruo  wrote:
>
>
> At 04/11/2017 11:40 PM, Austin S. Hemmelgarn wrote:
>>
>> About a year ago now, I decided to set up a small storage cluster to store
>> backups (and partially replace Dropbox for my usage, but that's a separate
>> story).  I ended up using GlusterFS as the clustering software itself, and
>> BTRFS as the back-end storage.
>>
>> GlusterFS itself is actually a pretty easy workload as far as cluster
>> software goes.  It does some processing prior to actually storing the data
>> (a significant amount in fact), but the actual on-device storage on any
>> given node is pretty simple.  You have the full directory structure for the
>> whole volume, and whatever files happen to be on that node are located
>> within that tree exactly like they are in the GlusterFS volume. Beyond the
>> basic data, gluster only stores 2-4 xattrs per-file (which are used to track
>> synchronization, and also for it's internal data scrubbing), and a directory
>> called .glusterfs in the top of the back-end storage location for the volume
>> which contains the data required to figure out which node a file is on.
>> Overall, the access patterns mostly mirror whatever is using the Gluster
>> volume, or are reduced to slow streaming writes (when writing files and the
>> back-end nodes are computationally limited instead of I/O limited), with the
>> addition of some serious metadata operations in the .glusterfs directory
>> (lots of stat calls there, together with large numbers of small files).
>
>
> Any real world experience is welcomed to share.
>
>>
>> As far as overall performance, BTRFS is actually on par for this usage
>> with both ext4 and XFS (at least, on my hardware it is), and I actually see
>> more SSD friendly access patterns when using BTRFS in this case than any
>> other FS I tried.
>
>
> We also find that, for pure buffered read/write, btrfs is no worse than
> traditional fs.
>
> In our PostgreSQL test, btrfs can even get a little better performance than
> ext4/xfs when handling DB files.
>
> But if using btrfs for PostgreSQL Write Ahead Log (WAL), then it's
> completely another thing.
> Btrfs falls far behind ext4/xfs on HDD, only half of the TPC performance for
> low concurrency load.
>
> Due to btrfs CoW, btrfs causes extra IO for fsync.
> For example, if only to fsync 4K data, btrfs can cause 64K metadata write
> for default mkfs options.
> (One tree block for log root tree, one tree block for log tree, multiple by
> 2 for default DUP profile)
>
>>
>> After some serious experimentation with various configurations for this
>> during the past few months, I've noticed a handful of other things:
>>
>> 1. The 'ssd' mount option does not actually improve performance on these
>> SSD's.  To a certain extent, this actually surprised me at first, but having
>> seen Hans' e-mail and what he found about this option, it actually makes
>> sense, since erase-blocks on these devices are 4MB, not 2MB, and the drives
>> have a very good FTL (so they will aggregate all the little writes
>> properly).
>>
>> Given this, I'm beginning to wonder if it actually makes sense to not
>> automatically enable this on mount when dealing with certain types of
>> storage (for example, most SATA and SAS SSD's have reasonably good FTL's, so
>> I would expect them to have similar behavior).  Extrapolating further, it
>> might instead make sense to just never automatically enable this, and expose
>> the value this option is manipulating as a mount option as there are other
>> circumstances where setting specific values could improve performance (for
>> example, if you're on hardware RAID6, setting this to the stripe size would
>> probably improve performance on many cheaper controllers).
>>
>> 2. Up to a certain point, running a single larger BTRFS volume with
>> multiple sub-volumes is more computationally efficient than running multiple
>> smaller BTRFS volumes.  More specifically, there is lower load on the system
>> and lower CPU utilization by BTRFS itself without much noticeable difference
>> in performance (in my tests it was about 0.5-1% performance difference,
>> YMMV).  To a certain extent this makes some sense, but the turnover point
>> was actually a lot higher than I expected (with this workload, the turnover
>> point was a

Re: [PATCH 04/25] fs: Provide infrastructure for dynamic BDIs in filesystems

2017-04-12 Thread Christoph Hellwig
> + if (sb->s_iflags & SB_I_DYNBDI) {
> + bdi_put(sb->s_bdi);
> + sb->s_bdi = &noop_backing_dev_info;

At some point I'd really like to get rid of noop_backing_dev_info and
have a NULL here..

Otherwise this looks fine..

Reviewed-by: Christoph Hellwig 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 08/25] btrfs: Convert to separately allocated bdi

2017-04-12 Thread Christoph Hellwig
Looks fine,

Reviewed-by: Christoph Hellwig 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/9] Use RWF_* flags for AIO operations

2017-04-12 Thread Christoph Hellwig
>  
> + if (unlikely(iocb->aio_rw_flags & ~(RWF_HIPRI | RWF_DSYNC | RWF_SYNC))) 
> {
> + pr_debug("EINVAL: aio_rw_flags set with incompatible flags\n");
> + return -EINVAL;
> + }

> + if (iocb->aio_rw_flags & RWF_HIPRI)
> + req->common.ki_flags |= IOCB_HIPRI;
> + if (iocb->aio_rw_flags & RWF_DSYNC)
> + req->common.ki_flags |= IOCB_DSYNC;
> + if (iocb->aio_rw_flags & RWF_SYNC)
> + req->common.ki_flags |= (IOCB_DSYNC | IOCB_SYNC);

Pleae introduce a common helper to share this code between the
synchronous and the aio path

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/9] nowait aio: return on congested block device

2017-04-12 Thread Christoph Hellwig
As mentioned last time around, this should be a REQ_NOWAIT flag so
that it can be easily passed down to the request layer.

> +static inline void bio_wouldblock_error(struct bio *bio)
> +{
> + bio->bi_error = -EAGAIN;
> + bio_endio(bio);
> +}

Please skip this helper..

> +#define QUEUE_FLAG_NOWAIT  28/* queue supports BIO_NOWAIT */

Please make the flag name a little more descriptive, this sounds like
it will never wait.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 9/9] nowait aio: Return -EOPNOTSUPP if filesystem does not support

2017-04-12 Thread Christoph Hellwig
This should go into the patch that introduces IOCB_NOWAIT.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] fstests: regression test for btrfs dio read repair

2017-04-12 Thread Filipe Manana
On Wed, Apr 12, 2017 at 2:27 AM, Liu Bo  wrote:
> This case tests whether dio read can repair the bad copy if we have
> a good copy.

Regardless of being a test we should have always had (thanks for
this!), it would be useful to mention we had a regression (as the test
description in the btrfs/140 file says) and which patch fixed it (and
possibly which kernel version or patch/commit introduced the
regression).

Just a comment/question below.

>
> Signed-off-by: Liu Bo 
> ---
>  tests/btrfs/140 | 152 
> 
>  tests/btrfs/140.out |  39 ++
>  tests/btrfs/group   |   1 +
>  3 files changed, 192 insertions(+)
>  create mode 100755 tests/btrfs/140
>  create mode 100644 tests/btrfs/140.out
>
> diff --git a/tests/btrfs/140 b/tests/btrfs/140
> new file mode 100755
> index 000..db56123
> --- /dev/null
> +++ b/tests/btrfs/140
> @@ -0,0 +1,152 @@
> +#! /bin/bash
> +# FS QA Test 140
> +#
> +# Regression test for btrfs DIO read's repair during read.
> +#
> +#---
> +# Copyright (c) 2017 Liu Bo.  All Rights Reserved.
> +#
> +# This program is free software; you can redistribute it and/or
> +# modify it under the terms of the GNU General Public License as
> +# published by the Free Software Foundation.
> +#
> +# This program is distributed in the hope that it would be useful,
> +# but WITHOUT ANY WARRANTY; without even the implied warranty of
> +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> +# GNU General Public License for more details.
> +#
> +# You should have received a copy of the GNU General Public License
> +# along with this program; if not, write the Free Software Foundation,
> +# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
> +#---
> +#
> +
> +seq=`basename $0`
> +seqres=$RESULT_DIR/$seq
> +echo "QA output created by $seq"
> +
> +here=`pwd`
> +tmp=/tmp/$$
> +status=1   # failure is the default!
> +trap "_cleanup; exit \$status" 0 1 2 3 15
> +
> +_cleanup()
> +{
> +   cd /
> +   rm -f $tmp.*
> +}
> +
> +# get standard environment, filters and checks
> +. ./common/rc
> +. ./common/filter
> +
> +# remove previous $seqres.full before test
> +rm -f $seqres.full
> +
> +# real QA test starts here
> +
> +# Modify as appropriate.
> +_supported_fs btrfs
> +_supported_os Linux
> +_require_scratch_dev_pool 2
> +_require_command "$BTRFS_MAP_LOGICAL_PROG" btrfs-map-logical
> +_require_command "$FILEFRAG_PROG" filefrag
> +_require_odirect
> +
> +# helpe to convert 'file offset' to btrfs logical offset
> +FILEFRAG_FILTER='
> +   if (/blocks? of (\d+) bytes/) {
> +   $blocksize = $1;
> +   next
> +   }
> +   ($ext, $logical, $physical, $length) =
> +   (/^\s*(\d+):\s+(\d+)..\s+\d+:\s+(\d+)..\s+\d+:\s+(\d+):/)
> +   or next;
> +   ($flags) = /.*:\s*(\S*)$/;
> +   print $physical * $blocksize, "#",
> + $length * $blocksize, "#",
> + $logical * $blocksize, "#",
> + $flags, " "'
> +
> +# this makes filefrag output script readable by using a perl helper.
> +# output is one extent per line, with three numbers separated by '#'
> +# the numbers are: physical, length, logical (all in bytes)
> +# sample output: "1234#10#5678" -> physical 1234, length 10, logical 5678
> +_filter_extents()
> +{
> +   tee -a $seqres.full | $PERL_PROG -ne "$FILEFRAG_FILTER"
> +}
> +
> +_check_file_extents()
> +{
> +   cmd="filefrag -v $1"
> +   echo "# $cmd" >> $seqres.full
> +   out=`$cmd | _filter_extents`
> +   if [ -z "$out" ]; then
> +   return 1
> +   fi
> +   echo "after filter: $out" >> $seqres.full
> +   echo $out
> +   return 0
> +}
> +
> +_check_repair()
> +{
> +   filter=${1:-cat}
> +   dmesg | tac | sed -ne "0,\#run fstests $seqnum at $date_time#p" | tac 
> | $filter | grep -q -e "csum failed"
> +   if [ $? -eq 0 ]; then
> +   echo 1
> +   else
> +   echo 0
> +   fi
> +}
> +
> +_scratch_dev_pool_get 2
> +# step 1, create a raid1 btrfs which contains one 128k file.
> +echo "step 1..mkfs.btrfs" >>$seqres.full
> +
> +mkfs_opts="-d raid1"
> +_scratch_pool_mkfs $mkfs_opts >>$seqres.full 2>&1
> +
> +_scratch_mount -o nospace_cache

Why do we need to mount without space cache?
I don't see why nor I think it's obvious. A comment in the test
mentioning why would be useful for everyone.


> +
> +$XFS_IO_PROG -f -d -c "pwrite -S 0xaa -b 128K 0 128K" "$SCRATCH_MNT/foobar" 
> | _filter_xfs_io
> +
> +sync
> +
> +# step 2, corrupt the first 64k of one copy (on SCRATCH_DEV which is the 
> first
> +# one in $SCRATCH_DEV_POOL
> +echo "step 2..corrupt file extent" >>$seqres.full
> +
> +extents=`_check_file_extents $SCRATCH_MNT/foobar`
> +logical_in_btrfs=`echo ${extents} | cut -d '#' -f 1`
> +physical_on_scratch=`$BTRFS_M

Re: [PATCH] fstests: regression test for btrfs buffered read's repair

2017-04-12 Thread Filipe Manana
On Wed, Apr 12, 2017 at 2:27 AM, Liu Bo  wrote:
> This case tests whether buffered read can repair the bad copy if we
> have a good copy.

Regardless of being a test we should have always had (thanks for
this!), it would be useful to mention we had a regression (as the test
description in the btrfs/141 file says) and which patch fixed it (and
possibly which kernel version or patch/commit introduced the
regression).

Just a couple comments/questions below.

>
> Signed-off-by: Liu Bo 
> ---
>  tests/btrfs/141 | 152 
> 
>  tests/btrfs/141.out |  39 ++
>  tests/btrfs/group   |   1 +
>  3 files changed, 192 insertions(+)
>  create mode 100755 tests/btrfs/141
>  create mode 100644 tests/btrfs/141.out
>
> diff --git a/tests/btrfs/141 b/tests/btrfs/141
> new file mode 100755
> index 000..53fd75c
> --- /dev/null
> +++ b/tests/btrfs/141
> @@ -0,0 +1,152 @@
> +#! /bin/bash
> +# FS QA Test 141
> +#
> +# Regression test for btrfs buffered read's repair during read.
> +#
> +#---
> +# Copyright (c) 2017 Liu Bo.  All Rights Reserved.
> +#
> +# This program is free software; you can redistribute it and/or
> +# modify it under the terms of the GNU General Public License as
> +# published by the Free Software Foundation.
> +#
> +# This program is distributed in the hope that it would be useful,
> +# but WITHOUT ANY WARRANTY; without even the implied warranty of
> +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> +# GNU General Public License for more details.
> +#
> +# You should have received a copy of the GNU General Public License
> +# along with this program; if not, write the Free Software Foundation,
> +# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
> +#---
> +#
> +
> +seq=`basename $0`
> +seqres=$RESULT_DIR/$seq
> +echo "QA output created by $seq"
> +
> +here=`pwd`
> +tmp=/tmp/$$
> +status=1   # failure is the default!
> +trap "_cleanup; exit \$status" 0 1 2 3 15
> +
> +_cleanup()
> +{
> +   cd /
> +   rm -f $tmp.*
> +}
> +
> +# get standard environment, filters and checks
> +. ./common/rc
> +. ./common/filter
> +
> +# remove previous $seqres.full before test
> +rm -f $seqres.full
> +
> +# real QA test starts here
> +
> +# Modify as appropriate.
> +_supported_fs btrfs
> +_supported_os Linux
> +_require_scratch_dev_pool 2
> +_require_command "$BTRFS_MAP_LOGICAL_PROG" btrfs-map-logical
> +_require_command "$FILEFRAG_PROG" filefrag
> +
> +# helpe to convert 'file offset' to btrfs logical offset
> +FILEFRAG_FILTER='
> +   if (/blocks? of (\d+) bytes/) {
> +   $blocksize = $1;
> +   next
> +   }
> +   ($ext, $logical, $physical, $length) =
> +   (/^\s*(\d+):\s+(\d+)..\s+\d+:\s+(\d+)..\s+\d+:\s+(\d+):/)
> +   or next;
> +   ($flags) = /.*:\s*(\S*)$/;
> +   print $physical * $blocksize, "#",
> + $length * $blocksize, "#",
> + $logical * $blocksize, "#",
> + $flags, " "'
> +
> +# this makes filefrag output script readable by using a perl helper.
> +# output is one extent per line, with three numbers separated by '#'
> +# the numbers are: physical, length, logical (all in bytes)
> +# sample output: "1234#10#5678" -> physical 1234, length 10, logical 5678
> +_filter_extents()
> +{
> +   tee -a $seqres.full | $PERL_PROG -ne "$FILEFRAG_FILTER"
> +}
> +
> +_check_file_extents()
> +{
> +   cmd="filefrag -v $1"
> +   echo "# $cmd" >> $seqres.full
> +   out=`$cmd | _filter_extents`
> +   if [ -z "$out" ]; then
> +   return 1
> +   fi
> +   echo "after filter: $out" >> $seqres.full
> +   echo $out
> +   return 0
> +}
> +
> +_check_repair()
> +{
> +   filter=${1:-cat}
> +   dmesg | tac | sed -ne "0,\#run fstests $seqnum at $date_time#p" | tac 
> | $filter | grep -q -e "csum failed"
> +   if [ $? -eq 0 ]; then
> +   echo 1
> +   else
> +   echo 0
> +   fi
> +}
> +
> +_scratch_dev_pool_get 2
> +# step 1, create a raid1 btrfs which contains one 128k file.
> +echo "step 1..mkfs.btrfs" >>$seqres.full
> +
> +mkfs_opts="-d raid1"
> +_scratch_pool_mkfs $mkfs_opts >>$seqres.full 2>&1
> +
> +_scratch_mount -o nospace_cache

Same as the other test, why do we need to mount without space cache?
It isn't obvious if it's needed nor why - a comment in the test
explaining why would be useful for everyone.

> +
> +$XFS_IO_PROG -f -d -c "pwrite -S 0xaa -b 128K 0 128K" "$SCRATCH_MNT/foobar" 
> | _filter_xfs_io
> +
> +sync
> +
> +# step 2, corrupt the first 64k of one copy (on SCRATCH_DEV which is the 
> first
> +# one in $SCRATCH_DEV_POOL
> +echo "step 2..corrupt file extent" >>$seqres.full
> +
> +extents=`_check_file_extents $SCRATCH_MNT/foobar`
> +logical_in_btrfs=`echo ${extents} | cut -d '#' -f 1`
> +physica

[PATCH 0/25 v3] fs: Convert all embedded bdis into separate ones

2017-04-12 Thread Jan Kara
Hello,

this is the third revision of the patch series which converts all embedded
occurences of struct backing_dev_info to use standalone dynamically allocated
structures. This makes bdi handling unified across all bdi users and generally
removes some boilerplate code from filesystems setting up their own bdi. It
also allows us to remove some code from generic bdi implementation.

The patches were only compile-tested for most filesystems (I've tested
mounting only for NFS & btrfs) so fs maintainers please have a look whether
the changes look sound to you.

This series is based on top of bdi fixes that were merged into linux-block
git tree into for-next branch. I have pushed out the result as a branch to

git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs.git bdi

Since all patches got reviewed by Christoph, can you please pick them up Jens?
Thanks!

Changes since v2:
* Added Reviewed-by tags from Christoph

Changes since v1:
* Added some acks
* Added further FUSE cleanup patch
* Added removal of unused argument to bdi_register()
* Fixed up some compilation failures spotted by 0-day testing

Honza
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 08/25] btrfs: Convert to separately allocated bdi

2017-04-12 Thread Jan Kara
Allocate struct backing_dev_info separately instead of embedding it
inside superblock. This unifies handling of bdi among users.

CC: Chris Mason 
CC: Josef Bacik 
CC: David Sterba 
CC: linux-btrfs@vger.kernel.org
Reviewed-by: Liu Bo 
Reviewed-by: David Sterba 
Reviewed-by: Christoph Hellwig 
Signed-off-by: Jan Kara 
---
 fs/btrfs/ctree.h   |  1 -
 fs/btrfs/disk-io.c | 36 +++-
 fs/btrfs/super.c   |  7 +++
 3 files changed, 14 insertions(+), 30 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 29b7fc28c607..f6019ce20035 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -810,7 +810,6 @@ struct btrfs_fs_info {
struct btrfs_super_block *super_for_commit;
struct super_block *sb;
struct inode *btree_inode;
-   struct backing_dev_info bdi;
struct mutex tree_log_mutex;
struct mutex transaction_kthread_mutex;
struct mutex cleaner_mutex;
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 08b74daf35d0..a7d8c342f604 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1808,21 +1808,6 @@ static int btrfs_congested_fn(void *congested_data, int 
bdi_bits)
return ret;
 }
 
-static int setup_bdi(struct btrfs_fs_info *info, struct backing_dev_info *bdi)
-{
-   int err;
-
-   err = bdi_setup_and_register(bdi, "btrfs");
-   if (err)
-   return err;
-
-   bdi->ra_pages = VM_MAX_READAHEAD * 1024 / PAGE_SIZE;
-   bdi->congested_fn   = btrfs_congested_fn;
-   bdi->congested_data = info;
-   bdi->capabilities |= BDI_CAP_CGROUP_WRITEBACK;
-   return 0;
-}
-
 /*
  * called by the kthread helper functions to finally call the bio end_io
  * functions.  This is where read checksum verification actually happens
@@ -2601,16 +2586,10 @@ int open_ctree(struct super_block *sb,
goto fail;
}
 
-   ret = setup_bdi(fs_info, &fs_info->bdi);
-   if (ret) {
-   err = ret;
-   goto fail_srcu;
-   }
-
ret = percpu_counter_init(&fs_info->dirty_metadata_bytes, 0, 
GFP_KERNEL);
if (ret) {
err = ret;
-   goto fail_bdi;
+   goto fail_srcu;
}
fs_info->dirty_metadata_batch = PAGE_SIZE *
(1 + ilog2(nr_cpu_ids));
@@ -2718,7 +2697,6 @@ int open_ctree(struct super_block *sb,
 
sb->s_blocksize = 4096;
sb->s_blocksize_bits = blksize_bits(4096);
-   sb->s_bdi = &fs_info->bdi;
 
btrfs_init_btree_inode(fs_info);
 
@@ -2915,9 +2893,12 @@ int open_ctree(struct super_block *sb,
goto fail_sb_buffer;
}
 
-   fs_info->bdi.ra_pages *= btrfs_super_num_devices(disk_super);
-   fs_info->bdi.ra_pages = max(fs_info->bdi.ra_pages,
-   SZ_4M / PAGE_SIZE);
+   sb->s_bdi->congested_fn = btrfs_congested_fn;
+   sb->s_bdi->congested_data = fs_info;
+   sb->s_bdi->capabilities |= BDI_CAP_CGROUP_WRITEBACK;
+   sb->s_bdi->ra_pages = VM_MAX_READAHEAD * 1024 / PAGE_SIZE;
+   sb->s_bdi->ra_pages *= btrfs_super_num_devices(disk_super);
+   sb->s_bdi->ra_pages = max(sb->s_bdi->ra_pages, SZ_4M / PAGE_SIZE);
 
sb->s_blocksize = sectorsize;
sb->s_blocksize_bits = blksize_bits(sectorsize);
@@ -3285,8 +3266,6 @@ int open_ctree(struct super_block *sb,
percpu_counter_destroy(&fs_info->delalloc_bytes);
 fail_dirty_metadata_bytes:
percpu_counter_destroy(&fs_info->dirty_metadata_bytes);
-fail_bdi:
-   bdi_destroy(&fs_info->bdi);
 fail_srcu:
cleanup_srcu_struct(&fs_info->subvol_srcu);
 fail:
@@ -4007,7 +3986,6 @@ void close_ctree(struct btrfs_fs_info *fs_info)
percpu_counter_destroy(&fs_info->dirty_metadata_bytes);
percpu_counter_destroy(&fs_info->delalloc_bytes);
percpu_counter_destroy(&fs_info->bio_counter);
-   bdi_destroy(&fs_info->bdi);
cleanup_srcu_struct(&fs_info->subvol_srcu);
 
btrfs_free_stripe_hash_table(fs_info);
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index da687dc79cce..e0a7503ab31e 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -1133,6 +1133,13 @@ static int btrfs_fill_super(struct super_block *sb,
 #endif
sb->s_flags |= MS_I_VERSION;
sb->s_iflags |= SB_I_CGROUPWB;
+
+   err = super_setup_bdi(sb);
+   if (err) {
+   btrfs_err(fs_info, "super_setup_bdi failed");
+   return err;
+   }
+
err = open_ctree(sb, fs_devices, (char *)data);
if (err) {
btrfs_err(fs_info, "open_ctree failed");
-- 
2.12.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 04/25] fs: Provide infrastructure for dynamic BDIs in filesystems

2017-04-12 Thread Jan Kara
Provide helper functions for setting up dynamically allocated
backing_dev_info structures for filesystems and cleaning them up on
superblock destruction.

CC: linux-...@lists.infradead.org
CC: linux-...@vger.kernel.org
CC: Petr Vandrovec 
CC: linux-ni...@vger.kernel.org
CC: cluster-de...@redhat.com
CC: osd-...@open-osd.org
CC: codal...@coda.cs.cmu.edu
CC: linux-...@lists.infradead.org
CC: ecryp...@vger.kernel.org
CC: linux-c...@vger.kernel.org
CC: ceph-de...@vger.kernel.org
CC: linux-btrfs@vger.kernel.org
CC: v9fs-develo...@lists.sourceforge.net
CC: lustre-de...@lists.lustre.org
Reviewed-by: Christoph Hellwig 
Signed-off-by: Jan Kara 
---
 fs/super.c   | 49 
 include/linux/backing-dev-defs.h |  2 +-
 include/linux/fs.h   |  6 +
 3 files changed, 56 insertions(+), 1 deletion(-)

diff --git a/fs/super.c b/fs/super.c
index b8b6a086c03b..0f51a437c269 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -446,6 +446,11 @@ void generic_shutdown_super(struct super_block *sb)
hlist_del_init(&sb->s_instances);
spin_unlock(&sb_lock);
up_write(&sb->s_umount);
+   if (sb->s_iflags & SB_I_DYNBDI) {
+   bdi_put(sb->s_bdi);
+   sb->s_bdi = &noop_backing_dev_info;
+   sb->s_iflags &= ~SB_I_DYNBDI;
+   }
 }
 
 EXPORT_SYMBOL(generic_shutdown_super);
@@ -1256,6 +1261,50 @@ mount_fs(struct file_system_type *type, int flags, const 
char *name, void *data)
 }
 
 /*
+ * Setup private BDI for given superblock. It gets automatically cleaned up
+ * in generic_shutdown_super().
+ */
+int super_setup_bdi_name(struct super_block *sb, char *fmt, ...)
+{
+   struct backing_dev_info *bdi;
+   int err;
+   va_list args;
+
+   bdi = bdi_alloc(GFP_KERNEL);
+   if (!bdi)
+   return -ENOMEM;
+
+   bdi->name = sb->s_type->name;
+
+   va_start(args, fmt);
+   err = bdi_register_va(bdi, NULL, fmt, args);
+   va_end(args);
+   if (err) {
+   bdi_put(bdi);
+   return err;
+   }
+   WARN_ON(sb->s_bdi != &noop_backing_dev_info);
+   sb->s_bdi = bdi;
+   sb->s_iflags |= SB_I_DYNBDI;
+
+   return 0;
+}
+EXPORT_SYMBOL(super_setup_bdi_name);
+
+/*
+ * Setup private BDI for given superblock. I gets automatically cleaned up
+ * in generic_shutdown_super().
+ */
+int super_setup_bdi(struct super_block *sb)
+{
+   static atomic_long_t bdi_seq = ATOMIC_LONG_INIT(0);
+
+   return super_setup_bdi_name(sb, "%.28s-%ld", sb->s_type->name,
+   atomic_long_inc_return(&bdi_seq));
+}
+EXPORT_SYMBOL(super_setup_bdi);
+
+/*
  * This is an internal function, please use sb_end_{write,pagefault,intwrite}
  * instead.
  */
diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index e66d4722db8e..866c433e7d32 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -146,7 +146,7 @@ struct backing_dev_info {
congested_fn *congested_fn; /* Function pointer if device is md/dm */
void *congested_data;   /* Pointer to aux data for congested func */
 
-   char *name;
+   const char *name;
 
struct kref refcnt; /* Reference counter for the structure */
unsigned int capabilities; /* Device capabilities */
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 7251f7bb45e8..98cf14ea78c0 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1272,6 +1272,9 @@ struct mm_struct;
 /* sb->s_iflags to limit user namespace mounts */
 #define SB_I_USERNS_VISIBLE0x0010 /* fstype already mounted */
 
+/* Temporary flag until all filesystems are converted to dynamic bdis */
+#define SB_I_DYNBDI0x0100
+
 /* Possible states of 'frozen' field */
 enum {
SB_UNFROZEN = 0,/* FS is unfrozen */
@@ -2121,6 +2124,9 @@ extern int vfs_ustat(dev_t, struct kstatfs *);
 extern int freeze_super(struct super_block *super);
 extern int thaw_super(struct super_block *super);
 extern bool our_mnt(struct vfsmount *mnt);
+extern __printf(2, 3)
+int super_setup_bdi_name(struct super_block *sb, char *fmt, ...);
+extern int super_setup_bdi(struct super_block *sb);
 
 extern int current_umask(void);
 
-- 
2.12.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS as a GlusterFS storage back-end, and what I've learned from using it as such.

2017-04-12 Thread Austin S. Hemmelgarn

On 2017-04-12 01:49, Qu Wenruo wrote:



At 04/11/2017 11:40 PM, Austin S. Hemmelgarn wrote:

About a year ago now, I decided to set up a small storage cluster to
store backups (and partially replace Dropbox for my usage, but that's
a separate story).  I ended up using GlusterFS as the clustering
software itself, and BTRFS as the back-end storage.

GlusterFS itself is actually a pretty easy workload as far as cluster
software goes.  It does some processing prior to actually storing the
data (a significant amount in fact), but the actual on-device storage
on any given node is pretty simple.  You have the full directory
structure for the whole volume, and whatever files happen to be on
that node are located within that tree exactly like they are in the
GlusterFS volume. Beyond the basic data, gluster only stores 2-4
xattrs per-file (which are used to track synchronization, and also for
it's internal data scrubbing), and a directory called .glusterfs in
the top of the back-end storage location for the volume which contains
the data required to figure out which node a file is on.  Overall, the
access patterns mostly mirror whatever is using the Gluster volume, or
are reduced to slow streaming writes (when writing files and the
back-end nodes are computationally limited instead of I/O limited),
with the addition of some serious metadata operations in the
.glusterfs directory (lots of stat calls there, together with large
numbers of small files).


Any real world experience is welcomed to share.



As far as overall performance, BTRFS is actually on par for this usage
with both ext4 and XFS (at least, on my hardware it is), and I
actually see more SSD friendly access patterns when using BTRFS in
this case than any other FS I tried.


We also find that, for pure buffered read/write, btrfs is no worse than
traditional fs.

In our PostgreSQL test, btrfs can even get a little better performance
than ext4/xfs when handling DB files.

But if using btrfs for PostgreSQL Write Ahead Log (WAL), then it's
completely another thing.
Btrfs falls far behind ext4/xfs on HDD, only half of the TPC performance
for low concurrency load.

Due to btrfs CoW, btrfs causes extra IO for fsync.
For example, if only to fsync 4K data, btrfs can cause 64K metadata
write for default mkfs options.
(One tree block for log root tree, one tree block for log tree, multiple
by 2 for default DUP profile)



After some serious experimentation with various configurations for
this during the past few months, I've noticed a handful of other things:

1. The 'ssd' mount option does not actually improve performance on
these SSD's.  To a certain extent, this actually surprised me at
first, but having seen Hans' e-mail and what he found about this
option, it actually makes sense, since erase-blocks on these devices
are 4MB, not 2MB, and the drives have a very good FTL (so they will
aggregate all the little writes properly).

Given this, I'm beginning to wonder if it actually makes sense to not
automatically enable this on mount when dealing with certain types of
storage (for example, most SATA and SAS SSD's have reasonably good
FTL's, so I would expect them to have similar behavior).
Extrapolating further, it might instead make sense to just never
automatically enable this, and expose the value this option is
manipulating as a mount option as there are other circumstances where
setting specific values could improve performance (for example, if
you're on hardware RAID6, setting this to the stripe size would
probably improve performance on many cheaper controllers).

2. Up to a certain point, running a single larger BTRFS volume with
multiple sub-volumes is more computationally efficient than running
multiple smaller BTRFS volumes.  More specifically, there is lower
load on the system and lower CPU utilization by BTRFS itself without
much noticeable difference in performance (in my tests it was about
0.5-1% performance difference, YMMV).  To a certain extent this makes
some sense, but the turnover point was actually a lot higher than I
expected (with this workload, the turnover point was around half a
terabyte).


This seems to be related to tree locking overhead.
My thought too, although I find it interesting that the benefit starts 
to disappear as the FS gets bigger beyond a certain point (on my system 
it was about half a terabyte, but I would expect it to be different on 
systems with different numbers of CPU cores (differing levels of lock 
contention) or different workloads (probably inversely proportionate to 
the amount of metadata work the workload produces).


The most obvious solution is just as you stated, use many small
subvolumes other than one large subvolume.

Another less obvious solution is to reduce tree block size at mkfs time.

This Btrfs is not that good at handling metadata workload, limited by
both the overhead of mandatory metadata CoW and current tree lock
algorithm.



I believe this to be a side-effect of how we use per-filesystem

Re: Btrfs disk layout question

2017-04-12 Thread Austin S. Hemmelgarn

On 2017-04-12 00:18, Chris Murphy wrote:

On Tue, Apr 11, 2017 at 3:00 PM, Adam Borowski  wrote:

On Tue, Apr 11, 2017 at 12:15:32PM -0700, Amin Hassani wrote:

I am working on a project with Btrfs and I was wondering if there is
any way to see the disk layout of the btrfs image. Let's assume I have
a read-only btrfs image with compression on and only using one disk
(no raid or anything). Is it possible to get a set of offset-lengths
for each file


While btrfs-specific ioctls give more information, you might want to look at
FIEMAP (Documentation/filesystems/fiemap.txt) as it works on most
filesystems, not just btrfs.  One interface to FIEMAP is provided in
"/usr/sbin/filefrag -v".


Good idea. Although, on Btrfs I'm pretty sure it reports the Btrfs
(internal) logical addressing; not the actual physical sector address
on the drive. So it depends on what the original poster is trying to
discover.

That said, there is a tool to translate that back, and depending on how 
detailed you want to get, that may be more efficient than debug tree.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] fstests: introduce btrfs-map-logical

2017-04-12 Thread David Sterba
On Wed, Apr 12, 2017 at 09:35:00AM +0800, Qu Wenruo wrote:
> 
> 
> At 04/12/2017 09:27 AM, Liu Bo wrote:
> > A typical use case of 'btrfs-map-logical' is to translate btrfs logical
> > address to physical address on each disk.
> 
> Could we avoid usage of btrfs-map-logical here?

Agreed.

> I understand that we need to do corruption so that we can test if the 
> repair works, but I'm not sure if the output format will change, or if 
> the program will get replace by "btrfs inspect-internal" group.

In the long-term it will be repleaced, but there's no ETA.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] fstests: remove snapshot aware defrag test

2017-04-12 Thread David Sterba
On Tue, Apr 11, 2017 at 06:27:18PM -0700, Liu Bo wrote:
> Since snapshot aware defrag has been disabled in kernel, and we all have
> learned to ignore the failure of btrfs/010, lets just remove it.
> 
> Signed-off-by: Liu Bo 

Reviewed-by: David Sterba 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] fstests: introduce btrfs-map-logical

2017-04-12 Thread David Sterba
On Wed, Apr 12, 2017 at 02:32:02PM +0200, David Sterba wrote:
> On Wed, Apr 12, 2017 at 09:35:00AM +0800, Qu Wenruo wrote:
> > 
> > 
> > At 04/12/2017 09:27 AM, Liu Bo wrote:
> > > A typical use case of 'btrfs-map-logical' is to translate btrfs logical
> > > address to physical address on each disk.
> > 
> > Could we avoid usage of btrfs-map-logical here?
> 
> Agreed.
> 
> > I understand that we need to do corruption so that we can test if the 
> > repair works, but I'm not sure if the output format will change, or if 
> > the program will get replace by "btrfs inspect-internal" group.
> 
> In the long-term it will be repleaced, but there's no ETA.

Possibly, if fstests maintainer agrees, we can add btrfs-map-logical to
fstests. It's small and uses headers from libbtrfs, so this would become
a new dependency but I believe is still bearable.

I'm not sure if we should export all debuging functionality in 'btrfs'
as this is typically something that a user will never want, not even in
the emergency environments. There's an overlap in the information to be
exported but I'd be more inclined to satisfy user needs than testsuite
needs. So an independent tool would give us more freedom on both sides.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: remove some dead code

2017-04-12 Thread David Sterba
On Tue, Apr 11, 2017 at 11:57:15AM +0300, Dan Carpenter wrote:
> btrfs_get_extent() never returns NULL pointers, so this code introduces
> a static checker warning.
> 
> The btrfs_get_extent() is a bit complex, but trust me that it doesn't
> return NULLs and also if it did we would trigger the BUG_ON(!em) before
> the last return statement.
> 
> Signed-off-by: Dan Carpenter 

Added to 4.12, thanks. I've updated the subject line so it reflects what
the patch does.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 07/12] fs: btrfs: Use ktime_get_real_ts for root ctime

2017-04-12 Thread David Sterba
On Fri, Apr 07, 2017 at 05:57:05PM -0700, Deepa Dinamani wrote:
> btrfs_root_item maintains the ctime for root updates.
> This is not part of vfs_inode.
> 
> Since current_time() uses struct inode* as an argument
> as Linus suggested, this cannot be used to update root
> times unless, we modify the signature to use inode.
> 
> Since btrfs uses nanosecond time granularity, it can also
> use ktime_get_real_ts directly to obtain timestamp for
> the root. It is necessary to use the timespec time api
> here because the same btrfs_set_stack_timespec_*() apis
> are used for vfs inode times as well. These can be
> transitioned to using timespec64 when btrfs internally
> changes to use timespec64 as well.
> 
> Signed-off-by: Deepa Dinamani 
> Acked-by: David Sterba 
> Reviewed-by: Arnd Bergmann 

I'm going to add the patch to my 4.12 queue and will let Andrew know.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] btrfs: fiemap: Cache and merge fiemap extent before submit it to user

2017-04-12 Thread David Sterba
On Fri, Apr 07, 2017 at 10:43:15AM +0800, Qu Wenruo wrote:
> [BUG]
> Cycle mount btrfs can cause fiemap to return different result.
> Like:
>  # mount /dev/vdb5 /mnt/btrfs
>  # dd if=/dev/zero bs=16K count=4 oflag=dsync of=/mnt/btrfs/file
>  # xfs_io -c "fiemap -v" /mnt/btrfs/file
>  /mnt/test/file:
>  EXT: FILE-OFFSET  BLOCK-RANGE  TOTAL FLAGS
>0: [0..127]:25088..25215   128   0x1
>  # umount /mnt/btrfs
>  # mount /dev/vdb5 /mnt/btrfs
>  # xfs_io -c "fiemap -v" /mnt/btrfs/file
>  /mnt/test/file:
>  EXT: FILE-OFFSET  BLOCK-RANGE  TOTAL FLAGS
>0: [0..31]: 25088..2511932   0x0
>1: [32..63]:25120..2515132   0x0
>2: [64..95]:25152..2518332   0x0
>3: [96..127]:   25184..2521532   0x1
> But after above fiemap, we get correct merged result if we call fiemap
> again.
>  # xfs_io -c "fiemap -v" /mnt/btrfs/file
>  /mnt/test/file:
>  EXT: FILE-OFFSET  BLOCK-RANGE  TOTAL FLAGS
>0: [0..127]:25088..25215   128   0x1
> 
> [REASON]
> Btrfs will try to merge extent map when inserting new extent map.
> 
> btrfs_fiemap(start=0 len=(u64)-1)
> |- extent_fiemap(start=0 len=(u64)-1)
>|- get_extent_skip_holes(start=0 len=64k)
>|  |- btrfs_get_extent_fiemap(start=0 len=64k)
>| |- btrfs_get_extent(start=0 len=64k)
>||  Found on-disk (ino, EXTENT_DATA, 0)
>||- add_extent_mapping()
>||- Return (em->start=0, len=16k)
>|
>|- fiemap_fill_next_extent(logic=0 phys=X len=16k)
>|
>|- get_extent_skip_holes(start=0 len=64k)
>|  |- btrfs_get_extent_fiemap(start=0 len=64k)
>| |- btrfs_get_extent(start=16k len=48k)
>||  Found on-disk (ino, EXTENT_DATA, 16k)
>||- add_extent_mapping()
>||  |- try_merge_map()
>|| Merge with previous em start=0 len=16k
>|| resulting em start=0 len=32k
>||- Return (em->start=0, len=32K)<< Merged result
>|- Stripe off the unrelated range (0~16K) of return em
>|- fiemap_fill_next_extent(logic=16K phys=X+16K len=16K)
>   ^^^ Causing split fiemap extent.
> 
> And since in add_extent_mapping(), em is already merged, in next
> fiemap() call, we will get merged result.
> 
> [FIX]
> Here we introduce a new structure, fiemap_cache, which records previous
> fiemap extent.
> 
> And will always try to merge current fiemap_cache result before calling
> fiemap_fill_next_extent().
> Only when we failed to merge current fiemap extent with cached one, we
> will call fiemap_fill_next_extent() to submit cached one.
> 
> So by this method, we can merge all fiemap extents.

The cache gets reset on each call to extent_fiemap, so if fi_extents_max
is 1, the cache will be always unset and we'll never merge anything. The
same can happen if the number of extents reaches the limit
(FIEMAP_MAX_EXTENTS or any other depending on the ioctl caller). And
this leads to the unmerged extents.

> It can also be done in fs/ioctl.c, however the problem is if
> fieinfo->fi_extents_max == 0, we have no space to cache previous fiemap
> extent.

I don't see why, it's the same code path, no?

> So I choose to merge it in btrfs.

Lifting that to the vfs interface is probably not the right approach.
The ioctl has never done any postprocessing of the data returned by
filesystems, it's really up to the filesystem to prepare the data.

> Signed-off-by: Qu Wenruo 
> ---
> v2:
>   Since fiemap_extent_info has a limit for number of fiemap_extent, it's 
> possible
>   that fiemap_fill_next_extent() return 1 halfway. Remove the WARN_ON() which 
> can
>   cause kernel warning if we fiemap is called on large compressed file.
> v3:
>   Rename finish_fiemap_extent() to check_fiemap_extent(), as in v3 we ensured
>   submit_fiemap_extent() to submit fiemap cache, so it just acts as a
>   sanity check.
>   Remove BTRFS_MAX_EXTENT_SIZE limit in submit_fiemap_extent(), as
>   extent map can be larger than BTRFS_MAX_EXTENT_SIZE.
>   Don't do backward jump, suggested by David.
>   Better sanity check and recoverable fix.
> 
> To David:
>   What about adding a btrfs_debug_warn(), which will only call WARN_ON(1) if
>   BTRFS_CONFIG_DEBUG is specified for recoverable bug?
> 
>   And modify ASSERT() to always WARN_ON() and exit error code?

That's for a separate discussion.

> ---
>  fs/btrfs/extent_io.c | 124 
> ++-
>  1 file changed, 122 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 28e8192..c4cb65d 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -4353,6 +4353,123 @@ static struct extent_map 
> *get_extent_skip_holes(struct inode *inode,
>   return NULL;
>  }
>  
> +/*
> + * To cache previous fiemap extent
> + *
> + * Will be used for merging fiemap extent
> + */
> +struct fiemap_cache {
> + u64 offset;
> + u64 phys;
> + u64 len;
> + u32 f

Re: Btrfs disk layout question

2017-04-12 Thread Andrei Borzenkov
12.04.2017 14:20, Austin S. Hemmelgarn пишет:
> On 2017-04-12 00:18, Chris Murphy wrote:
>> On Tue, Apr 11, 2017 at 3:00 PM, Adam Borowski 
>> wrote:
>>> On Tue, Apr 11, 2017 at 12:15:32PM -0700, Amin Hassani wrote:
 I am working on a project with Btrfs and I was wondering if there is
 any way to see the disk layout of the btrfs image. Let's assume I have
 a read-only btrfs image with compression on and only using one disk
 (no raid or anything). Is it possible to get a set of offset-lengths
 for each file
>>>
>>> While btrfs-specific ioctls give more information, you might want to
>>> look at
>>> FIEMAP (Documentation/filesystems/fiemap.txt) as it works on most
>>> filesystems, not just btrfs.  One interface to FIEMAP is provided in
>>> "/usr/sbin/filefrag -v".
>>
>> Good idea. Although, on Btrfs I'm pretty sure it reports the Btrfs
>> (internal) logical addressing; not the actual physical sector address
>> on the drive. So it depends on what the original poster is trying to
>> discover.
>>
> That said, there is a tool to translate that back, and depending on how
> detailed you want to get, that may be more efficient than debug tree.

Could you give pointer to this tool? I use filefrag on bootinfoscript to
display physical disk offset of files of interest to bootloader. I was
not aware it shows logical offset which makes it kinda pointless.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs disk layout question

2017-04-12 Thread Amin Hassani
Hi, Thanks for responses.

I actually need the physical addresses. FIEMAP I believe (and I
tested) gives logical address which as Andrei mentioned is useless.
I'm assuming btfs-debug-tree gives the physical addresses right? I
also need to know the compression method used on each extent and I
don't think I can get that with the fiemap stuff. It seems that fiemap
capability is not implemented in Btrfs as I'm looking at the Btrfs
implementation and based on documentation of fiemap:

"File systems wishing to support fiemap must implement a ->fiemap
callback on their inode_operations structure. The fs ->fiemap call is
responsible for
defining its set of supported fiemap flags, and calling a helper function on
each discovered extent:"

Thanks,
Amin.



On Wed, Apr 12, 2017 at 9:44 AM, Andrei Borzenkov  wrote:
> 12.04.2017 14:20, Austin S. Hemmelgarn пишет:
>> On 2017-04-12 00:18, Chris Murphy wrote:
>>> On Tue, Apr 11, 2017 at 3:00 PM, Adam Borowski 
>>> wrote:
 On Tue, Apr 11, 2017 at 12:15:32PM -0700, Amin Hassani wrote:
> I am working on a project with Btrfs and I was wondering if there is
> any way to see the disk layout of the btrfs image. Let's assume I have
> a read-only btrfs image with compression on and only using one disk
> (no raid or anything). Is it possible to get a set of offset-lengths
> for each file

 While btrfs-specific ioctls give more information, you might want to
 look at
 FIEMAP (Documentation/filesystems/fiemap.txt) as it works on most
 filesystems, not just btrfs.  One interface to FIEMAP is provided in
 "/usr/sbin/filefrag -v".
>>>
>>> Good idea. Although, on Btrfs I'm pretty sure it reports the Btrfs
>>> (internal) logical addressing; not the actual physical sector address
>>> on the drive. So it depends on what the original poster is trying to
>>> discover.
>>>
>> That said, there is a tool to translate that back, and depending on how
>> detailed you want to get, that may be more efficient than debug tree.
>
> Could you give pointer to this tool? I use filefrag on bootinfoscript to
> display physical disk offset of files of interest to bootloader. I was
> not aware it shows logical offset which makes it kinda pointless.



-- 
Amin Hassani.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs disk layout question

2017-04-12 Thread Austin S. Hemmelgarn

On 2017-04-12 12:44, Andrei Borzenkov wrote:

12.04.2017 14:20, Austin S. Hemmelgarn пишет:

On 2017-04-12 00:18, Chris Murphy wrote:

On Tue, Apr 11, 2017 at 3:00 PM, Adam Borowski 
wrote:

On Tue, Apr 11, 2017 at 12:15:32PM -0700, Amin Hassani wrote:

I am working on a project with Btrfs and I was wondering if there is
any way to see the disk layout of the btrfs image. Let's assume I have
a read-only btrfs image with compression on and only using one disk
(no raid or anything). Is it possible to get a set of offset-lengths
for each file


While btrfs-specific ioctls give more information, you might want to
look at
FIEMAP (Documentation/filesystems/fiemap.txt) as it works on most
filesystems, not just btrfs.  One interface to FIEMAP is provided in
"/usr/sbin/filefrag -v".


Good idea. Although, on Btrfs I'm pretty sure it reports the Btrfs
(internal) logical addressing; not the actual physical sector address
on the drive. So it depends on what the original poster is trying to
discover.


That said, there is a tool to translate that back, and depending on how
detailed you want to get, that may be more efficient than debug tree.


Could you give pointer to this tool? I use filefrag on bootinfoscript to
display physical disk offset of files of interest to bootloader. I was
not aware it shows logical offset which makes it kinda pointless.

Looking again, I think I was thinking of `btrfs inspect-internal 
logical-resolve`, which actually is more like a reverse fiemap (you give 
it a logical address, and it spits out paths to all the files that 
include that logical address), so such a tool may not actually exist (at 
least, not in the standard tools).

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs disk layout question

2017-04-12 Thread Chris Murphy
btrfs-map-logical is the tool that will convert logical to physical
and also give what device it's on; but the device notation is copy 1
and copy 2, so you have to infer what device that is, it's not
explicit.


Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs disk layout question

2017-04-12 Thread Hans van Kranenburg
On 04/11/2017 09:15 PM, Amin Hassani wrote:
> 
> I am working on a project with Btrfs and I was wondering if there is
> any way to see the disk layout of the btrfs image. Let's assume I have
> a read-only btrfs image with compression on and only using one disk
> (no raid or anything).

> Is it possible to get a set of offset-lengths
> for each file or metadata parts of the image.

These are two very different things, and it's unclear to me what you
actually want.

Do you want:

1. a layout of physical disk space, and then for each range see if it's
used for data, metadata or not used?

2. a list of files and how they're split up (or not) in one or multiple
extents, and how long those are?

Remember that multiple files can reuse part of each others data in
btrfs. So if you follow the files, and you have reflinked copies or
subvolume snapshots, then you see actual disk usage multiple times.

> I know there is an
> unfinished documentation for On-disk Formant in here:
> https://btrfs.wiki.kernel.org/index.php/On-disk_Format
> But it is not complete and does not show what I am looking for. Is
> there any other documentation on this? Is there any public API that I
> can use to get this information.

...

> For example can I iterate on all
> files starting from the root node and get all offset-lengths? This way
> any part that doesn't come can be assumed as metadata. I don't really
> care what is inside the metadata, I just want to know their
> offset-lengths in the file system.

No, that's not how it works.

To learn more about how btrfs organizes data internally, you need a good
understanding of these concepts:

* how btrfs allocates "chunks" (often 256MiB or 1GiB size) of raw disk
space and dedicate them to either data or metadata.
* how btrfs uses a "virtual address space" and how that maps back from
(dev tree) and forth (chunk tree) to raw physical disk space on either
of the disks that is attached to the filesystem.
* how btrfs stores the administration of exactly with part in that
virtual address space is in use (extent tree).
* how btrfs stores files and directories, and how it does so for
multiple directory trees (subvolumes), (the fs tree and all 256 <= trees
<= -256).
* how files in these file trees reference data from data extents.
* how extents reference back to which (can be multiple!) files they're
used in.

IOW, there are likely multiple levels of indirection that you need to
follow to find things out.

Currently there's no perfect tutorial that explains exactly all those
things in a nice way.

The btrfs wiki can help with this, the btrfs-heatmap tool which was
already meantioned is nice to play around with, and get a better
understanding of all address space and usage.

If you know exactly what the end result would be, then it's probably
possible to build something that uses the SEARCH IOCTL with which you
can search in all metadata (containing info of above mentioned trees) of
a live filesystem. At least for C and for python there's enough example
code around to do so.

-- 
Hans van Kranenburg
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/9] nowait aio: return on congested block device

2017-04-12 Thread Goldwyn Rodrigues


On 04/12/2017 03:36 AM, Christoph Hellwig wrote:
> As mentioned last time around, this should be a REQ_NOWAIT flag so
> that it can be easily passed dow? n to the request layer.
> 
>> +static inline void bio_wouldblock_error(struct bio *bio)
>> +{
>> +bio->bi_error = -EAGAIN;
>> +bio_endio(bio);
>> +}
> 
> Please skip this helper..

Why? It is being called three times?
I am incorporating all the rest of the comments, besides this one. Thanks.

> 
>> +#define QUEUE_FLAG_NOWAIT  28   /* queue supports BIO_NOWAIT */
> 
> Please make the flag name a little more descriptive, this sounds like
> it will never wait.
> 

-- 
Goldwyn
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] fstests: regression test for btrfs dio read repair

2017-04-12 Thread Liu Bo
On Wed, Apr 12, 2017 at 10:42:47AM +0100, Filipe Manana wrote:
> On Wed, Apr 12, 2017 at 2:27 AM, Liu Bo  wrote:
> > This case tests whether dio read can repair the bad copy if we have
> > a good copy.
> 
> Regardless of being a test we should have always had (thanks for
> this!), it would be useful to mention we had a regression (as the test
> description in the btrfs/140 file says) and which patch fixed it (and
> possibly which kernel version or patch/commit introduced the
> regression).
>

Sure, thanks for the review.

> Just a comment/question below.
> 
> >
> > Signed-off-by: Liu Bo 
> > ---
> >  tests/btrfs/140 | 152 
> > 
> >  tests/btrfs/140.out |  39 ++
> >  tests/btrfs/group   |   1 +
> >  3 files changed, 192 insertions(+)
> >  create mode 100755 tests/btrfs/140
> >  create mode 100644 tests/btrfs/140.out
> >
> > diff --git a/tests/btrfs/140 b/tests/btrfs/140
> > new file mode 100755
> > index 000..db56123
> > --- /dev/null
> > +++ b/tests/btrfs/140
> > @@ -0,0 +1,152 @@
> > +#! /bin/bash
> > +# FS QA Test 140
> > +#
> > +# Regression test for btrfs DIO read's repair during read.
> > +#
> > +#---
> > +# Copyright (c) 2017 Liu Bo.  All Rights Reserved.
> > +#
> > +# This program is free software; you can redistribute it and/or
> > +# modify it under the terms of the GNU General Public License as
> > +# published by the Free Software Foundation.
> > +#
> > +# This program is distributed in the hope that it would be useful,
> > +# but WITHOUT ANY WARRANTY; without even the implied warranty of
> > +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> > +# GNU General Public License for more details.
> > +#
> > +# You should have received a copy of the GNU General Public License
> > +# along with this program; if not, write the Free Software Foundation,
> > +# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
> > +#---
> > +#
> > +
> > +seq=`basename $0`
> > +seqres=$RESULT_DIR/$seq
> > +echo "QA output created by $seq"
> > +
> > +here=`pwd`
> > +tmp=/tmp/$$
> > +status=1   # failure is the default!
> > +trap "_cleanup; exit \$status" 0 1 2 3 15
> > +
> > +_cleanup()
> > +{
> > +   cd /
> > +   rm -f $tmp.*
> > +}
> > +
> > +# get standard environment, filters and checks
> > +. ./common/rc
> > +. ./common/filter
> > +
> > +# remove previous $seqres.full before test
> > +rm -f $seqres.full
> > +
> > +# real QA test starts here
> > +
> > +# Modify as appropriate.
> > +_supported_fs btrfs
> > +_supported_os Linux
> > +_require_scratch_dev_pool 2
> > +_require_command "$BTRFS_MAP_LOGICAL_PROG" btrfs-map-logical
> > +_require_command "$FILEFRAG_PROG" filefrag
> > +_require_odirect
> > +
> > +# helpe to convert 'file offset' to btrfs logical offset
> > +FILEFRAG_FILTER='
> > +   if (/blocks? of (\d+) bytes/) {
> > +   $blocksize = $1;
> > +   next
> > +   }
> > +   ($ext, $logical, $physical, $length) =
> > +   (/^\s*(\d+):\s+(\d+)..\s+\d+:\s+(\d+)..\s+\d+:\s+(\d+):/)
> > +   or next;
> > +   ($flags) = /.*:\s*(\S*)$/;
> > +   print $physical * $blocksize, "#",
> > + $length * $blocksize, "#",
> > + $logical * $blocksize, "#",
> > + $flags, " "'
> > +
> > +# this makes filefrag output script readable by using a perl helper.
> > +# output is one extent per line, with three numbers separated by '#'
> > +# the numbers are: physical, length, logical (all in bytes)
> > +# sample output: "1234#10#5678" -> physical 1234, length 10, logical 5678
> > +_filter_extents()
> > +{
> > +   tee -a $seqres.full | $PERL_PROG -ne "$FILEFRAG_FILTER"
> > +}
> > +
> > +_check_file_extents()
> > +{
> > +   cmd="filefrag -v $1"
> > +   echo "# $cmd" >> $seqres.full
> > +   out=`$cmd | _filter_extents`
> > +   if [ -z "$out" ]; then
> > +   return 1
> > +   fi
> > +   echo "after filter: $out" >> $seqres.full
> > +   echo $out
> > +   return 0
> > +}
> > +
> > +_check_repair()
> > +{
> > +   filter=${1:-cat}
> > +   dmesg | tac | sed -ne "0,\#run fstests $seqnum at $date_time#p" | 
> > tac | $filter | grep -q -e "csum failed"
> > +   if [ $? -eq 0 ]; then
> > +   echo 1
> > +   else
> > +   echo 0
> > +   fi
> > +}
> > +
> > +_scratch_dev_pool_get 2
> > +# step 1, create a raid1 btrfs which contains one 128k file.
> > +echo "step 1..mkfs.btrfs" >>$seqres.full
> > +
> > +mkfs_opts="-d raid1"
> > +_scratch_pool_mkfs $mkfs_opts >>$seqres.full 2>&1
> > +
> > +_scratch_mount -o nospace_cache
> 
> Why do we need to mount without space cache?
> I don't see why nor I think it's obvious. A comment in the test
> mentioning why would be useful for everyone.
>

Thanks for spotting it, we can safely get rid of it,

Re: BTRFS as a GlusterFS storage back-end, and what I've learned from using it as such.

2017-04-12 Thread Duncan
Austin S. Hemmelgarn posted on Wed, 12 Apr 2017 07:18:44 -0400 as
excerpted:

> On 2017-04-12 01:49, Qu Wenruo wrote:
>>
>> At 04/11/2017 11:40 PM, Austin S. Hemmelgarn wrote:
>>>
>>> 4. Depending on other factors, compression can actually slow you down
>>> pretty significantly.  In the particular case I saw this happen (all
>>> cores completely utilized by userspace software), LZO compression
>>> actually caused around 5-10% performance degradation compared to no
>>> compression.  This is somewhat obvious once it's explained, but it's
>>> not exactly intuitive  and as such it's probably worth documenting in
>>> the man pages that compression won't always make things better.  I may
>>> send a patch to add this at some point in the near future.
>>
>> This seems interesting.
>> Maybe it's CPU limiting the performance?

> In this case, I'm pretty certain that that's the cause.  I've only ever
> seen this happen though when the CPU was under either full or more than
> full load (so pretty much full utilization of all the cores), and it
> gets worse as the CPU load increases.

This seems blatantly obvious to me, no explanation needed, at least 
assuming people understand what compression is and does.  It certainly 
doesn't seem btrfs specific to me.

Which makes my wonder if I'm missing something that would seem to 
counteract the obvious, but doesn't in this case.

Compression at its most basic can be described as a tradeoff of CPU 
cycles to decrease data size (by tracking and eliminating internal 
redundancy), and thus transfer time of the data.

In conditions where the bottleneck is (seek and) transfer time, as on hdds 
with mostly idle CPUs, compression therefore tends to be a pretty big 
performance boost because the lower size of the compressed data means 
fewer seeks and lower transfer time, and because that's where the 
bottleneck is, making it more efficient increases the performance of the 
entire thing.

But the context here is SSDs, with 0 seek time and fast transfer speeds, 
and already 100% utilized CPUs, so the bottleneck is the 100% utilized 
CPUs and the increased CPU cycles necessary for the compression/
decompression simply increases the CPU bottleneck.

So far from a mystery, this seems so basic to me that the simplest 
dunderhead should get it, at least as long as they aren't /so/ simple 
they can't understand the tradeoff inherent in the simplest compression 
basics.

But that's not the implication of the discussion quoted above, and the 
participants are both what I'd consider far more qualified to understand 
and deal with this sort of thing than I, so I /gotta/ be missing 
something that despite my correct ultimate conclusion, means I haven't 
reached it using a correct logic train, and that there /must/ be some 
logic steps in there that I've left out that would intuitively switch the 
logic, making this a rather less intuitive conclusion than I'm thinking.

So what am I missing?

Or is it simply that the tradeoff between CPU usage and data size and 
minimum transit time isn't as simple and basic for most people as I'm 
assuming here, such that it isn't obviously giving more work to an 
already bottlenecked CPU, reducing the performance when it /is/ the CPU 
that's bottlenecked?

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] btrfs-progs: send-dump: always print a space after path

2017-04-12 Thread Duncan
Evan Danaher posted on Tue, 11 Apr 2017 12:33:40 -0400 as excerpted:

> I was shocked to discover that 'btrfs receive --dump' doesn't print a
> space after long filenames, so it runs together into the metadata; for
> example:
> 
> truncate./20-00-03/this-name-is-32-characters-longsize=0
> 
> This is a trivial patch to add a single space unconditionally, so the
> result is the following:
> 
> truncate./20-00-03/this-name-is-32-characters-long size=0
> 
> I suppose this is technically a breaking change, but it seems unlikely
> to me that anyone would depend on the existing behavior given how
> unfriendly it is.
> 
> Signed-off-by: Evan Danaher 
> ---

I'm not a dev so won't attempt to comment on the patch itself, but it's 
worth noting that according to kernel patch submission guidelines (which 
btrfs-progs use as well) on V2+ patch postings, there should be a short, 
often one-line per version, summary of what changed between versions.  
This helps both reviewers and would-be patch-using admins such as myself 
understand how a patch is evolving, as well as for reviewers preventing 
unnecessary work when re-reviewing a new version of a patch previously 
reviewed in an earlier version.

On patch series this summary is generally found in the 0/N post, while on 
individual patches without a 0/N, it's normally found below the first --- 
delimiter, so as to avoid including the patch history in the final merged 
version comment.

See pretty much any other multi-version posted patch for examples.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] btrfs: fiemap: Cache and merge fiemap extent before submit it to user

2017-04-12 Thread Qu Wenruo



At 04/12/2017 11:05 PM, David Sterba wrote:

On Fri, Apr 07, 2017 at 10:43:15AM +0800, Qu Wenruo wrote:

[BUG]
Cycle mount btrfs can cause fiemap to return different result.
Like:
  # mount /dev/vdb5 /mnt/btrfs
  # dd if=/dev/zero bs=16K count=4 oflag=dsync of=/mnt/btrfs/file
  # xfs_io -c "fiemap -v" /mnt/btrfs/file
  /mnt/test/file:
  EXT: FILE-OFFSET  BLOCK-RANGE  TOTAL FLAGS
0: [0..127]:25088..25215   128   0x1
  # umount /mnt/btrfs
  # mount /dev/vdb5 /mnt/btrfs
  # xfs_io -c "fiemap -v" /mnt/btrfs/file
  /mnt/test/file:
  EXT: FILE-OFFSET  BLOCK-RANGE  TOTAL FLAGS
0: [0..31]: 25088..2511932   0x0
1: [32..63]:25120..2515132   0x0
2: [64..95]:25152..2518332   0x0
3: [96..127]:   25184..2521532   0x1
But after above fiemap, we get correct merged result if we call fiemap
again.
  # xfs_io -c "fiemap -v" /mnt/btrfs/file
  /mnt/test/file:
  EXT: FILE-OFFSET  BLOCK-RANGE  TOTAL FLAGS
0: [0..127]:25088..25215   128   0x1

[REASON]
Btrfs will try to merge extent map when inserting new extent map.

btrfs_fiemap(start=0 len=(u64)-1)
|- extent_fiemap(start=0 len=(u64)-1)
|- get_extent_skip_holes(start=0 len=64k)
|  |- btrfs_get_extent_fiemap(start=0 len=64k)
| |- btrfs_get_extent(start=0 len=64k)
||  Found on-disk (ino, EXTENT_DATA, 0)
||- add_extent_mapping()
||- Return (em->start=0, len=16k)
|
|- fiemap_fill_next_extent(logic=0 phys=X len=16k)
|
|- get_extent_skip_holes(start=0 len=64k)
|  |- btrfs_get_extent_fiemap(start=0 len=64k)
| |- btrfs_get_extent(start=16k len=48k)
||  Found on-disk (ino, EXTENT_DATA, 16k)
||- add_extent_mapping()
||  |- try_merge_map()
|| Merge with previous em start=0 len=16k
|| resulting em start=0 len=32k
||- Return (em->start=0, len=32K)<< Merged result
|- Stripe off the unrelated range (0~16K) of return em
|- fiemap_fill_next_extent(logic=16K phys=X+16K len=16K)
   ^^^ Causing split fiemap extent.

And since in add_extent_mapping(), em is already merged, in next
fiemap() call, we will get merged result.

[FIX]
Here we introduce a new structure, fiemap_cache, which records previous
fiemap extent.

And will always try to merge current fiemap_cache result before calling
fiemap_fill_next_extent().
Only when we failed to merge current fiemap extent with cached one, we
will call fiemap_fill_next_extent() to submit cached one.

So by this method, we can merge all fiemap extents.


The cache gets reset on each call to extent_fiemap, so if fi_extents_max
is 1, the cache will be always unset and we'll never merge anything. The
same can happen if the number of extents reaches the limit
(FIEMAP_MAX_EXTENTS or any other depending on the ioctl caller). And
this leads to the unmerged extents.


Nope, extents will still be merged, as long as they can be merged.

The fiemap extent is only submitted if we found an unmergeable extent.

Even fi_extents_max is 1, it still possible for us to merge extents.

File A:
Extent 1: offset=0 len=4k phys=X
Extent 2: offset=4k len=4k phys=X+4
Extent 3: offset=8k len=4k phys=Y

1) Found Extent 1
   Cache it, not submitted yet.
2) Found Extent 2
   Merge it with cached one, not submitted yet.
3) Found Extent 3
   Can't merge, submit cached first.
   Submitted one reach fi_extents_max, exit current extent_fiemap.

4) Next fiemap call starts from offset 8K,
   Extent 3 is the last extent, no need to cache just submit.

So we still got merged fiemap extent, without anything wrong.

The point is, fi_extents_max or other limit can only be merged when we 
submit fiemap_extent, in that case either we found unmergable extent, or 
we already hit the last extent.





It can also be done in fs/ioctl.c, however the problem is if
fieinfo->fi_extents_max == 0, we have no space to cache previous fiemap
extent.


I don't see why, it's the same code path, no?


My original design in VFS is to check if we can merge current fiemap 
extent with the last one in fiemap_info.


But for fi_extents_max == 0 case, fiemap_info doesn't store any extent 
so that's not possible.



So for fi_extents_max == 0 case, either do it in each fs like what we 
are doing, or introduce a new function like fiemap_cache_next_extent() 
with reference to cached structure.





So I choose to merge it in btrfs.


Lifting that to the vfs interface is probably not the right approach.
The ioctl has never done any postprocessing of the data returned by
filesystems, it's really up to the filesystem to prepare the data.


OK, let's keep it in btrfs.




Signed-off-by: Qu Wenruo 
---
v2:
   Since fiemap_extent_info has a limit for number of fiemap_extent, it's 
possible
   that fiemap_fill_next_extent() return 1 halfway. Remove the WARN_ON() which 
can
   cause kernel warning if we fiemap is called on large c

Re: [PATCH] fstests: introduce btrfs-map-logical

2017-04-12 Thread Qu Wenruo



At 04/12/2017 08:52 PM, David Sterba wrote:

On Wed, Apr 12, 2017 at 02:32:02PM +0200, David Sterba wrote:

On Wed, Apr 12, 2017 at 09:35:00AM +0800, Qu Wenruo wrote:



At 04/12/2017 09:27 AM, Liu Bo wrote:

A typical use case of 'btrfs-map-logical' is to translate btrfs logical
address to physical address on each disk.


Could we avoid usage of btrfs-map-logical here?


Agreed.


I understand that we need to do corruption so that we can test if the
repair works, but I'm not sure if the output format will change, or if
the program will get replace by "btrfs inspect-internal" group.


In the long-term it will be repleaced, but there's no ETA.


Possibly, if fstests maintainer agrees, we can add btrfs-map-logical to
fstests. It's small and uses headers from libbtrfs, so this would become
a new dependency but I believe is still bearable.

I'm not sure if we should export all debuging functionality in 'btrfs'
as this is typically something that a user will never want, not even in
the emergency environments. There's an overlap in the information to be
exported but I'd be more inclined to satisfy user needs than testsuite
needs. So an independent tool would give us more freedom on both sides.

I'm working on the new btrfs-corrupt-block equivalent, considering the 
demand to corrupt on-disk data for recovery test, I could provide tool 
with fundamental corruption support.


Which could corrupt on-disk data, either specified by (root, inode, 
offset, length) or just (logical address, length).

And support to corrupt given mirror or even P/Q for RAID56.
(With btrfs_map_block_v2 from offline scrub)

I'm not sure if I should just replace btrfs-corrupt-block or add a new 
individual prog or add a btrfs subcommand group which is disabled by 
default?


Thanks,
Qu


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: send snapshot from snapshot incremental

2017-04-12 Thread Jakob Schürz
Am 2017-03-26 um 22:07 schrieb Peter Grandi:
> [ ... ]
>> BUT if i take a snapshot from the system, and want to transfer
>> it to the external HD, i can not set a parent subvolume,
>> because there isn't any.
> 
> Questions like this are based on incomplete understanding of
> 'send' and 'receive', and on IRC user "darkling" explained it
> fairly well:
> 
>> When you use -c, you're telling the FS that it can expect to
>> find a sent copy of that subvol on the receiving side, and
>> that anything shared with it can be sent by reference. OK, so
>> with -c on its own, you're telling the FS that "all the data
>> in this subvol already exists on the remote".
> 
>> So, when you send your subvol, *all* of the subvol's metadata
>> is sent, and where that metadata refers to an extent that's
>> shared with the -c subvol, the extent data isn't sent, because
>> it's known to be on the other end already, and can be shared
>> directly from there.
> 
>> OK. So, with -p, there's a "base" subvol. The send subvol and
>> the -p reference subvol are both snapshots of that base (at
>> different times). The -p reference subvol, as with -c, is
>> assumed to be on the remote FS. However, because it's known to
>> be an earlier version of the same data, you can be more
>> efficient in the sending by saying "start from the earlier
>> version, and modify it in this way to get the new version"
> 
>> So, with -p, not all of the metadata is sent, because you know
>> you've already got most of it on the remote in the form of the
>> earlier version.
> 
>> So -p is "take this thing and apply these differences to it"
>> and -c is "build this thing from scratch, but you can share
>> some of the data with these sources"
> 

For now, I think i got it... (maybe).

I put the following logic into my script:
1) Search for all Subvolumes on local and remote side, where the
Received-UUID on the remote side is the same as the UUID on the local side
2) Take the parent-UUID from the Snapshot i want to transfer and search
in the list from 1) which snapshot (from the local side) has the same
parent UUID.
3) Take the younges Snapshot from 2) ans set it as parent for the btrfs
send-command
4) Search for snapshot local and remote, wich have the same name|path
ans "basename" as the snapshot i want to transfer
basename means, my system-subvolume is called @debian
it contains one subvolume @debian/var/spool
the snapshotnames are @debian_$TIMESTAMP and
@debian_$TIMESTAMP/var/spool
The basename is @debian and @debian/var/spool
5) set all of the snapshots with the same basename as the snapshot to be
transferred as clones for btrfs send.

The final command involves the youngest "sister" from the snapshot i
want to transfer, which is on both sides, set as "parent", and a bunch
of snapshots wich are older (or even younger - is this a problem???)
than the snapshot i want to transfer wich contain modified and the same
data, set as clones

If there is no parent (In case of transferring a snapshot of a
snapshot...) then there are clones of this snapshot, so not all of the
data is to be sent again (and consumes the double space on the backup-media)
If there are no parents AND clones (similar snapshots), the subvolume
seems to be totally new, and the whole must be transferred.
If there is a parent and clones, both of them are used to minimize the
data for the transfer, and use as much as possible from the existing
data/metadata on the backup-media to build the new snapshot there

To use all of the similar snapshots (get by the snapshotname) as clones
seems to fasten the transfer in comparison to only use the parent (this
seems slower). Could this be, or is this only a "feeling?

Thanks for all your advices. This helped me a lot!!

regards Jakob

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs disk layout question

2017-04-12 Thread Andrei Borzenkov
12.04.2017 20:21, Chris Murphy пишет:
> btrfs-map-logical is the tool that will convert logical to physical
> and also give what device it's on; but the device notation is copy 1
> and copy 2, so you have to infer what device that is, it's not
> explicit.
> 

Quickly checking output - for my purposes it looks OK, as BIS just tries
to warn if file is too far to be accessible by BIOS, so I am not even
interested in specific device, just max physical offset. Thank you!
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] fstests: introduce btrfs-map-logical

2017-04-12 Thread Eryu Guan
On Wed, Apr 12, 2017 at 02:52:23PM +0200, David Sterba wrote:
> > > I understand that we need to do corruption so that we can test if the 
> > > repair works, but I'm not sure if the output format will change, or if 
> > > the program will get replace by "btrfs inspect-internal" group.
> > 
> > In the long-term it will be repleaced, but there's no ETA.
> 
> Possibly, if fstests maintainer agrees, we can add btrfs-map-logical to
> fstests. It's small and uses headers from libbtrfs, so this would become
> a new dependency but I believe is still bearable.

IMHO, I think the ability to poke btrfs internal really should be
provided by btrfs-progs package and maintained by btrfs community.
fstests provides some fs-independent c helpers to assist testing, but
not necessarily needs to "understand" filesystem internals.

For historical reason, building fstests requires xfsprogs development
headers, we'd better not introduce new fs-specific dependencies.

Thanks,
Eryu
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html