qcow2 becomes 37P in size while qemu crashes

2016-07-22 Thread Chris Murphy
Here is the bug write up so far, which contains most of the relevant details.
https://bugzilla.redhat.com/show_bug.cgi?id=1359325

Here are three teasers to get you to look at the bug:

1.
[root@f24m ~]# ls -lsh /var/lib/libvirt/images
total 57G
1.5G -rw-r-. 1 qemu qemu 1.5G Jul 21 10:54
Fedora-Workstation-Live-x86_64-24-1.2.iso
1.4G -rw-r--r--. 1 qemu qemu 1.4G Jul 20 13:28
Fedora-Workstation-Live-x86_64-Rawhide-20160718.n.0.iso
4.4G -rw-r-. 1 qemu qemu 4.4G Jul 22 10:43
openSUSE-Leap-42.2-DVD-x86_64-Build0109-Media.iso
 50G -rw-r--r--. 1 root root  37P Jul 22 13:23 uefi_opensuseleap42.2a3-1.qcow2
196K -rw-r--r--. 1 root root 193K Jul 22 08:46 uefi_opensuseleap42.2a3-2.qcow2
[root@f24m ~]#

Yes, it's using 50G worth of sectors on the drive. But then it's 37
Petabytes?! That's really weird.

[root@f24m ~]# df -h
Filesystem  Size  Used Avail Use% Mounted on
/dev/sda5   104G   67G   36G  65% /

2.
Btrfs mounts, reads, writes, just fine, no messages in dmesg other
than the usual mount messages; all before, during, and after the qemu
crash, and rebooting. I rebooted to do an offline btrfs check, which
has no complaints. Scrub has no complaints. Yes the qcow2 has +C xattr
set so there's no independent way to determine if/hoe it's corrupt.
But qemu-img does say it's corrupt and libvirt will not start the VM
anymore with this qcow2 attached.

3.
I've attached to the bug a filefrag -v output from the 37 P file,
which has ~900 extents. There's only one thing that's a bit out of the
ordinary, which is mentioned in the bug.

Pretty weird. To try to reproduce this I kinda need to delete that
qcow2 file. So if anyone has suggestions on what other information to
put in that bug report before I change the state of the system, lemme
know.

-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Send-recieve performance

2016-07-22 Thread Libor Klepáč
Hello,

Dne pátek 22. července 2016 14:59:43 CEST, Henk Slager napsal(a):
> On Wed, Jul 20, 2016 at 11:15 AM, Libor Klepáč  wrote:
> > Hello,
> > we use backuppc to backup our hosting machines.
> > 
> > I have recently migrated it to btrfs, so we can use send-recieve for
> > offsite backups of our backups.
> > 
> > I have several btrfs volumes, each hosts nspawn container, which runs in
> > /system subvolume and has backuppc data in /backuppc subvolume .
> > I use btrbk to do snapshots and transfer.
> > Local side is set to keep 5 daily snapshots, remote side to hold some
> > history. (not much yet, i'm using it this way for few weeks).
> > 
> > If you know backuppc behaviour: for every backup (even incremental), it
> > creates full directory tree of each backed up machine even if it has no
> > modified files and places one small file in each, which holds some info
> > for backuppc. So after few days i ran into ENOSPACE on one volume,
> > because my metadata grow, because of inlineing. I switched from mdata=DUP
> > to mdata=single (now I see it's possible to change inline file size,
> > right?).
> I would try mounting both send and receive volumes with max_inline=0
> So then for all small new- and changed files, the filedata will be
> stored in data chunks and not inline in the metadata chunks.

Ok, i will try. Is there way to move existing files from metadata to data 
chunks? Something like btrfs balance with convert filter?

> That you changed metadata profile from dup to single is unrelated in
> principle. single for metadata instead of dup is half the write I/O
> for the harddisks, so in that sense it might speed up send actions a
> bit. I guess almost all time is spend in seeks.

Yes, I just didn't realize that so much files will be in metadata structures 
and it cought me be suprise.

> 
> > My problem is, that on some volumes, send-recieve is relatively fast (rate
> > in MB/s or hundreds of kB/s) but on biggest volume (biggest in space and
> > biggest in contained filesystem trees) rate is just 5-30kB/s.
> > 
> > Here is btrbk progress copyed
> > 785MiB 47:52:00 [12.9KiB/s] [4.67KiB/s]
> > 
> > ie. 758MB in 48 hours.
> > 
> > Reciever has high IO/wait - 90-100%, when i push data using btrbk.
> > When I run dd over ssh it can do 50-75MB/s.
> 
> The send part is the speed bottleneck as it looks like, you can test
> and isolate it by doing a dummy send and pipe it to  | mbuffer >
> /dev/null  and see what speed you get.

I tried it already, did incremental send to file 
#btrfs send -v -p ./backuppc.20160712/  ./backuppc.20160720_1/ | pv > /mnt/
data1/send
At subvol ./backuppc.20160720_1/
joining genl thread
18.9GiB 21:14:45 [ 259KiB/s]

Copied it over scp to reciever with speed 50.9MB/s.
No i will try recieve.


> > Sending machine is debian jessie with kernel 4.5.0-0.bpo.2-amd64 (upstream
> > 4.5.3) , btrfsprogs 4.4.1. It is virtual machine running on volume
> > exported from MD3420, 4 SAS disks in RAID10.
> > 
> > Recieving machine is debian jessie on Dell T20 with 4x3TB disks in MD
> > RAID5 , kernel is 4.4.0-0.bpo.1-amd64 (upstream 4.4.6), btrfsprgos 4.4.1
> > 
> > BTRFS volumes were created using those listed versions.
> > 
> > Sender:
> > -
> > #mount | grep hosting
> > /dev/sdg on /mnt/btrfs/hosting type btrfs
> > (rw,noatime,space_cache,subvolid=5,subvol=/) /dev/sdg on
> > /var/lib/container/hosting type btrfs
> > (rw,noatime,space_cache,subvolid=259,subvol=/system) /dev/sdg on
> > /var/lib/container/hosting/var/lib/backuppc type btrfs
> > (rw,noatime,space_cache,subvolid=260,subvol=/backuppc)
> > 
> > #btrfs filesystem usage /mnt/btrfs/hosting
> > 
> > Overall:
> > Device size: 840.00GiB
> > Device allocated:815.03GiB
> > Device unallocated:   24.97GiB
> > Device missing:  0.00B
> > Used:522.76GiB
> > Free (estimated):283.66GiB  (min: 271.18GiB)
> > Data ratio:   1.00
> > Metadata ratio:   1.00
> > Global reserve:  512.00MiB  (used: 0.00B)
> > 
> > Data,single: Size:710.98GiB, Used:452.29GiB
> > 
> >/dev/sdg  710.98GiB
> > 
> > Metadata,single: Size:103.98GiB, Used:70.46GiB
> > 
> >/dev/sdg  103.98GiB
> 
> This is a very large ratio metadata/data. Large and scattered
> metadata, even on fast rotational media, will result in slow send
> operation is my experience ( incremental send, about 10G metadata). So
> hopefully, when all the small files and many directories from backuppc
> are in data chunks and metadata is significantly smaller, send will be
> faster. However, maybe it is just the huge amount of files and not
> inlining of small files that makes metadata so big.
Backuppc says
"Pool is 462.30GB comprising 5140707 files and 4369 directories"
that is only storage of files, not counting all the server trees

> 
> I assume incremental send of snapshots is done.

Yes, it was incremental

Is a

Re: [PATCH 1/2] Btrfs: be more precise on errors when getting an inode from disk

2016-07-22 Thread J. Bruce Fields
On Fri, Jul 22, 2016 at 12:40:26PM +1000, NeilBrown wrote:
> On Fri, Jul 22 2016, J. Bruce Fields wrote:
> 
> > On Fri, Jul 22, 2016 at 11:08:17AM +1000, NeilBrown wrote:
> >> On Fri, Jun 10 2016, fdman...@kernel.org wrote:
> >> 
> >> > From: Filipe Manana 
> >> >
> >> > When we attempt to read an inode from disk, we end up always returning an
> >> > -ESTALE error to the caller regardless of the actual failure reason, 
> >> > which
> >> > can be an out of memory problem (when allocating a path), some error 
> >> > found
> >> > when reading from the fs/subvolume btree (like a genuine IO error) or the
> >> > inode does not exists. So lets start returning the real error code to the
> >> > callers so that they don't treat all -ESTALE errors as meaning that the
> >> > inode does not exists (such as during orphan cleanup). This will also be
> >> > needed for a subsequent patch in the same series dealing with a special
> >> > fsync case.
> >> >
> >> > Signed-off-by: Filipe Manana 
> >> 
> >> SNIP
> >> 
> >> > @@ -5594,7 +5602,8 @@ struct inode *btrfs_iget(struct super_block *s, 
> >> > struct btrfs_key *location,
> >> >  } else {
> >> >  unlock_new_inode(inode);
> >> >  iput(inode);
> >> > -inode = ERR_PTR(-ESTALE);
> >> > +ASSERT(ret < 0);
> >> > +inode = ERR_PTR(ret < 0 ? ret : -ESTALE);
> >> >  }
> >> 
> >> Just a heads-up.  This change breaks NFS :-(
> >> 
> >> The change in error code percolates up the call chain:
> >> 
> >>  nfs4_pufh->fh_verify->nfsd_set_fh_dentry->exportfs_decode_fh
> >> ->btrfs_fh_to_dentry->ntrfs_get_dentry->btrfs_iget
> >> 
> >> and nfsd returns NFS4ERR_NOENT to the client instead of NFS4ERR_STALE,
> >> and the client doesn't handle that quite the same way.
> >> 
> >> This doesn't mean that the change is wrong, but it could mean we need to
> >> fix something else in the path to sanitize the error code.
> >> 
> >> nfsd_set_fh_dentry already has
> >> 
> >>error = nfserr_stale;
> >>if (PTR_ERR(exp) == -ENOENT)
> >>return error;
> >> 
> >>if (IS_ERR(exp))
> >>return nfserrno(PTR_ERR(exp));
> >> 
> >> for a different error case, so duplicating that would work, but I doubt
> >> it is best.  At the very least we should check for valid errors, not
> >> specific invalid ones.
> >> 
> >> Bruce: do you have an opinion where we should make sure that PUTFH (and
> >> various other requests) returns a valid error code?
> >
> > Uh, I guess not.  Maybe exportfs_decode_fh?
> >
> > Though my kneejerk reaction is to be cranky and wonder why btrfs
> > suddenly needs a different convention for decode_fh().
> >
> 
> I can certainly agree with that perspective, though it would be
> appropriate in that case to make sure we document the requirements for
> fh_to_dentry (the current spelling for 'decode_fh').  So I went looking
> for documentation and found:
> 
>  * fh_to_dentry:
>  *@fh_to_dentry is given a &struct super_block (@sb) and a file handle
>  *fragment (@fh, @fh_len). It should return a &struct dentry which refers
>  *to the same file that the file handle fragment refers to.  If it cannot,
>  *it should return a %NULL pointer if the file was found but no acceptable
>  *&dentries were available, or an %ERR_PTR error code indicating why it
>  *couldn't be found (e.g. %ENOENT or %ENOMEM).  Any suitable dentry can be
>  *returned including, if necessary, a new dentry created with 
> d_alloc_root.
>  *The caller can then find any other extant dentries by following the
>  *d_alias links.
>  *
> 
> So the new btrfs code is actually conformant!!
> That documentation dates back to 2002 when I wrote it
> And it looks like ENOENT wasn't handled correctly then :-(
> 
> I suspect anything that isn't ENOMEM should be converted to ESTALE.
> ENOMEM causes the client to be asked to retry the request later.
> 
> Does this look reasonable to you?
> (Adding Christof as he as contributed a lot to exportfs)
> 
> If there is agreement I'll test and post a proper patch.

I can live with it.  It bothers me that we're losing potentially useful
information about what went wrong in the filesystem.  Maybe this is a
place a dprintk could be handy?

--b.

> 
> Thanks,
> NeilBrown
> 
> 
> diff --git a/fs/exportfs/expfs.c b/fs/exportfs/expfs.c
> index 207ba8d627ca..3527b58cd5bc 100644
> --- a/fs/exportfs/expfs.c
> +++ b/fs/exportfs/expfs.c
> @@ -428,10 +428,10 @@ struct dentry *exportfs_decode_fh(struct vfsmount *mnt, 
> struct fid *fid,
>   if (!nop || !nop->fh_to_dentry)
>   return ERR_PTR(-ESTALE);
>   result = nop->fh_to_dentry(mnt->mnt_sb, fid, fh_len, fileid_type);
> - if (!result)
> - result = ERR_PTR(-ESTALE);
> - if (IS_ERR(result))
> - return result;
> + if (PTR_ERR(result) == -ENOMEM)
> + return ERR_CAST(result)
> + if (IS_E

Re: [PATCH] btrfs-progs: Make RAID stripesize configurable

2016-07-22 Thread Austin S. Hemmelgarn

On 2016-07-22 12:06, Sanidhya Solanki wrote:

On Fri, 22 Jul 2016 10:58:59 -0400
"Austin S. Hemmelgarn"  wrote:


On 2016-07-22 09:42, Sanidhya Solanki wrote:

+*stripesize=*;;
+Specifies the new stripe size for a filesystem instance. Multiple BTrFS
+filesystems mounted in parallel with varying stripe size are supported, the 
only
+limitation being that the stripe size provided to balance in this option must
+be a multiple of 512 bytes, and greater than 512 bytes, but not larger than
+16 KiBytes. These limitations exist in the user's best interest. due to sizes 
too
+large or too small leading to performance degradations on modern devices.
+
+It is recommended that the user try various sizes to find one that best suit 
the
+performance requirements of the system. This option renders the RAID instance 
as
+in-compatible with previous kernel versions, due to the basis for this 
operation
+being implemented through FS metadata.
+

I'm actually somewhat curious to see numbers for sizes larger than 16k.
In most cases, that probably will be either higher or lower than the
point at which performance starts suffering.  On an set of fast SSD's,
that's almost certainly lower than the turnover point (I can't give an
opinion on BTRFS, but for DM-RAID, the point at which performance starts
degrading significantly is actually 64k on the SSD's I use), while on a
set of traditional hard drives, it may be as low as 4k (yes, I have
actually seen systems where this is the case).  I think that we should
warn about sizes larger than 16k, not refuse to use them, especially
because the point of optimal performance will shift when we get proper
I/O parallelization.  Or, better yet, warn about changing this at all,
and assume that if the user continues they know what they're doing.


I agree with you from a limited point of view. Your considerations are
relevant for a more broad, but general, set of circumstances.

My consideration is worst case scenario, particularly on SSDs, where,
say, you pick 8KiB or 16 KiB, write out all your data, then delete a
block, which will have to be read-erase-written on a multi-page level,
usually 4KiB in size.
I don't know what SSD's you've been looking at, but the erase block size 
on all of the modern NAND MLC based SSD's I've seen is between 1 and 8 
megabytes, so it would lead to at most a single erase block being 
rewritten.  Even most of the NAND SLC based SSD's I've seen have at 
least a 64k erase block.  Overall, the only case this is reasonably 
going to lead to a multi-page rewrite is if the filesystem isn't 
properly aligned, which is not a likely situation for most people.


On HDDs, this will make the problem of fragmenting even worse. On HDDs,
I would only recommend setting stripe block size to the block level
(usually 4KiB native, 512B emulated), but this just me focusing on the
worst case scenario.
And yet, software RAID implementations do fine with larger stripe sizes. 
 On my home server, I'm using BTRFS in RAID1 mode on top of LVM managed 
DM-RAID0 volumes, and I actually have gone through testing every power 
of 2 stripe size in this configuration for the DM-RAID volumes from 1k 
up to 64k.  I get peak performance using a 16k stripe size, and the 
performance actually falls off faster at lower sizes than it does at 
higher ones (at least, within the range I checked).  I've seen similar 
results on all the server systems I manage for work as well, so it's not 
just consumer hard drives that behave like this.


Maybe I will add these warnings in a follow-on patch, if others agree
with these statements and concerns.
The other part of my issue with this which forgot to state is that two 
types of people are likely to use this feature:
1. Those who actually care about performance and are willing to test 
multiple configurations to find an optimal one.
2. Those who claim to care about performance, but either just twiddle 
things randomly or blindly follow advice from others without really 
knowing for certain what they're doing.
The only people settings like this actually help to a reasonable degree 
are in the first group.  Putting a upper limit on the stripe size caters 
to protecting the second group (who shouldn't be using this to begin 
with) at the expense of the first group.  This doesn't affect data 
safety (or at least, it shouldn't), it only impacts performance, the 
system is still usable even if this is set poorly, so the value of 
trying to make it resistant to stupid users is not all that great.


Additionally, unless you have numbers to back up 16k being the practical 
maximum on most devices, then it's really just an arbitrary number, 
which is something that should be avoided in management tools.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs-progs: Make RAID stripesize configurable

2016-07-22 Thread Sanidhya Solanki
On Fri, 22 Jul 2016 10:58:59 -0400
"Austin S. Hemmelgarn"  wrote:

> On 2016-07-22 09:42, Sanidhya Solanki wrote:
> > +*stripesize=*;;
> > +Specifies the new stripe size for a filesystem instance. Multiple BTrFS
> > +filesystems mounted in parallel with varying stripe size are supported, 
> > the only
> > +limitation being that the stripe size provided to balance in this option 
> > must
> > +be a multiple of 512 bytes, and greater than 512 bytes, but not larger than
> > +16 KiBytes. These limitations exist in the user's best interest. due to 
> > sizes too
> > +large or too small leading to performance degradations on modern devices.
> > +
> > +It is recommended that the user try various sizes to find one that best 
> > suit the
> > +performance requirements of the system. This option renders the RAID 
> > instance as
> > +in-compatible with previous kernel versions, due to the basis for this 
> > operation
> > +being implemented through FS metadata.
> > +  
> I'm actually somewhat curious to see numbers for sizes larger than 16k. 
> In most cases, that probably will be either higher or lower than the 
> point at which performance starts suffering.  On an set of fast SSD's, 
> that's almost certainly lower than the turnover point (I can't give an 
> opinion on BTRFS, but for DM-RAID, the point at which performance starts 
> degrading significantly is actually 64k on the SSD's I use), while on a 
> set of traditional hard drives, it may be as low as 4k (yes, I have 
> actually seen systems where this is the case).  I think that we should 
> warn about sizes larger than 16k, not refuse to use them, especially 
> because the point of optimal performance will shift when we get proper 
> I/O parallelization.  Or, better yet, warn about changing this at all, 
> and assume that if the user continues they know what they're doing.

I agree with you from a limited point of view. Your considerations are
relevant for a more broad, but general, set of circumstances. 

My consideration is worst case scenario, particularly on SSDs, where,
say, you pick 8KiB or 16 KiB, write out all your data, then delete a
block, which will have to be read-erase-written on a multi-page level,
usually 4KiB in size.

On HDDs, this will make the problem of fragmenting even worse. On HDDs,
I would only recommend setting stripe block size to the block level
(usually 4KiB native, 512B emulated), but this just me focusing on the
worst case scenario.

Maybe I will add these warnings in a follow-on patch, if others agree
with these statements and concerns.

Thanks
Sanidhya
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs-progs: Make RAID stripesize configurable

2016-07-22 Thread Austin S. Hemmelgarn

On 2016-07-22 09:42, Sanidhya Solanki wrote:

Adds the user-space component of making the RAID stripesize user configurable.
Updates the btrfs-documentation to provide the information to users.
Adds parsing capabilities for the new options.
Adds the means of transfering the data to kernel space.
Updates the kernel ioctl interface to account for new options.
Updates the user-space component of RAID stripesize management.
Updates the TODO list for future tasks.

Patch applies to the v4.6.1 release branch.

Signed-off-by: Sanidhya Solanki 
---
 Documentation/btrfs-balance.asciidoc | 14 +
 btrfs-convert.c  | 59 +++-
 btrfs-image.c|  4 ++-
 btrfsck.h|  2 +-
 chunk-recover.c  |  8 +++--
 cmds-balance.c   | 45 +--
 cmds-check.c |  4 ++-
 disk-io.c| 10 --
 extent-tree.c|  4 ++-
 ioctl.h  | 10 --
 mkfs.c   | 18 +++
 raid6.c  |  3 ++
 utils.c  |  4 ++-
 volumes.c| 18 ---
 volumes.h| 12 +---
 15 files changed, 170 insertions(+), 45 deletions(-)

diff --git a/Documentation/btrfs-balance.asciidoc 
b/Documentation/btrfs-balance.asciidoc
index 7df40b9..fd61523 100644
--- a/Documentation/btrfs-balance.asciidoc
+++ b/Documentation/btrfs-balance.asciidoc
@@ -32,6 +32,7 @@ The filters can be used to perform following actions:
 - convert block group profiles (filter 'convert')
 - make block group usage more compact  (filter 'usage')
 - perform actions only on a given device (filters 'devid', 'drange')
+- perform an operation that changes the stripe size for a RAID instance

 The filters can be applied to a combination of block group types (data,
 metadata, system). Note that changing 'system' needs the force option.
@@ -157,6 +158,19 @@ is a range specified as 'start..end'. Makes sense for 
block group profiles that
 utilize striping, ie. RAID0/10/5/6.  The range minimum and maximum are
 inclusive.

+*stripesize=*;;
+Specifies the new stripe size for a filesystem instance. Multiple BTrFS
+filesystems mounted in parallel with varying stripe size are supported, the 
only
+limitation being that the stripe size provided to balance in this option must
+be a multiple of 512 bytes, and greater than 512 bytes, but not larger than
+16 KiBytes. These limitations exist in the user's best interest. due to sizes 
too
+large or too small leading to performance degradations on modern devices.
+
+It is recommended that the user try various sizes to find one that best suit 
the
+performance requirements of the system. This option renders the RAID instance 
as
+in-compatible with previous kernel versions, due to the basis for this 
operation
+being implemented through FS metadata.
+
I'm actually somewhat curious to see numbers for sizes larger than 16k. 
In most cases, that probably will be either higher or lower than the 
point at which performance starts suffering.  On an set of fast SSD's, 
that's almost certainly lower than the turnover point (I can't give an 
opinion on BTRFS, but for DM-RAID, the point at which performance starts 
degrading significantly is actually 64k on the SSD's I use), while on a 
set of traditional hard drives, it may be as low as 4k (yes, I have 
actually seen systems where this is the case).  I think that we should 
warn about sizes larger than 16k, not refuse to use them, especially 
because the point of optimal performance will shift when we get proper 
I/O parallelization.  Or, better yet, warn about changing this at all, 
and assume that if the user continues they know what they're doing.

 *soft*::
 Takes no parameters. Only has meaning when converting between profiles.
 When doing convert from one profile to another and soft mode is on,
diff --git a/btrfs-convert.c b/btrfs-convert.c
index b18de59..dc796d0 100644
--- a/btrfs-convert.c
+++ b/btrfs-convert.c
@@ -278,12 +278,14 @@ static int intersect_with_sb(u64 bytenr, u64 num_bytes)
 {
int i;
u64 offset;
+   extern u32 sz_stripe;
+   extern u32 stripe_width;

for (i = 0; i < BTRFS_SUPER_MIRROR_MAX; i++) {
offset = btrfs_sb_offset(i);
-   offset &= ~((u64)BTRFS_STRIPE_LEN - 1);
+   offset &= ~((u64)((sz_stripe) * (stripe_width)) - 1);

-   if (bytenr < offset + BTRFS_STRIPE_LEN &&
+   if (bytenr < offset + ((sz_stripe) * (stripe_width)) &&
bytenr + num_bytes > offset)
return 1;
}
@@ -603,6 +605,8 @@ static int block_iterate_proc(u64 disk_block, u64 
file_block,
int ret = 0;
int sb_region;
int do_barrier;
+   extern u32 sz_stripe;
+   extern u

Re: Send-recieve performance

2016-07-22 Thread Martin Raiber
On 20.07.2016 11:15 Libor Klepáč wrote:
> Hello,
> we use backuppc to backup our hosting machines.
>
> I have recently migrated it to btrfs, so we can use send-recieve for offsite 
> backups of our backups.
>
> I have several btrfs volumes, each hosts nspawn container, which runs in 
> /system subvolume and has backuppc data in /backuppc subvolume
> .
> I use btrbk to do snapshots and transfer.
> Local side is set to keep 5 daily snapshots, remote side to hold some 
> history. (not much yet, i'm using it this way for few weeks).
>
> If you know backuppc behaviour: for every backup (even incremental), it 
> creates full directory tree of each backed up machine even if it has no 
> modified files and places one small file in each, which holds some info for 
> backuppc. 
> So after few days i ran into ENOSPACE on one volume, because my metadata 
> grow, because of inlineing.
> I switched from mdata=DUP to mdata=single (now I see it's possible to change 
> inline file size, right?).
I am biased, but UrBackup works like BackupPC, except it has a client,
and like btrbk puts every backup into a separate btrfs sub-volume with
snapshotting reducing metadata workload. Then you could create read-only
snapshots from the UrBackup sub-volumes and use e.g. buttersink to copy
those to another btrfs.

So maybe try that?

Regards,
Martin



smime.p7s
Description: S/MIME Cryptographic Signature


Re: [PATCH] btrfs: Change RAID stripesize to a user-configurable option

2016-07-22 Thread Sanidhya Solanki
Applies to v4.7rc7 release kernel.

Sanidhya
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs: Change RAID stripesize to a user-configurable option

2016-07-22 Thread Sanidhya Solanki
Adds the kernel component of making the RAID stripesize user configurable.
Updates the kernel ioctl interface to account for new options.
Updates the existing implementations of RAID stripesize in metadata.
Make the stripesize an user-configurable option.
Convert the existing metadata option of stripesize into the basis for
this option.
Updates the kernel component of RAID stripesize management.
Update the RAID stripe block management.

Signed-off-by: Sanidhya Solanki 
---
 fs/btrfs/ctree.h| 21 ++--
 fs/btrfs/disk-io.c  | 12 ++-
 fs/btrfs/extent-tree.c  |  2 ++
 fs/btrfs/ioctl.c|  2 ++
 fs/btrfs/raid56.c   | 19 ++
 fs/btrfs/scrub.c|  6 --
 fs/btrfs/super.c| 12 ++-
 fs/btrfs/volumes.c  | 44 ++---
 fs/btrfs/volumes.h  |  3 +--
 include/trace/events/btrfs.h|  2 ++
 include/uapi/linux/btrfs.h  | 13 ++--
 include/uapi/linux/btrfs_tree.h | 10 --
 12 files changed, 119 insertions(+), 27 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 4274a7b..3fa4723 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -2139,6 +2139,25 @@ static inline void btrfs_set_balance_data(struct 
extent_buffer *eb,
write_eb_member(eb, bi, struct btrfs_balance_item, data, ba);
 }
 
+static inline void btrfs_balance_raid(struct extent_buffer *eb,
+ struct btrfs_balance_item *bi,
+ struct btrfs_disk_balance_args *ba)
+{
+   extern u32 sz_stripe;
+   extern u32 stripe_width;
+
+   sz_stripe = ba->sz_stripe;
+   stripe_width = ((64 * 1024) / sz_stripe);
+   read_eb_member(eb, bi, struct btrfs_balance_item, data, ba);
+}
+
+static inline void btrfs_set_balance_raid(struct extent_buffer *eb,
+   struct btrfs_balance_item *bi,
+   struct btrfs_disk_balance_args *ba)
+{
+   write_eb_member(eb, bi, struct btrfs_balance_item, data, ba);
+}
+
 static inline void btrfs_balance_meta(struct extent_buffer *eb,
  struct btrfs_balance_item *bi,
  struct btrfs_disk_balance_args *ba)
@@ -2233,8 +2252,6 @@ BTRFS_SETGET_STACK_FUNCS(super_sectorsize, struct 
btrfs_super_block,
 sectorsize, 32);
 BTRFS_SETGET_STACK_FUNCS(super_nodesize, struct btrfs_super_block,
 nodesize, 32);
-BTRFS_SETGET_STACK_FUNCS(super_stripesize, struct btrfs_super_block,
-stripesize, 32);
 BTRFS_SETGET_STACK_FUNCS(super_root_dir, struct btrfs_super_block,
 root_dir_objectid, 64);
 BTRFS_SETGET_STACK_FUNCS(super_num_devices, struct btrfs_super_block,
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 60ce119..45344ed 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2523,6 +2523,8 @@ int open_ctree(struct super_block *sb,
struct btrfs_root *tree_root;
struct btrfs_root *chunk_root;
int ret;
+   extern u32 sz_stripe;
+   extern u32 stripe_width;
int err = -EINVAL;
int num_backups_tried = 0;
int backup_index = 0;
@@ -2704,7 +2706,7 @@ int open_ctree(struct super_block *sb,
goto fail_alloc;
}
 
-   __setup_root(4096, 4096, 4096, tree_root,
+   __setup_root(4096, 4096, sz_stripe, tree_root,
 fs_info, BTRFS_ROOT_TREE_OBJECTID);
 
invalidate_bdev(fs_devices->latest_bdev);
@@ -2806,7 +2808,7 @@ int open_ctree(struct super_block *sb,
 
nodesize = btrfs_super_nodesize(disk_super);
sectorsize = btrfs_super_sectorsize(disk_super);
-   stripesize = sectorsize;
+   stripesize = sz_stripe;
fs_info->dirty_metadata_batch = nodesize * (1 + ilog2(nr_cpu_ids));
fs_info->delalloc_batch = sectorsize * 512 * (1 + ilog2(nr_cpu_ids));
 
@@ -4050,6 +4052,7 @@ static int btrfs_check_super_valid(struct btrfs_fs_info 
*fs_info,
u64 nodesize = btrfs_super_nodesize(sb);
u64 sectorsize = btrfs_super_sectorsize(sb);
int ret = 0;
+   extern u32 sz_stripe;
 
if (btrfs_super_magic(sb) != BTRFS_MAGIC) {
printk(KERN_ERR "BTRFS: no valid FS found\n");
@@ -4133,9 +4136,8 @@ static int btrfs_check_super_valid(struct btrfs_fs_info 
*fs_info,
   btrfs_super_bytes_used(sb));
ret = -EINVAL;
}
-   if (!is_power_of_2(btrfs_super_stripesize(sb))) {
-   btrfs_err(fs_info, "invalid stripesize %u",
-  btrfs_super_stripesize(sb));
+   if (!is_power_of_2(sz_stripe)) {
+   btrfs_err(fs_info, "invalid stripesize %u", sz_stripe);
ret = -EINVAL;
}
if (btrfs_super_num_devices(sb) > (1UL << 31))
diff --git a

[PATCH] btrfs-progs: Make RAID stripesize configurable

2016-07-22 Thread Sanidhya Solanki
Adds the user-space component of making the RAID stripesize user configurable.
Updates the btrfs-documentation to provide the information to users.
Adds parsing capabilities for the new options.
Adds the means of transfering the data to kernel space.
Updates the kernel ioctl interface to account for new options.
Updates the user-space component of RAID stripesize management.
Updates the TODO list for future tasks.

Patch applies to the v4.6.1 release branch.

Signed-off-by: Sanidhya Solanki 
---
 Documentation/btrfs-balance.asciidoc | 14 +
 btrfs-convert.c  | 59 +++-
 btrfs-image.c|  4 ++-
 btrfsck.h|  2 +-
 chunk-recover.c  |  8 +++--
 cmds-balance.c   | 45 +--
 cmds-check.c |  4 ++-
 disk-io.c| 10 --
 extent-tree.c|  4 ++-
 ioctl.h  | 10 --
 mkfs.c   | 18 +++
 raid6.c  |  3 ++
 utils.c  |  4 ++-
 volumes.c| 18 ---
 volumes.h| 12 +---
 15 files changed, 170 insertions(+), 45 deletions(-)

diff --git a/Documentation/btrfs-balance.asciidoc 
b/Documentation/btrfs-balance.asciidoc
index 7df40b9..fd61523 100644
--- a/Documentation/btrfs-balance.asciidoc
+++ b/Documentation/btrfs-balance.asciidoc
@@ -32,6 +32,7 @@ The filters can be used to perform following actions:
 - convert block group profiles (filter 'convert')
 - make block group usage more compact  (filter 'usage')
 - perform actions only on a given device (filters 'devid', 'drange')
+- perform an operation that changes the stripe size for a RAID instance
 
 The filters can be applied to a combination of block group types (data,
 metadata, system). Note that changing 'system' needs the force option.
@@ -157,6 +158,19 @@ is a range specified as 'start..end'. Makes sense for 
block group profiles that
 utilize striping, ie. RAID0/10/5/6.  The range minimum and maximum are
 inclusive.
 
+*stripesize=*;;
+Specifies the new stripe size for a filesystem instance. Multiple BTrFS
+filesystems mounted in parallel with varying stripe size are supported, the 
only
+limitation being that the stripe size provided to balance in this option must
+be a multiple of 512 bytes, and greater than 512 bytes, but not larger than
+16 KiBytes. These limitations exist in the user's best interest. due to sizes 
too
+large or too small leading to performance degradations on modern devices.
+
+It is recommended that the user try various sizes to find one that best suit 
the
+performance requirements of the system. This option renders the RAID instance 
as
+in-compatible with previous kernel versions, due to the basis for this 
operation
+being implemented through FS metadata.
+
 *soft*::
 Takes no parameters. Only has meaning when converting between profiles.
 When doing convert from one profile to another and soft mode is on,
diff --git a/btrfs-convert.c b/btrfs-convert.c
index b18de59..dc796d0 100644
--- a/btrfs-convert.c
+++ b/btrfs-convert.c
@@ -278,12 +278,14 @@ static int intersect_with_sb(u64 bytenr, u64 num_bytes)
 {
int i;
u64 offset;
+   extern u32 sz_stripe;
+   extern u32 stripe_width;
 
for (i = 0; i < BTRFS_SUPER_MIRROR_MAX; i++) {
offset = btrfs_sb_offset(i);
-   offset &= ~((u64)BTRFS_STRIPE_LEN - 1);
+   offset &= ~((u64)((sz_stripe) * (stripe_width)) - 1);
 
-   if (bytenr < offset + BTRFS_STRIPE_LEN &&
+   if (bytenr < offset + ((sz_stripe) * (stripe_width)) &&
bytenr + num_bytes > offset)
return 1;
}
@@ -603,6 +605,8 @@ static int block_iterate_proc(u64 disk_block, u64 
file_block,
int ret = 0;
int sb_region;
int do_barrier;
+   extern u32 sz_stripe;
+   extern u32 stripe_width;
struct btrfs_root *root = idata->root;
struct btrfs_block_group_cache *cache;
u64 bytenr = disk_block * root->sectorsize;
@@ -629,8 +633,8 @@ static int block_iterate_proc(u64 disk_block, u64 
file_block,
}
 
if (sb_region) {
-   bytenr += BTRFS_STRIPE_LEN - 1;
-   bytenr &= ~((u64)BTRFS_STRIPE_LEN - 1);
+   bytenr += ((sz_stripe) * (stripe_width)) - 1;
+   bytenr &= ~((u64)((sz_stripe) * (stripe_width)) - 1);
} else {
cache = btrfs_lookup_block_group(root->fs_info, bytenr);
BUG_ON(!cache);
@@ -1269,6 +1273,8 @@ static int create_image_file_range(struct 
btrfs_trans_handle *trans,
u64 disk_bytenr;
int i;
int ret;
+   extern u32 sz_stripe;
+   e

Re: Send-recieve performance

2016-07-22 Thread Henk Slager
On Wed, Jul 20, 2016 at 11:15 AM, Libor Klepáč  wrote:
> Hello,
> we use backuppc to backup our hosting machines.
>
> I have recently migrated it to btrfs, so we can use send-recieve for offsite 
> backups of our backups.
>
> I have several btrfs volumes, each hosts nspawn container, which runs in 
> /system subvolume and has backuppc data in /backuppc subvolume
> .
> I use btrbk to do snapshots and transfer.
> Local side is set to keep 5 daily snapshots, remote side to hold some 
> history. (not much yet, i'm using it this way for few weeks).
>
> If you know backuppc behaviour: for every backup (even incremental), it 
> creates full directory tree of each backed up machine even if it has no 
> modified files and places one small file in each, which holds some info for 
> backuppc.
> So after few days i ran into ENOSPACE on one volume, because my metadata 
> grow, because of inlineing.
> I switched from mdata=DUP to mdata=single (now I see it's possible to change 
> inline file size, right?).

I would try mounting both send and receive volumes with max_inline=0
So then for all small new- and changed files, the filedata will be
stored in data chunks and not inline in the metadata chunks.

That you changed metadata profile from dup to single is unrelated in
principle. single for metadata instead of dup is half the write I/O
for the harddisks, so in that sense it might speed up send actions a
bit. I guess almost all time is spend in seeks.

> My problem is, that on some volumes, send-recieve is relatively fast (rate in 
> MB/s or hundreds of kB/s) but on biggest volume (biggest in space and biggest 
> in contained filesystem trees) rate is just 5-30kB/s.
>
> Here is btrbk progress copyed
> 785MiB 47:52:00 [12.9KiB/s] [4.67KiB/s]
>
> ie. 758MB in 48 hours.
>
> Reciever has high IO/wait - 90-100%, when i push data using btrbk.
> When I run dd over ssh it can do 50-75MB/s.

The send part is the speed bottleneck as it looks like, you can test
and isolate it by doing a dummy send and pipe it to  | mbuffer >
/dev/null  and see what speed you get.

> Sending machine is debian jessie with kernel 4.5.0-0.bpo.2-amd64 (upstream 
> 4.5.3) , btrfsprogs 4.4.1. It is virtual machine running on volume exported 
> from MD3420, 4 SAS disks in RAID10.
>
> Recieving machine is debian jessie on Dell T20 with 4x3TB disks in MD RAID5 , 
> kernel is 4.4.0-0.bpo.1-amd64 (upstream 4.4.6), btrfsprgos 4.4.1
>
> BTRFS volumes were created using those listed versions.
>
> Sender:
> -
> #mount | grep hosting
> /dev/sdg on /mnt/btrfs/hosting type btrfs 
> (rw,noatime,space_cache,subvolid=5,subvol=/)
> /dev/sdg on /var/lib/container/hosting type btrfs 
> (rw,noatime,space_cache,subvolid=259,subvol=/system)
> /dev/sdg on /var/lib/container/hosting/var/lib/backuppc type btrfs 
> (rw,noatime,space_cache,subvolid=260,subvol=/backuppc)
>
> #btrfs filesystem usage /mnt/btrfs/hosting
> Overall:
> Device size: 840.00GiB
> Device allocated:815.03GiB
> Device unallocated:   24.97GiB
> Device missing:  0.00B
> Used:522.76GiB
> Free (estimated):283.66GiB  (min: 271.18GiB)
> Data ratio:   1.00
> Metadata ratio:   1.00
> Global reserve:  512.00MiB  (used: 0.00B)
>
> Data,single: Size:710.98GiB, Used:452.29GiB
>/dev/sdg  710.98GiB
>
> Metadata,single: Size:103.98GiB, Used:70.46GiB
>/dev/sdg  103.98GiB

This is a very large ratio metadata/data. Large and scattered
metadata, even on fast rotational media, will result in slow send
operation is my experience ( incremental send, about 10G metadata). So
hopefully, when all the small files and many directories from backuppc
are in data chunks and metadata is significantly smaller, send will be
faster. However, maybe it is just the huge amount of files and not
inlining of small files that makes metadata so big.

I assume incremental send of snapshots is done.

> System,DUP: Size:32.00MiB, Used:112.00KiB
>/dev/sdg   64.00MiB
>
> Unallocated:
>/dev/sdg   24.97GiB
>
> # btrfs filesystem show /mnt/btrfs/hosting
> Label: 'BackupPC-BcomHosting'  uuid: edecc92a-646a-4585-91a0-9cbb556303e9
> Total devices 1 FS bytes used 522.75GiB
> devid1 size 840.00GiB used 815.03GiB path /dev/sdg
>
> #Reciever:
> #mount | grep hosting
> /dev/mapper/vgPecDisk2-lvHostingBackupBtrfs on /mnt/btrfs/hosting type btrfs 
> (rw,noatime,space_cache,subvolid=5,subvol=/)
>
> #btrfs filesystem usage /mnt/btrfs/hosting/
> Overall:
> Device size: 896.00GiB
> Device allocated:604.07GiB
> Device unallocated:  291.93GiB
> Device missing:  0.00B
> Used:565.98GiB
> Free (estimated):313.62GiB  (min: 167.65GiB)
> Data ratio:   1.00
> Metadata ratio:   1.00
> Gl

[PATCH] btrfs: do not background blkdev_put()

2016-07-22 Thread Anand Jain
From: Anand Jain 

At the end of unmount/dev-delete, if the device exclusive open is not
actually closed, then there might be a race with another program in
the userland who is trying to open the device in exclusive mode and
it may fail for eg:
  unmount /btrfs; fsck /dev/x
  btrfs dev del /dev/x /btrfs; fsck /dev/x
so here background blkdev_put() is not a choice

---
This patch depends on the patch 2/2 as below,
  [PATCH 1/2] btrfs: reorg btrfs_close_one_device()
  [PATCH v3 2/2] btrfs: make sure device is synced before return

RFC->PATCH:
 Collage sync and put to a function and use it

 fs/btrfs/volumes.c | 27 +++
 1 file changed, 19 insertions(+), 8 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index f741ade130a4..1ce584893d1b 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -834,10 +834,6 @@ static void __free_device(struct work_struct *work)
struct btrfs_device *device;
 
device = container_of(work, struct btrfs_device, rcu_work);
-
-   if (device->bdev)
-   blkdev_put(device->bdev, device->mode);
-
rcu_string_free(device->name);
kfree(device);
 }
@@ -852,6 +848,17 @@ static void free_device(struct rcu_head *head)
schedule_work(&device->rcu_work);
 }
 
+static void btrfs_close_bdev(struct btrfs_device *device)
+{
+   if (device->bdev && device->writeable) {
+   sync_blockdev(device->bdev);
+   invalidate_bdev(device->bdev);
+   }
+
+   if (device->bdev)
+   blkdev_put(device->bdev, device->mode);
+}
+
 static void btrfs_close_one_device(struct btrfs_device *device)
 {
struct btrfs_fs_devices *fs_devices = device->fs_devices;
@@ -870,10 +877,7 @@ static void btrfs_close_one_device(struct btrfs_device 
*device)
if (device->missing)
fs_devices->missing_devices--;
 
-   if (device->bdev && device->writeable) {
-   sync_blockdev(device->bdev);
-   invalidate_bdev(device->bdev);
-   }
+   btrfs_close_bdev(device);
 
new_device = btrfs_alloc_device(NULL, &device->devid,
device->uuid);
@@ -1932,6 +1936,8 @@ int btrfs_rm_device(struct btrfs_root *root, char 
*device_path, u64 devid)
btrfs_sysfs_rm_device_link(root->fs_info->fs_devices, device);
}
 
+   btrfs_close_bdev(device);
+
call_rcu(&device->rcu, free_device);
 
num_devices = btrfs_super_num_devices(root->fs_info->super_copy) - 1;
@@ -2025,6 +2031,9 @@ void btrfs_rm_dev_replace_free_srcdev(struct 
btrfs_fs_info *fs_info,
/* zero out the old super if it is writable */
btrfs_scratch_superblocks(srcdev->bdev, srcdev->name->str);
}
+
+   btrfs_close_bdev(srcdev);
+
call_rcu(&srcdev->rcu, free_device);
 
/*
@@ -2080,6 +2089,8 @@ void btrfs_destroy_dev_replace_tgtdev(struct 
btrfs_fs_info *fs_info,
 * the device_list_mutex lock.
 */
btrfs_scratch_superblocks(tgtdev->bdev, tgtdev->name->str);
+
+   btrfs_close_bdev(tgtdev);
call_rcu(&tgtdev->rcu, free_device);
 }
 
-- 
2.7.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html