Re: Pointers to mirroring partitions (w/ encryption?) help?

2016-06-03 Thread B. S.



On 06/03/2016 09:39 PM, Justin Brown wrote:

Here's some thoughts:


Assume a CD sized (680MB) /boot


Some distros carry patches for grub that allow booting from Btrfs,
so no separate /boot file system is required. (Fedora does not;
Ubuntu -- and therefore probably all Debians -- does.)


OTOH, a separate /boot keeps all possible future options open, and
reduces complexity. e.g. unwinding a /boot from within /, later.
Regardless, no harm in having separate /boot. (Assuming not worried
about partition presence detection.)


perhaps a 200MB (?) sized EFI partition


Way bigger than necessary. It should only be 1-2MiB, and IIRC 2MiB
might be the max UEFI allows.


Thanks for that. https://en.wikipedia.org/wiki/EFI_system_partition,
which I stupidly didn't think to look at at the time, doesn't speak to
size, but does note, for Gummiboot, "Configuration file fragments, 
kernel images and initrd images are required to reside on the EFI System

partition, as Gummiboot does not provide support for accessing files on
other partitions or file systems." So I'm not sure that 2MB is large
enough, and I suspect exceeding 2MB, reasonably, should do no harm
except waste some space.


then creates another partition for mirroring, later. IIUC, btrfs
add device /dev/sda4 / is appropriate, then. Then running a balance
seems recommended.


Don't do this. It's not going to provide any additional protection
that you can't do in a smarter way. If you only have one device and
want data duplication, just use the `dup` data profile (settable via
`balance`). In fact, by default Btrfs uses the `dup` profile for
metadata (and `single` for data). You'll get all the data integrity
benefits with `dup`.


Thank you for that. So a data dup'ed fs will overwrite a checksum 
failing file with the (checksum succeeding) 2nd copy, and the weekly 
scrub will ensure the reverse won't happen (likely). Cool!


I wonder if a separate physical partition brings anything to the party.
OTOH, a botched partition, duplicated, is still botched. Hmm.

Having a btrfs partition currently, with / even, can the partition be
grown and dup added after the fact?


One of the best features and initally confusing things about Btrfs
is how much is done "within" a file system. (There is a certain "the
Btrfs way" to it.)


Yep. Thus the questions. And thank you, list, for being here.




Confusing, however, is having those (both) partitions encrypted.
Seems some work is needed beforehand. But I've never done
encryption.


(This is moot if you go with `dup`.) It's actually quite easy with
every major distro. If we're talking about a fresh install, the
distro installer probably has full support


... don't we wish. Just tried a Kubuntu 16.04 LTS install ... passphrase 
request hidden and broken. Some googling suggests staying away from 
K/Ubuntu at the moment for crypt installs. Installer broken.


So switched to Debian 8, which is bringing its own problems. e.g. 
network can ping locally but not outside. Set static address and it's 
fine - go figure. Broken video and updates, and more. This, I expect, 
has more to do with getting back into the Debian way.



for passphrase-based
dm-crypt LUKS encryption, including multiple volumes sharing a
passphrase.


... and you're back to why I posted the OP. Just sinking into such, and 
the water is murky. No doubt, like so many other things Linux, in a few 
years in will be old hat. Not there yet, though.



An existing install should be convertable without much
trouble. It's ususally just a matter of setting up the container with
`cryptsetup`, populating `/etc/crypttab`, possibly adding crypto
modules to your initrd and/or updating settings, and rebuilding the
initrd. (I have first-hand experience doing this on a Fedora install
recently, and it took about half an hour and I knew nothing about
Fedora's `dracut` initrd generator tool.)


Hmmm. Interesting thought. Perhaps I should clone a current install, and 
go through the exercise. Then trying to do it all at once on a new 
install should have a lower learning curve / botch risk.



If you do need multiple encrypted file systems, simply use the same
passphrase for all volumes (but never do this by cloning the LUKS
headers). You'll only need to enter it once at boot.


Good to know, thank you. That's not obvious / made readily apparent when 
googling.


Let alone, if trying to reduce complexity by ignoring LVM, it isn't 
readily apparent that dmcrypt involves LUKS. Too many terms and 
technologies flying by, cross-pollinating, even.



The additional problem is most articles reference FDE (Full Disk
Encryption) - but that doesn't seem to be prudent. e.g. Unencrypted
/boot. So having problems finding concise links on the topics, -FDE
-"Full Disk Encryption".


Yeah, when it comes to FDE, you either have to make your peace with
trusting the manufacturer, or you can't. If you are going to boot
your system with a traditional boot loader, an unencrypted partition
is mandatory. 

Re: btrfs

2016-06-03 Thread Chris Murphy
On Fri, Jun 3, 2016 at 8:13 PM, Christoph Anton Mitterer
 wrote:

> If there would be e.g. an kept-up-to-date wiki page about the status
> and current perils of e.g. RAID5/6, people (like me) wouldn't ask every
> weeks, saving the devs' time.

Well up until 4.6, there was a rather clear "Btrfs is under heavy
development, and is not suitable for-any uses other than benchmarking
and review." statement in kernel documentation.

https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/diff/Documentation/filesystems/btrfs.txt?id=v4.6=v4.5

There's no longer such a strongly worded caution in that document, nor
in the wiki.

The wiki has stale information still, but it's a volunteer effort like
everything else Btrfs related.


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs

2016-06-03 Thread Christoph Anton Mitterer
On Sat, 2016-06-04 at 00:22 +0200, Brendan Hide wrote:
> - RAID5/6 seems far from being stable or even usable,... not to
> > talk
> >   about higher parity levels, whose earlier posted patches (e.g.
> >   http://thread.gmane.org/gmane.linux.kernel/1654735) seem to have
> >   been given up.
>  I'm not certain why that patch didn't get any replies, though it
> should also be noted that it was sent to three mailing lists - and
> that btrfs was simply an implementation example. See previous thread
> here: http://thread.gmane.org/gmane.linux.kernel/1622485
Ah... I remembered that one, but just couldn't find it anymore... so
even two efforts already, both seem dead :-(

> I recall reading it and thinking 6 parities is madness - but I
> certainly see how it would be good for future-proofing.
Well I can imagine that scenarios exist in which more than two parities
may be highly desirable...


> > - a number of important core features not fully working in many
> >   situations (e.g. the issues with defrag, not being ref-link
> > aware,...
> >   an I vaguely remember similar things with compression).
>  True also. There are various features and situations where btrfs
> does not work as intelligently as expected.

And even worse: Some of these are totally impossible to know for the
average user. => the documentation issue (though at least the defrag
issue is documented now in btrfs-filesystem(8) at least).



>  I class these under the "you're doing it wrong" theme. The vast
> majority of popular database engines have been designed without CoW
> in mind and, unfortunately, one *cannot* simply dump it onto a CoW
> system and expect it to perform well. There is no easy answer here.
Well the easy answer is: nodatacow
At least in terms of: it's technically possible, not talking about "is
it easy for the end-user (the average admin may possible at one point
read that nodatacow should be done for VMs and DBs, but what about all
the smallish DBs like Firefox sqlites and so on, or simply any other
scenario where such IO patterns happen).

But the problem with nodatacow is the implication of checksumming loss.




> > - other earlier anticipated features like newer/better compression
> > or
> >   checksum algos seem to be dead either
>  Re alternative compression: https://btrfs.wiki.kernel.org/index.php/
> FAQ#Will_btrfs_support_LZ4.3F
> My short version: This is a premature optimisation.
> 
> IMO, alternative checksums is also a premature optimisation. An RFC
> for alternative checksums was last looked at by Liu Bo in November
> 2014. A different strategy was proposed as the code didn't make use
> of a pre-existing crypto code in the kernel.



> > - still no real RAID 1
>  This depends on what you mean by "real" - and I'm guessing you're
> misled by mdraid's feature to have multiple copies in RAID1 rather
> than just the two. RAID1 by definition is exactly two mirrored
> copies. No more. No less.
See my answer to Austin about the same claim.
Actually I have no idea where it comes from,... even the more down-to-
earth sources like Wikipedia all speak about "mirroring of all disks",
as the original paper about RAID.


> > - no end-user/admin grade maangement/analysis tools, that tell non-
> >   experts about the state/health of their fs, and whether things
> > like
> >   balance etc.pp. are necessary
> > 
> > - the still problematic documentation situation
>  Simple answer: RAID5/6 is not yet recommended for storing data you
> don't mind losing. Btrfs is *also* not yet ready for install-and-
> forget-style system administration.

Well the problem with writing good documentation in the "we do it once
it's finished style" is often that it will never happen... or that the
devs themselves don't recall all details.

Also in the meantime there is so much (also often outdated) 3rd party
documentation and myths that come alive, that it takes ages to clean up
with all that.


> I personally recommend against using btrfs for people who aren't
> familiar with it.
I think it *is* pretty important that many people try/test/play with
it, because that helps stabilisation... but even during that phase,
documentation would be quite important.

If there would be e.g. an kept-up-to-date wiki page about the status
and current perils of e.g. RAID5/6, people (like me) wouldn't ask every
weeks, saving the devs' time.
Plus people wouldn't end up simply trying it, believing it works
already, and then face data loss.


Cheers,
Chris.

smime.p7s
Description: S/MIME cryptographic signature


Re: btrfs

2016-06-03 Thread Christoph Anton Mitterer
On Fri, 2016-06-03 at 15:50 -0400, Austin S Hemmelgarn wrote:
> There's no point in trying to do higher parity levels if we can't get
> regular parity working correctly.  Given the current state of things,
> it might be better to break even and just rewrite the whole parity
> raid thing from scratch, but I doubt that anybody is willing to do
> that.

Well... as I've said, things are pretty worrying. Obviously I cannot
really judge, since I'm not into btrfs' development... maybe there's a
lack of manpower? Since btrfs seems to be a very important part (i.e.
next-gen fs), wouldn't it be possible to either get some additional
funding by the Linux Foundation, or possible that some of the core
developers make an open call for funding by companies?
Having some additional people, perhaps working fulltime on it, may be a
big help.

As for the RAID... given how many time/effort is spent now into 5/6,..
it really seems that one should have considered multi-parity from the
beginning on.
Kinda feels like either, with multi-parity this whole instability phase
would start again, or it will simply never happen.


> > - Serious show-stoppers and security deficiencies like the UUID
> >   collision corruptions/attacks that have been extensively
> > discussed
> >   earlier, are still open
> The UUID issue is not a BTRFS specific one, it just happens to be
> easier
> to cause issues with it on BTRFS

uhm this had been discussed extensively before, as I've said... AFAICS
btrfs is the only system we have, that can possibly cause data
corruption or even security breach by UUID collisions.
I wouldn't know that other fs, or LVM are affected, these just continue
to use those devices already "online"... and I think lvm refuses to
activate VGs, if conflicting UUIDs are found.


> There is no way to solve it sanely given the requirement that
> userspace
> not be broken.
No this is not true. Back when this was discussed, I and others
described how it could/should be done,... respectively how
userspace/kernel should behave, in short:
- continue using those devices that are already active
- refusing to (auto)assemble by UUID, if there are conflicts
  or requiring to specify the devices (with some --override-yes-i-know-
  what-i-do option option or so)
- in case of assembling/rebuilding/similar... never doing this
  automatically

I think there were some more corner cases, I basically had them all
discussed in the thread back then (search for "attacking btrfs
filesystems via UUID collisions?" and IIRC some different titled parent
or child threads).


>   Properly fixing this would likely make us more dependent
> on hardware configuration than even mounting by device name.
Sure, if there are colliding UUIDs, and one still wants to mount (by
using some --override-yes-i-know-what-i-do option),.. it would need to
be by specifying the device name...
But where's the problem?
This would anyway only happen if someone either attacks or someone made
a clone, and it's far better to refuse automatic assembly in cases
where accidental corruption can happen or where attacks may be
possible, requiring the user/admin to manually take action, than having
corruption or security breach.

Imagine the simple case: degraded RAID1 on a PC; if btrfs would do some
auto-rebuild based on UUID, then if an attacker knows that he'd just
need to plug in a USB disk with a fitting UUID...and easily gets a copy
of everything on disk, gpg keys, ssh keys, etc.



> > - a number of important core features not fully working in many
> >   situations (e.g. the issues with defrag, not being ref-link
> > aware,...
> >   an I vaguely remember similar things with compression).
> OK, how then should defrag handle reflinks?  Preserving them prevents
> it
> from being able to completely defragment data.
Didn't that even work in the past and had just some performance issues?


> > - OTOH, defrag seems to be viable for important use cases (VM
> > images,
> >   DBs,... everything where large files are internally re-written
> >   randomly).
> >   Sure there is nodatacow, but with that one effectively completely
> >   looses one of the core features/promises of btrfs (integrity by
> >   checksumming)... and as I've showed in an earlier large
> > discussion,
> >   none of the typical use cases for nodatacow has any high-level
> >   checksumming, and even if, it's not used per default, or doesn't
> > give
> >   the same benefits at it would on the fs level, like using it for
> > RAID
> >   recovery).
> The argument of nodatacow being viable for anything is a pretty
> significant secondary discussion that is itself entirely orthogonal
> to
> the point you appear to be trying to make here.

Well the point here was: 
- many people (including myself) like btrfs, it's
  (promised/future/current) features
- it's intended as a general purpose fs
- this includes the case of having such file/IO patterns as e.g. for VM
  images or DBs
- this is currently not really doable without loosing one of the
  

Re: Recommended why to use btrfs for production?

2016-06-03 Thread Chris Murphy
On Fri, Jun 3, 2016 at 6:48 PM, Nicholas D Steeves  wrote:
> On 3 June 2016 at 11:33, Austin S. Hemmelgarn  wrote:
>> On 2016-06-03 10:11, Martin wrote:

 Make certain the kernel command timer value is greater than the driver
 error recovery timeout. The former is found in sysfs, per block
 device, the latter can be get and set with smartctl. Wrong
 configuration is common (it's actually the default) when using
 consumer drives, and inevitably leads to problems, even the loss of
 the entire array. It really is a terrible default.
>>>
>>>
>>> Are nearline SAS drives considered consumer drives?
>>>
>> If it's a SAS drive, then no, especially when you start talking about things
>> marketed as 'nearline'.  Additionally, SCT ERC is entirely a SATA thing, I
>> forget what the equivalent in SCSI (and by extension SAS) terms is, but I'm
>> pretty sure that the kernel handles things differently there.
>
> For the purposes of BTRFS RAID1: For drives that ship with SCT ERC of
> 7sec, is the default kernel command timeout of 30sec appropriate, or
> should it be reduced?

It's fine. But it depends on your use case, if it can tolerate a rare
> 7 second < 30 second hang, and you're prepared to start
investigating the cause then I'd leave it alone. If the use case
prefers resetting the drive when it stops responding, then you'd go
with something shorter.

I'm fairly certain SAS's command queue doesn't get obliterated with
such a link reset, just the hung command; where SATA drives all
information in the queue is lost. So resets on SATA are a much bigger
penalty if I have the correct understanding.


>  For SATA drives that do not support SC TERC, is
> it true that 120sec is a sane value?  I forget where I got this value
> of 120sec;

It's a good question. It's not well documented, is not defined in the
SATA spec, so it's probably make/model specific. The linux-raid@ list
probably has the most information on this just because their users get
nailed by this problem often. And the recommendation does seem to vary
around 120 to 180. That is of course a maximum. The drive could give
up much sooner. But what you don't want is for the drive to be in
recovery for a bad sector, and the command timer does a link reset,
losing all of what the drive was doing: all of which is replaceable
except really one thing which is what sector was having the problem.
And right now there's no report of the drive for slow sectors. It only
reports failed reads, and it's that failed read error that includes
the sector, so that the raid mechanism can figure out what data is
missing, recongistruct from mirror or parity, and then fix the bad
sector by writing to it.

> it might have been this list, it might have been an mdadm
> bug report.  Also, in terms of tuning, I've been unable to find
> whether the ideal kernel timeout value changes depending on RAID
> type...is that a factor in selecting a sane kernel timeout value?

No. It's strictly a value to make certain you get read errors from the
drive rather than link resets.

And that's why I think it's a bad default, because it totally thwarts
attempts by manufacturers to recover marginal sectors, even in the
single disk case.


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Pointers to mirroring partitions (w/ encryption?) help?

2016-06-03 Thread Justin Brown
Here's some thoughts:

> Assume a CD sized (680MB) /boot

Some distros carry patches for grub that allow booting from Btrfs, so
no separate /boot file system is required. (Fedora does not; Ubuntu --
and therefore probably all Debians -- does.)

> perhaps a 200MB (?) sized EFI partition

Way bigger than necessary. It should only be 1-2MiB, and IIRC 2MiB
might be the max UEFI allows.

>  then creates another partition for mirroring, later. IIUC, btrfs add device 
> /dev/sda4 / is appropriate, then. Then running a balance seems recommended.

Don't do this. It's not going to provide any additional protection
that you can't do in a smarter way. If you only have one device and
want data duplication, just use the `dup` data profile (settable via
`balance`). In fact, by default Btrfs uses the `dup` profile for
metadata (and `single` for data). You'll get all the data integrity
benefits with `dup`.

One of the best features and initally confusing things about Btrfs is
how much is done "within" a file system. (There is a certain "the
Btrfs way" to it.)

> Confusing, however, is having those (both) partitions encrypted. Seems some 
> work is needed beforehand. But I've never done encryption.

(This is moot if you go with `dup`.) It's actually quite easy with
every major distro. If we're talking about a fresh install, the distro
installer probably has full support for passphrase-based dm-crypt LUKS
encryption, including multiple volumes sharing a passphrase. An
existing install should be convertable without much trouble. It's
ususally just a matter of setting up the container with `cryptsetup`,
populating `/etc/crypttab`, possibly adding crypto modules to your
initrd and/or updating settings, and rebuilding the initrd. (I have
first-hand experience doing this on a Fedora install recently, and it
took about half an hour and I knew nothing about Fedora's `dracut`
initrd generator tool.)

If you do need multiple encrypted file systems, simply use the same
passphrase for all volumes (but never do this by cloning the LUKS
headers). You'll only need to enter it once at boot.

> The additional problem is most articles reference FDE (Full Disk Encryption) 
> - but that doesn't seem to be prudent. e.g. Unencrypted /boot. So having 
> problems finding concise links on the topics, -FDE -"Full Disk Encryption".

Yeah, when it comes to FDE, you either have to make your peace with
trusting the manufacturer, or you can't. If you are going to boot your
system with a traditional boot loader, an unencrypted partition is
mandatory. That being said, we live in a world with UEFI Secure Boot.
While your EFI parition must be unencrypted vfat, you can sign the
kernels (or shims), and the UEFI can be configured to only boot signed
executables, including only those signed by your own key. Some distros
already provide this feature, including using keys probably already
trusted by the default keystore.

> mirror subvolumes (or it inherently comes along for the ride?)

Yes, that is correct. Just to give some more background: the data and
metadata profiles control "mirroring," and they are set at the file
system level. Subvolumes live entirely within one file system, so
whatever profile is set in the FS applies to subvolumes.

> So, I could take an HD, create partitions as above (how? e.g. Set up 
> encryption / btrfs mirror volumes), then clonezilla (?) partitions from a 
> current machine in.

Are you currently using Btrfs? If so, use Btrfs' `send` and `receive`
commands. That should be lot friendlier to your SSD. (I'll take this
opportunity to say that you need to consider the `discard` mount *and*
`/etc/crypttab` options. Discard -- or scheduling `fstrim` -- is
extremely important to maintain optimal performance of a SSD, but
there are some privacy trade-offs on encrypted systems.) If not, then
`cp -a` or similar will work. Obviously, you'll have to get your boot
mechanism and file system identifiers updated in addition to
`/etc/crypttab` described above.

Lastly, strongly consider `autodefrag` and possibly setting some
highly violatile -- but *unimportant* -- directories to `nodatacow`
via purging and `chattr +C`. (I do this for ~/.cache and /var/cache.)

> Yet not looking to put in a 2nd HD

If you change your mind and decide on a backup device, or even if you
just want local backup snapshots, one of the best snapshot managers is
btrfs-sxbackup (no association with the FS project).

On Fri, Jun 3, 2016 at 3:30 PM, B. S.  wrote:
> Hallo. I'm continuing on sinking in to btrfs, so pointers to concise help
> articles appreciated. I've got a couple new home systems, so perhaps it's
> time to investigate encryption, and given the bit rot I've seen here,
> perhaps time to mirror volumes so the wonderful btrfs self-healing
> facilities can be taken advantage of.
>
> Problem with today's hard drives, a quick look at Canada Computer shows the
> smallest drives 500GB, 120GB SSDs, far more than the 20GB or so an OS needs.
> Yet not 

Re: Recommended why to use btrfs for production?

2016-06-03 Thread Chris Murphy
On Fri, Jun 3, 2016 at 8:11 AM, Martin  wrote:
>> Make certain the kernel command timer value is greater than the driver
>> error recovery timeout. The former is found in sysfs, per block
>> device, the latter can be get and set with smartctl. Wrong
>> configuration is common (it's actually the default) when using
>> consumer drives, and inevitably leads to problems, even the loss of
>> the entire array. It really is a terrible default.
>
> Are nearline SAS drives considered consumer drives?

No, they should have configurable sct erc setting using smartctl.
Many, possibly most, consumer drives now do not support it, so often
the only workable way to use them in any kind of multiple device
scenario other than linear/concat or raid0 is to significantly
increase the scsi command timer - upwards or 2 or 3 minutes. So if
your use case cannot tolerate such delays, then the drives must be
disqualified.



-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: clear uptodate flags of pages in sys_array eb

2016-06-03 Thread Liu Bo
On Fri, Jun 03, 2016 at 05:41:42PM -0700, Liu Bo wrote:
> We set uptodate flag to pages in the temporary sys_array eb,
> but do not clear the flag after free eb.  As the special
> btree inode may still hold a reference on those pages, the
> uptodate flag can remain alive in them.
> 
> If btrfs_super_chunk_root has been intentionally changed to the
> offset of this sys_array eb, reading chunk_root will read content
> of sys_array and it will pass our beautiful checks in

s/pass/skip/

My mistake, sorry.

Thanks,

-liubo

> btree_readpage_end_io_hook() because of
> "pages of eb are uptodate => eb is uptodate"
> 
> This adds the 'clear uptodate' part to force it to read from disk.
> 
> Signed-off-by: Liu Bo 
> ---
>  fs/btrfs/volumes.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 7a169de..d2ca03b 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -6681,12 +6681,14 @@ int btrfs_read_sys_array(struct btrfs_root *root)
>   sb_array_offset += len;
>   cur_offset += len;
>   }
> + clear_extent_buffer_uptodate(sb);
>   free_extent_buffer_stale(sb);
>   return ret;
>  
>  out_short_read:
>   printk(KERN_ERR "BTRFS: sys_array too short to read %u bytes at offset 
> %u\n",
>   len, cur_offset);
> + clear_extent_buffer_uptodate(sb);
>   free_extent_buffer_stale(sb);
>   return -EIO;
>  }
> -- 
> 2.5.5
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: clear uptodate flags of pages in sys_array eb

2016-06-03 Thread Josef Bacik

On 06/03/2016 08:41 PM, Liu Bo wrote:

We set uptodate flag to pages in the temporary sys_array eb,
but do not clear the flag after free eb.  As the special
btree inode may still hold a reference on those pages, the
uptodate flag can remain alive in them.

If btrfs_super_chunk_root has been intentionally changed to the
offset of this sys_array eb, reading chunk_root will read content
of sys_array and it will pass our beautiful checks in
btree_readpage_end_io_hook() because of
"pages of eb are uptodate => eb is uptodate"

This adds the 'clear uptodate' part to force it to read from disk.

Signed-off-by: Liu Bo 


Reviewed-by: Josef Bacik 

Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Recommended why to use btrfs for production?

2016-06-03 Thread Nicholas D Steeves
On 3 June 2016 at 11:33, Austin S. Hemmelgarn  wrote:
> On 2016-06-03 10:11, Martin wrote:
>>>
>>> Make certain the kernel command timer value is greater than the driver
>>> error recovery timeout. The former is found in sysfs, per block
>>> device, the latter can be get and set with smartctl. Wrong
>>> configuration is common (it's actually the default) when using
>>> consumer drives, and inevitably leads to problems, even the loss of
>>> the entire array. It really is a terrible default.
>>
>>
>> Are nearline SAS drives considered consumer drives?
>>
> If it's a SAS drive, then no, especially when you start talking about things
> marketed as 'nearline'.  Additionally, SCT ERC is entirely a SATA thing, I
> forget what the equivalent in SCSI (and by extension SAS) terms is, but I'm
> pretty sure that the kernel handles things differently there.

For the purposes of BTRFS RAID1: For drives that ship with SCT ERC of
7sec, is the default kernel command timeout of 30sec appropriate, or
should it be reduced?  For SATA drives that do not support SC TERC, is
it true that 120sec is a sane value?  I forget where I got this value
of 120sec; it might have been this list, it might have been an mdadm
bug report.  Also, in terms of tuning, I've been unable to find
whether the ideal kernel timeout value changes depending on RAID
type...is that a factor in selecting a sane kernel timeout value?

Kind regards,
Nicholas
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs: clear uptodate flags of pages in sys_array eb

2016-06-03 Thread Liu Bo
We set uptodate flag to pages in the temporary sys_array eb,
but do not clear the flag after free eb.  As the special
btree inode may still hold a reference on those pages, the
uptodate flag can remain alive in them.

If btrfs_super_chunk_root has been intentionally changed to the
offset of this sys_array eb, reading chunk_root will read content
of sys_array and it will pass our beautiful checks in
btree_readpage_end_io_hook() because of
"pages of eb are uptodate => eb is uptodate"

This adds the 'clear uptodate' part to force it to read from disk.

Signed-off-by: Liu Bo 
---
 fs/btrfs/volumes.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 7a169de..d2ca03b 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -6681,12 +6681,14 @@ int btrfs_read_sys_array(struct btrfs_root *root)
sb_array_offset += len;
cur_offset += len;
}
+   clear_extent_buffer_uptodate(sb);
free_extent_buffer_stale(sb);
return ret;
 
 out_short_read:
printk(KERN_ERR "BTRFS: sys_array too short to read %u bytes at offset 
%u\n",
len, cur_offset);
+   clear_extent_buffer_uptodate(sb);
free_extent_buffer_stale(sb);
return -EIO;
 }
-- 
2.5.5

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Debian BTRFS/UEFI Documentation

2016-06-03 Thread Nicholas D Steeves
Hi David,

Sorry for the delay.  Yes, at this point I feel it would be best to
continue this discussion off-list, or perhaps to shift it to the
debian-doc list.  Appologies to linux-btrfs if this should have been
shifted sooner!  I'll follow-up with a PM reply momentarily.

Cheers,
Nicholas

On 3 May 2016 at 03:37, David Alcorn  wrote:
> "Honestly, did you read the Debian wiki pages for btrfs and EFI?  If
> you read them, could you please let me know where they were deficient
> so I can fix them?"
>
> I did not use the Debian wiki pages for BTRFS and UEFI as a resource
> in my attempts to answer my questions because I read them in the past
> and they did not address my specific needs.  Technically, I lack the
> skill set required for my perspectives to merit credulity but I am
> willing to give it a shot.  I do not want to take the list off focus:
> if this discussion belongs elsewhere, let me know.
>
> My question about how to recover/replace a failed boot where "/" is
> located in a BTRFS subvolume located on a BTRFS RAID56 array presents
> challenges but it is reasonable to provide sufficient infrastructure
> in the wiki's to let a portion of the readers answer this question
> themselves rather than bother this list.  Am I correct that (i) there
> is no reasonable tool to permit a screen shot of the Grub menu being
> edited using the "e" key as the O/S has not yet loaded?, and (ii) do
> USB flash drive (unlike some SSD's) respect the "dup" data profile?
>
> It is easy to answer my question whether "/boot" may be located on a
> BTRFS RAID56 array somewhere in the UEFI wiki.  I am more comfortable
> with a more comprehensive revision to the wiki as suggested in the
> below draft.  If the editorial comments are excessive or offend
> community standards, scrap em.
>
> Replace the "RAID for the EFI System Partition" section with:
>
> "DRAFT: RAID and LVM for the EFI and /Boot Partitions".  The UEFI
> firmware specification supports several alternative boot strategies
> including PXE boot and boot from an EFI System Partition ("ESP") which
> might be located on a MBR, GPT or El Torito volume on an optical disk.
> The ESP must be partitioned using a supported FAT partition (such as
> FAT32).  A mdadm RAID array (other than perhaps a RAID 1 array
> formatted as FAT32), a LVM partition and a BTRFS RAID array are not
> FAT and can not hold a functional ESP.  Once Grub loads the ESP
> payload, Grub has enhanced abilities to recognize file systems which
> it uses to acquire required information from "/boot".  The Grub
> Manual, which may be viewed with the command "info grub", reports Grub
> (unlike grub-legacy stage 1.5) has some ability to use advanced file
> systems such as LVM and RAID once the ESP payload is loaded.  This
> support appears to exclude BTRFS RAID 56.  Other than the possible
> mdadm RAID 1 exception noted above, ESP always goes in a separate, non
> array, non LVM FAT partition.  For BTRFS RAID56 arrays,  "/boot" also
> requires a separate, non array partition.
>
> Because LVM does not favor a whole disk Physical Volume ("PV") over a
> partition based PV, it is trivial to create a petite ESP on a disk and
> assign the balance of the disk to a LVM PV.  Array capacity of both
> MDADM and BTRFS RAID 56 arrays may be disproportionately reduced when
> the size of a single disk is reduced by, say an ESP.  For
> administrative simplicity and to maximize array capacity, equal sized
> whole disk arrays are favored.
>
> Both the ESP and "/boot" partitions present limited, read dominated
> workloads.  USB flash drives are cheap and tolerate light, read
> dominated workloads well.  For a stand alone server, it is common to
> locate the ESP on a USB flash device.  If you use a BTRFS RAID56
> array, "/boot" and perhaps "/swap" may also go to separate partitions
> on the flash drive.  This permits assignment of whole disks to the
> array.  If you are working with a large number of servers, it may be
> cheaper, more energy efficient, and more reliable to replace whatever
> is on the flash drive with PXE boot.  Frequently, SATA (or IDE) drives
> that are not wholly allocated to the RAID array are scarce.  If you
> have one, the ESP (and "/boot") partitions may be located there.
> Similar concerns affect LILO.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[GIT PULL] Btrfs

2016-06-03 Thread Chris Mason
Hi Linus,

My for-linus-4.7 branch has some fixes:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git 
for-linus-4.7

I realized as I was prepping this pull that my tip commit still had
Facebook task numbers and other internal metadata in it.  So I had to
reword the description, which is why it is only a few hours old.  Only
the description changed since testing.

The important part of this pull is Filipe's set of fixes for btrfs device
replacement.  Filipe fixed a few issues seen on the list and a number
he found on his own.

Filipe Manana (8) commits (+93/-19):
Btrfs: fix race setting block group back to RW mode during device replace 
(+5/-5)
Btrfs: fix unprotected assignment of the left cursor for device replace 
(+4/-0)
Btrfs: fix race setting block group readonly during device replace (+46/-2)
Btrfs: fix race between device replace and block group removal (+11/-0)
Btrfs: fix race between device replace and chunk allocation (+9/-12)
Btrfs: fix race between readahead and device replace/removal (+2/-0)
Btrfs: fix race between device replace and read repair (+10/-0)
Btrfs: fix race between device replace and discard (+6/-0)

Chris Mason (1) commits (+12/-1):
Btrfs: deal with duplciates during extent_map insertion in btrfs_get_extent

Total: (9) commits (+105/-20)

 fs/btrfs/extent-tree.c  |  6 ++
 fs/btrfs/extent_io.c| 10 ++
 fs/btrfs/inode.c| 13 -
 fs/btrfs/ordered-data.c |  6 +-
 fs/btrfs/ordered-data.h |  2 +-
 fs/btrfs/reada.c|  2 ++
 fs/btrfs/scrub.c| 50 ++---
 fs/btrfs/volumes.c  | 32 +++
 8 files changed, 103 insertions(+), 18 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Pointers to mirroring partitions (w/ encryption?) help?

2016-06-03 Thread B. S.
Hallo. I'm continuing on sinking in to btrfs, so pointers to concise 
help articles appreciated. I've got a couple new home systems, so 
perhaps it's time to investigate encryption, and given the bit rot I've 
seen here, perhaps time to mirror volumes so the wonderful btrfs 
self-healing facilities can be taken advantage of.


Problem with today's hard drives, a quick look at Canada Computer shows 
the smallest drives 500GB, 120GB SSDs, far more than the 20GB or so an 
OS needs. Yet not looking to put in a 2nd HD, either. It feels like 
mirroring volumes makes sense.


(EFI [partitions] also seem to be sticking their fingers in here.]

Assume a CD sized (680MB) /boot, and perhaps a 200MB (?) sized EFI 
partition, it seems to me one sets up / as usual (less complex install), 
then creates another partition for mirroring, later. IIUC, btrfs add 
device /dev/sda4 / is appropriate, then. Then running a balance seems 
recommended.


Confusing, however, is having those (both) partitions encrypted. Seems 
some work is needed beforehand. But I've never done encryption. I have 
come across https://github.com/gebi/keyctl_keyscript, so I understand 
there will be gotchas to deal with - later. But not there yet, and not 
real sure how to start.


The additional problem is most articles reference FDE (Full Disk 
Encryption) - but that doesn't seem to be prudent. e.g. Unencrypted 
/boot. So having problems finding concise links on the topics, -FDE 
-"Full Disk Encryption".


Any good links to concise instructions on building / establishing 
encrypted btrfs mirror volumes? dm_crypt seems to be the basis, and not 
looking to add LVM, seems an unnecessary extra layer of complexity.


It also feels like I could mkfs.btrfs /dev/sda3 /dev/sda4, then mirror 
subvolumes (or it inherently comes along for the ride?) - so my 
confusion level increases. Especially if encryption is added to the mix.


So, I could take an HD, create partitions as above (how? e.g. Set up 
encryption / btrfs mirror volumes), then clonezilla (?) partitions from 
a current machine in. I assume mounting a live cd then cp -a from old 
disk partition to new disk partition won't 'just work'. (?)


Article suggestions?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID1 vs RAID10 and best way to set up 6 disks

2016-06-03 Thread Justin Brown
> Mitchell wrote:
> With RAID10, there's still only 1 other copy, but the entire "original"
disk is mirrored to another one, right?

No, full disks are never mirrored in any configuration.

Here's how I understand Btrfs' non-parity redundancy profiles:

single: only a single instance of a file exists across the file system
dup: two instances of a file exist across the file system, and they
may reside on the same physical disk (4.5.1+ required to use dup
profile on multi-disk file system)
raid1: same as dup but the instances are guaranteed to be on different disks
raid0: single but  can be striped between multiple disks
raid10: data is guaranteed to exist on two separate devices but if n>2
the data is load balanced between disks*

Even though my explanation is imperfect, I hopes that illustrates that
Btrfs RAID is different than traditional RAID. Btrfs provides the same
physical redundancy as RAID, but the implementation mechanisms are
quite a bit different. This has wonderful consequences for
flexibility, and it's what allowed me to run a 5x2TB RAID10 array for
nearly two years and essentially allow complete allocation. The
downside is that since allocations aren't enforced from start (eg. MD
requiring certain number of disks and identical sizes), it's possible
to get weird allocations over time, but the resolution is simple: run
a balance from time to time.

> Christoph wrote:
> Especially, when you have an odd number devices (or devices with
different sizes), its not clear to me, personally, at all how far that
redundancy actually goes respectively what btrfs actually does... could
be that you have your 2 copies, but maybe on the same device then?

RAID1 (and transitively RAID10) guarantees two copies on different
disks, always. Only dup allows the copies to reside on the same disk.
This is guaranteed is preserved, even when n=2k+1 and mixed-capacity
disks. If disks run out of available chunks to satisfy the redundancy
profile, the result is ENOSPC and requires the administrator to
balance the file system before new allocations can succeed. The
question essentially is asking if Btrfs will spontaneously degrade
into "dup" if chunks cannot be allocated on some devices. That will
never happen.


On Fri, Jun 3, 2016 at 1:42 PM, Mitchell Fossen  wrote:
> Thanks for pointing that out, so if I'm thinking correctly, with RAID1
> it's just that there is a copy of the data somewhere on some other
> drive.
>
> With RAID10, there's still only 1 other copy, but the entire "original"
> disk is mirrored to another one, right?
>
> On Fri, 2016-06-03 at 20:13 +0200, Christoph Anton Mitterer wrote:
>> On Fri, 2016-06-03 at 13:10 -0500, Mitchell Fossen wrote:
>> >
>> > Is there any caveats between RAID1 on all 6 vs RAID10?
>> Just to be safe: RAID1 in btrfs means not what RAID1 means in any
>> other
>> terminology about RAID.
>>
>> The former has only two duplicates, the later means full mirroring of
>> all devices.
>>
>>
>> Cheers,
>> Chris.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs

2016-06-03 Thread Austin S Hemmelgarn
On 2016-06-03 13:38, Christoph Anton Mitterer wrote:
> Hey..
> 
> Hm... so the overall btrfs state seems to be still pretty worrying,
> doesn't it?
> 
> - RAID5/6 seems far from being stable or even usable,... not to talk
>   about higher parity levels, whose earlier posted patches (e.g.
>   http://thread.gmane.org/gmane.linux.kernel/1654735) seem to have
>   been given up.
There's no point in trying to do higher parity levels if we can't get
regular parity working correctly.  Given the current state of things, it
might be better to break even and just rewrite the whole parity raid
thing from scratch, but I doubt that anybody is willing to do that.
> 
> - Serious show-stoppers and security deficiencies like the UUID
>   collision corruptions/attacks that have been extensively discussed
>   earlier, are still open
The UUID issue is not a BTRFS specific one, it just happens to be easier
to cause issues with it on BTRFS, it causes problems with all Linux
native filesystems, as well as LVM, and is also an issue on Windows.
There is no way to solve it sanely given the requirement that userspace
not be broken.  Properly fixing this would likely make us more dependent
on hardware configuration than even mounting by device name.
> 
> - a number of important core features not fully working in many
>   situations (e.g. the issues with defrag, not being ref-link aware,...
>   an I vaguely remember similar things with compression).
OK, how then should defrag handle reflinks?  Preserving them prevents it
from being able to completely defragment data.  It's worth pointing out
that it is generally pointless to defragment snapshots, as they are
typically infrequently accessed in most use cases.
> 
> - OTOH, defrag seems to be viable for important use cases (VM images,
>   DBs,... everything where large files are internally re-written
>   randomly).
>   Sure there is nodatacow, but with that one effectively completely
>   looses one of the core features/promises of btrfs (integrity by
>   checksumming)... and as I've showed in an earlier large discussion,
>   none of the typical use cases for nodatacow has any high-level
>   checksumming, and even if, it's not used per default, or doesn't give
>   the same benefits at it would on the fs level, like using it for RAID
>   recovery).
The argument of nodatacow being viable for anything is a pretty
significant secondary discussion that is itself entirely orthogonal to
the point you appear to be trying to make here.
> 
> - other earlier anticipated features like newer/better compression or
>   checksum algos seem to be dead either
This one I entirely agree about.  The arguments against adding other
compression algorithms and new checksums are entirely bogus.  Ideally
we'd switch to just encoding API info from the CryptoAPI and let people
use wherever they want from there.
> 
> - still no real RAID 1
No, you mean still no higher order replication.  I know I'm being
stubborn about this, but RAID-1 is offici8ally defined in the standards
as 2-way replication.  The only extant systems that support higher
levels of replication and call it RAID-1 are entirely based on MD RAID
and it's poor choice of naming.

Overall, between this and the insanity that is raid5/6, somebody with
significantly more skill than me, and significantly more time than most
of the developers, needs to just take a step back and rewrite the whole
multi-device profile support from scratch.
> 
> - no end-user/admin grade maangement/analysis tools, that tell non-
>   experts about the state/health of their fs, and whether things like
>   balance etc.pp. are necessary
I don't see anyone forthcoming with such tools either.  As far as basic
monitoring, it's trivial to do with simple scripts from tools like monit
or nagios.  As far as complex things like determining whether a fs needs
balanced, that's really non-trivial to figure out.  Even with a person
looking at it, it's still not easy to know whether or not a balance will
actually help.
> 
> - the still problematic documentation situation
Not trying to rationalize this, but go take a look at a majority of
other projects, most of them that aren't backed by some huge corporation
throwing insane amounts of money at them have at best mediocre end-user
documentation.  The fact that more effort is being put into development
than documentation is generally a good thing, especially for something
that is not yet feature complete like BTRFS.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Recommended why to use btrfs for production?

2016-06-03 Thread Christoph Anton Mitterer
Hey.

Does anyone know whether the write hole issues have been fixed already?
https://btrfs.wiki.kernel.org/index.php/RAID56 still mentions it.

Cheers,
Chris.

smime.p7s
Description: S/MIME cryptographic signature


[PATCH v2] Btrfs: fix eb memory leak due to readpage failure

2016-06-03 Thread Liu Bo
eb->io_pages is set in read_extent_buffer_pages().

In case of readpage failure, for pages that have been added to bio,
it calls bio_endio and later readpage_io_failed_hook() does the work.

When this eb's page (couldn't be the 1st page) fails to add itself to bio
due to failure in merge_bio(), it cannot decrease eb->io_pages via bio_endio,
 and ends up with a memory leak eventually.

This lets __do_readpage propagate errors to callers and adds the
 'atomic_dec(>io_pages)'.

Signed-off-by: Liu Bo 
---
v2:
  - Move 'dec io_pages' to the caller so that we're consistent with
write_one_eb()

 fs/btrfs/extent_io.c | 16 +---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index d247fc0..0309388 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2869,6 +2869,7 @@ __get_extent_map(struct inode *inode, struct page *page, 
size_t pg_offset,
  * into the tree that are removed when the IO is done (by the end_io
  * handlers)
  * XXX JDM: This needs looking at to ensure proper page locking
+ * return 0 on success, otherwise return error
  */
 static int __do_readpage(struct extent_io_tree *tree,
 struct page *page,
@@ -2890,7 +2891,7 @@ static int __do_readpage(struct extent_io_tree *tree,
sector_t sector;
struct extent_map *em;
struct block_device *bdev;
-   int ret;
+   int ret = 0;
int nr = 0;
size_t pg_offset = 0;
size_t iosize;
@@ -3081,7 +3082,7 @@ out:
SetPageUptodate(page);
unlock_page(page);
}
-   return 0;
+   return ret;
 }
 
 static inline void __do_contiguous_readpages(struct extent_io_tree *tree,
@@ -5204,8 +5205,17 @@ int read_extent_buffer_pages(struct extent_io_tree *tree,
  get_extent, ,
  mirror_num, _flags,
  READ | REQ_META);
-   if (err)
+   if (err) {
ret = err;
+   /*
+* We use  in above __extent_read_full_page,
+* so we ensure that if it returns error, the
+* current page fails to add itself to bio.
+*
+* We must dec io_pages by ourselves.
+*/
+   atomic_dec(>io_pages);
+   }
} else {
unlock_page(page);
}
-- 
2.5.5

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 2/2] Btrfs: add valid checks for chunk loading

2016-06-03 Thread Liu Bo
To prevent fuzz filesystem images from panic the whole system,
we need various validation checks to refuse to mount such an image
if btrfs finds any invalid value during loading chunks, including
both sys_array and regular chunks.

Note that these checks may not be sufficient to cover all corner cases,
feel free to add more checks.

Reported-by: Vegard Nossum 
Reported-by: Quentin Casasnovas 
Signed-off-by: Liu Bo 
---
v2: 
 - Fix several typos.

 fs/btrfs/volumes.c | 81 --
 1 file changed, 66 insertions(+), 15 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index d403ab6..7a169de 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -6250,27 +6250,23 @@ struct btrfs_device *btrfs_alloc_device(struct 
btrfs_fs_info *fs_info,
return dev;
 }
 
-static int read_one_chunk(struct btrfs_root *root, struct btrfs_key *key,
- struct extent_buffer *leaf,
- struct btrfs_chunk *chunk)
+/* Return -EIO if any error, otherwise return 0. */
+static int btrfs_check_chunk_valid(struct btrfs_root *root,
+  struct extent_buffer *leaf,
+  struct btrfs_chunk *chunk, u64 logical)
 {
-   struct btrfs_mapping_tree *map_tree = >fs_info->mapping_tree;
-   struct map_lookup *map;
-   struct extent_map *em;
-   u64 logical;
u64 length;
u64 stripe_len;
-   u64 devid;
-   u8 uuid[BTRFS_UUID_SIZE];
-   int num_stripes;
-   int ret;
-   int i;
+   u16 num_stripes;
+   u16 sub_stripes;
+   u64 type;
 
-   logical = key->offset;
length = btrfs_chunk_length(leaf, chunk);
stripe_len = btrfs_chunk_stripe_len(leaf, chunk);
num_stripes = btrfs_chunk_num_stripes(leaf, chunk);
-   /* Validation check */
+   sub_stripes = btrfs_chunk_sub_stripes(leaf, chunk);
+   type = btrfs_chunk_type(leaf, chunk);
+
if (!num_stripes) {
btrfs_err(root->fs_info, "invalid chunk num_stripes: %u",
  num_stripes);
@@ -6281,6 +6277,11 @@ static int read_one_chunk(struct btrfs_root *root, 
struct btrfs_key *key,
  "invalid chunk logical %llu", logical);
return -EIO;
}
+   if (btrfs_chunk_sector_size(leaf, chunk) != root->sectorsize) {
+   btrfs_err(root->fs_info, "invalid chunk sectorsize %u",
+ btrfs_chunk_sector_size(leaf, chunk));
+   return -EIO;
+   }
if (!length || !IS_ALIGNED(length, root->sectorsize)) {
btrfs_err(root->fs_info,
"invalid chunk length %llu", length);
@@ -6292,13 +6293,53 @@ static int read_one_chunk(struct btrfs_root *root, 
struct btrfs_key *key,
return -EIO;
}
if (~(BTRFS_BLOCK_GROUP_TYPE_MASK | BTRFS_BLOCK_GROUP_PROFILE_MASK) &
-   btrfs_chunk_type(leaf, chunk)) {
+   type) {
btrfs_err(root->fs_info, "unrecognized chunk type: %llu",
  ~(BTRFS_BLOCK_GROUP_TYPE_MASK |
BTRFS_BLOCK_GROUP_PROFILE_MASK) &
  btrfs_chunk_type(leaf, chunk));
return -EIO;
}
+   if ((type & BTRFS_BLOCK_GROUP_RAID10 && sub_stripes != 2) ||
+   (type & BTRFS_BLOCK_GROUP_RAID1 && num_stripes < 1) ||
+   (type & BTRFS_BLOCK_GROUP_RAID5 && num_stripes < 2) ||
+   (type & BTRFS_BLOCK_GROUP_RAID6 && num_stripes < 3) ||
+   (type & BTRFS_BLOCK_GROUP_DUP && num_stripes > 2) ||
+   ((type & BTRFS_BLOCK_GROUP_PROFILE_MASK) == 0 &&
+num_stripes != 1)) {
+   btrfs_err(root->fs_info, "invalid num_stripes:sub_stripes %u:%u 
for profile %llu",
+ num_stripes, sub_stripes,
+ type & BTRFS_BLOCK_GROUP_PROFILE_MASK);
+   return -EIO;
+   }
+
+   return 0;
+}
+
+static int read_one_chunk(struct btrfs_root *root, struct btrfs_key *key,
+ struct extent_buffer *leaf,
+ struct btrfs_chunk *chunk)
+{
+   struct btrfs_mapping_tree *map_tree = >fs_info->mapping_tree;
+   struct map_lookup *map;
+   struct extent_map *em;
+   u64 logical;
+   u64 length;
+   u64 stripe_len;
+   u64 devid;
+   u8 uuid[BTRFS_UUID_SIZE];
+   int num_stripes;
+   int ret;
+   int i;
+
+   logical = key->offset;
+   length = btrfs_chunk_length(leaf, chunk);
+   stripe_len = btrfs_chunk_stripe_len(leaf, chunk);
+   num_stripes = btrfs_chunk_num_stripes(leaf, chunk);
+
+   ret = btrfs_check_chunk_valid(root, leaf, chunk, logical);
+   if (ret)
+   return ret;
 
read_lock(_tree->map_tree.lock);
em = 

[PATCH v2 1/2] Btrfs: add more valid checks for superblock

2016-06-03 Thread Liu Bo
This adds valid checks for super_total_bytes, super_bytes_used and
super_stripesize, super_num_devices.

Reported-by: Vegard Nossum 
Reported-by: Quentin Casasnovas 
Signed-off-by: Liu Bo 
---
v2:
 - Check super_num_devices and super_total_bytes after loading chunk
   tree.
 - Check super_bytes_used against the minimum space usage of a fresh
   mkfs.btrfs.
 - Fix super_stripesize to be sectorsize instead of 4096

 fs/btrfs/disk-io.c | 11 +++
 fs/btrfs/volumes.c | 24 
 2 files changed, 35 insertions(+)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 6628fca..ea78d77 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -4130,6 +4130,17 @@ static int btrfs_check_super_valid(struct btrfs_fs_info 
*fs_info,
 * Hint to catch really bogus numbers, bitflips or so, more exact 
checks are
 * done later
 */
+   if (btrfs_super_bytes_used(sb) < 6 * btrfs_super_nodesize(sb)) {
+   printk(KERN_ERR "BTRFS: bytes_used is too small %llu\n",
+  btrfs_super_bytes_used(sb));
+   ret = -EINVAL;
+   }
+   if (!is_power_of_2(btrfs_super_stripesize(sb)) ||
+   btrfs_super_stripesize(sb) != sectorsize) {
+   printk(KERN_ERR "BTRFS: invalid stripesize %u\n",
+  btrfs_super_stripesize(sb));
+   ret = -EINVAL;
+   }
if (btrfs_super_num_devices(sb) > (1UL << 31))
printk(KERN_WARNING "BTRFS: suspicious number of devices: 
%llu\n",
btrfs_super_num_devices(sb));
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index bdc6256..d403ab6 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -6648,6 +6648,7 @@ int btrfs_read_chunk_tree(struct btrfs_root *root)
struct btrfs_key found_key;
int ret;
int slot;
+   u64 total_dev = 0;
 
root = root->fs_info->chunk_root;
 
@@ -6689,6 +6690,7 @@ int btrfs_read_chunk_tree(struct btrfs_root *root)
ret = read_one_dev(root, leaf, dev_item);
if (ret)
goto error;
+   total_dev++;
} else if (found_key.type == BTRFS_CHUNK_ITEM_KEY) {
struct btrfs_chunk *chunk;
chunk = btrfs_item_ptr(leaf, slot, struct btrfs_chunk);
@@ -6698,6 +6700,28 @@ int btrfs_read_chunk_tree(struct btrfs_root *root)
}
path->slots[0]++;
}
+
+   /*
+* After loading chunk tree, we've got all device information,
+* do another round of validation check.
+*/
+   if (total_dev != root->fs_info->fs_devices->total_devices) {
+   btrfs_err(root->fs_info,
+  "super_num_devices(%llu) mismatch with num_devices(%llu) found here",
+ btrfs_super_num_devices(root->fs_info->super_copy),
+ total_dev);
+   ret = -EINVAL;
+   goto error;
+   }
+   if (btrfs_super_total_bytes(root->fs_info->super_copy) <
+   root->fs_info->fs_devices->total_rw_bytes) {
+   btrfs_err(root->fs_info,
+   "super_total_bytes(%llu) mismatch with fs_devices total_rw_bytes(%llu)",
+ btrfs_super_total_bytes(root->fs_info->super_copy),
+ root->fs_info->fs_devices->total_rw_bytes);
+   ret = -EINVAL;
+   goto error;
+   }
ret = 0;
 error:
unlock_chunks(root);
-- 
2.5.5

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID1 vs RAID10 and best way to set up 6 disks

2016-06-03 Thread Christoph Anton Mitterer
On Fri, 2016-06-03 at 13:42 -0500, Mitchell Fossen wrote:
> Thanks for pointing that out, so if I'm thinking correctly, with
> RAID1
> it's just that there is a copy of the data somewhere on some other
> drive.
> 
> With RAID10, there's still only 1 other copy, but the entire
> "original"
> disk is mirrored to another one, right?

To be honest, I couldn't tell you for sure :-/ ... IMHO the btrfs
documentation has some "issues".

mkfs.btrfs(8) says: 2 copies for RAID10, so I'd assume it's just the
striped version of what btrfs - for whichever questionable reason -
calls "RAID1".

Especially, when you have an odd number devices (or devices with
different sizes), its not clear to me, personally, at all how far that
redundancy actually goes respectively what btrfs actually does... could
be that you have your 2 copies, but maybe on the same device then?


Cheers,
Chris.


smime.p7s
Description: S/MIME cryptographic signature


Re: RAID1 vs RAID10 and best way to set up 6 disks

2016-06-03 Thread Mitchell Fossen
Thanks for pointing that out, so if I'm thinking correctly, with RAID1
it's just that there is a copy of the data somewhere on some other
drive.

With RAID10, there's still only 1 other copy, but the entire "original"
disk is mirrored to another one, right?

On Fri, 2016-06-03 at 20:13 +0200, Christoph Anton Mitterer wrote:
> On Fri, 2016-06-03 at 13:10 -0500, Mitchell Fossen wrote:
> > 
> > Is there any caveats between RAID1 on all 6 vs RAID10?
> Just to be safe: RAID1 in btrfs means not what RAID1 means in any
> other
> terminology about RAID.
> 
> The former has only two duplicates, the later means full mirroring of
> all devices.
> 
> 
> Cheers,
> Chris.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID1 vs RAID10 and best way to set up 6 disks

2016-06-03 Thread Christoph Anton Mitterer
On Fri, 2016-06-03 at 13:10 -0500, Mitchell Fossen wrote:
> Is there any caveats between RAID1 on all 6 vs RAID10?

Just to be safe: RAID1 in btrfs means not what RAID1 means in any other
terminology about RAID.

The former has only two duplicates, the later means full mirroring of
all devices.


Cheers,
Chris.

smime.p7s
Description: S/MIME cryptographic signature


RAID1 vs RAID10 and best way to set up 6 disks

2016-06-03 Thread Mitchell Fossen
Hello,

I have 6 WD Red Pro drives, each 6TB in space. My question is, what is
the best way to set these up? 

The system drive (and root) are on a 500GB SSD, so these drives will
only be used for /home and file storage.

Is there any caveats between RAID1 on all 6 vs RAID10?

Thanks for the help,

Mitch
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs (was: raid5/6) production use status (and future)?

2016-06-03 Thread Christoph Anton Mitterer
Hey..

Hm... so the overall btrfs state seems to be still pretty worrying,
doesn't it?

- RAID5/6 seems far from being stable or even usable,... not to talk
  about higher parity levels, whose earlier posted patches (e.g.
  http://thread.gmane.org/gmane.linux.kernel/1654735) seem to have
  been given up.

- Serious show-stoppers and security deficiencies like the UUID
  collision corruptions/attacks that have been extensively discussed
  earlier, are still open

- a number of important core features not fully working in many
  situations (e.g. the issues with defrag, not being ref-link aware,...
  an I vaguely remember similar things with compression).

- OTOH, defrag seems to be viable for important use cases (VM images,
  DBs,... everything where large files are internally re-written
  randomly).
  Sure there is nodatacow, but with that one effectively completely
  looses one of the core features/promises of btrfs (integrity by
  checksumming)... and as I've showed in an earlier large discussion,
  none of the typical use cases for nodatacow has any high-level
  checksumming, and even if, it's not used per default, or doesn't give
  the same benefits at it would on the fs level, like using it for RAID
  recovery).

- other earlier anticipated features like newer/better compression or
  checksum algos seem to be dead either

- still no real RAID 1

- no end-user/admin grade maangement/analysis tools, that tell non-
  experts about the state/health of their fs, and whether things like
  balance etc.pp. are necessary

- the still problematic documentation situation




smime.p7s
Description: S/MIME cryptographic signature


Re: btrfs ENOSPC "not the usual problem"

2016-06-03 Thread Liu Bo
On Thu, Jun 02, 2016 at 07:45:49PM +, Omari Stephens wrote:
> [Note: not on list; please reply-all]
> 
> I've read everything I can find about running out of space on btrfs, and it
> hasn't helped.  I'm currently dead in the water.
> 
> Everything I do seems to make the problem monotonically worse — I tried
> adding a loopback device to the fs, and now I can't remove it.  Then I tried
> adding a real device (mSATA) to the fs and now I still can't remove the
> loopback device (which is making everything super slow), and I also can't
> remove the mSATA.  I've removed about 100GB from the filesystem and that
> hasn't done anything either.
> 
> Is there anything I can to do even figure out how bad things are, what I
> need to do to make any kind of forward progress?  This is a laptop, so I
> don't want to add an external drive only to find out that I can't remove it
> without corrupting my filesystem.
> 
> ### FILESYSTEM STATE
> 19:23:14> [root{slobol}@/home/xsdg]
> #btrfs fi show /home
> Label: none  uuid: 4776be5b-5058-4248-a1b7-7c213757dfbd
> Total devices 3 FS bytes used 221.02GiB
> devid1 size 418.72GiB used 413.72GiB path /dev/sda3
> devid2 size 10.00GiB used 5.00GiB path /dev/loop0
> devid3 size 14.91GiB used 3.00GiB path /dev/sdb1
> 
> 
> 19:23:33> [root{slobol}@/home/xsdg]
> #btrfs fi usage /home
> Overall:
> Device size: 443.63GiB
> Device allocated: 421.72GiB
> Device unallocated:  21.91GiB
> Device missing: 0.00B
> Used: 221.68GiB
> Free (estimated): 219.24GiB(min: 208.29GiB)
> Data ratio:  1.00
> Metadata ratio:  2.00
> Global reserve: 228.00MiB(used: 36.00KiB)
> 
> Data,single: Size:417.69GiB, Used:220.36GiB
>/dev/loop0   5.00GiB
>/dev/sda3 409.69GiB
>/dev/sdb1   3.00GiB
> 
> Metadata,single: Size:8.00MiB, Used:0.00B
>/dev/sda3   8.00MiB
> 
> Metadata,DUP: Size:2.00GiB, Used:674.45MiB
>/dev/sda3   4.00GiB
> 
> System,single: Size:4.00MiB, Used:0.00B
>/dev/sda3   4.00MiB
> 
> System,DUP: Size:8.00MiB, Used:56.00KiB
>/dev/sda3  16.00MiB
> 
> Unallocated:
>/dev/loop0   5.00GiB
>/dev/sda3   5.00GiB
>/dev/sdb1  11.91GiB
> 
> 
> ### BALANCE FAILS, EVEN WITH -dusage=0
> 19:23:02> [root{slobol}@/home/xsdg]
> #btrfs balance start -v -dusage=0 .
> Dumping filters: flags 0x1, state 0x0, force is off
>   DATA (flags 0x2): balancing, usage=0
> ERROR: error during balancing '.': No space left on device
> There may be more info in syslog - try dmesg | tail

1. Could you please show us your `uname -r`?

2. 
http://git.kernel.org/cgit/linux/kernel/git/kdave/btrfs-progs.git/tree/btrfs-debugfs
We need to know more information about block group in order to take more
fine-grained balance, so there is a tool for developer called
'btrfs-debugfs', you may download it from the above link, it's a python
script, as long as you're able to run it, try btrfs-debugfs -b
/your_partition.

Thanks,

-liubo

> 
> 
> ### CAN'T REMOVE DEVICES -> ENOSPC
> #btrfs device remove /dev/loop0 /home
> ERROR: error removing device '/dev/loop0': No space left on device
> 
> --xsdg
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Recommended why to use btrfs for production?

2016-06-03 Thread Austin S. Hemmelgarn

On 2016-06-03 10:11, Martin wrote:

Make certain the kernel command timer value is greater than the driver
error recovery timeout. The former is found in sysfs, per block
device, the latter can be get and set with smartctl. Wrong
configuration is common (it's actually the default) when using
consumer drives, and inevitably leads to problems, even the loss of
the entire array. It really is a terrible default.


Are nearline SAS drives considered consumer drives?

If it's a SAS drive, then no, especially when you start talking about 
things marketed as 'nearline'.  Additionally, SCT ERC is entirely a SATA 
thing, I forget what the equivalent in SCSI (and by extension SAS) terms 
is, but I'm pretty sure that the kernel handles things differently there.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v10 00/21] Btrfs dedupe framework

2016-06-03 Thread Josef Bacik

On 04/01/2016 02:34 AM, Qu Wenruo wrote:

This patchset can be fetched from github:
https://github.com/adam900710/linux.git wang_dedupe_20160401

In this patchset, we're proud to bring a completely new storage backend:
Khala backend.

With Khala backend, all dedupe hash will be restored in the Khala,
shared with every Kalai protoss, with unlimited storage and almost zero
search latency.
A perfect backend for any Kalai protoss. "My life for Aiur!"

Unfortunately, such backend is not available for human.


OK, except the super-fancy and date-related backend, the patchset is
still a serious patchset.
In this patchset, we mostly addressed the on-disk format change comment from
Chris:
1) Reduced dedupe hash item and bytenr item.
   Now dedupe hash item structure size is reduced from 41 bytes
   (9 bytes hash_item + 32 bytes hash)
   to 29 bytes (5 bytes hash_item + 24 bytes hash)
   Without the last patch, it's even less with only 24 bytes
   (24 bytes hash only).
   And dedupe bytenr item structure size is reduced from 32 bytes (full
   hash) to 0.

2) Hide dedupe ioctls into CONFIG_BTRFS_DEBUG
   Advised by David, to make btrfs dedupe as an experimental feature for
   advanced user.
   This is used to allow this patchset to be merged while still allow us
   to change ioctl in the further.

3) Add back missing bug fix patches
   I just missed 2 bug fix patches in previous iteration.
   Adding them back.

Now patch 1~11 provide the full backward-compatible in-memory backend.
And patch 12~14 provide per-file dedupe flag feature.
Patch 15~20 provide on-disk dedupe backend with persist dedupe state for
in-memory backend.
The last patch is just preparation for possible dedupe-compress co-work.



You can add

Reviewed-by: Josef Bacik 

to everything I didn't comment on (and not the ENOSPC one either, but I 
commented on that one last time).


But just because I've reviewed it doesn't mean it's ready to go in. 
Before we are going to take this I want to see the following


1) fsck support for dedupe that verifies the hashes with what is on disk 
so any xfstests we write are sure to catch problems.
2) xfstests.  They need to do the following things for both in memory 
and ondisk

a) targeted verification.  So write one pattern, write the same
   pattern to a different file and use fiemap to verify they are the
   same.
b) modify fsstress to have an option to always write the same
   pattern and then run a stress test while balancing.

Once the issues I've hilighted in the other patches are resolved and the 
above xfstests things are merged and the fsck patches are 
reviewed/accepted then we can move forward with including dedup.  Thanks,


Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v10 20/21] btrfs: dedupe: Add support for adding hash for on-disk backend

2016-06-03 Thread Josef Bacik

On 04/01/2016 02:35 AM, Qu Wenruo wrote:

Now on-disk backend can add hash now.

Since all needed on-disk backend functions are added, also allow on-disk
backend to be used, by changing DEDUPE_BACKEND_COUNT from 1(inmemory
only) to 2 (inmemory + ondisk).

Signed-off-by: Wang Xiaoguang 
Signed-off-by: Qu Wenruo 
---
 fs/btrfs/dedupe.c | 83 +++
 fs/btrfs/dedupe.h |  3 +-
 2 files changed, 84 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/dedupe.c b/fs/btrfs/dedupe.c
index 7c5d58a..1f0178e 100644
--- a/fs/btrfs/dedupe.c
+++ b/fs/btrfs/dedupe.c
@@ -437,6 +437,87 @@ out:
return 0;
 }

+static int ondisk_search_bytenr(struct btrfs_trans_handle *trans,
+   struct btrfs_dedupe_info *dedupe_info,
+   struct btrfs_path *path, u64 bytenr,
+   int prepare_del);
+static int ondisk_search_hash(struct btrfs_dedupe_info *dedupe_info, u8 *hash,
+ u64 *bytenr_ret, u32 *num_bytes_ret);
+static int ondisk_add(struct btrfs_trans_handle *trans,
+ struct btrfs_dedupe_info *dedupe_info,
+ struct btrfs_dedupe_hash *hash)
+{
+   struct btrfs_path *path;
+   struct btrfs_root *dedupe_root = dedupe_info->dedupe_root;
+   struct btrfs_key key;
+   u64 hash_offset;
+   u64 bytenr;
+   u32 num_bytes;
+   int hash_len = btrfs_dedupe_sizes[dedupe_info->hash_type];
+   int ret;
+
+   if (WARN_ON(hash_len <= 8 ||
+   !IS_ALIGNED(hash->bytenr, dedupe_root->sectorsize)))
+   return -EINVAL;
+
+   path = btrfs_alloc_path();
+   if (!path)
+   return -ENOMEM;
+
+   mutex_lock(_info->lock);
+
+   ret = ondisk_search_bytenr(NULL, dedupe_info, path, hash->bytenr, 0);
+   if (ret < 0)
+   goto out;
+   if (ret > 0) {
+   ret = 0;
+   goto out;
+   }
+   btrfs_release_path(path);
+
+   ret = ondisk_search_hash(dedupe_info, hash->hash, , _bytes);
+   if (ret < 0)
+   goto out;
+   /* Same hash found, don't re-add to save dedupe tree space */
+   if (ret > 0) {
+   ret = 0;
+   goto out;
+   }
+
+   /* Insert hash->bytenr item */
+   memcpy(, hash->hash + hash_len - 8, 8);


No magic numbers please.  Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v10 18/21] btrfs: dedupe: Add support for on-disk hash search

2016-06-03 Thread Josef Bacik

On 04/01/2016 02:35 AM, Qu Wenruo wrote:

Now on-disk backend should be able to search hash now.

Signed-off-by: Wang Xiaoguang 
Signed-off-by: Qu Wenruo 
---
 fs/btrfs/dedupe.c | 167 --
 fs/btrfs/dedupe.h |   1 +
 2 files changed, 151 insertions(+), 17 deletions(-)

diff --git a/fs/btrfs/dedupe.c b/fs/btrfs/dedupe.c
index a274c1c..00f2a01 100644
--- a/fs/btrfs/dedupe.c
+++ b/fs/btrfs/dedupe.c
@@ -652,6 +652,112 @@ int btrfs_dedupe_disable(struct btrfs_fs_info *fs_info)
 }

 /*
+ * Compare ondisk hash with src.
+ * Return 0 if hash matches.
+ * Return non-zero for hash mismatch
+ *
+ * Caller should ensure the slot contains a valid hash item.
+ */
+static int memcmp_ondisk_hash(const struct btrfs_key *key,
+ struct extent_buffer *node, int slot,
+ int hash_len, const u8 *src)
+{
+   u64 offset;
+   int ret;
+
+   /* Return value doesn't make sense in this case though */
+   if (WARN_ON(hash_len <= 8 || key->type != BTRFS_DEDUPE_HASH_ITEM_KEY))


No magic numbers please.


+   return -EINVAL;
+
+   /* compare the hash exlcuding the last 64 bits */
+   offset = btrfs_item_ptr_offset(node, slot);
+   ret = memcmp_extent_buffer(node, src, offset, hash_len - 8);
+   if (ret)
+   return ret;
+   return memcmp(>objectid, src + hash_len - 8, 8);
+}
+
+ /*
+ * Return 0 for not found
+ * Return >0 for found and set bytenr_ret
+ * Return <0 for error
+ */
+static int ondisk_search_hash(struct btrfs_dedupe_info *dedupe_info, u8 *hash,
+ u64 *bytenr_ret, u32 *num_bytes_ret)
+{
+   struct btrfs_path *path;
+   struct btrfs_key key;
+   struct btrfs_root *dedupe_root = dedupe_info->dedupe_root;
+   u8 *buf = NULL;
+   u64 hash_key;
+   int hash_len = btrfs_dedupe_sizes[dedupe_info->hash_type];
+   int ret;
+
+   path = btrfs_alloc_path();
+   if (!path)
+   return -ENOMEM;
+
+   buf = kmalloc(hash_len, GFP_NOFS);
+   if (!buf) {
+   ret = -ENOMEM;
+   goto out;
+   }
+
+   memcpy(_key, hash + hash_len - 8, 8);
+   key.objectid = hash_key;
+   key.type = BTRFS_DEDUPE_HASH_ITEM_KEY;
+   key.offset = (u64)-1;
+
+   ret = btrfs_search_slot(NULL, dedupe_root, , path, 0, 0);
+   if (ret < 0)
+   goto out;
+   WARN_ON(ret == 0);
+   while (1) {
+   struct extent_buffer *node;
+   struct btrfs_dedupe_hash_item *hash_item;
+   int slot;
+
+   ret = btrfs_previous_item(dedupe_root, path, hash_key,
+ BTRFS_DEDUPE_HASH_ITEM_KEY);
+   if (ret < 0)
+   break;
+   if (ret > 0) {
+   ret = 0;
+   break;
+   }
+
+   node = path->nodes[0];
+   slot = path->slots[0];
+   btrfs_item_key_to_cpu(node, , slot);
+
+   /*
+* Type of objectid mismatch means no previous item may
+* hit, exit searching
+*/
+   if (key.type != BTRFS_DEDUPE_HASH_ITEM_KEY ||
+   memcmp(, _key, 8))
+   break;
+   hash_item = btrfs_item_ptr(node, slot,
+   struct btrfs_dedupe_hash_item);
+   /*
+* If the hash mismatch, it's still possible that previous item
+* has the desired hash.
+*/
+   if (memcmp_ondisk_hash(, node, slot, hash_len, hash))
+   continue;
+   /* Found */
+   ret = 1;
+   *bytenr_ret = key.offset;
+   *num_bytes_ret = dedupe_info->blocksize;
+   break;
+   }
+out:
+   kfree(buf);
+   btrfs_free_path(path);
+   return ret;
+}
+
+/*
  * Caller must ensure the corresponding ref head is not being run.
  */
 static struct inmem_hash *
@@ -681,9 +787,36 @@ inmem_search_hash(struct btrfs_dedupe_info *dedupe_info, 
u8 *hash)
return NULL;
 }

-static int inmem_search(struct btrfs_dedupe_info *dedupe_info,
-   struct inode *inode, u64 file_pos,
-   struct btrfs_dedupe_hash *hash)
+/* Wapper for different backends, caller needs to hold dedupe_info->lock */
+static inline int generic_search_hash(struct btrfs_dedupe_info *dedupe_info,
+ u8 *hash, u64 *bytenr_ret,
+ u32 *num_bytes_ret)
+{
+   if (dedupe_info->backend == BTRFS_DEDUPE_BACKEND_INMEMORY) {
+   struct inmem_hash *found_hash;
+   int ret;
+
+   found_hash = inmem_search_hash(dedupe_info, hash);
+   if (found_hash) {
+   

Re: [PATCH v10 17/21] btrfs: dedupe: Introduce interfaces to resume and cleanup dedupe info

2016-06-03 Thread Josef Bacik

On 04/01/2016 02:35 AM, Qu Wenruo wrote:

Since we will introduce a new on-disk based dedupe method, introduce new
interfaces to resume previous dedupe setup.

And since we introduce a new tree for status, also add disable handler
for it.

Signed-off-by: Wang Xiaoguang 
Signed-off-by: Qu Wenruo 
---
 fs/btrfs/dedupe.c  | 197 -
 fs/btrfs/dedupe.h  |  13 
 fs/btrfs/disk-io.c |  25 ++-
 fs/btrfs/disk-io.h |   1 +
 4 files changed, 232 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/dedupe.c b/fs/btrfs/dedupe.c
index cfb7fea..a274c1c 100644
--- a/fs/btrfs/dedupe.c
+++ b/fs/btrfs/dedupe.c
@@ -21,6 +21,8 @@
 #include "transaction.h"
 #include "delayed-ref.h"
 #include "qgroup.h"
+#include "disk-io.h"
+#include "locking.h"

 struct inmem_hash {
struct rb_node hash_node;
@@ -102,10 +104,69 @@ static int init_dedupe_info(struct btrfs_dedupe_info 
**ret_info, u16 type,
return 0;
 }

+static int init_dedupe_tree(struct btrfs_fs_info *fs_info,
+   struct btrfs_dedupe_info *dedupe_info)
+{
+   struct btrfs_root *dedupe_root;
+   struct btrfs_key key;
+   struct btrfs_path *path;
+   struct btrfs_dedupe_status_item *status;
+   struct btrfs_trans_handle *trans;
+   int ret;
+
+   path = btrfs_alloc_path();
+   if (!path)
+   return -ENOMEM;
+
+   trans = btrfs_start_transaction(fs_info->tree_root, 2);
+   if (IS_ERR(trans)) {
+   ret = PTR_ERR(trans);
+   goto out;
+   }
+   dedupe_root = btrfs_create_tree(trans, fs_info,
+  BTRFS_DEDUPE_TREE_OBJECTID);
+   if (IS_ERR(dedupe_root)) {
+   ret = PTR_ERR(dedupe_root);
+   btrfs_abort_transaction(trans, fs_info->tree_root, ret);
+   goto out;
+   }
+   dedupe_info->dedupe_root = dedupe_root;
+
+   key.objectid = 0;
+   key.type = BTRFS_DEDUPE_STATUS_ITEM_KEY;
+   key.offset = 0;
+
+   ret = btrfs_insert_empty_item(trans, dedupe_root, path, ,
+ sizeof(*status));
+   if (ret < 0) {
+   btrfs_abort_transaction(trans, fs_info->tree_root, ret);
+   goto out;
+   }
+
+   status = btrfs_item_ptr(path->nodes[0], path->slots[0],
+   struct btrfs_dedupe_status_item);
+   btrfs_set_dedupe_status_blocksize(path->nodes[0], status,
+dedupe_info->blocksize);
+   btrfs_set_dedupe_status_limit(path->nodes[0], status,
+   dedupe_info->limit_nr);
+   btrfs_set_dedupe_status_hash_type(path->nodes[0], status,
+   dedupe_info->hash_type);
+   btrfs_set_dedupe_status_backend(path->nodes[0], status,
+   dedupe_info->backend);
+   btrfs_mark_buffer_dirty(path->nodes[0]);
+out:
+   btrfs_free_path(path);
+   if (ret == 0)
+   btrfs_commit_transaction(trans, fs_info->tree_root);


Still need to call btrfs_end_transaction() if we aborted to clean things 
up.  Thanks,


Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v10 09/21] btrfs: dedupe: Inband in-memory only de-duplication implement

2016-06-03 Thread Josef Bacik

On 04/01/2016 02:35 AM, Qu Wenruo wrote:

Core implement for inband de-duplication.
It reuse the async_cow_start() facility to do the calculate dedupe hash.
And use dedupe hash to do inband de-duplication at extent level.

The work flow is as below:
1) Run delalloc range for an inode
2) Calculate hash for the delalloc range at the unit of dedupe_bs
3) For hash match(duplicated) case, just increase source extent ref
   and insert file extent.
   For hash mismatch case, go through the normal cow_file_range()
   fallback, and add hash into dedupe_tree.
   Compress for hash miss case is not supported yet.

Current implement restore all dedupe hash in memory rb-tree, with LRU
behavior to control the limit.

Signed-off-by: Wang Xiaoguang 
Signed-off-by: Qu Wenruo 
---
 fs/btrfs/extent-tree.c |  18 
 fs/btrfs/inode.c   | 235 ++---
 fs/btrfs/relocation.c  |  16 
 3 files changed, 236 insertions(+), 33 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 53e1297..dabd721 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c






@@ -1076,6 +1135,68 @@ out_unlock:
goto out;
 }

+static int hash_file_ranges(struct inode *inode, u64 start, u64 end,
+   struct async_cow *async_cow, int *num_added)
+{
+   struct btrfs_root *root = BTRFS_I(inode)->root;
+   struct btrfs_fs_info *fs_info = root->fs_info;
+   struct btrfs_dedupe_info *dedupe_info = fs_info->dedupe_info;
+   struct page *locked_page = async_cow->locked_page;
+   u16 hash_algo;
+   u64 actual_end;
+   u64 isize = i_size_read(inode);
+   u64 dedupe_bs;
+   u64 cur_offset = start;
+   int ret = 0;
+
+   actual_end = min_t(u64, isize, end + 1);
+   /* If dedupe is not enabled, don't split extent into dedupe_bs */
+   if (fs_info->dedupe_enabled && dedupe_info) {
+   dedupe_bs = dedupe_info->blocksize;
+   hash_algo = dedupe_info->hash_type;
+   } else {
+   dedupe_bs = SZ_128M;
+   /* Just dummy, to avoid access NULL pointer */
+   hash_algo = BTRFS_DEDUPE_HASH_SHA256;
+   }
+
+   while (cur_offset < end) {
+   struct btrfs_dedupe_hash *hash = NULL;
+   u64 len;
+
+   len = min(end + 1 - cur_offset, dedupe_bs);
+   if (len < dedupe_bs)
+   goto next;
+
+   hash = btrfs_dedupe_alloc_hash(hash_algo);
+   if (!hash) {
+   ret = -ENOMEM;
+   goto out;
+   }
+   ret = btrfs_dedupe_calc_hash(fs_info, inode, cur_offset, hash);
+   if (ret < 0)
+   goto out;
+
+   ret = btrfs_dedupe_search(fs_info, inode, cur_offset, hash);
+   if (ret < 0)
+   goto out;


You leak hash in both of these cases.  Also if btrfs_dedup_search




+   if (ret < 0)
+   goto out_qgroup;
+
+   /*
+* Hash hit won't create a new data extent, so its reserved quota
+* space won't be freed by new delayed_ref_head.
+* Need to free it here.
+*/
+   if (btrfs_dedupe_hash_hit(hash))
+   btrfs_qgroup_free_data(inode, file_pos, ram_bytes);
+
+   /* Add missed hash into dedupe tree */
+   if (hash && hash->bytenr == 0) {
+   hash->bytenr = ins.objectid;
+   hash->num_bytes = ins.offset;
+   ret = btrfs_dedupe_add(trans, root->fs_info, hash);


I don't want to flip read only if we fail this in the in-memory mode. 
Thanks,


Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Recommended why to use btrfs for production?

2016-06-03 Thread Martin
> I would say it is, but I also don't have quite as much experience with it as
> with BTRFS raid1 mode.  The one thing I do know for certain about it is that
> even if it theoretically could recover from two failed disks (ie, if they're
> from different positions in the striping of each mirror), there is no code
> to actually do so, so make sure you replace any failed disks as soon as
> possible (or at least balance the array so that you don't have a missing
> device anymore).

Ok, so that really speaks for raid1...

> Most of my systems where I would run raid10 mode are set up as BTRFS raid1
> on top of two LVM based RAID0 volumes, as this gets measurably better
> performance than BTRFS raid10 mode at the moment (I see roughly a 10-20%
> difference on my home server system), and provides the same data safety
> guarantees as well.  It's worth noting for such a setup that the current
> default block size in BTRFS is 16k except on very small filesystems, so you
> may want a larger stripe size than you would on a traditional filesystem.
>
> As far as BTRFS raid10 mode in general, there are a few things that are
> important to remember about it:
> 1. It stores exactly two copies of everything, any extra disks just add to
> the stripe length on each copy.
> 2. Because each stripe has the same number of disks as it's mirrored
> partner, the total number of disks in any chunk allocation will always be
> even, which means that if your using an odd number of disks, there will
> always be one left out of every chunk.  This has limited impact on actual
> performance usually, but can cause confusing results if you have differently
> sized disks.
> 3. BTRFS (whether using raid10, raid0, or even raid5/6) will always try to
> use as many devices as possible for a stripe.  As a result of this, the
> moment you add a new disk, the total length of all new stripes will adjust
> to fit the new configuration.  If you want maximal performance when adding
> new disks, make sure to balance the rest of the filesystem afterwards,
> otherwise any existing stripes will just stay the same size.

Those are very good things to know!
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v10 09/21] btrfs: dedupe: Inband in-memory only de-duplication implement

2016-06-03 Thread Josef Bacik

On 06/01/2016 09:12 PM, Qu Wenruo wrote:



At 06/02/2016 06:08 AM, Mark Fasheh wrote:

On Fri, Apr 01, 2016 at 02:35:00PM +0800, Qu Wenruo wrote:

Core implement for inband de-duplication.
It reuse the async_cow_start() facility to do the calculate dedupe hash.
And use dedupe hash to do inband de-duplication at extent level.

The work flow is as below:
1) Run delalloc range for an inode
2) Calculate hash for the delalloc range at the unit of dedupe_bs
3) For hash match(duplicated) case, just increase source extent ref
   and insert file extent.
   For hash mismatch case, go through the normal cow_file_range()
   fallback, and add hash into dedupe_tree.
   Compress for hash miss case is not supported yet.

Current implement restore all dedupe hash in memory rb-tree, with LRU
behavior to control the limit.

Signed-off-by: Wang Xiaoguang 
Signed-off-by: Qu Wenruo 
---
 fs/btrfs/extent-tree.c |  18 
 fs/btrfs/inode.c   | 235
++---
 fs/btrfs/relocation.c  |  16 
 3 files changed, 236 insertions(+), 33 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 53e1297..dabd721 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -37,6 +37,7 @@
 #include "math.h"
 #include "sysfs.h"
 #include "qgroup.h"
+#include "dedupe.h"

 #undef SCRAMBLE_DELAYED_REFS

@@ -2399,6 +2400,8 @@ static int run_one_delayed_ref(struct
btrfs_trans_handle *trans,

 if (btrfs_delayed_ref_is_head(node)) {
 struct btrfs_delayed_ref_head *head;
+struct btrfs_fs_info *fs_info = root->fs_info;
+
 /*
  * we've hit the end of the chain and we were supposed
  * to insert this extent into the tree.  But, it got
@@ -2413,6 +2416,15 @@ static int run_one_delayed_ref(struct
btrfs_trans_handle *trans,
 btrfs_pin_extent(root, node->bytenr,
  node->num_bytes, 1);
 if (head->is_data) {
+/*
+ * If insert_reserved is given, it means
+ * a new extent is revered, then deleted
+ * in one tran, and inc/dec get merged to 0.
+ *
+ * In this case, we need to remove its dedup
+ * hash.
+ */
+btrfs_dedupe_del(trans, fs_info, node->bytenr);
 ret = btrfs_del_csums(trans, root,
   node->bytenr,
   node->num_bytes);
@@ -6713,6 +6725,12 @@ static int __btrfs_free_extent(struct
btrfs_trans_handle *trans,
 btrfs_release_path(path);

 if (is_data) {
+ret = btrfs_dedupe_del(trans, info, bytenr);
+if (ret < 0) {
+btrfs_abort_transaction(trans, extent_root,
+ret);


I don't see why an error here should lead to a readonly fs.
--Mark



Because such deletion error can lead to corruption.

For example, extent A is already in hash pool.
And when freeing extent A, we need to delete its hash, of course.

But if such deletion fails, which means the hash is still in the pool,
even the extent A no longer exists in extent tree.


Except if we're in in-memory mode only it doesn't matter, so don't abort 
if we're in in-memory mode.  Thanks,


Josef

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Recommended why to use btrfs for production?

2016-06-03 Thread Austin S. Hemmelgarn

On 2016-06-03 09:31, Martin wrote:

In general, avoid Ubuntu LTS versions when dealing with BTRFS, as well as
most enterprise distros, they all tend to back-port patches instead of using
newer kernels, which means it's functionally impossible to provide good
support for them here (because we can't know for sure what exactly they've
back-ported).  I'd suggest building your own kernel if possible, with Arch
Linux being a close second (they follow upstream very closely), followed by
Fedora and non-LTS Ubuntu.


Then I would build my own, if that is the preferred option.
If you do go this route, make sure to keep an eye on the mailing list, 
as this is usually where any bugs get reported.  New bugs have 
thankfully been decreasing in number each release, but they do still 
happen, and it's important to know what to avoid and what to look out 
for when dealing with something under such active development.



Do not use BTRFS raid6 mode in production, it has at least 2 known serious
bugs that may cause complete loss of the array due to a disk failure.  Both
of these issues have as of yet unknown trigger conditions, although they do
seem to occur more frequently with larger arrays.


Ok. No raid6.


That said, there are other options.  If you have enough disks, you can run
BTRFS raid1 on top of LVM or MD RAID5 or RAID6, which provides you with the
benefits of both.

Alternatively, you could use BTRFS raid1 on top of LVM or MD RAID1, which
actually gets relatively decent performance and can provide even better
guarantees than RAID6 would (depending on how you set it up, you can lose a
lot more disks safely).  If you go this way, I'd suggest setting up disks in
pairs at the lower level, and then just let BTRFS handle spanning the data
across disks (BTRFS raid1 mode keeps exactly two copies of each block).
While this is not quite as efficient as just doing LVM based RAID6 with a
traditional FS on top, it's also a lot easier to handle reshaping the array
on-line because of the device management in BTRFS itself.


Right now I only have 10TB of backup data, but this is grow when
urbackup is roled out. So maybe I could get a way with plain btrfs
raid10 for the first year, and then re-balance to raid6 when the two
bugs have been found...

is the failed disk handling in btrfs raid10 considered stable?

I would say it is, but I also don't have quite as much experience with 
it as with BTRFS raid1 mode.  The one thing I do know for certain about 
it is that even if it theoretically could recover from two failed disks 
(ie, if they're from different positions in the striping of each 
mirror), there is no code to actually do so, so make sure you replace 
any failed disks as soon as possible (or at least balance the array so 
that you don't have a missing device anymore).


Most of my systems where I would run raid10 mode are set up as BTRFS 
raid1 on top of two LVM based RAID0 volumes, as this gets measurably 
better performance than BTRFS raid10 mode at the moment (I see roughly a 
10-20% difference on my home server system), and provides the same data 
safety guarantees as well.  It's worth noting for such a setup that the 
current default block size in BTRFS is 16k except on very small 
filesystems, so you may want a larger stripe size than you would on a 
traditional filesystem.


As far as BTRFS raid10 mode in general, there are a few things that are 
important to remember about it:
1. It stores exactly two copies of everything, any extra disks just add 
to the stripe length on each copy.
2. Because each stripe has the same number of disks as it's mirrored 
partner, the total number of disks in any chunk allocation will always 
be even, which means that if your using an odd number of disks, there 
will always be one left out of every chunk.  This has limited impact on 
actual performance usually, but can cause confusing results if you have 
differently sized disks.
3. BTRFS (whether using raid10, raid0, or even raid5/6) will always try 
to use as many devices as possible for a stripe.  As a result of this, 
the moment you add a new disk, the total length of all new stripes will 
adjust to fit the new configuration.  If you want maximal performance 
when adding new disks, make sure to balance the rest of the filesystem 
afterwards, otherwise any existing stripes will just stay the same size.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Recommended why to use btrfs for production?

2016-06-03 Thread Martin
> Make certain the kernel command timer value is greater than the driver
> error recovery timeout. The former is found in sysfs, per block
> device, the latter can be get and set with smartctl. Wrong
> configuration is common (it's actually the default) when using
> consumer drives, and inevitably leads to problems, even the loss of
> the entire array. It really is a terrible default.

Are nearline SAS drives considered consumer drives?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Recommended why to use btrfs for production?

2016-06-03 Thread Chris Murphy
On Fri, Jun 3, 2016 at 6:55 AM, Austin S. Hemmelgarn
 wrote:

>
> That said, there are other options.  If you have enough disks, you can run
> BTRFS raid1 on top of LVM or MD RAID5 or RAID6, which provides you with the
> benefits of both.

There is a trade off. Either mdadm or lvm raid5, raid6, are more
mature and stable, but it's more maintenance. You have a btrfs scrub
as well as the md scrub. Btrfs on md/lvm raid56 will detect mismatches
but won't be able to fix them because from its perspective there's no
redundancy, except possibly metadata. So the repair has to happen on
the mdadm/lvm side

Make certain the kernel command timer value is greater than the driver
error recovery timeout. The former is found in sysfs, per block
device, the latter can be get and set with smartctl. Wrong
configuration is common (it's actually the default) when using
consumer drives, and inevitably leads to problems, even the loss of
the entire array. It really is a terrible default.


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] btrfs: fix check_shared for fiemap ioctl

2016-06-03 Thread Josef Bacik

On 06/01/2016 01:48 AM, Lu Fengqi wrote:

Only in the case of different root_id or different object_id, check_shared
identified extent as the shared. However, If a extent was referred by
different offset of same file, it should also be identified as shared.
In addition, check_shared's loop scale is at least  n^3, so if a extent
has too many references,  even causes soft hang up.

First, add all delayed_ref to the ref_tree and calculate the unqiue_refs,
if the unique_refs is greater than one, return BACKREF_FOUND_SHARED.
Then individually add the  on-disk reference(inline/keyed) to the ref_tree
and calculate the unique_refs of the ref_tree to check if the unique_refs
is greater than one.Because once there are two references to return
SHARED, so the time complexity is close to the constant.

Reported-by: Tsutomu Itoh 
Signed-off-by: Lu Fengqi 


This is a lot of work for just wanting to know if something is shared. 
Instead lets adjust this slightly.  Instead of passing down a 
root_objectid/inum and noticing this and returned shared, add a new way 
to iterate refs.  Currently we go gather all the refs and then do the 
iterate dance, which is what takes so long.  So instead add another 
helper that calls the provided function every time it has a match, and 
then we can pass in whatever context we want, and we return when 
something matches.  This way we don't have all this extra accounting, 
and we're no longer passing root_objectid/inum around and testing for 
some magic scenario.  Thanks,


Josef

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Recommended why to use btrfs for production?

2016-06-03 Thread Julian Taylor

On 06/03/2016 03:31 PM, Martin wrote:

In general, avoid Ubuntu LTS versions when dealing with BTRFS, as well as
most enterprise distros, they all tend to back-port patches instead of using
newer kernels, which means it's functionally impossible to provide good
support for them here (because we can't know for sure what exactly they've
back-ported).  I'd suggest building your own kernel if possible, with Arch
Linux being a close second (they follow upstream very closely), followed by
Fedora and non-LTS Ubuntu.


Then I would build my own, if that is the preferred option.



Ubuntu also provides newer kernels for their LTS via the Hardware 
Enablement Stack:


https://wiki.ubuntu.com/Kernel/LTSEnablementStack

So if you can live with about 6 month time lag and shorter support for 
the non-lts versions of those kernels that is a good option.
As you can see 16.04 currently provides 4.4 and the next update will 
likely be 4.8.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Recommended why to use btrfs for production?

2016-06-03 Thread Martin
> In general, avoid Ubuntu LTS versions when dealing with BTRFS, as well as
> most enterprise distros, they all tend to back-port patches instead of using
> newer kernels, which means it's functionally impossible to provide good
> support for them here (because we can't know for sure what exactly they've
> back-ported).  I'd suggest building your own kernel if possible, with Arch
> Linux being a close second (they follow upstream very closely), followed by
> Fedora and non-LTS Ubuntu.

Then I would build my own, if that is the preferred option.

> Do not use BTRFS raid6 mode in production, it has at least 2 known serious
> bugs that may cause complete loss of the array due to a disk failure.  Both
> of these issues have as of yet unknown trigger conditions, although they do
> seem to occur more frequently with larger arrays.

Ok. No raid6.

> That said, there are other options.  If you have enough disks, you can run
> BTRFS raid1 on top of LVM or MD RAID5 or RAID6, which provides you with the
> benefits of both.
>
> Alternatively, you could use BTRFS raid1 on top of LVM or MD RAID1, which
> actually gets relatively decent performance and can provide even better
> guarantees than RAID6 would (depending on how you set it up, you can lose a
> lot more disks safely).  If you go this way, I'd suggest setting up disks in
> pairs at the lower level, and then just let BTRFS handle spanning the data
> across disks (BTRFS raid1 mode keeps exactly two copies of each block).
> While this is not quite as efficient as just doing LVM based RAID6 with a
> traditional FS on top, it's also a lot easier to handle reshaping the array
> on-line because of the device management in BTRFS itself.

Right now I only have 10TB of backup data, but this is grow when
urbackup is roled out. So maybe I could get a way with plain btrfs
raid10 for the first year, and then re-balance to raid6 when the two
bugs have been found...

is the failed disk handling in btrfs raid10 considered stable?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/5] btrfs-progs: btrfs-crc: fix build error

2016-06-03 Thread David Sterba
On Thu, Jun 02, 2016 at 05:06:37PM +0900, Satoru Takeuchi wrote:
> Remove the following build error.
> 
>
>$ make btrfs-crc
>[CC] btrfs-crc.o
>[LD] btrfs-crc
>btrfs-crc.o: In function `usage':
>/home/sat/src/btrfs-progs/btrfs-crc.c:26: multiple definition of `usage'
>help.o:/home/sat/src/btrfs-progs/help.c:125: first defined here
>collect2: error: ld returned 1 exit status
>Makefile:294: recipe for target 'btrfs-crc' failed
>make: *** [btrfs-crc] Error 1
>=
> 
> Signed-off-by: Satoru Takeuchi 

1-5 applied, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Recommended why to use btrfs for production?

2016-06-03 Thread Austin S. Hemmelgarn

On 2016-06-03 05:49, Martin wrote:

Hello,

We would like to use urBackup to make laptop backups, and they mention
btrfs as an option.

https://www.urbackup.org/administration_manual.html#x1-8400010.6

So if we go with btrfs and we need 100TB usable space in raid6, and to
have it replicated each night to another btrfs server for "backup" of
the backup, how should we then install btrfs?

E.g. Should we use the latest Fedora, CentOS, Ubuntu, Ubuntu LTS, or
should we compile the kernel our self?
In general, avoid Ubuntu LTS versions when dealing with BTRFS, as well 
as most enterprise distros, they all tend to back-port patches instead 
of using newer kernels, which means it's functionally impossible to 
provide good support for them here (because we can't know for sure what 
exactly they've back-ported).  I'd suggest building your own kernel if 
possible, with Arch Linux being a close second (they follow upstream 
very closely), followed by Fedora and non-LTS Ubuntu.


And a bonus question: How stable is raid6 and detecting and replacing
failed drives?
Do not use BTRFS raid6 mode in production, it has at least 2 known 
serious bugs that may cause complete loss of the array due to a disk 
failure.  Both of these issues have as of yet unknown trigger 
conditions, although they do seem to occur more frequently with larger 
arrays.


That said, there are other options.  If you have enough disks, you can 
run BTRFS raid1 on top of LVM or MD RAID5 or RAID6, which provides you 
with the benefits of both.


Alternatively, you could use BTRFS raid1 on top of LVM or MD RAID1, 
which actually gets relatively decent performance and can provide even 
better guarantees than RAID6 would (depending on how you set it up, you 
can lose a lot more disks safely).  If you go this way, I'd suggest 
setting up disks in pairs at the lower level, and then just let BTRFS 
handle spanning the data across disks (BTRFS raid1 mode keeps exactly 
two copies of each block).  While this is not quite as efficient as just 
doing LVM based RAID6 with a traditional FS on top, it's also a lot 
easier to handle reshaping the array on-line because of the device 
management in BTRFS itself.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: "No space left on device" and balance doesn't work

2016-06-03 Thread Austin S. Hemmelgarn

On 2016-06-02 18:45, Henk Slager wrote:

On Thu, Jun 2, 2016 at 3:55 PM, MegaBrutal  wrote:

2016-06-02 0:22 GMT+02:00 Henk Slager :

What is the kernel version used?
Is the fs on a mechanical disk or SSD?
What are the mount options?
How old is the fs?


Linux 4.4.0-22-generic (Ubuntu 16.04).
Mechanical disks in LVM.
Mount: /dev/mapper/centrevg-rootlv on / type btrfs
(rw,relatime,space_cache,subvolid=257,subvol=/@)
I don't know how to retrieve the exact FS age, but it was created in
2014 August.

Snapshots (their names encode their creation dates):

ID 908 gen 487349 top level 5 path @-snapshot-2016050301

...

ID 937 gen 521829 top level 5 path @-snapshot-2016060201

Removing old snapshots is the most feasible solution, but I can also
increase the FS size. It's easy since it's in LVM, and there is plenty
of space in the volume group.

Probably I should rewrite my alert script to check btrfs fi show
instead of plain df.


Yes I think that makes sense, to decide on chunk-level. You can see
how big the chunks are with the linked show_usage.py program, most of
33 should be 1GiB as already very well explained by Austin.

The setup looks all pretty normal and btrfs should be able to handle
it, but unfortunately your fs is a typical example that one currently
needs to monitor/tune a btrfs fs for its 'health' in order to keep it
running longterm. You might want to change mount option relatime to
noatime, so that you have less writes to metadata chunks. It should
lower the scattering inside the metadata chunks.
Also, since you're on a new enough kernel, try 'lazytime' in the mount 
options as well, this defers all on-disk timestamp updates for up to 24 
hours or until the inode gets written out anyway, but keeps the updated 
info in memory.  The only downside to this is that mtimes might not be 
correct after an unclean shutdown, but most software will have no issues 
with this.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Recommended why to use btrfs for production?

2016-06-03 Thread Martin
> Before trying RAID5/6 in production, be sure to read posts like these:
>
> http://www.spinics.net/lists/linux-btrfs/msg55642.html

Very interesting post and very recent even.

If I decide to try raid6 and of course everything is replicated each
day (for a bit of a safety net), and disks begin to fail, how much
help will I likely get from this list to recover?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Recommended why to use btrfs for production?

2016-06-03 Thread Hans van Kranenburg

Hi Martin,

On 06/03/2016 11:49 AM, Martin wrote:


We would like to use urBackup to make laptop backups, and they mention
btrfs as an option.

[...]

And a bonus question: How stable is raid6 and detecting and replacing
failed drives?


Before trying RAID5/6 in production, be sure to read posts like these:

http://www.spinics.net/lists/linux-btrfs/msg55642.html

o/

Hans van Kranenburg
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Recommended why to use btrfs for production?

2016-06-03 Thread Martin
> Do you plan to use Snapshots? How many of them?

Yes, minimum 7 for each day of the week.

Nice to have would be 4 extra for each week of the month and then 12
for each month of the year.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Recommended why to use btrfs for production?

2016-06-03 Thread Marc Haber
On Fri, Jun 03, 2016 at 11:49:09AM +0200, Martin wrote:
> We would like to use urBackup to make laptop backups, and they mention
> btrfs as an option.
> 
> https://www.urbackup.org/administration_manual.html#x1-8400010.6
> 
> So if we go with btrfs and we need 100TB usable space in raid6, and to
> have it replicated each night to another btrfs server for "backup" of
> the backup, how should we then install btrfs?

Do you plan to use Snapshots? How many of them?

Greetings
Marc

-- 
-
Marc Haber | "I don't trust Computers. They | Mailadresse im Header
Leimen, Germany|  lose things."Winona Ryder | Fon: *49 6224 1600402
Nordisch by Nature |  How to make an American Quilt | Fax: *49 6224 1600421
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Recommended why to use btrfs for production?

2016-06-03 Thread Martin
Hello,

We would like to use urBackup to make laptop backups, and they mention
btrfs as an option.

https://www.urbackup.org/administration_manual.html#x1-8400010.6

So if we go with btrfs and we need 100TB usable space in raid6, and to
have it replicated each night to another btrfs server for "backup" of
the backup, how should we then install btrfs?

E.g. Should we use the latest Fedora, CentOS, Ubuntu, Ubuntu LTS, or
should we compile the kernel our self?

And a bonus question: How stable is raid6 and detecting and replacing
failed drives?

-RC
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html