Re: [PATCH v3] btrfs: Introduce new mount option to disable tree log replay

2015-12-15 Thread Christoph Anton Mitterer
On Wed, 2015-12-16 at 09:36 +0800, Qu Wenruo wrote:
> One sad example is, we can't use 'norecovery' mount option to disable
 
> log replay in btrfs, as there is 'recovery' mount option already.
I think "norecovery" would anyway not really fit... the name should
rather indicated, that from the filesystem side, nothing changes the
underlying device's contents.
"norecovery" would just tell, that no recovery options would be tried,
however, any other changes (optimisations, etc.) could still go on.

David's "nowr" is already, better, though it could be misinterpreted as
no write/read (as wr being rw swapped), so perhaps "nowrites" would be
better... but that again may be considered just another name for "ro".

So perhaps one could do something that includes "dev", like "rodev",
"ro-dev", or "immuatable-dev"... or instead of "dev" "devs" to cover
multi-device cases.
OTOH, the devices aren't really set "ro" (as in blockdev --setro).

Maybe "nodevwrites" or "no-dev-writes" or one of these with "device"
not abbreviated?


Many programs have a "--dry-run" option, but I kinda don't liky
drymount or something like that.


Guess from the above, I'd personally like "nodevwrites" the most.


Oh and Qu's idea of coordinating that with the other filesystems is
surely good.


Cheers,
Chris.


smime.p7s
Description: S/MIME cryptographic signature


Re: attacking btrfs filesystems via UUID collisions?

2015-12-14 Thread Christoph Anton Mitterer
On Mon, 2015-12-14 at 14:26 -0700, Chris Murphy wrote:
> The automobile is invented and due to the ensuing chaos, common
> practice of doing whatever the F you wanted came to an end in favor
> of
> rules of the road and traffic lights. I'm sure some people went
> ballistic, but for the most part things were much better without the
> brokenness or prior common practice.
Okay than take your road traffic example, apply it to filesystems.

In road traffic you have rules, e.g. pedestrians may cross the road
when their light shows green and that of the cars red.
That could be the rule, similar as to "don't have duplicate UUIDs with
btrfs".

Despite we have the rule, cars stop at red, pedestrians walk at green,
we still teach our kids: "look at both sides on the road, only cross if
there's no car (or tank or whatever ;) ) crossing.
Applying that to filesystems would be: "hope that everyone plays the
rules, but don't kill yourself in one doesn't and there are duplicate
IDs).

 
> So the fact we're going to have this problem with all file systems
> that incorporate the volume UUID into the metadata stream, tells me
> that the very rudimentary common practice of using dd needs to go
> away, in general practice.
Sure, for those that use multiple devices (LVM, MD, etc.), or for those
that actually just use the UUID to select the block device for each
write/read (and not use these only "once") to get the right major/minor
dev id (or whatever the kernel uses internally for path based
addressing).


> http://www.ietf.org/rfc/rfc4122.txt
> "A UUID is 128 bits long, and can guarantee uniqueness across space
> and time."
But of course not in terms of the problems we're talking about here,
where UUIDs may be accidentally or maliciously duplicated.

> Also see security considerations in section 6.
Doesn't section 6 basically imply that you can not 100% guarantee
they're equal? E.g. bad random seed on multiple systems?

Also, IIRC, one of the UUID algos just used some combination of MAC,
time and PID... which especially in VMs may even lead to dupes.



Cheers,
Chris.

smime.p7s
Description: S/MIME cryptographic signature


Re: dear developers, can we have notdatacow + checksumming, plz?

2015-12-14 Thread Christoph Anton Mitterer
On Mon, 2015-12-14 at 17:42 +1100, Russell Coker wrote:
> My understanding of BTRFS is that the metadata referencing data
> blocks has the 
> checksums for those blocks, then the blocks which link to that
> metadata (EG 
> directory entries referencing file metadata) has checksums of those.
You mean basically, that all metadata is chained, right?

> For each 
> metadata block there is a new version that is eventually linked from
> a new 
> version of the tree root.
> 
> This means that the regular checksum mechanisms can't work with nocow
> data.  A 
> filesystem can have checksums just pointing to data blocks but you
> need to 
> cater for the case where a corrupt metadata block points to an old
> version of 
> a data block and matching checksum.  The way that BTRFS works with an
> entire 
> checksumed tree means that there's no possibility of pointing to an
> old 
> version of a data block.
Hmm I'm not sure whether I understand that (or better said, I'm
probably sure I don't :D).

AFAIU, the metadata is always CoWed, right? So when a nodatacow file is
written, I'd assume it's mtime was update, which already leads to
CoWing of metadata... just that now, the checksums should be written as
well.

If the metadata block is corrupt, then should that be noticed via the
csums on that?

And you said "The way that BTRFS works with an entire checksumed tree
means that there's no possibility of pointing to an old version of a
data block."... how would that work for nodatacow'ed blocks? If there
is a crash it cannot know whether it was still the old block or the new
one or any garbage in between?!


> The NetApp published research into hard drive errors indicates that
> they are 
> usually in small numbers and located in small areas of the disk.  So
> if BTRFS 
> had a nocow file with any storage method other than dup you would
> have metadata 
> and file data far enough apart that they are not likely to be hit by
> the same 
> corruption (and the same thing would apply with most Ext4 Inode
> tables and 
> data blocks).
Well put aside any such research (whose results aren't guaranteed to be
always the case)... but that's just one reason from my motivation why
I've said checksums for no-CoWed files would be great (I used the
multi-device example though, not DUP).


> I think that a file mode where there were checksums on data 
> blocks with no checksums on the metadata tree would be useful.  But
> it would 
> require a moderate amount of coding
Do you mean in general, or having this as a mode for nodatacow'ed
files?
Loosing the meta data checksumming, doesn't seem really much more
appealing than not having data checksumming :-(


> and there's lots of other things that the 
> developers are working on.
Sure, I just wanted to bring this to their attending... I already
imagined that they wouldn't drop their current work to do that, just
because me whining for it ;-)


Thanks,
Chris.

smime.p7s
Description: S/MIME cryptographic signature


Re: [PATCH v3] btrfs: Introduce new mount option to disable tree log replay

2015-12-14 Thread Christoph Anton Mitterer
On Mon, 2015-12-14 at 15:20 -0500, Austin S. Hemmelgarn wrote:
> On 2015-12-14 14:44, Christoph Anton Mitterer wrote:
> > On Mon, 2015-12-14 at 14:33 -0500, Austin S. Hemmelgarn wrote:
> > > The traditional reasoning was that read-only meant that users
> > > couldn't
> > > change anything
> > Where I'd however count the atime changes to.
> > The atimes wouldn't change magically, but only because the user
> > stared
> > some program, configured some daemon, etc. ... which
> > reads/writes/etc.
> > the file.
> But reading the file is allowed, which is where this starts to get 
> ambiguous.
Why?

> Reading a file updates the atime (and in fact, this is the 
> way that most stuff that uses them cares about them), but even a ro 
> mount allows reading the file.
As I just wrote in the other post, at least for btrfs (haven't checked
ext/xfs due to being... well... lazy O:-) ) ro mount option  or  ro
snapshot seems to mean: no atime updates even if mounted with
strictatimes (or maybe I did just something stupid when checking, so
better double check)


> The traditional meaning of ro on UNIX 
> was (AFAIUI) that directory structure couldn't change, new files 
> couldn't be created, existing files couldn't be deleted, flags on the
> inodes couldn't be changed, and file data couldn't be changed.  TBH,
> I'm 
> not even certain that atime updates on ro filesystems was even an 
> intentional thing in the first place, it really sounds to me like the
> type of thing that somebody forgot to put in a permissions check for,
> and then people thought it was a feature.
Well in the end it probably doesn't matter how it came to existence,...
rather what it should be and what it actually is.
As said, I, personally, from the user PoV, would says soft-ro already
includes no dates on files being modifiable (including atime), as I'd
consider these a property of the file.
However anyone else may of course see that differently and at the same
time be smarter than I am.



> Also,
 
> even with noatime, I'm pretty sure the VFS updates the atime every
> time 
> the mtime changes
I've just checked and not it doesn't:
  File: ‘subvol/FILE’
  Size: 8   Blocks: 16 IO Block: 4096   regular
file
Device: 30h/48d Inode: 257 Links: 1
Access: (0644/-rw-r--r--)  Uid: (0/root)   Gid:
(0/root)
Access: 2015-12-15 00:01:46.452007798 +0100
Modify: 2015-12-15 00:31:26.579511816 +0100
Change: 2015-12-15 00:31:26.579511816 +0100

(rw,noatime mounted,... mtime, is more recent than atime)


>  (because not doing so would be somewhat stupid, and 
> you're writing the inode anyway), which technically means that stuff 
> could work around this by opening the file, truncating it to the size
> it 
> already is, and then closing it.
Hmm I don't have a strong opinion here... it sounds "supid" from the
technical point in that it *could* write the atime and that wouldn't
cost much.
OTOH, that would make things more ambiguous when atimes change and when
not... (they'd only change on writes, never on reads,...)
So I think it's good as it is... and it matches the name, which is
noatime - and not noatime-unless-on-writes ;-)



Cheers,
Chris.

smime.p7s
Description: S/MIME cryptographic signature


Re: attacking btrfs filesystems via UUID collisions?

2015-12-14 Thread Christoph Anton Mitterer
On Mon, 2015-12-14 at 13:55 -0700, Chris Murphy wrote:
> I'm aware of this proof of concept. I'd put it, and this one, in the
> realm of a targeted attack, so it's not nearly as likely as other
> problems needing fixing. That doesn't mean don't understand it better
> so it can be fixed. It means understand before arriving at risk
> assessment let alone conclusions.
Assessing the actual risk of any such attack vector is IMHO quite
difficult... but at least past experience has shown countless times
over and over again, that any system, where people already saw it would
have issues, were sooner or later actively attacked.

Take all the things from online banking... TAN, iTAN... at some point
the two-factor auth via mobileTAN were some people already warned, that
this would be rather easy to attack... banks and proponents of the
system said, that this is rather not realistic in practise.
I think alone in Germany we had some 8 million Euros that were stolen
by hacking mTANs last year.


> I didn't. I did state there are edge cases, not normal use. My
> criticism of dd for copying a volume is for general purpose copying,
> not edge cases.
Sure... but I guess we've never needed to argue about that.
If a howto were to be written on "how to best copy a btrfs filesystem"
and someone would say "me! take dd"... I'd be surely on your side,
sayin "Naaahh... stupid... you copy empty blocks and that like".

But here we talk about something completely different... namely all
those cases where UUID collisions could happen, including those where a
bit-identical copy is, for whichever reason, the best solution.



> I already have, as have others.
So far you've only said it would be bad practise as it wouldn't work
well with filesystems that do use UUIDs.
I agree with what Austin gave you as an answer upon that.


> Does the user want cake or pie? The computer doesn't have that level
> of granular information when there are two apparently bitwise
> identical devices.
I'm quite sure the computer has some concept of device path, and UUID
isn't the only way to identify a device. If that was so, than any
cloned ext4 would suffer from corruptions as well, as the fs would
chose the device based on UUID.

brtfs does of course more, especially in the multi-device case,...
where it needs to differ devices based on their content, no on their
path (which may be unstable).
But such case can surely be detected, and as you said yourself below:

> So option a is to simply fail and let the user
> resolve the ambiguity.
... on could e.g. simply require the user to resolve the situation
manually.
And I guess that's exactly what I've wrote here several times in this
thread, for mounting situations, for rebuild/fsck/repair/etc.
sitations.


>  Option b is maybe to leveral btrfs check code
> and find out if there's more to the story, some indication that one
> of
> the apparently identical copies isn't really identical.
Can't believe that this would be possible... if they're bitwise
identical, they're bitwise identical, the only thing that differs them
is how they're connected, e.g. USB port 1, sata port 2, etc..
But as this is unstable (just swap two sata disks) it cannot be used.


> That's not something btrfs can resolve alone.
Sure, I've never demanded that.
I always said "handle it gracefully" (i.e. no corruptions, no new
mounts, fsck's, etc.), require the user to manually sort out things.
Not automagically determine which of the devices are actually the right
ones and use them.


Cheers,
Chris.

smime.p7s
Description: S/MIME cryptographic signature


Re: attacking btrfs filesystems via UUID collisions?

2015-12-14 Thread Christoph Anton Mitterer
On Mon, 2015-12-14 at 08:23 -0500, Austin S. Hemmelgarn wrote:
> The reason that this isn't quite as high of a concern is because
> performing this attack requires either root access, or direct
> physical 
> access to the hardware, and in either case, your system is already 
> compromised.
No necessarily.
Apart from the ATM image (where most people wouldn't call it
compromised, just because it's openly accessible on the street)
imageine you're running a VM hosting service, where you allow users to
upload images and have them deployed.
In the cheap" case these will end up as regular files, where they
couldn't do any harm (even if colliding UUIDs)... but even there one
would have to expect, that the hypervisor admin may losetup them for
whichever reason.
But if you offer more professional services, you may give your clients
e.g. direct access to some storage backend, which are then probably
also seen on the host by its kernel.
And here we already have the case, that a client could remotely trigger
such collision.

And remember, things only sounds far-fetched until it actually happens
the first time ;)


> I still think that that isn't a sufficient excuse for not fixing the 
> issue, as there are a number of non-security related issues that can 
> result from this (there are some things that are common practice with
> LVM or mdraid that can't be done with BTRFS because of this).
Sure, I guess we agree on that,...


> > Apart from that, btrfs should be a general purpose fs, and not just
> > a
> > desktop or server fs.
> > So edge cases like forensics (where it's common that you create
> > bitwise
> > identical images) shouln't be forgotten either.
> While I would normally agree, there are ways to work around this in
> the 
> forensics case that don't work for any other case (namely, if BTRFS
> is 
> built as a module, you can unmount everything, unload the module,
> reload 
> it, and only scan the devices you want).
see below (*)


> On that note, why exactly is it better to make the filesystem UUID
> such 
> an integral part of the filesystem?
Well I think it's a proper way to e.g. handle the multi-device case.
You have n devices, you want to differ them,... using a pseudo-random
UUID is surely better than giving them numbers.
Same for the fs UUID, e.g. when used for mounting devices whose paths
aren't stable.

As said before, using the UUID isn't the problem - not protecting
against collisions is.


> The other thing I'm reading out of 
> this all, is that by writing a total of 64 bytes to a specific
> location 
> in a single disk in a multi-device BTRFS filesystem, you can make the
> whole filesystem fall apart, which is absolutely absurd.
Well,... I don't think that writing *into* the filesystem is covered by
common practise anymore.

In UNIX, a device (which holds the filesystem) is a file. Therefore one
can argue: if one copies/duplicates one file (i.e. the fs) neither of
the two's contents should get corrupted.
But if you actively write *into* the file by yourself,... then you're
simply on your own, either you know what you do, or just may just
corrupt *that* specific file. Of course it should again not lead to any
of it's clones or become corrupted as well.



> And some recovery situations (think along the lines of no recovery
> disk, 
> and you only have busybox or something similar to work with).
(*) which is however also, why you may not be able to unmount the
device anymore or unload btrfs.
Maybe you have reasons you must/want to do any forensics in the running
system.


> > AFAIK, there's not even a solution right now, that copies a
> > complete
> > btrfs, with snapshots, etc. preserving all ref-links. At least
> > nothing
> > official that works in one command.
> Send-receive kind of works for that
I've added the "in one command" for that... O:-)
In case the btrfs would have subvols/snapshots... the user would need
to make the recursion himself... 


Cheers,
Chris.

smime.p7s
Description: S/MIME cryptographic signature


Re: dear developers, can we have notdatacow + checksumming, plz?

2015-12-14 Thread Christoph Anton Mitterer
On Mon, 2015-12-14 at 09:16 -0500, Austin S. Hemmelgarn wrote:
> > When one starts to get a bit deeper into btrfs (from the admin/end-
> > user
> > side) one sooner or later stumbles across the recommendation/need
> > to
> > use nodatacow for certain types of data (DBs, VM images, etc.) and
> > the
> > reason, AFAIU, being the inherent fragmentation that comes along
> > with
> > the CoW, which is especially noticeable for those types of files
> > with
> > lots of random internal writes.
> It is worth pointing out that in the case of DB's at least, this is 
> because at least some of the do COW internally to provide the 
> transactional semantics that are required for many workloads.
Guess that also applies to some VM images then, IIRC qcow2 does CoW.



> > a) for performance reasons (when I consider our research software
> > which
> > often has IO as the limiting factor and where we want as much IO
> > being
> > used by actual programs as possible)...
> There are other things that can be done to improve this.  I would
> assume 
> of course that you're already doing some of them (stuff like using 
> dedicated storage controller cards instead of the stuff on the 
> motherboard), but some things often get overlooked, like actually
> taking 
> the time to fine-tune the I/O scheduler for the workload (Linux has 
> particularly brain-dead default settings for CFQ, and the deadline
> I/O 
> scheduler is only good in hard-real-time usage or on small hard
> drives 
> that actually use spinning disks).
Well sure, I think we'de done most of this and have dedicated
controllers, at least of a quality that funding allows us ;-)
But regardless how much one tunes, and how good the hardware is. If
you'd then loose always a fraction of your overall IO, and be it just
5%, to defragging these types of files, one may actually want to avoid
this at all, for which nodatacow seems *the* solution.


> The big argument for defragmenting a SSD is that it makes it such
> that 
> you require fewer I/O requests to the device to read a file
I've had read about that too, but since I haven't had much personal
experience or measurements in that respect, I didn't list it :)


> The problem is not entirely the lack of COW semantics, it's also the
> fact that it's impossible to implement an atomic write on a hard
> disk. 
Sure... but that's just the same for the nodatacow writes of data.
(And the same, AFAIU, for CoW itself, just that we'd notice any
corruption in case of a crash due to the CoWed nature of the fs and
could go back to the last generation).


> > but I wouldn't know that relational DBs really do cheksuming of the
> > data.
> All the ones I know of except GDBM and BerkDB do in fact provide the 
> option of checksumming.  It's pretty much mandatory if you want to be
> considered for usage in financial, military, or medical applications.
Hmm I see... PostgreSQL seem to have it since 9.3 ... didn't know
that... only crc16 but at least something.


> > Long story short, it does happen every now and then, that a scrub
> > shows
> > file errors, for neither the RAID was broken, nor there were any
> > block
> > errors reported by the disks, or anything suspicious in SMART.
> > In other words, silent block corruption.
> Or a transient error in system RAM that ECC didn't catch, or a 
> undetected error in the physical link layer to the disks, or an error
> in 
> the disk cache or controller, or any number of other things.
Well sure,... I was referring to these particular cases, where silent
block corruption was the most likely reason.
The data was reproducibly read identical, which probably rules out bad
RAM or controller, etc.


>   BTRFS 
> could only protect against some cases, not all (for example, if you
> have 
> a big enough error in RAM that ECC doesn't catch it, you've got
> serious 
> issues that just about nothing short of a cold reboot can save you
> from).
Sure, I haven't claimed, that checksumming for no-CoWed data is a
solution for everything.


> > But, AFAIU, not doing CoW, while not having a journal (or does it
> > have
> > one for these cases???) almost certainly means that the data (not
> > necessarily the fs) will be inconsistent in case of a crash during
> > a
> > no-CoWed write anyway, right?
> > Wouldn't it be basically like ext2?
> Kind of, but not quite.  Even with nodatacow, metadata is still COW, 
> which is functionally as safe as a traditional journaling filesystem 
> like XFS or ext4.
Sure, I was referring to the data part only, should have made that more
clear.


> Absolute worst case scenario for both nodatacow on 
> BTRFS, and a traditional journaling filesystem, the contents of the
> file 
> are inconsistent.  However, almost all of the things that are 
> recommended use cases for nodatacow (primarily database files and VM 
> images) have some internal method of detecting and dealing with 
> corruption (because of the traditional filesystem semantics ensuring 
> metadata consistency, but not data 

Re: btrfs: poor performance on deleting many large files

2015-12-14 Thread Christoph Anton Mitterer
On Mon, 2015-12-14 at 09:24 -0500, Austin S. Hemmelgarn wrote:
> Unless things have changed very recently, even many modern systems
> update atime on read-only filesystems, unless the media itself is 
> read-only.
Seriously? Oh... *sigh*...
You mean as in Linux, ext*, xfs?

> If you have software that actually depends on atimes, then that
> software 
> is broken (and yes, I even feel this way about Mutt).
I don't disagree here :D

> The way atimes 
> are implemented on most systems breaks the semantics that almost 
> everyone expects from them, because they get updated for anything
> that 
> even looks sideways at the inode from across the room.  Most software
> that uses them expects them to answer the question 'When were the 
> contents of this file last read?', but they can get updated even for 
> stuff like calculating file sizes, listing directory contents, or 
> modifying the file's metadata.
Sure... my point here again was, that I try to look every now and then
at the whole thing from the pure-end-user side:
For them, the default is relatime, and they likely may not want to
change that because they have no clue on how much further effects this
may have (or not).
So as long as Linux doesn't change it's defaults to noatime, leaving
things up to broken software (i.e. to get fixed), I think it would be
nice for the end-user, to have e.g. snapshots be "save" (from the
write-amplification on read) out of the box.

My idea would be basically, that having a noatime btrfs-property, which
is perhaps even set automatically, would be an elegant way of doing
that.
I just haven't had time to properly write that up and add is as a
"feature request" to the projects idea wiki page.


Cheers,
Chris.

smime.p7s
Description: S/MIME cryptographic signature


Re: [PATCH v3] btrfs: Introduce new mount option to disable tree log replay

2015-12-14 Thread Christoph Anton Mitterer
On Mon, 2015-12-14 at 18:32 +0100, David Sterba wrote:
> I've read the discussions around the change and from the user's POV
> I'd
> suggest to add another mount option that would be just an alias for
> any
> mount options that would implement the 'hard-ro' semantics.
Nice to hear... 


> Say it's called 'nowr'
though I'm deeply saddened, that you don't like my proposed "hard-ro"
which I though about for nearly 1s ;-)

>  mount -o ro,nowr /dev/sdx /mnt
Sounds reasonable... especially I mean that, as long ro's documentation
points to "nowr" and clearly states whether both (ro+nowr) are required
to get the desired behaviour, I have no very strong opinion, whether
both (ro+nowr) should be required, or whether nowr, should imply ro.
Though I think, the later may be better.

Thanks,
Chris.

smime.p7s
Description: S/MIME cryptographic signature


Re: [PATCH v3] btrfs: Introduce new mount option to disable tree log replay

2015-12-14 Thread Christoph Anton Mitterer
On Mon, 2015-12-14 at 12:50 -0500, Austin S. Hemmelgarn wrote:
> It should also imply noatime.  I'm not sure how BTRFS handles atime
> when 
> mounted RO, but I know a lot of old UNIX systems updated atime even
> on 
> filesystems mounted RO, and I know that at least at one point Linux
> did too.
I stumbled over that recently myself, and haven't bothered to try it
out, yet.
But Duncan's argument, why at least ro-snapshots (yes I know, this may
not be exactly the same as RO mount option) would need to imply
noatime, is pretty convincing. :)

Anyway, if it "ro" wouldn't imply noatime, I would ask why, because the
atime is definitely something the fs exports normally to userland,...
and that's how I'd basically consider hard-ro vs. (soft-)ro:

soft-ro: data as visible by the mounted fs must not change (unless
         perhaps for necessary repair/replay operations to get the 
         filesystem back in a consistent state)
hard-ro: soft-ro + nothing on the backing devices may change (bitwise)


Cheers,
Chris.

smime.p7s
Description: S/MIME cryptographic signature


Re: [PATCH v3] btrfs: Introduce new mount option to disable tree log replay

2015-12-14 Thread Christoph Anton Mitterer
On Mon, 2015-12-14 at 14:33 -0500, Austin S. Hemmelgarn wrote:
> The traditional reasoning was that read-only meant that users
> couldn't 
> change anything
Where I'd however count the atime changes to.
The atimes wouldn't change magically, but only because the user stared
some program, configured some daemon, etc. ... which reads/writes/etc.
the file.


> , not that the actual data on disk wouldn't change. 
> That, and there's been some really brain-dead software over the years
> that depended on atimes being right (now, the only remaining software
> I 
> know of that even uses them at all is Mutt).
Wasn't tmpwatcher anoterh candidate?


> This should be 'Nothing on the backing device may change as a result
> of 
> the FS', nitpicking I know, but we should be specific so that we 
> hopefully avoid ending up in the same situation again.
Of course, you're right! :-)

(especially when btrfs should ever be formalised in a standards
document, this should read like:
>hard-ro: Nothing on the backing device may change as a result of the
>FS, however, e.g. maleware, may directly destroy the data on the
>blockdevice ;-)


Chris.

smime.p7s
Description: S/MIME cryptographic signature


Re: btrfs: poor performance on deleting many large files

2015-12-14 Thread Christoph Anton Mitterer
On Mon, 2015-12-14 at 15:27 -0500, Austin S. Hemmelgarn wrote:
> On 2015-12-14 14:39, Christoph Anton Mitterer wrote:
> > On Mon, 2015-12-14 at 09:24 -0500, Austin S. Hemmelgarn wrote:
> > > Unless things have changed very recently, even many modern
> > > systems
> > > update atime on read-only filesystems, unless the media itself is
> > > read-only.
> > Seriously? Oh... *sigh*...
> > You mean as in Linux, ext*, xfs?
> Possibly, I know that Windows 7 does it, and I think OS X and OpenBSD
> do 
> it, but I'm not sure about Linux.
I've just checked it via loopback image and strictatime:

- ro snapshot doesn't get atime updated
- rw snapshot does atime get update
- ro mounted fs (top level subvol) doesn't get atimes updated (neither
  in subvols)
- rw mounted fs (top level subvol) does get atimes updated

Cheers,
Chris.

smime.p7s
Description: S/MIME cryptographic signature


project idea: per-object default mount-options / more btrfs-properties / chattr attributes (was: btrfs: poor performance on deleting many large files)

2015-12-14 Thread Christoph Anton Mitterer
Just FYI:

On Mon, 2015-12-14 at 15:27 -0500, Austin S. Hemmelgarn wrote:
> > My idea would be basically, that having a noatime btrfs-property,
> > which
> > is perhaps even set automatically, would be an elegant way of doing
> > that.
> > I just haven't had time to properly write that up and add is as a
> > "feature request" to the projects idea wiki page.
> I like this idea.

I've just compiled some thoughts and ideas into:
https://btrfs.wiki.kernel.org/index.php/Project_ideas#Per-object_default_mount-options_.2F_btrfs-properties_.2F_chattr.281.29_attributes_and_reasonable_userland_defaults

As usual, this is mostly from my admin/end-user side, i.e. what I could
imagine would ease in the maintenance of large/complex (in terms of
subvols, nesting, snapshots) btrfs filesystems...

And of course, any developer or more expert user than me is happily
invited to comment/remove any (possibly stupid) ideas of mine therein,
or summon the inquisition for my heresy ;)


Cheers,
Chris.

smime.p7s
Description: S/MIME cryptographic signature


Re: btrfs: poor performance on deleting many large files

2015-12-14 Thread Christoph Anton Mitterer
On Mon, 2015-12-14 at 22:30 +0100, Lionel Bouton wrote:
> Mutt is often used as an example but tmpwatch uses atime by default
> too
> and it's quite useful.
Hmm one could probably argue that these few cases justify the use of
separate filesystems (or btrfs subvols ;) ), so that the majority could
benefit of noatime.

> If you have a local cache of remote files for which you want a good
> hit
> ratio and don't care too much about its exact size (you should have
> Nagios/Zabbix/... alerting you when a filesystem reaches a %free
> limit
> if you value your system's availability anyway), using tmpwatch with
> cron to maintain it is only one single line away and does the job.
> For
> an example of this particular case, on Gentoo the
> /usr/portage/distfiles
> directory is used in one of the tasks you can uncomment to activate
> in
> the cron.daily file provided when installing tmpwatch.
> Using tmpwatch/cron is far more convenient than using a dedicated
> cache
> (which might get tricky if the remote isn't HTTP-based, like an
> rsync/ftp/nfs/... server or doesn't support HTTP IMS requests for
> example).
> Some http frameworks put sessions in /tmp: in this case if you want
> sessions to expire based on usage and not creation time, using
> tmpwatch
> or similar with atime is the only way to clean these files. This can
> even become a performance requirement: I've seen some servers slowing
> down with tens/hundreds of thousands of session files in /tmp because
> it
> was only cleaned at boot and the systems were almost never
> rebooted...
Okay there are probably some usecases, ... the session cleaning I'd
however rather consider a bug in the respective software, especially if
it really depends on it to expire the session (what if for some reason
tmpwatch get's broken, uninstalled, etc.)


> I use noatime and nodiratime
FYI: noatime implies nodiratime :-)


> Finally Linus Torvalds has been quite vocal and consistent on the
> general subject of the kernel not breaking user-space APIs no matter
> what so I wouldn't have much hope for default kernel mount options
> changes...
He surely is right in general,... but when the point has been reached,
where only a minority actually requires the feature... and the minority
actually starts to suffer from that... it may change.


Cheers,
Chris.

smime.p7s
Description: S/MIME cryptographic signature


Re: [auto-]defrag, nodatacow - general suggestions?(was: btrfs: poor performance on deleting many large files?)

2015-12-13 Thread Christoph Anton Mitterer
On Wed, 2015-12-09 at 13:36 +, Duncan wrote:
> Answering the BTW first, not to my knowledge, and I'd be
> skeptical.  In 
> general, btrfs is cowed, and that's the focus.  To the extent that
> nocow 
> is necessary for fragmentation/performance reasons, etc, the idea is
> to 
> try to make cow work better in those cases, for example by working on
> autodefrag to make it better at handling large files without the
> scaling 
> issues it currently has above half a gig or so, and thus to confine
> nocow 
> to a smaller and smaller niche use-case, rather than focusing on
> making 
> nocow better.
> Of course it remains to be seen how much better they can do with 
> autodefrag, etc, but at this point, there's way more project 
> possibilities than people to develop them, so even if they do find
> they 
> can't make cow work much better for these cases, actually working on
> nocow 
> would still be rather far down the list, because there's so many
> other 
> improvement and feature opportunities that will get the focus
> first.  
> Which in practice probably puts it in "it'd be nice, but it's low
> enough 
> priority that we're talking five years out or more, unless of course 
> someone else qualified steps up and that's their personal itch they
> want 
> to scratch", territory.
I guess I'll split out my answer on that, in a fresh thread about
checksums for nodatacow later, hoping to attract some more devs there
:-)

I think however, again with my naive understanding on how CoW works and
what it inherently implies, that there cannot be a real good solution
to the fragmentation problem for DB/etc. files.

And as such, I'd think that having a checksumming feature for
notdatacow as well, even if it's not perfect, is definitely worth it.


> As for the updated checksum after modification, the problem with that
> is 
> that in the mean time, the checksum wouldn't verify,
Well one could either implement some locking,.. but I don't see the
general problem here... if the block is still being written (and I
count updating the meta-data, including checksum, to that) it cannot be
read anyway, can it? It may be only half written and the data returned
would be garbage.


>  and while btrfs 
> could of course keep status in memory during normal operations,
> that's 
> not the problem, the problem is what happens if there's a crash and
> in-
> memory state vaporizes.  In that case, when btrfs remounted, it'd
> have no 
> way of knowing why the checksum didn't match, just that it didn't,
> and 
> would then refuse access to that block in the file, because for all
> it 
> knows, it /is/ a block error.
And this would only happen in the rare cases that anything crashes,
where it's anyway quite likely that this no-CoWed block will be
garbage.
I'll talk about that more in the separate thread... so let's move
things there.


> Same here.  In fact, my most anticipated feature is N-way-mirroring, 
Hmm ... not totally sure about that...
AFAIU, N-way-mirroring is what currently the currently wrongly called
RAID1 is in btrfs, i.e. having N replicas of everything on M devices,
right?
In other words, not being a N-parity-RAID and not guaranteeing that
*any* N disks could fail, right?

Hmm I guess that would be definitely nice to have, especially since
then we could have true RAID1, i.e. N=M.

But it's probably rather important for those scenarios, where either
resilience matters a lot... and/or   those where write speed doesn't
but read speed does, right?

Taking the example of our use case at the university, i.e. the LHC
Tier-2 we run,... that would rather be uninteresting.
We typically have storage nodes (and many of them) of say 16-24
devices, and based on funding constraints, resilience concerns and IO
performance, we place them in RAID6 (yeah i know, RAID5 is faster, but
even with hotspares in place, practise lead too often to lost RAIDs).

Especially for the bigger nodes, with more disks, we'd rather have a N-
parity RAID, where any N disks can fail)... of course performance
considerations may kill that desire again ;)


> It is a big and basic feature, but turning it off isn't the end of
> the 
> world, because then it's still the same level of reliability other 
> solutions such as raid generally provide.
Sure... I never meant it as "loss to what we already have in other
systems"... but as "loss compared to how awesome[0] btrfs could be ;-)"


> But as it happens, both VM image management and databases tend to
> come 
> with their own integrity management, in part precisely because the 
> filesystem could never provide that sort of service.
Well that's only partially true, to my knowledge.
a) I wouldn't know that hypervisors do that at all.
b) DBs have of course their journal, but that protects only against
crashes,... not against bad blocks nor does it help you to decide which
block is good when you have multiple.


> After all, you can always decide not to run it if you're worried
> about the space effects it's going to have

Re: attacking btrfs filesystems via UUID collisions?

2015-12-13 Thread Christoph Anton Mitterer
On Fri, 2015-12-11 at 16:06 -0700, Chris Murphy wrote:
> For anything but a new and empty Btrfs volume
What's the influence of the fs being new/empty?

> this hypothetical
> attack would be a ton easier to do on LVM and mdadm raid because they
> have a tiny amount of metadata to spoof compared to a Btrfs volume
> with even a little bit of data on it.
Uhm I haven't said that other systems properly handle this kind of
attack. ;-)
Guess that would need to be evaluated...


>  I think this concern is overblown.
I don't think so. Let me give you an example: There is an attack[0]
against crypto, where the attacker listens via a smartphone's
microphone, and based on the acoustics of a computer where gnupg runs.
This is surely not an attack many people would have considered even
remotely possible, but in fact it works, at least under lab conditions.

I guess the same applies for possible attack vectors like this here.
The stronger actual crypto and the strong software gets in terms of
classical security holes (buffer overruns and so), the more attackers
will try to go alternative ways.


> I'm suggesting bitwise identical copies being created is not what is
> wanted most of the time, except in edge cases.
mhh,.. well there's the VM case, e.g. duplicating a template VM,
booting it deploying software. Guess that's already common enough.
There are people who want to use btrfs on top of LVM and using the
snapshot functionality of that... another use case.
Some people may want to use it on top of MD (for whatever reason)... at
least in the mirroring RAID case, the kernel would see the same btrfs
twice.

Apart from that, btrfs should be a general purpose fs, and not just a
desktop or server fs.
So edge cases like forensics (where it's common that you create bitwise
identical images) shouln't be forgotten either.


> > >If your workflow requires making an exact copy (for the shelf or
> > > for
> > > an emergency) then dd might be OK. But most often it's used
> > > because
> > > it's been easy, not because it's a good practice.
> > Ufff.. I wouldn't got that far to call something here bad or good
> > practice.
> 
> It's not just bad practice, it's sufficiently sloppy that it's very
> nearly user sabotage. That this is due to innocent ignorance, and a
> long standing practice that's bad advice being handed down from
> previous generations doesn't absolve the practice and mean we should
> invent esoteric work arounds for what is not a good practice. We have
> all sorts of exhibits why it's not a good idea.
Well if you don't give any real arguments or technical reasons (apart
from "working around software that doesn't handle this well") I
consider this just repetition of the baseless claim that long standing
practise would be bad.


> I disagree. It was due to the rudimentary nature of earlier
> filesystems' metadata paradigm that it worked. That's no longer the
> case.
Well in the end it's of course up to the developers to decide whether
this is acceptable or not, but being on the admin/end-user side, I can
at least say that not everyone on there would accept "this is no longer
the case" as valid explanation when their fs was corrupted or attacked.


> Sure, the kernel code should get smarter about refusing to mount in
> ambiguous cases, so that a file system isn't nerfed. That shouldn't
> happen. But we also need to get away from this idea that dd is
> actually an appropriate tool for making a file system copy.
Uhm... your view is a bit narrow-sighted... again take the forensics
example.

But apart from that,... I never said that dd should be the regular tool
for people to copy a btrfs image. Typically it would be simply slower
than other means.

But for some solutions, it may still be the better choice, or at least
the only choice implemented right now (e.g. I wouldn't now of a
hypervisor system, that looks at an existing disk image, finds any
btrfs in that (possibly "hidden" below further block layers), and
cleanly copies the data into freshly created btrfs image, with the same
structure.
AFAIK, there's not even a solution right now, that copies a complete
btrfs, with snapshots, etc. preserving all ref-links. At least nothing
official that works in one command.

Long story, short, I think we can agree, that - dd or not - corruptions
or attack vectors shouldn't be possible.
And be it just, to protect against the btrfs on hardware RAID1 case,
which is accidentally switched to JBOD mode...


Cheers,
Chris.


[0] http://www.tau.ac.il/~tromer/papers/acoustic-20131218.pdf


smime.p7s
Description: S/MIME cryptographic signature


Re: attacking btrfs filesystems via UUID collisions?

2015-12-13 Thread Christoph Anton Mitterer
On Sat, 2015-12-12 at 02:34 +0100, S.J. wrote:
> A bit more about the dd-is-bad-topic:
> 
> IMHO it doesn't matter at all.
Yes, fully agree.


> a) For this specific problem here, fixing a security problem
> automatically
> fixes the risk of data corruption because careless cloning+mounting
> (without UUID adjustments) too.
> So, if the user likes to use dd with its disadvantages, like waiting 
> hours to
> copy lots of free space, and bad practice, etc.etc., why should it
> concern
> the Btrfs developers and/or us here?
> 
> b) At wider scope; while Btrfs is more complex than Xfs etc.,
> currently
> there is no other reason why things could go bad when dd'ing
> something.
> As long as this holds, is there really a place in the official Btrfs 
> documentation
> for telling the users "dd is bad [practice]"?
> ...
fully agree as well. :-)


Cheers,
Chris.

smime.p7s
Description: S/MIME cryptographic signature


Re: [auto-]defrag, nodatacow - general suggestions?(was: btrfs: poor performance on deleting many large files?)

2015-12-13 Thread Christoph Anton Mitterer
Two more on these:

On Thu, 2015-11-26 at 00:33 +, Hugo Mills wrote:
> 3) When I would actually disable datacow for e.g. a subvolume that
> > holds VMs or DBs... what are all the implications?
> > Obviously no checksumming, but what happens if I snapshot such a
> > subvolume or if I send/receive it?
>    After snapshotting, modifications are CoWed precisely once, and
> then it reverts to nodatacow again. This means that making a snapshot
> of a nodatacow object will cause it to fragment as writes are made to
> it.
AFAIU, the one the get's fragmented then is the snapshot, right, and
the "original" will stay in place where it was? (Which is of course
good, because one probably marked it nodatacow, to avoid that
fragmentation problem on internal writes).

I'd assume the same happens when I do a reflink cp.

Can one make a copy, where one still has atomicity (which I guess
implies CoW) but where the destination file isn't heavily fragmented
afterwards,... i.e. there's some pre-allocation, and then cp really
does copy each block (just everything's at the state of time where I
stared cp, not including any other internal changes made on the source
in between).


And one more:
You both said, auto-defrag is generally recommended.
Does that also apply for SSDs (where we want to avoid unnecessary
writes)?
It does seem to get enabled, when SSD mode is detected.
What would it actually do on an SSD?


Cheers,
Chris.

smime.p7s
Description: S/MIME cryptographic signature


dear developers, can we have notdatacow + checksumming, plz?

2015-12-13 Thread Christoph Anton Mitterer
(consider that question being asked with that face on: http://goo.gl/LQaOuA)

Hey.

I've had some discussions on the list these days about not having
checksumming with nodatacow (mostly with Hugo and Duncan).

They both basically told me it wouldn't be straight possible with CoW,
and Duncan thinks it may not be so much necessary, but none of them
could give me really hard arguments, why it cannot work (or perhaps I
was just too stupid to understand them ^^)... while at the same time I
think that it would be generally utmost important to have checksumming
(real world examples below).

Also, I remember that in 2014, Ted Ts'o told me that there are some
plans ongoing to get data checksumming into ext4, with possibly even
some guy at RH actually doing it sooner or later.

Since these threads were rather admin-work-centric, developers may have
skipped it, therefore, I decided to write down some thoughts
label them with a more attracting subject and give it some bigger
attention.
O:-)




1) Motivation why, it makes sense to have checksumming (especially also
in the nodatacow case)


I think of all major btrfs features I know of (apart from the CoW
itself and having things like reflinks), checksumming is perhaps the
one that distinguishes it the most from traditional filesystems.

Sure we have snapshots, multi-device support and compression - but we
could have had that as well with LVM and software/hardware RAID... (and
ntfs supported compression IIRC ;) ).
Of course, btrfs does all that in a much smarter way, I know, but it's
nothing generally new.
The *data* checksumming at filesystem level, to my knowledge, is
however. Especially that it's always verified. Awesome. :-)


When one starts to get a bit deeper into btrfs (from the admin/end-user 
side) one sooner or later stumbles across the recommendation/need to
use nodatacow for certain types of data (DBs, VM images, etc.) and the
reason, AFAIU, being the inherent fragmentation that comes along with
the CoW, which is especially noticeable for those types of files with
lots of random internal writes.

Now duncan implied, that this could improve in the future, with the
auto-defragmentation getting (even) better, defrag becoming usable
again for those that do snapshots or reflinked copies and btrfs itself
generally maturing more and more.
But I kinda wonder to what extent one will be really able to solve
that, what seems to me a CoW-inherent "problem",...
Even *if* one can make the auto-defrag much smarter, it would still
mean that such files, like big DBs, VMs, or scientific datasets that
are internally rewritten, may get more or less constantly defragmented.
That may be quite undesired...
a) for performance reasons (when I consider our research software which
often has IO as the limiting factor and where we want as much IO being
used by actual programs as possible)...
b) SSDs...
Not really sure about that; btrfs seems to enable the autodefrag even
when an SSD is detected,... what is it doing? Placing the block in a
smart way on different chips so that accesses can be better
parallelised by the controller?
Anyway, (a) is could be already argument enough, not to run solve the
problem by a smart-[auto-]defrag, should that actually be implemented.

So I think having notdatacow is great and not just a workaround till
everything else gets better to handle these cases.
Thus, checksumming, which is such a vital feature, should also be
possible for that.


Duncan also mention that in some of those cases, the integrity is
already protected by the application layer, making it less important to
have it at the fs-layer.
Well, this may be true for file-sharing protocols, but I wouldn't know
that relational DBs really do cheksuming of the data.
They have journals, of course, but these protect against crashes, not
against silent block errors and that like.
And I wouldn't know that VM hypervisors do checksuming (but perhaps
I've just missed that).

Here I can give a real-world example, from the Tier-2 that I run for
LHC at work/university.
We have large amounts of storage (perhaps not as large as what Google
and Facebook have, or what the NSA stores about us)... but it's still
some ~ 2PiB, or a bit more.
That's managed with some special storage management software called
dCache. dCache even stores checksums, but per file, so that means for
normal reads, these cannot be verified (well technically it's
supported, but with our usual file sizes, this is not working) so what
remains are scrubs.
For The two PiB, we have some... roughly 50-60 nodes, each with
something between 12 and 24 disks, usually in either one or two RAID6
volumes, all different kinds of hard disks.
And we do run these scrubs quite rarely, since it costs IO that could
be used for actual computing jobs (a problem that wouldn't be there
with how btrfs calculates the sums on read, the data is then read
anyway)... so likely there are even more errors that are just never
noticed, because the datasets are removed 

Re: subvols and parents - how?

2015-12-12 Thread Christoph Anton Mitterer
On Wed, 2015-12-09 at 10:53 +, Duncan wrote:
> If you use the recipe (subvol create, cp with reflink) it suggests
> there, 
> you'll end up with the reflinked copy in a subvol.
> 
> You can then mount that subvol over top of the existing dir, and
> *new* 
> file opens will access the new subvol, tho existing open files will
> of 
> course continue to access the files/reflinks to which they have a 
> reference, "underneath" the new mount.
Sure, which would mean however that a downtime is still necessary.



> For some services it's possible to signal them to reload their
> files.  
> Where this is possible, you can do the overmount trick and then
> signal 
> them to reload, and they should keep running, otherwise undisturbed
> (altho 
> any changes between the reflink and processing the signal will still
> go 
> to the existing open files, I don't believe there's a way around
> that).
Yep,.. but as you say,... it doesn't really help to avoid the
downtime,... rather lead to data corruption (not on the filesystem
level, of course, but within the application).


> But AFAIK that's the closest it gets, and nothing more along that
> line is 
> planned.
I've been so free to add that idea to the project ideas:
https://btrfs.wiki.kernel.org/index.php/Project_ideas#.28Atomically.29_convert_directories_into_subvolumes_and_vice_versa

any developer or more experienced user than me is of course free to
remove that again, if it seems not so important or isn't possible to be
implemented.


> In general, keep in mind that subvolumes work in most respects very
> much 
> like normal directories do, except where they don't. =:^)
:-P :-P 


Cheers,
Chris.

smime.p7s
Description: S/MIME cryptographic signature


Re: subvols, ro- and bind mounts - how?

2015-12-12 Thread Christoph Anton Mitterer
On Thu, 2015-12-10 at 19:32 -0700, Chris Murphy wrote:
> That seems due for a revision because I do rw, ro, rw, rw, ro mounts
> in sequence and they stick fine. In fact they stick with the same
> subvolume.
> 
> [root@f23m ]# mount /dev/sda7 /mnt/1 -o subvol=home
> [root@f23m ]# mount /dev/sda7 /mnt/2 -o subvol=home,ro
> [root@f23m ]# mount /dev/sda7 /mnt/3 -o subvol=home
> [root@f23m ]# mount
> [...snip...]
> /dev/sda7 on /mnt/1 type btrfs
> (rw,relatime,seclabel,ssd,space_cache,subvolid=258,subvol=/home)
> /dev/sda7 on /mnt/2 type btrfs
> (ro,relatime,seclabel,ssd,space_cache,subvolid=258,subvol=/home)
> /dev/sda7 on /mnt/3 type btrfs
> (rw,relatime,seclabel,ssd,space_cache,subvolid=258,subvol=/home)


Not sure what you mean with "stick" here... I'd say the above has
simply the following semantics:
- the default for mounts is rw
- thus /mnt/1 and /mnt/3 are rw, and 3 isn't rw, because 1 was

In other words, if you change that to the following:
# mount /dev/sda7 /mnt/1 -o subvol=home,ro
# mount /dev/sda7 /mnt/2 -o subvol=home,ro
# mount /dev/sda7 /mnt/3 -o subvol=home
I'd expect that you get
ro
ro
rw

At least based on how I understood the whole system now.
That was actually my question here:
Q: In other words does mounting the same subvol *again* behave like --
bind mounts, i.e. the further mounts would get the options from the
first mount?

And I guess the A:(nswer) is: no, mount options affect the respective
mountpoint (including any nested subvols below that) only (except of
course it's a --bind mount.


If one of the devs could confirm that semantics, I may find some time
to update the wiki accordingly.


Cheers,
Chris.


smime.p7s
Description: S/MIME cryptographic signature


Re: Will "btrfs check --repair" fix the mounting problem?

2015-12-12 Thread Christoph Anton Mitterer
On Sat, 2015-12-12 at 13:16 -0700, Chris Murphy wrote:
> > What is the better way to get data? send/receive works only with RO
> > snapshots. Is there another way to preserve subvolumes and CoW
> > structure (a lot of files was copied between subvols using "cp
> > --reflink=always")? Or just rsync'ing files is all what I can do?
> 
> cp -a or rsync -a is all I can think of. To start to get it back to
> normal, you can use duperemove. While that doesn't create subvolumes,
> it'll at least find duplicate extents and use reflinks for those. So
> it's in effect the same thing you have now, just lacking the
> subvolume
> structure.

If he can still write to the fs, he could just create a ro-snapshot of
the rw ones?
Of course if the fs is already damaged, that may cause even more
damage... so this should only be made on a clone of the fs.


smime.p7s
Description: S/MIME cryptographic signature


Re: btrfs: poor performance on deleting many large files

2015-12-12 Thread Christoph Anton Mitterer
On Sat, 2015-11-28 at 06:49 +, Duncan wrote:
> Christoph Anton Mitterer posted on Sat, 28 Nov 2015 04:57:05 +0100 as
> excerpted:
> > Still, specifically for snapshots that's a bit unhandy, as one
> > typically
> > doesn't mount each of them... one rather mount e.g. the top level
> > subvol
> > and has a subdir snapshots there...
> > So perhaps the idea of having snapshots that are per se noatime is
> > still
> > not too bad.
> Read-only snapshots?
So you basically mean that ro snapshots won't have their atime updated
even without noatime?
Well I guess that was anyway the recent behaviour of Linux filesystems,
and only very old UNIX systems updated the atime even when the fs was
set ro.

> That'd do it, and of course you can toggle the read-
> only property (see btrfs property and its btrfs-property manpage).
Sure, but then it would still be nice for rw snapshots.

I guess what I probably actually want is the ability to set noatime as
a property.
I'll add that in a "feature request" on the project ideas wiki.

> Alternatively, mount the toplevel subvol read-only or noatime on one 
> mountpoint, and bind-mount it read-write or whatever other
> appropriate 
Well it's of course somehow possible... but that seems a bit ugly to
me... the best IMHO, would really be if one could set a property on
snapshots that marks them noatime.


Cheers,
Chris.

smime.p7s
Description: S/MIME cryptographic signature


Re: attacking btrfs filesystems via UUID collisions?

2015-12-11 Thread Christoph Anton Mitterer
Sorry, I'm just about to change my mail system, and used a bogus test
From: address in the previous mail (please replace fo@fo with
cales...@scientia.net).

Apologies for any inconveniences and this noise here.

Cheers,
Chris.

smime.p7s
Description: S/MIME cryptographic signature


Re: attacking btrfs filesystems via UUID collisions?

2015-12-11 Thread Christoph Anton Mitterer
On Thu, 2015-12-10 at 12:42 -0700, Chris Murphy wrote:
> That isn't what I'm suggesting. In the multiple device volume case
> where there are two exact (same UUID, same devid, same generation)
> instances of one of the block devices, Btrfs could randomly choose
> either one if it's an RO mount.
No, for the same reasons as just stated in my mail few minutes ago.
An attacker could probably find out the UUID/devid/generation... it
would probably possible for him to craft a device with exactly those
and try to use it.
If then btrfs would select any of these, it may also select the wrong
one - ro or rw, this may likely lead to problems.




> > About 1 and 2 ... if 3 gets fulfilled, why?
> > DD itself is not a problem "if" the UUID is changed after it
> > (which is a command as simple as dd), and if someone doesn't
> > know that, he/she will notice when mount refuses to work
> > because UUID duplicate.
> 
> dd is not a copy operation. It's creating a 2nd original. You don't
> end up with an original and a copy (or clone). A copy or clone has
> some distinguishing difference. Volume UUID is used throughout Btrfs
> metadata, it's not just in the superblocks. Changing volume UUID
> requires a rewrite of all metadata. This is inefficient for two
> reasons: one dd copies unused sectors; two it copies metadata that
> will have to be completely rewritten by btrfstune to change volume
> UUID; and also the subvolume UUIDs aren't changed, so it's an
> incomplete solution that has problems (see other threads).
Well dd is surely not the only thing that can be used to create a clone
(i.e. a bitwise identical copy - I guess we don't really care which is
the "original" and which are the "clones", or whether these are "2nd
originals).
We always just use it here as an example for scenarios in which bitwise
identical copies are created.

And even if internally it's a big thing, from the user's PoV, changing
the UUID is pretty simple (I guess that's what S.J. meant).


> If your workflow requires making an exact copy (for the shelf or for
> an emergency) then dd might be OK. But most often it's used because
> it's been easy, not because it's a good practice.
Ufff.. I wouldn't got that far to call something here bad or good
practice.
At least, I do not see any reason to call it a bad practice, except
that systems got over time much more complex and haven't dealt properly
with the problems that can occur by using dd.
Again, I don't demand magical "solutions" (i.e. the btrfs or LVM people
getting code into all dd like tools, so that these auto-detect when the
duplicate such data and auto-change the UUIDs)... they just should
handle the situations gracefully.


>  Note that Btrfs is
> not unique, XFS v5 does a very similar thing with volume UUID as
> well,
> and resulted in this change:
> http://oss.sgi.com/pipermail/xfs/2015-April/041267.html
Do you mean that xfs may suffer from the same issues that we're talking
about here? If so, one should probably give them a notice.



> Using dd also means the volume is offline.
Not really, you could do it on a snapshotted LV, while the "original"
is still running.
Or in emergency cases one could do it on a ro-remounted... probably not
guaranteed to work, but may do so in practise.


Cheers,
Chris.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: attacking btrfs filesystems via UUID collisions?

2015-12-11 Thread Christoph Anton Mitterer
On Wed, 2015-12-09 at 22:48 +0100, S.J. wrote:
> > 3. Some way to fail gracefully, when there's ambiguity that cannot
> > be
> > resolved. Once there are duplicate devs (dd or lvm snapshots, etc)
> > then there's simply no way to resolve the ambiguity automatically,
> > and
> > the volume should just refuse to rw mount until the user resolves
> > the
> > ambiguity. I think it's OK to fallback to ro mount (maybe) by
> > default
> > in such a case rather than totally fail to mount.
> About 3:
> RO fallback for the second device/partitions is not good.
> It won't stop confusing the two partitions, and even if both are RO,
> thinking it's ok to read and then reading the wrong data is bad.
Adding my two cents about that, just to emphasise it, even though S.J.
already covered it:

Even romounts, if anything is ambiguous, are evil:
Even if the filesystem itself wouldn't be destroyed by that, it could
mean that bogus data (or even evil data by an attacker) shows up in the
system that is then used and causes damage by being used.

In the "accidental" scenario, data from the wrong device could e.g.
contain outdated binaries, that still have security holes, or they
could contain lists of datasets to be deleted by some software, but
since being outdated or simply garbage, the wrong data could be
deleted.

In the "attacker" scenario,... well again as above, old binaries could
get used, or garbage data injected into the system (even if ro) could
make it compromised or be used for DoS.




In general, the longer I think about it, the more I come to the
conclusion that any form of auto activation (mounting, assembling,
rebuilding, etc.) is kind of dangerous... (see below)

And this applies in general, not just when using UUIDs,... but since in
btrfs UUIDs are the main criterion for selecting/auto-assembling these
devices, it's what applies for us here.

We have several stages, where wrong devices could be picked up and lead
to damage (either accidentally or as part of a tricky attack):
1) When the system boots, i.e. replacing parts of the system (e.g. 
   root fs) itself.
   There's little we can do here in general (regardless of UUID,
   labels or device=/dev/sda,/dev/sdb). If an attacker can exchange
   one of the devices, he may do evil things.
   That's bad of course, but I think "fixing" it, is beyond the scope
   of btrfs.
   - If e.g. the ATM has an unsecured BIOS/UEFI/bootloader and allows 
     the attacker easily to access these and select which device to 
     boot from,... well than I feel no sorry for the owner (their 
     fault).
   - If they configure their grub/initrd/etc. to boot LABEL/UUID... 
     well that's certainly handy, but it's also stupid if these boots 
     happen unattended, and there is an way around it (specify the 
     device paths or e.g. /dev/sda)... if the HDDs are properly
     secured by steel, and attacker cannot use the possibly more
     easily accessible USB bus.
   - Another way to partially help here is: use disk dm-crypt and 
     boot/assemble your system based on the dm-crypt devices.
     E.g. boot from the multi-device-btrfs 
     device=/dev/mapper/crypt1,/dev/mapper/crypt2 and so on.
     As long as the kernel and initrd (which does all that) are secure 
     (which is assumed here), then even when the attacker manages to 
     replace one of the devices, it wouldn't help him, as the he 
     couldn't present a device for which a dm-crypt mapping can be set 
     up (unless he has the keys, but then game's over anyway)

=> Long story short, if the system boots unattended, then people
   should not use UUID/LABEL to select the device, if they do, their 
   fault, not btrfs scope.
   If boots are attended, there's anyway not problem.
=> IHMO, this conceptually "fixes" (in the sense, that there's nothing
   to do specifically from the btrfs side) the possible problems of
   such a system being booted, with an attacker having replaced or 
   added some devices to it (especially when unattended).
   And also the situation, that such system was left back, in an
   incomplete multi-device state (i.e. left back unattended with a
   degraded RAID)


In other words, I think any problems, resulting of auto-
assembly/activation/mounting, based on UUIDs/device-scanning/etc. that
affect the valid system becoming running (i.e. booting) are beyond our
scope here.
Yes there are problems, but one can at least try to avoid them, by
using dm-crypt  or  device paths instead of LABELS/UUIDs, and properly
securing (i.e. steel and so on) the system disks, mainboard, bios, etc.


So the remaining issues are those we discussed already before:
The system runs already.
1) Further devices show up with colliding UUIDs /device IDs.
   a) Either none of them are used (mounted, fsck, etc.) already.
   b) Or     some of them are used (mounted, fsck, etc.) already.
2)
Further devices show up, that have no UUID / device ID collisions,
 
 but that may fit to an already used multi-device btrfs.
   E.g. in 

Re: subvols and parents - how?

2015-12-11 Thread Christoph Anton Mitterer
On Sat, 2015-12-12 at 03:32 +0100, Christoph Anton Mitterer wrote:
> What's still missing now, IMHO, is:
> - a guide when one should make subvole (e.g. keeping things on the
> root
> fs together, unless it's separate like /var/www is usually, but
> /var/lib typically "corresponds" to a state of /etc and /usr.
I just added that:
https://btrfs.wiki.kernel.org/index.php/SysadminGuide#When_To_Make_Subvolumes

Chris.

smime.p7s
Description: S/MIME cryptographic signature


Re: subvols, ro- and bind mounts - how?

2015-12-10 Thread Christoph Anton Mitterer
Hey.

I'd have an additional question about subvols O:-)

Given the following setup:
5
|
+--root (subvol, /)
   +-- mnt (dir)

with the following done:
- init 1
- remount,ro / (i.e. the subvol root)
- mount /dev/btrfs-device /mnt (i.e. mount the top subvol at /mnt)

The following happened:
- / was ro-mounted (obviously, at least one thing that I had expected
  correctly)
- /mnt was ro-mounted either (and the /mnt/root/ nested subvol then as
  well).
  => why is /mnt (i.e. the top level subvol) mounted ro??
  => I would have expected that, since / (i.e. the subvol "root" is ro
     mounted), it's also ro mounted as the nested subvol below 5, i.e.
     my naive thinking was in terms of logic:
     "/ mounted ro" => "subvol root is mounted ro (everywhere)"
       => "thus /mnt/root/ is mounted ro as well"

However, the later doesn't seem to be true, cause then I did:
- remount,rw /mnt
=> now /mnt/*, including /mnt/root/* was rw moutned



So I guess my assumption of subvols behaving more or less as if they'd
be a fs (and thus mounted at one place ro => everywhere ro) is not
true, is it?

Do, ro,rw (and possibly others) instead only affect the respective
mountpoint?
And automatically any nested subvols of that mountpoint?

So I could have basically:
/mount-point1/subvol-a  ; ro, noexec
/mount-point2/subvol-a  ; rw, compress=yes
/root   ; rw, compress=no
/root/here/it/is/nested/subvol-a ; (no mountpoint)

(with subvol-a being the same subvol)

And when I write via mount-point1 I'd get an error, but via mount-
point2 it works and in addition I get compression, while when writing
via the /root mountpoint, where it is nested, I'd get the rw and
compress=no from the "parent" mountpoint /root


Does that sounds correct?
It seems to make sense actually, though it's a bit unfamiliar... if I'm
not correctly wrong, than e.g. in terms of ext* I cannot have the same
fs mounted with different settings,... of course I cannot have it
mounted twice at all, but speaking of bind mounts.

So I guess, that when I'd do --bind mounts instead, I actually do get
the "old" behaviour, i.e. when the source is ro, then the --bind
mount's target is also forcibly ro.


Still, one unclear thing, why got /mnt mounted ro very above?



Thanks,
Chris.

btw: Not sure if I just missed it, but I guess the above should be more
or less documented, showing people that mounting subvols (especially
when mounting the same several times, either directly or as nested
subvol) has these implications.

smime.p7s
Description: S/MIME cryptographic signature


Re: subvols, ro- and bind mounts - how?

2015-12-10 Thread Christoph Anton Mitterer
On Thu, 2015-12-10 at 23:36 +0100, S.J. wrote:
> Quote:
> 
> " Most mount options apply to the whole filesystem, and only the
> options 
> for the first subvolume
> to be mounted will take effect. This is due to lack of implementation
> and may change in the future. "
> 
> from https://btrfs.wiki.kernel.org/index.php/Mount_options in a red
> box 
> on the top.

I've had read that, but it doesn't really make clear that that options
can effectively differ for the *same* subvol, when mounted several
times (or when appearing additionally as nested subvolume).

Chris.

smime.p7s
Description: S/MIME cryptographic signature


Re: [PATCH] btrfs: Introduce new mount option to disable tree log replay

2015-12-08 Thread Christoph Anton Mitterer
On Tue, 2015-12-08 at 07:15 -0500, Austin S Hemmelgarn wrote:
> Despite this, it really isn't a widely known or well documented
> behavior 
> outside of developers, forensic specialists, and people who have had
> to 
> deal with the implications it has on data recovery.  There really
> isn't 
> any way that the user would know about it without being explicitly
> told, 
> and it's something that can have a serious impact on being able to 
> recover a broken filesystem.  TBH, I really feel that _every_ 
> filesystem's documentation should have something about how to make it
> mount truly read-only, even if it's just a reference to how to mark
> the 
> block device read-only.
Exactly what I've meant.

And the developers here, should definitely consider that every normal
end-user, may easily assume the role of e.g. a forensics specialist
(especially with btrfs ;-) ), when recovery in case of corruptions is
tried.


I don't think that "it has always been improperly documented" (i.e. the
"ro" option) is a good excuse to continue doing it that way =)


Cheers,
Chris.

smime.p7s
Description: S/MIME cryptographic signature


Re: [RFC] Btrfs device and pool management (wip)

2015-12-08 Thread Christoph Anton Mitterer
On Mon, 2015-11-30 at 13:17 -0700, Chris Murphy wrote:
> On Mon, Nov 30, 2015 at 7:51 AM, Austin S Hemmelgarn
>  wrote:
> 
> > General thoughts on this:
> > 1. If there's a write error, we fail unconditionally right now.  It
> > would be
> > nice to have a configurable number of retries before failing.
> 
> I'm unconvinced. I pretty much immediately do not trust a block
> device
> that fails even a single write, and I'd expect the file system to
> quickly get confused if it can't rely on flushing pending writes to
> that device.
From my large-amounts-of-storage-admin PoV,... I'd say it would be nice
to have more knobs to control when exactly a device is considered no
longer perfectly fine, which can include several different stages like:
- perhaps unreliable
  e.g. maybe the device shows SMART problems or there were correctable 
  read and/or write errors under a certain threshold (either in total,
  or per time period)
  Then I could imagine that one can control whether the device is put 
  - continued to be normally used until certain error thresholds are
    exceeded.
  - placed in a mode where data is still written to, but only when
    there's a duplicate on at least on other good device,... so the
    device would be used as read pool
    maybe optionally, data already on the device is auto-replicated to
    good devices
  - offline (perhaps only to be automatically reused in case of
    emergency (as a hot spare) when the fs knows that otherwise it's
   
even more likely that data would be lost soon
- failed
  the threshold
from above has been reached, the fs suspects the
  device to completely
fail soon
  Possible knobs would include how aggressively data is tried
to move
  of the device.
  How often should retries be made? In case the
other devices are
  under high IO load how much percentage should be
used to get the
  still working data of the bad device (i.e. up to 100%,
meaning 
  "rather stop any other IO, just to move the data to good
devices 
  ASAP)? 
- dead
  accesses don't work anymore at all an the fs shouldn't even waste 
  time trying to read/recover data from it.

It would also make sense to allow tuning what conditions need be met to
e.g. consider a drive unreliable (e.g. which SMART errors?) and to
allow an admin to manually place a drive in a certain state (e.g. SMART
would be still good, no IO errors so far, but the drive is 5 year old
and I better want to consider it unreliable).


That's - to some extent - what we at our LHC Tier-2 do at higher levels
(partly simply by human management, partly via the storage management
system we use (dCache), partly by RAID and other tools and scripting).



In any case, though,... any of these knobs should IMHO default to the
most conservative settings.
In other words: If a device shows the slightest hint of being
unstable/unreliable/failed... it should be considered bad, no new data
should go on it (if necessary, because not enough other devices are
left, the fs should get ro).
The only thing I wouldn't have a opinion is: should the fs go ro and do
nothing, waiting for a human to decide what's next, or should it go ro
and (if possible) try to move data off the bad device (per default).

Generally, a filesystem should be safe per default (which is why I see
the issue in the other thread with the corruption/security leaks in
case of UUID collisions quite a showstopper).
From the admin side, I don't want to be required to make it safe,.. my
interaction should rather only be needed to tune things.

Of course I'm aware that btrfs brings several techniques which make it
unavoidable that more maintenance is put into the filesystem, but, per
default, this should be minimised as far as possible.


Cheers,
Chris.

smime.p7s
Description: S/MIME cryptographic signature


Re: [auto-]defrag, nodatacow - general suggestions?(was: btrfs: poor performance on deleting many large files?)

2015-12-08 Thread Christoph Anton Mitterer
Hey Hugo,


On Thu, 2015-11-26 at 00:33 +, Hugo Mills wrote:
>    Answering the second part first, no, it can't.
Thanks so far :)


>    The issue is that nodatacow bypasses the transactional nature of
> the FS, making changes to live data immediately. This then means that
> if you modify a modatacow file, the csum for that modified section is
> out of date, and won't be back in sync again until the latest
> transaction is committed. So you can end up with an inconsistent
> filesystem if there's a crash between the two events.
Sure,... (and btw: is there some kind of journal planned for
nodatacow'ed files?),... but why not simply trying to write an updated
checksum after the modified section has been flushed to disk... of
course there's no guarantee that both are consistent in case of crash (
but that's also the case without any checksum)... but at least one
would have the csum protection against everything else (blockerrors and
that like) in case no crash occurs?



> > For me the checksumming is actually the most important part of
> > btrfs
> > (not that I wouldn't like its other features as well)... so turning
> > it
> > off is something I really would want to avoid.
> > 
> > Plus it opens questions like: When there are no checksums, how can
> > it
> > (in the RAID cases) decide which block is the good one in case of
> > corruptions?
>    It doesn't decide -- both copies look equally good, because
> there's
> no checksum, so if you read the data, the FS will return whatever
> data
> was on the copy it happened to pick.
Hmm I see... so one gets basically the behaviour of RAID.
Isn't that kind of a big loss? I always considered the guarantee
against block errors and that like one of the big and basic features of
btrfs.
It seems that for certain (not too unimportant cases: DBs, VMs) one has
to decide between either evil, loosing the guaranteed consistency via
checksums... or basically running into severe troubles (like Mitch's
reported fragmentation issues).


> > 3) When I would actually disable datacow for e.g. a subvolume that
> > holds VMs or DBs... what are all the implications?
> > Obviously no checksumming, but what happens if I snapshot such a
> > subvolume or if I send/receive it?
> 
>    After snapshotting, modifications are CoWed precisely once, and
> then it reverts to nodatacow again. This means that making a snapshot
> of a nodatacow object will cause it to fragment as writes are made to
> it.
I see... something that should possibly go to some advanced admin
documentation (if not already in).
It means basically, that one must assure that any such files (VM
images, DB data dirs) are already created with nodatacow (perhaps on a
subvolume which is mounted as such.


> > 4) Duncan mentioned that defrag (and I guess that's also for auto-
> > defrag) isn't ref-link aware...
> > Isn't that somehow a complete showstopper?
>    It is, but the one attempt at dealing with it caused massive data
> corruption, and it was turned off again.
So... does this mean that it's still planned to be implemented some day
or has it been given up forever?
And is it (hopefully) also planned to be implemented for reflinks when
compression is added/changed/removed?


Given that you (or Duncan?,... sorry I sometimes mix up which of said
exactly what, since both of you are notoriously helpful :-) ) mentioned
that autodefrag basically fails with larger files,... and given that it
seems to be quite important for btrfs to not be fragmented too heavily,
it sounds a bit as if anything that uses (multiple) reflinks (e.g.
snapshots) cannot be really used very well.


>  autodefrag, however, has
> always been snapshot aware and snapshot safe, and would be the
> recommended approach here.
Ahhh... so autodefag *is* snapshot aware, and that's basically why the
suggestion is (AFAIU) that it's turned on, right?
So, I'm afraid O:-), that triggers a follow-up question:
Why isn't it the default? Or in other words what are its drawbacks
(e.g. other cases where ref-links would be broken up,... or issues with
compression)?

And also, when I now activate it on an already populated fs, will it
defrag also any old files (even if they're not rewritten or so)?
I tried to have a look for some general (rather "for dummies" than for
core developers) description of how defrag and autodefrag work... but
couldn't find anything in the usual places... :-(

btw: The wiki (https://btrfs.wiki.kernel.org/index.php/UseCases#How_do_
I_defragment_many_files.3F) doesn't mention that auto-defrag doesn't
suffer from that problem.


>  (Actually, it was broken in the same
> incident I just described -- but fixed again when the broken patches
> were reverted).
So it just couldn't be fixed (hopfully: yet) for the (manual) online
defragmentation?!


> > 5) Especially keeping (4) in mind but also the other comments in
> > from
> > Duncan and Austin...
> > Is auto-defrag now recommended to be generally used?
>
>    Absolutely, yes.
I see... well, I'll probably wait 

Re: [auto-]defrag, nodatacow - general suggestions?(was: btrfs: poor performance on deleting many large files?)

2015-12-08 Thread Christoph Anton Mitterer
On 2015-11-27 00:08, Duncan wrote:
> Christoph Anton Mitterer posted on Thu, 26 Nov 2015 01:23:59 +0100 as
> excerpted:
>> 1) AFAIU, the fragmentation problem exists especially for those files
>> that see many random writes, especially, but not limited to, big files.
>> Now that databases and VMs are affected by this, is probably broadly
>> known in the meantime (well at least by people on that list).
>> But I'd guess there are n other cases where such IO patterns can happen
>> which one simply never notices, while the btrfs continues to degrade.
> 
> The two other known cases are:
> 
> 1) Bittorrent download files, where the full file size is preallocated 
> (and I think fsynced), then the torrent client downloads into it a chunk 
> at a time.
Okay, sounds obvious.


> The more general case would be any time a file of some size is 
> preallocated and then written into more or less randomly, the problem 
> being the preallocation, which on traditional rewrite-in-place 
> filesystems helps avoid fragmentation (as well as ensuring space to save 
> the full file), but on COW-based filesystems like btrfs, triggers exactly 
> the fragmentation it was trying to avoid.
Is it really just the case when the file storage *is* actually fully
pre-allocated?
Cause that wouldn't (necessarily) be the case for e.g. VM images (e.g.
qcow2, or raw images when these are sparse files).
Or is it rather any case where, in larger file, many random (file
internal) writes occur?


> arranging to 
> have the client write into a dir with the nocow attribute set, so newly 
> created torrent files inherit it and do rewrite-in-place, is highly 
> recommended.
At the IMHO pretty high expense of loosing the checksumming :-(
Basically loosing half of the main functionalities that make btrfs
interesting for me.


> It's also worth noting that once the download is complete, the files 
> aren't going to be rewritten any further, and thus can be moved out of 
> the nocow-set download dir and treated normally.
Sure... but this requires manual intervention.

For databases, will e.g. the vacuuming maintenance tasks solve the
fragmentation issues (cause I guess at least when doing full vacuuming,
it will rewrite the files).


> The problem is much reduced in newer systemd, which is btrfs aware and in 
> fact uses btrfs-specific features such as subvolumes in a number of cases 
> (creating subvolumes rather than directories where it makes sense in some 
> shipped tmpfiles.d config files, for instance), if it's running on 
> btrfs.
Hmm doesn't seem really good to me if systemd would do that, cause it
then excludes any such files from being snapshot.


> For the journal, I /think/ (see the next paragraph) that it now 
> sets the journal files nocow, and puts them in a dedicated subvolume so 
> snapshots of the parent won't snapshot the journals, thereby helping to 
> avoid the snapshot-triggered cow1 issue.
The same here, kinda disturbing if systemd would decide that on it's
own, i.e. excluding files from being checksum protected...


>> So is there any general approach towards this?
> The general case is that for normal desktop users, it doesn't tend to be 
> a problem, as they don't do either large VMs or large databases,
Well depends a bit on how one defines the "normal desktop user",... for
e.g. developers or more "power users" it's probably not so unlikely that
they do run local VMs for testing or whatever.

> and 
> small ones such as the sqlite files generated by firefox and various 
> email clients are handled quite well by autodefrag, with that general 
> desktop usage being its primary target.
Which is however not yet the default...


> For server usage and the more technically inclined workstation users who 
> are running VMs and larger databases, the general feeling seems to be 
> that those adminning such systems are, or should be, technically inclined 
> enough to do their research and know when measures such as nocow and 
> limited snapshotting along with manual defrags where necessary, are 
> called for.
mhh... well it's perhaps simple to expect that knowledge for few things
like VMs, DBs and that like... but there are countless of software
systems, many of them being more or less like a black box, at least with
respect to their internals.

It feels a bit, if there should be some tools provided by btrfs, which
tell the users which files are likely problematic and should be nodatacow'ed


> And if they don't originally, they find out when they start 
> researching why performance isn't what they expected and what to do about 
> it. =:^)
Which can take quite a while to be found out...


>> And what are the actual possible consequences? Is it just that fs gets
>> slower (due to the fragmentation) or may I even run into other is

Re: subvols and parents - how?

2015-12-08 Thread Christoph Anton Mitterer
On Fri, 2015-11-27 at 02:02 +, Duncan wrote:
> Uhm, I don't get the big security advantage here... whether nested
> > or
> > manually mounted to a subdir,... if the permissions are insecure
> > I'll
> > have a problem... if they're secure, than not.
> Consider a setuid-root binary with a recently publicized but patched
> on 
> your system vuln.  But if you have root snapshots from before the
> patch 
> and those snapshots are nested below root, then they're always 
> accessible.  If the path to the vulnerable setuid is as user
> accessible 
> as it likely was in its original location, then anyone with login
> access 
> to the system is likely to be able to run it from the snapshot... and
> will be able to get root due to the vuln.

Hmm good point... I think it would be great if you could add that
scenario somewhere to the documentation. :-)
Based on that one can easily think about more/similar examples...
device file that had too permissive modes set, and where snapshotted
like that... and so on.

I think that's another example why it would be nice if btrfs had
something (per subvolume) like ext4's default mount options (I mean the
ones stored in the superblock).

Not only would it allow the userland tools to do things like "adding
notatime" per default on snapshots (at least ro snapshot), so that one
can have them nested and still doesn't suffer from the previously
discussed writes-on-read-amplifications... it would also allow to set
things like nodev, noexec, nosuid and that like on subvols... and again
it would make the whole thing practically usable with nested subvols.


Where would be the appropriate place to record that as a feature
request?
Simply here on the list?


Cheers,
Chris.

smime.p7s
Description: S/MIME cryptographic signature


Re: attacking btrfs filesystems via UUID collisions?

2015-12-08 Thread Christoph Anton Mitterer
On Sun, 2015-12-06 at 22:34 +0800, Qu Wenruo wrote:
> Not sure about LVM/MD, but they should suffer the same UUID conflict
> problem.
Well I had that actually quite often in LVM (i.e. same UUIDs visible on
the same system), basically because we made clones from one template VM
image and when that is normally booted, LVM doesn't allow to change the
UUIDs of already active PV/VG/LVs (or maybe just some of these three,
forgot the details)

But there was never any issue, LVM on the host system, when one set was
already used, continues to use that just fine and the toolset reports
which it would use (more below).


> The only idea I have can only enhance the behavior, but never fix it.
> For example, if found multiple btrfs devices with same devid, just 
> refuse to mount.
> And for already mounted btrfs, ignore any duplicated fsid/devid.
Well I think that's already a perfectly valid solution... basically the
idea that I had before.
I'd call that a 100% fix, not just a workaround.

If then the tools (i.e. btrfstune) allows to change the UUID of the duplicate 
set of devices (perhaps again with the necessity to specify each of them via 
device=/dev/sda,etc.) I'd be completely happy again,... and the show could get 
on ;)

> The problem can get even tricky for case like device missing for a
> while 
> and appear again case.
I had thought about that too:
a) In the non-malicious case, this could e.g. mean that a device from a
btrfs RAID was missing and a clone with the same UUID / dev ID get's
added to the system
Possible consequences, AFAICS:
- The data is simply auto-rebuilt on the clone.
- Some corruptions occur when the clone is older, and data that was
only on the newer device is now missing (not sure if this can happen at
all or whether generation IDs prevent it).

b) In the malicious/attack case, one possible scenario could be:
A device is missing from a btrfs RAID... the machine is left
unattended. An attacker comes plugs in the USB stick with the missing
UUID. Is the rebuild (and thus data leakage) now happening
automatically?

In any case though, a simply solution could be, that not automatic
assemblies happen per default, and the people who still want to do
that, are properly warned about the possible implications in the docs.


> But just as you mentioned, it *IS* a real problem, and we should need
> to 
> enhance it.
Should one (or I) add this as a ticket to the kernel bugzilla, or as an
entry to the btrfs wiki?


> I'd like to see how LVM/DM behaves first, at least as a reference if 
> they are really so safe.
Well that's very simple to check, I did it here for the LV case only:
root@lcg-lrz-admin:~# truncate -s 1G image1
root@lcg-lrz-admin:~# losetup -f image1 
root@lcg-lrz-admin:~# pvcreate /dev/loop0
  Physical volume "/dev/loop0" successfully created
root@lcg-lrz-admin:~# losetup -d /dev/loop0 
root@lcg-lrz-admin:~# cp image1 image2
root@lcg-lrz-admin:~# losetup -f image1 
root@lcg-lrz-admin:~# pvscan 
  PV /dev/sdb VG vg_data lvm2 [50,00 GiB / 0free]
  PV /dev/sda1VG vg_system   lvm2 [9,99 GiB / 0free]
  PV /dev/loop0  lvm2 [1,00 GiB]
  Total: 3 [60,99 GiB] / in use: 2 [59,99 GiB] / in no VG: 1 [1,00 GiB]
root@lcg-lrz-admin:~# losetup -f image2 
root@lcg-lrz-admin:~# pvscan 
  Found duplicate PV tSK9Cdpw6bcmocZnxFPD6ThNz1opRXsB: using /dev/loop1 not 
/dev/loop0
  PV /dev/sdb VG vg_data lvm2 [50,00 GiB / 0free]
  PV /dev/sda1VG vg_system   lvm2 [9,99 GiB / 0free]
  PV /dev/loop1  lvm2 [1,00 GiB]
  Total: 3 [60,99 GiB] / in use: 2 [59,99 GiB] / in no VG: 1 [1,00 GiB]

Obviously, with PVs alone, there is no "x is already used" case. As one
can see it just says it would ignore one of them, which I think is
rather stupid in that particular case (i.e. non of the devices already
used somehow), because it probably just "randomly" decides which is to
be used, which is ambiguous.


> And what will rescan show if they are not active?
My experience was always (it's just quite late and I don't want to
simulate everything right now, which is trivial anyway):
- It shows warnings about the duplicates in the tools
- It continues to use the already active devices (if any)
- Unfortunately, while the kernel continues to use the already used
devices, the toolset may use other device (kinda stupid, but at least
it warns and the already used devices seem to be still properly used):

continuation from the setup above:
root@lcg-lrz-admin:~# losetup -d /dev/loop1 
(now only image1 is seen as loop0)
root@lcg-lrz-admin:~# vgcreate vg_test /dev/loop0
  Volume group "vg_test" successfully created
root@lcg-lrz-admin:~# lvcreate -n test vg_test -l 100
  Logical volume "test" created
root@lcg-lrz-admin:~# mkfs.ext4 /dev/vg_test/test 
mke2fs 1.42.12 (29-Aug-2014)
...
root@lcg-lrz-admin:~# mount /dev/vg_test/test /mnt/
root@lcg-lrz-admin:~# losetup -a
/dev/loop0: [64768]:518297 (/root/image1)
root@lcg-lrz-admin:~# losetup -f image2 
root@lcg-lrz-admin:~# vgs
 

Re: kernel call trace during send/receive

2015-12-08 Thread Christoph Anton Mitterer
Hey.

Hmm I guess no one has any clue about that error?

Well it seems at least that an fsck over the receiving fs passes
through without any error.

Cheers,
Chris.

On Fri, 2015-11-27 at 02:49 +0100, Christoph Anton Mitterer wrote:
> Hey.
> 
> Just got the following during send/receiving a big snapshot from one
> btrfs to another fresh one.
> 
> Both under kernel 4.2.6, tools 4.3
> 
> The send/receive seems to continue however...
> 
> Any ideas what that means?
> 
> Cheers,
> Chris.
> 
> Nov 27 01:52:36 heisenberg kernel: [ cut here ]
> 
> Nov 27 01:52:36 heisenberg kernel: WARNING: CPU: 7 PID: 18086 at
> /build/linux-CrHvZ_/linux-4.2.6/fs/btrfs/send.c:5794
> btrfs_ioctl_send+0x661/0x1120 [btrfs]()
> Nov 27 01:52:36 heisenberg kernel: Modules linked in: ext4 mbcache
> jbd2 nls_utf8 nls_cp437 vfat fat uas vhost_net vhost macvtap macvlan
> xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4
> iptable_nat nf_nat_ipv4 nf_nat xt_tcpudp tun bridge stp llc fuse ccm
> ebtable_filter ebtables seqiv ecb drbg ansi_cprng algif_skcipher md4
> algif_hash af_alg binfmt_misc xfrm_user xfrm4_tunnel tunnel4 ipcomp
> xfrm_ipcomp esp4 ah4 cpufreq_userspace cpufreq_powersave
> cpufreq_stats cpufreq_conservative ip6t_REJECT nf_reject_ipv6
> nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables xt_policy
> ipt_REJECT nf_reject_ipv4 xt_comment nf_conntrack_ipv4 nf_defrag_ipv4
> xt_multiport xt_conntrack nf_conntrack iptable_filter ip_tables
> x_tables joydev rtsx_pci_ms rtsx_pci_sdmmc mmc_core memstick iTCO_wdt
> iTCO_vendor_support x86_pkg_temp_thermal
> Nov 27 01:52:36 heisenberg kernel:  intel_powerclamp intel_rapl
> iosf_mbi coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul evdev
> deflate ctr psmouse serio_raw twofish_generic pcspkr btusb btrtl
> btbcm btintel bluetooth crc16 uvcvideo videobuf2_vmalloc
> videobuf2_memops videobuf2_core v4l2_common videodev media
> twofish_avx_x86_64 twofish_x86_64_3way twofish_x86_64 twofish_common
> sg arc4 camellia_generic iwldvm mac80211 iwlwifi cfg80211 rtsx_pci
> rfkill camellia_aesni_avx_x86_64 snd_hda_codec_hdmi tpm_tis tpm
> 8250_fintek camellia_x86_64 snd_hda_codec_realtek
> snd_hda_codec_generic processor battery fujitsu_laptop i2c_i801 ac
> lpc_ich serpent_avx_x86_64 mfd_core snd_hda_intel snd_hda_codec
> snd_hda_core snd_hwdep snd_pcm shpchp snd_timer e1000e snd soundcore
> i915 ptp pps_core video button drm_kms_helper drm thermal_sys mei_me
> Nov 27 01:52:36 heisenberg kernel:  i2c_algo_bit mei
> serpent_sse2_x86_64 xts serpent_generic blowfish_generic
> blowfish_x86_64 blowfish_common cast5_avx_x86_64 cast5_generic
> cast_common des_generic cbc cmac xcbc rmd160 sha512_ssse3
> sha512_generic sha256_ssse3 sha256_generic hmac crypto_null af_key
> xfrm_algo loop parport_pc ppdev lp parport autofs4 dm_crypt dm_mod
> md_mod btrfs xor raid6_pq uhci_hcd usb_storage sd_mod crc32c_intel
> aesni_intel aes_x86_64 glue_helper ahci lrw gf128mul ablk_helper
> libahci cryptd libata ehci_pci xhci_pci ehci_hcd scsi_mod xhci_hcd
> usbcore usb_common
> Nov 27 01:52:36 heisenberg kernel: CPU: 7 PID: 18086 Comm: btrfs Not
> tainted 4.2.0-1-amd64 #1 Debian 4.2.6-1
> Nov 27 01:52:36 heisenberg kernel: Hardware name: FUJITSU LIFEBOOK
> E782/FJNB23E, BIOS Version 1.11 05/24/2012
> Nov 27 01:52:36 heisenberg kernel:   a02e6260
> 8154e2f6 
> Nov 27 01:52:36 heisenberg kernel:  8106e5b1 880235a3c42c
> 7ffd3d3796c0 8802f0e5c000
> Nov 27 01:52:36 heisenberg kernel:  0004 88010543c500
> a02d2d81 88041e5ebb00
> Nov 27 01:52:36 heisenberg kernel: Call Trace:
> Nov 27 01:52:36 heisenberg kernel:  [] ?
> dump_stack+0x40/0x50
> Nov 27 01:52:36 heisenberg kernel:  [] ?
> warn_slowpath_common+0x81/0xb0
> Nov 27 01:52:36 heisenberg kernel:  [] ?
> btrfs_ioctl_send+0x661/0x1120 [btrfs]
> Nov 27 01:52:36 heisenberg kernel:  [] ?
> __alloc_pages_nodemask+0x194/0x9e0
> Nov 27 01:52:36 heisenberg kernel:  [] ?
> btrfs_ioctl+0x26c/0x2a10 [btrfs]
> Nov 27 01:52:36 heisenberg kernel:  [] ?
> sched_move_task+0xca/0x1d0
> Nov 27 01:52:36 heisenberg kernel:  [] ?
> cpumask_next_and+0x2e/0x50
> Nov 27 01:52:36 heisenberg kernel:  [] ?
> select_task_rq_fair+0x23f/0x5c0
> Nov 27 01:52:36 heisenberg kernel:  [] ?
> enqueue_task_fair+0x387/0x1120
> Nov 27 01:52:36 heisenberg kernel:  [] ?
> native_sched_clock+0x24/0x80
> Nov 27 01:52:36 heisenberg kernel:  [] ?
> sched_clock+0x5/0x10
> Nov 27 01:52:36 heisenberg kernel:  [] ?
> do_vfs_ioctl+0x2c3/0x4a0
> Nov 27 01:52:36 heisenberg kernel:  [] ?
> _do_fork+0x146/0x3a0
> Nov 27 01:52:36 heisenberg kernel:  [] ?
> SyS_ioctl+0x76/0x90
> Nov 27 01:52:36 heisenberg kernel:  [] ?
> system_call_fast_compare_end+0xc/0x6b
> Nov 27 01:52:36 heisenberg kernel: ---[ end trace f5fa91e2672eead0 ]-
> --

smime.p7s
Description: S/MIME cryptographic signature


Re: attacking btrfs filesystems via UUID collisions? (was: Subvolume UUID, data corruption?)

2015-12-08 Thread Christoph Anton Mitterer
On Sun, 2015-12-06 at 04:06 +, Duncan wrote:
> There's actually a number of USB-based hardware and software vulns
> out
> there, from the under $10 common-component-capacitor-based charge-
> and-zap
> (charges off the 5V USB line, zaps the port with several hundred
> volts
> reverse-polarity, if the machine survives the first pulse and
> continues
> supplying 5V power, repeat...), to the ones that act like USB-based
> input
> devices and "type" in whatever commands, to simple USB-boot to a
> forensic
> distro and let you inspect attached hardware (which is where the
> encrypted
> storage comes in, they've got everything that's not encrypted),
> to the plain old fashioned boot-sector viruses that quickly jump to
> everything else on the system that's not boot-sector protected and/or
> secure-boot locked, to...
Well this is all well known - at least to security folks ;) - but to be
quite honest:
Not an excuse for allowing even more attack surface, in this case via
the filesystem.
One will *always* find a weaker element in the security chain, and
could always argue with that not to fixe one's own issues.

"Well, there's no need to fix that possible collision-data-leakage-
issue in btrfs[0]! Why? Well an attacker could still simply abduct the
bank manager, torture him for hours until he gives any secret with
pleasure"
;-)


> Which is why most people in the know say if you have unsupervised
> physical
> access, you effectively own the machine and everything on it, at
> least
> that's not encrypted.
Sorry, I wouldn't say so. Ultimately you're of course right, which is
why my fully-dm-crypted notebook is never left alone when it runs (cold
boot or USB firmware attacks)... but in practise things are a bit
different I think.
Take the ATM example.

Or take real world life in big computing centres.
Fact is, many people have usually access, from the actual main
personell, over electricians to the cleaning personnel.
Whacking a device or attacking it via USB firmware tricks, is of course
possible for them, but it's much more likely to be noted (making noise,
taking time and so on),... so there is no need to give another attack
surface by this.


> If you haven't been keeping up, you really have some reading to
> do.  If
> you're plugging in untrusted USB devices, seriously, a thumb drive
> with a
> few duplicated btrfs UUIDs is the least of your worries!
Well as I've said, getting that in via USB may be only one way.
We're already so far that GNOME automount devices when plugged...
who says the the next step isn't that this happens remotely in some
form, e.g. btrfs-image on dropbox, automounted by nautilus.
Okay, that may be a bit constructed, but it should demonstrate that
there could be plenty of ways for that to happen, which we don't even
think of (and usually these are the worst in security).


You said it's basically not fixable in btrfs:
It's absolutely clear that I'm no btrfs expert (or even developer), but
my poor man approach which I think I've written before doesn't seem so
impossible, does it?
1) Don't simply "activate" btrfs devices that are found but rather:
2) Check if there are other devices of the same fs UUID + device ID, or
more generally said: check if there are any collisions
3) If there are, and some of them are already active, continue to use
them, don't activate the newly appeared ones
4) If there are, and none of them are already active, refuse to
activate *any* of them unless the user manually instructs to do so via
device= like options.


> BTW, this is documented (in someone simpler "do not do XX" form) on
> the
> wiki, gotchas page.
> 
> https://btrfs.wiki.kernel.org/index.php/Gotchas#Block-level_copies_of
> _devices
I know, but it doesn't really tell all possibly consequences, and
again, it's unlikely that the end-user (even if possibly heavily
affected by it) will stumble over that.


Cheer,
Chris.


[0] Assuming there is actually one, I haven't really verified that and
base it solely one what people told that basically arbitrary
corruptions may happen on both devices.

smime.p7s
Description: S/MIME cryptographic signature


Re: subvols and parents - how?

2015-12-08 Thread Christoph Anton Mitterer
On Fri, 2015-11-27 at 01:02 +, Duncan wrote:
[snip snap]
> #1 could be a pain to setup if you weren't actually mounting it
> previously, just relying on the nested tree, AND...
> 
> #2 The point I was trying to make, now, to mount it you'll mount not
> a 
> native nested subvol, and not a directly available sibling
> 5/subvols/home, but you'll actually be reaching into an entirely 
> different nesting structure to grab something down inside, mounting
> 5/subvols/root/home subvolume nesting down inside the direct
> 5/subvols/root sibling subvol.

Okay so your main point was basically "keeping things administrable"...


> one of which was that everything 
> that the package manager installs should be on the same partition
> with 
> the installed-package database, so if it has to be restored from
> backup, 
> at least if it's all old, at least it's all equally old, and the
> package 
> database actually matches what's on the system because it's in the
> same 
> backup!
I basically agree, though I'd allow few exceptions, like database-like
data that is stored in /var/ sometimes and that doesn't need to be
consistent with anything but iself... e.g. static web pages
(/var/www)... postgresl DB, or sks keyserver DB... and so on.

btw: What's the proper way for merging / splitting into subvols.
E.g. consider I have:
5
|
+--root (subvol)
   |
   +-- var (no subvol)

And say I would want to split of var/www into a subvol.
Well one obvious way would be with mv (and AFAIU that would keep my
ref-links with clones, if any) but that also means that anything that
accesses /var/www probably needs a downtime.
Is it planned to have a special function that basically says:
"make dir foo and anything below (except nested subvols) a subvol named
foo, immediately and atomically"?

And similar vice-versa... a special function that says:
"make subvol foo and anything below (except nested subvols) a dir of
the parent subvol named foo, immediately and atomically"?

Could be handy for real world administration, especially when one
want's
to avoid downtimes.

btw: Few days ago, either Hugo or your thought that mv'ing a subvol
would change it's UUID, but my try (which was with coreutils 8.3 -> no
reflinked mv) seemed to show it wouldn't but there was no further reply
then... so am I right that the UUID wouldn't change?


> The same idea applies here.  Once you start reaching into nested
> subvols 
> to get the deeper nested subvols you're trying to mount, it's too
> much 
> and you're just begging to get it wrong under the extreme pressures
> of a 
> disaster recovery.
Well apparently you oversaw the extremely simple and reliable solution:
leaving a tiny little note on your desk saying something like: "dear
boss, things are screwed up, I'm on vacation now..." ;-)


Thanks,
Chris.

smime.p7s
Description: S/MIME cryptographic signature


Re: [PATCH] btrfs: Introduce new mount option to disable tree log replay

2015-12-07 Thread Christoph Anton Mitterer
On Mon, 2015-12-07 at 11:29 -0600, Eric Sandeen wrote:
> FWIW, new mount options and their descriptions should be added to
> BTRFS-MOUNT(5) 
> as well.
Also, from the end-user perspective, there should be:
1) another option like (hard-ro) which is defined to imply any other
options that are required to make a mount truly read-only (i.e. right
now: ro and nologreplay... but in the future any other features that
may be required to make read-only really read-only).

and/or

2) a section that describes "ro" in btrfs-mount(5) which describes that
normal "ro" alone may cause changes on the device and which then refers
to hard-ro and/or the list of options (currently nologreplay) which are
required right now to make it truly ro.


I think this is important as an end-user probably expects "ro" to be
truly ro, so if he looks it up in the documentation (2) he should find
enough information that this isn't the case and what to do instead.
Further, as one might expect that in the future, other places (than
just the log) may cause changes to a device, even though mounted ro...
I think it's better for the end user to also have a real "hard-ro" like
option, than a possibly growing list of "noXYZ" where the end-user may
have no clue that something else is now also required to get the truly
read-only behaviour.


Cheers,
Chris.

smime.p7s
Description: S/MIME cryptographic signature


Re: [PATCH] btrfs: Introduce new mount option to disable tree log replay

2015-12-07 Thread Christoph Anton Mitterer
On Mon, 2015-12-07 at 17:06 -0600, Eric Sandeen wrote:
> Yeah, I don't know that this is true.  It hasn't been true for over a
> decade (2?), with the most widely-used filesystem in linux history,
> i.e.
> ext3.
Based on what? I'd now many sysadmins who don't expect that e.g. the
journal is replayed when ext* is mounted ro.

>   So if btrfs wants to go on this re-education crusade, more power
> to you, but I don't know that it's really a fight worth fighting.  ;)
I don't think there's a bight fight necessary, is it?
Just properly document all options and not only from the developers or
expert-users PoV.

Using blockdev --setro is of course an alternative to a dedicated
"hard-ro" option (or something similar that goes by a better name),...
OTOH I don't think that having such an option would cause big problems
or "religion wars" ;)


Cheers,
Chris.

smime.p7s
Description: S/MIME cryptographic signature


Re: attacking btrfs filesystems via UUID collisions? (was: Subvolume UUID, data corruption?)

2015-12-05 Thread Christoph Anton Mitterer
On Sat, 2015-12-05 at 13:19 +, Duncan wrote:
> The problem with btrfs is that because (unlike traditional
> filesystems) 
> it's multi-device, it needs some way to identify what devices belong
> to a 
> particular filesystem.
Sure, but that applies to lvm, or MD as well... and I wouldn't know of
any random corruption issues there.


> And UUID is, by definition and expansion, Universally Unique ID.
Nitpicking doesn't help here,... reality is they're not,.. either by
people doing stuff like dd, other forms of clones, LVM, etc. ... or as
I've described maliciously.


> Btrfs 
> simply depends on it being what it says on the the tin, universally 
> unique, to ID the components of the filesystem and assemble them 
> correctly.
Admittedly, I'm not an expert to the internals of btrfs, but it seems
other multi-device containers can handle UUID duplicates fine, or at
least so that you don't get any data corruption (or leaks).

This is a showstopper - maybe not under lab conditions but surely under
real world scenarios.
I'm actually quite surprised that no-one else didn't complain about
that before, given how long btrfs exists.


> Besides dd, etc, LVM snapshots are another case where this goes
> screwy.  
> If the UUID isn't UUID, do a btrfs device scan (which udev normally
> does 
> by default these days) so the duplicate UUID is detected, and btrfs 
> *WILL* eventually start trying to write to all the "newly added"
> devices 
> that scan found, identified by their Universally Unique IDs, aka
> UUIDs.  
> It's not a matter of if, but when.
Well.. as I said... quite scary, with respect to both, accidental and
malicious cases of duplicate UUIDs.


> And the UUID is embedded so deeply within the filesystem and its 
> operations, as an inextricable part of the metadata (thus avoiding
> the 
> problem reiserfs had where a reiserfs stored in a loopback file on a 
> reiserfs, would screw up reiserfsck, on btrfs, the loopback file
> would 
> have a different UUID and thus couldn't be mixed up), that changing
> the 
> UUID is not the simple operation of changing a few bytes in the
> superblock 
> that it is on other filesystems, which is why there's now a tool to
> go 
> thru all those metadata entries and change it.
I don't think that this design is per se bad and prevents the kernel to
handle such situations gracefully.

I would expect that in addition to the fs UUID, it needs a form of
device ID... so why not simply ignoring any new device for which there
already is a matching fs UUID and device ID, unless the respective tool
(mount, btrfs, etc.) is explicitly told so via some
device=/dev/sda,/dev/sdb option.

If that means that less things work out of the box (in the sense of
"auto-assembly") well than this is simply necessary.
data security and consistency is definitely much more important than
any fancy auto-magic.



> So an aware btrfs admin simply takes pains to avoid triggering a
> btrfs 
> device scan at the wrong time, and to immediately hide their LVM 
> snapshots, immediately unplug their directly dd-ed devices, etc, and
> thus 
> doesn't have to deal with the filesystem corruption that'd be a when
> not 
> if, if they didn't take such precautions with their dupped UUIDs that
> actually aren't as UUID as the name suggests...
a) People shouldn't need to do days of study to be able to use btrfs
securely. Of course it's more advanced and not everything can be
simplified in a way so that users don't need to know anything (e.g. all
the well-known effects of CoW)... but when the point is reached where
security and data integrity is threatened, there's definitely a hard
border that mustn't be crossed.

b) Given how complex software is, I doubt that it's easily possible,
even for the aware admin, to really prevent all situations that can
lead to such situations.
Not to talk about about any attack-scenarios.



> And as your followup suggests in a security context, they consider 
> masking out their UUIDs before posting them, as well, tho most kernel
> hackers generally consider unsupervised physical access to be game-
> over, 
> security-wise.
Do they? I rather thought many of them had a rather practical and real-
world-situations-based POV.

> (After all, in that case there's often little or nothing 
> preventing a reboot to that USB stick, if desired, or simply yanking
> the 
> devices and duping them or plugging them in elsewhere, if the BIOS is
> password protected, with the only thing standing in the way at that
> point 
> being possible device encryption.)
There's hardware which would, when it detects physicals intrusion (like
yanking) lock up itself (securely clearing the memory, disconnecting
itself from other nodes, which may be compromised as well, when the
filesystem on the attacked node would go crazy.

You have things like ATMs, which are physically usually quite well
secured, but which do have rather easily accessible maintenance ports.
All of us have seen such embedded devices rebooting 

Re: attacking btrfs filesystems via UUID collisions? (was: Subvolume UUID, data corruption?)

2015-12-05 Thread Christoph Anton Mitterer
On Sat, 2015-12-05 at 12:01 +, Hugo Mills wrote:
> On Sat, Dec 05, 2015 at 04:28:24AM +0100, Christoph Anton Mitterer
> wrote:
> > On Fri, 2015-12-04 at 13:07 +, Hugo Mills wrote:
> > > I don't think it'll cause problems.
> > Is there any guaranteed behaviour when btrfs encounters two
> > filesystems
> > (i.e. not talking about the subvols now) with the same UUID?
> 
>    Nothing guaranteed, but the likelihood is that things will go
> badly
> wrong, in the sense of corrupt filesystems.
Phew... well sorry, but I think that's really something that makes
btrfs not productively usable until fixed.



>    Except that that's exactly the mechanism that btrfs uses to handle
> multi-device filesystems, so you've just broken anything with more
> than one device in the FS.
Don't other containers (e.g. LVM) do something similar, and yet they
don't fail badly in case e.g. multipl PVs with the same UUID appear,
AFAIC.

And shouldn't there be some kind of device UUID, which differs
different parts of the same btrfs (with the same fs UUID) but on
different devices?!


>    If you inspect the devid on each device as well, and refuse
> duplicates of those, you've just broken any multipathing
> configurations.
Well, how many people are actually doing this? A minority. So then it
would be simply necessary that multipathing doesn't work out of the box
and one need to specifically tell the kernel to consider a device with
the same btrfs UUID as not a clone but another path to the same device.

In any cases, rare feature like multipathing cannot justify the
possibility of data corruption.
That situtation as it is now is IMHO completely unacceptable.



>    Even if you can handle that, if you have two copies of dev1, and
> two copies of dev2, how do you guarantee that the "right" pair of
> dev1
> and dev2 is selected? (e.g. if you have them as network devices, and
> the device enumeration order is unstable on each boot).
Not sure what you mean now:
The multipathing case?
Then, as I've said, such situations would simply require to manually
set things up and explicitly tell the kernel that the devices foo and
bar are to be used (despite their dup UUID).

If you mean what happens when I have e.g. two clones of a 2-device
btrfs, as in
fsdev1
fsdev2
fsdev1_clone
fsdev2_clone
Then as I've said before... if one pair of them
is already mounted (i.e. when the *_clone appear), than it's likely
that these belong actually together and the kernel should continue to
use them and ignore any other.
If all appear before any is mounted, then
either is should refuse to mount/use any of them, or it should require
to manually specify which devices to be used (i.e. via /dev/sda or so).


Cheers,
Chris.

smime.p7s
Description: S/MIME cryptographic signature


Re: Subvolume UUID, data corruption?

2015-12-04 Thread Christoph Anton Mitterer
On Fri, 2015-12-04 at 13:07 +, Hugo Mills wrote:
> I don't think it'll cause problems.
Is there any guaranteed behaviour when btrfs encounters two filesystems
(i.e. not talking about the subvols now) with the same UUID?

Given that it's long standing behaviour that people could clone
filesystems (dd, etc.) and this just worked™, btrfs should at least
handle such case gracefully.
For example, when already more than one block device with a btrfs of
the same UUID are known, then it should refuse to mount any of them.

And if one is already known and another device pops up it should refuse
to mount that and continue to normally use the already mounted one.



Cheers,
Chris.

smime.p7s
Description: S/MIME cryptographic signature


Re: attacking btrfs filesystems via UUID collisions? (was: Subvolume UUID, data corruption?)

2015-12-04 Thread Christoph Anton Mitterer
Thinking a bit more I that, I came to the conclusion that it's actually 
security relevant that btrfs deals gracefully with filesystems having the same 
UUID:

Getting to know someone else's filesystem's UUID may be more easily possible 
than one may think.
It's usually not considered secret and for example included in debug reports 
(e.g. several Debian packages do this).

The only thing an attacker then needs to do is somehow making another 
filesystem with the UUID available in his victims system.
Simplest way is via a USB stick when he has local access.
Thanks to some stupid desktop environments, chances aren't to bad that the 
system will even auto mount the stick.

If btrfs doesn't handle this gracefully the attacker may damage or destroy the 
original filesystem, or if things get awkwardly corrupted (and data is written 
to the fake btrfs) even get data out of such a system (despite any screen locks 
or dm-crypt).

Cheers
Chris.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs crashing the kernel with Seagate 8TB SMR drives.

2015-12-03 Thread Christoph Anton Mitterer
Any chances that this is:
https://bugzilla.kernel.org/show_bug.cgi?id=93581


Cheers,
Chris.

smime.p7s
Description: S/MIME cryptographic signature


Re: slowness when cp respectively send/receiving on top of dm-crypt

2015-11-28 Thread Christoph Anton Mitterer
On Sat, 2015-11-28 at 11:34 -0700, Chris Murphy wrote:
> It sounds to me like maybe LUKS is configured to use an encryption
> algorithm that isn't subject to CPU optimized support, e.g. aes-xts
> on
> my laptop gets 1600MiB/s where serpent-cbc gets only 68MiB/s and pegs
> the CPU. This is reported by 'cryptsetup benchmark'

hmmm...
$ /sbin/cryptsetup benchmark
# Tests are approximate using memory only (no storage IO).
PBKDF2-sha1   910222 iterations per second
PBKDF2-sha256 590414 iterations per second
PBKDF2-sha512 399609 iterations per second
PBKDF2-ripemd160  548418 iterations per second
PBKDF2-whirlpool  179060 iterations per second
#  Algorithm | Key |  Encryption |  Decryption
 aes-cbc   128b   474,3 MiB/s  1686,2 MiB/s
 serpent-cbc   128b69,4 MiB/s   235,3 MiB/s
 twofish-cbc   128b   144,5 MiB/s   271,6 MiB/s
 aes-cbc   256b   348,0 MiB/s  1239,4 MiB/s
 serpent-cbc   256b68,8 MiB/s   231,5 MiB/s
 twofish-cbc   256b   146,6 MiB/s   268,9 MiB/s
 aes-xts   256b  1381,3 MiB/s  1384,3 MiB/s
 serpent-xts   256b   238,6 MiB/s   231,1 MiB/s
 twofish-xts   256b   262,9 MiB/s   266,7 MiB/s
 aes-xts   512b  1085,7 MiB/s  1078,9 MiB/s
 serpent-xts   512b   242,1 MiB/s   230,2 MiB/s
 twofish-xts   512b   266,8 MiB/s   265,9 MiB/s

I'm having aes-xts-plain64 with 512 bit key...
that's still 1 GiB/s


Cheers,
Chris.

smime.p7s
Description: S/MIME cryptographic signature


Re: How to detect / notify when a raid drive fails?

2015-11-27 Thread Christoph Anton Mitterer
On Fri, 2015-11-27 at 17:16 +0800, Anand Jain wrote:
>   I understand as a user, a full md/lvm set of features are important
>   to begin operations using btrfs and we don't have it yet. I have to
>   blame it on the priority list.
What's would be especially nice from the admin side, would be something
like /proc/mdstat, which centrally gives information about the health
of your RAID.

It can/should of course be more than just "OK" / "not OK"...
information about which devices are in which state, whether a
rebuild/reconstruction/scrub is going on, etc. pp.
Maybe even details of properties like chunk sizes (as far as these
apply to btrfs).

Having a dedicated monitoring process... well nice to have, but
something like mdstat is, always there, doesn't need special userland
tools and can easily used by 3rd party stuff like Icinga/Nagios
check_raid.
I think the keywords here are human readable + parseable... so maybe
even two files.


Cheers,
Chris.

smime.p7s
Description: S/MIME cryptographic signature


slowness when cp respectively send/receiving on top of dm-crypt

2015-11-27 Thread Christoph Anton Mitterer
Hey.

Not sure if that's valuable input for the devs, but here's some vague
real-world report about performance:

I'm just copying (via send/receive) a large filesystem (~7TB) from on
HDD over to another.
The devices are both connected via USB3, and each of the btrfs is on
top of dm-crypt.

It's already obvious that things are slowed down, compared to "normal"
circumstances, but from looking at iotop for a while (and the best disk
IO measuring tool ever: the LEDs on the USB/SATA bridge) it seems that
there are always times when basically no IO happens to disk.

There seems to be a repeating schema like this:
- First, there is some heavy disk IO (200-250 M/s), mostly on btrfs
send and receive processes
- Then there are times when send/receive seem to not do anything, but
either btrfs-transaction (this I see far less however, and the IO% is
far lower, while that of dmcrypt_write is usually to 99%) or
dmcrypt_write eat up all IO (I mean the percent value shown in iotop)
with now total/actual disk write and read being basically zero during
that.

Kinda feels as if there would be some large buffer written first, then
when that gets full, dm-crypt starts encrypting it during which there
is no disk-IO (since it waits for the encryption).

Not sure if this is something that could be optimised or maybe it's
even a non issue that happens for example while many small files are
read/written (the data consists of both, many small files as well as
many big files), which may explain why sometimes the actual IO goes up
to large >200M/s or at least > 150M/s and sometimes it caps at around
40-80M/s


Obviously, since I use dm-crypt and compression on both devices, it may
be a CPU issue, but it's a 8 core machine with i7-3612QM CPU @
2.10GHz... not the fastest, but not the slowest either... and looking
at top/htop is happens quite often that there is only very little CPU
utilisation, so it doesn't seem as if CPU would be the killing factor
here.



HTH,
Chris.

smime.p7s
Description: S/MIME cryptographic signature


Re: slowness when cp respectively send/receiving on top of dm-crypt

2015-11-27 Thread Christoph Anton Mitterer
Hey.

Send/receiving the master to the backup has finished just before... and
now - not that I wouldn't trust btrfs, the hardware, etc. - I ran a
complete diff --recursive --no-dereference over the snapshots on the
two disks.

The two btrfs are mounted ro (thus no write IO), there is not really
any other IO going on in the system.


I basically see a similar up and down as before during writing:
This time, only the diff process show up in iotop.
Sometimes, I get rates of 280-300 MB/s... for several seconds,
sometimes 3-4s... sometimes longer 10-20s,... then it falls down to 30-
40MB/s

At the same time I look at which files diff is currently comparing,...
and these are all large analog image scans[0] and these are > 800MB
(per file).
Also the slow downs or speed ups don't happen when diff moves on to a
new file... can also be when it still compares the same file for a
while.

I wouldn't assume that these are highly fragmented, since both
filesystems were freshly filled a recently ago, with no much further
writes since.
And AFAIU, SMR shouldn't kick in here either.


I tried to have a short look at how the logical CPUs are utilised at
the slow and at the fast phases.
There is no exact schema, sometimes (but not always) it looks as if 1-2 
cores have some 100% utilisation when it's fast... while when it's slow
all cores are at 30-50%
But that may also be just coincidence, as I observed the opposite
behaviour few times...
So don't give too much on this.


Why can it sometimes be super fast and the falls back to low speed?

Cheers,
Chris.


[0] yay.. the good old childhood images where they had no digikams...
;)

smime.p7s
Description: S/MIME cryptographic signature


Re: btrfs: poor performance on deleting many large files

2015-11-27 Thread Christoph Anton Mitterer
On Fri, 2015-11-27 at 03:38 +, Duncan wrote:
> AFAIK, per-subvolume *atime mounts should already be working.
Ah I see. :)

Still, specifically for snapshots that's a bit unhandy, as one
typically doesn't mount each of them... one rather mount e.g. the top
level subvol and has a subdir snapshots there...
So perhaps the idea of having snapshots that are per se noatime is
still not too bad.


Cheers,
Chris

smime.p7s
Description: S/MIME cryptographic signature


cannot move ro-snapshot directly but indirectly works

2015-11-27 Thread Christoph Anton Mitterer
Hey.

Not sure whether this is intended or not, but it feels at least
somewhat strange:

Consider I have a readonly snapshot (the only subvol here):
/btrfs/snapshots/ro-snapshot
now I want to move it to the dir:
/btrfs/snapshots/foo/
i.e. 
mv /btrfs/snapshots/ro-snapshot
/btrfs/snapshots/foo/
but the mv complains about read-only filesystem.

However I can do something like:
mkdir /btrfs/tmp
mv  /btrfs/snapshots /btrfs/tmp/
mv /btrfs/tmp /btrfs/snapshots
mv /btrfs/snapshots/snapshots /btrfs/snapshots/foo
which would give me just the dir structure I tried to get before.


Long story short... I don't get why one cannot move a ro-snapshot directly, 
when one can move it indirectly in the fs-hierarchy.


Cheers,
Chris.

smime.p7s
Description: S/MIME cryptographic signature


Re: slowness when cp respectively send/receiving on top of dm-crypt

2015-11-27 Thread Christoph Anton Mitterer
On Fri, 2015-11-27 at 20:00 +0100, Henk Slager wrote:
> As far as I can guess this is transfers between Seagate Archive 8TB
> SMR drives.
Yes it is,... and I though about SMR being the reason at first, too,
but:
- As far as I understood SMR, it shouldn't kick in when I do what is
mostly streaming data. Okay I don't know exactly how btrfs writes it's
data, but when I send/receive 7GB I'd have expected that a great deal
of it is just sequential writing.

- When these disks move data from their non shingled areas to the
shingled ones, that - or at least that's my impression - produces some
typical sounds from the mechanical movements, which I didn't hear

- Bust most importantly,... if the reason was SMR, why should always
when no IO happens dmcrypt_write be at basically full CPU.


> I think you know this:
> https://www.mail-archive.com/linux-btrfs%40vger.kernel.org/msg47341.h
> tml
> and certainly this:
> https://bugzilla.kernel.org/show_bug.cgi?id=93581
I knew the later, but as you've mentioned, the USB/SATA bridge probably
save me from it. Interesting that USB3 is still slow enough not to get
caught ;)

But thanks for the hint nevertheless :)



> > There seems to be a repeating schema like this:
> > - First, there is some heavy disk IO (200-250 M/s), mostly on btrfs
> I think it is MByte/s (and not Mbit/s) right?
Oh yes.. it's whatever iotop shows (not sure whether the use MB or MiB)


> I must say that adding compression (compress-force=zlib mount option)
> makes the whole transferchain tend to not pipeline.
Ah? Well if I'd have known that in advance ^^ (although I just use
compress)...
Didn't marketing tell people that compression may even speed up IO
because the CPUs are so much faster than the disks?

>  Just dm-crypt I
> have not seen it on Core-i7 systems. A test between 2 modern SSDs
> (SATA3 connected ) is likely needed to see if there really is
> tendency
> for hiccups in processing/pipelining. On kernels 3.11 to 4.0 I have
> seen and experienced far from optimal behavior, but with 4.3 it is
> quite OK, although I use large bcache which can mitigate HDD seeks
> quite well.
I remember that in much earlier times, there was something about dm-
crypt that is used just a single thread or so for IO...forgot the
details though.

> >Not sure if this is something that could be optimised or maybe it's
> > even a non issue that happens for example while many small files
> > are
> > read/written (the data consists of both, many small files as well
> > as
> > many big files), which may explain why sometimes the actual IO goes
> > up
> > to large >200M/s or at least > 150M/s and sometimes it caps at
> > around
> > 40-80M/s
> Indeed typical behavior of SMR drive
Well it's not that I wanted to complain... I can live with that
speeds... I just thought that there may be some bad playing between dm-
crypt and btrfs, which could have been shown by these periods in which
nothing seems to happen except dmcrypt_write doing stuff.


Cheers,
Chris

smime.p7s
Description: S/MIME cryptographic signature


Re: [PATCH v2] btrfs-progs: fsck: Fix a false alert where extent record has wrong metadata flag

2015-11-26 Thread Christoph Anton Mitterer
On Fri, 2015-11-27 at 08:40 +0800, Qu Wenruo wrote:
> But since there is no real error
sure... 

> feel free to keep using it or just re 
> format it with skinny-metadata.
That's just onging =)

Thanks for all your efforts in that issue =)


Chris.

smime.p7s
Description: S/MIME cryptographic signature


Re: btrfs: poor performance on deleting many large files

2015-11-26 Thread Christoph Anton Mitterer
On Thu, 2015-11-26 at 23:29 +, Duncan wrote:
> > but only on meta-data blocks, right?
> Yes.
Okay... so it'll at most get the whole meta-data for a snapshot
separately and not shared anymore...
And when these are chained as in ZFS,.. it probably amplifies... i.e. a
change deep down in the tree changes all the upper elements as well?
Which shouldn't be a too big problem unless I have a lot snapshots or
extremely many files.



> I think it's whole 4 KiB blocks and possibly whole metadata nodes (16
> KiB), copy-on-write, and these would be relatively small changes 
> triggering cow of the entire block/node, aka write
> amplification.  While 
> not too large in themselves, it's the number of them that becomes a 
> problem.
Ah... there you say it already =)
But still it's always only meta-data that is copied, never the data,
right?!


> IIRC relatime updates once a day on access.  If you're doing daily 
> snapshots, updating metadata blocks for all files accessed in the
> last 24 
> hours...
Yes...


Wouldn't it be a way to handle that problem if btrfs allowed to create
snapshots for which the atime never gets updated, regardless of any
mount option?

And additionally, allow people to mount subvols with different
noatime/relatime/atime settings (unless that's already working)... that
way, they could enable it for things where they want/need it,... and
disable it where not.


> In my case, I'm on SSD with their limited write cycles, so while the
> snapshot thing doesn't affect me since my use-case doesn't involve 
> snapshots, the SSD write cycle count thing certainly does, and
> noatime is 
> worth it to me for that alone.
I'm always a bit unsure about that... I've used to do it as well as for
the wear.. but is that really necessary?
With relatime, atime updates happen at most once a day... so at worst
you rewrite... what... some 100 MB (at least in the ext234 case)... and
SSDs seem to bare much more write cycles than advertised.


Cheers,
Chris.

smime.p7s
Description: S/MIME cryptographic signature


kernel call trace during send/receive

2015-11-26 Thread Christoph Anton Mitterer
Hey.

Just got the following during send/receiving a big snapshot from one
btrfs to another fresh one.

Both under kernel 4.2.6, tools 4.3

The send/receive seems to continue however...

Any ideas what that means?

Cheers,
Chris.

Nov 27 01:52:36 heisenberg kernel: [ cut here ]
Nov 27 01:52:36 heisenberg kernel: WARNING: CPU: 7 PID: 18086 at 
/build/linux-CrHvZ_/linux-4.2.6/fs/btrfs/send.c:5794 
btrfs_ioctl_send+0x661/0x1120 [btrfs]()
Nov 27 01:52:36 heisenberg kernel: Modules linked in: ext4 mbcache jbd2 
nls_utf8 nls_cp437 vfat fat uas vhost_net vhost macvtap macvlan xt_CHECKSUM 
iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 
nf_nat xt_tcpudp tun bridge stp llc fuse ccm ebtable_filter ebtables seqiv ecb 
drbg ansi_cprng algif_skcipher md4 algif_hash af_alg binfmt_misc xfrm_user 
xfrm4_tunnel tunnel4 ipcomp xfrm_ipcomp esp4 ah4 cpufreq_userspace 
cpufreq_powersave cpufreq_stats cpufreq_conservative ip6t_REJECT nf_reject_ipv6 
nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables xt_policy 
ipt_REJECT nf_reject_ipv4 xt_comment nf_conntrack_ipv4 nf_defrag_ipv4 
xt_multiport xt_conntrack nf_conntrack iptable_filter ip_tables x_tables joydev 
rtsx_pci_ms rtsx_pci_sdmmc mmc_core memstick iTCO_wdt iTCO_vendor_support 
x86_pkg_temp_thermal
Nov 27 01:52:36 heisenberg kernel:  intel_powerclamp intel_rapl iosf_mbi 
coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul evdev deflate ctr psmouse 
serio_raw twofish_generic pcspkr btusb btrtl btbcm btintel bluetooth crc16 
uvcvideo videobuf2_vmalloc videobuf2_memops videobuf2_core v4l2_common videodev 
media twofish_avx_x86_64 twofish_x86_64_3way twofish_x86_64 twofish_common sg 
arc4 camellia_generic iwldvm mac80211 iwlwifi cfg80211 rtsx_pci rfkill 
camellia_aesni_avx_x86_64 snd_hda_codec_hdmi tpm_tis tpm 8250_fintek 
camellia_x86_64 snd_hda_codec_realtek snd_hda_codec_generic processor battery 
fujitsu_laptop i2c_i801 ac lpc_ich serpent_avx_x86_64 mfd_core snd_hda_intel 
snd_hda_codec snd_hda_core snd_hwdep snd_pcm shpchp snd_timer e1000e snd 
soundcore i915 ptp pps_core video button drm_kms_helper drm thermal_sys mei_me
Nov 27 01:52:36 heisenberg kernel:  i2c_algo_bit mei serpent_sse2_x86_64 xts 
serpent_generic blowfish_generic blowfish_x86_64 blowfish_common 
cast5_avx_x86_64 cast5_generic cast_common des_generic cbc cmac xcbc rmd160 
sha512_ssse3 sha512_generic sha256_ssse3 sha256_generic hmac crypto_null af_key 
xfrm_algo loop parport_pc ppdev lp parport autofs4 dm_crypt dm_mod md_mod btrfs 
xor raid6_pq uhci_hcd usb_storage sd_mod crc32c_intel aesni_intel aes_x86_64 
glue_helper ahci lrw gf128mul ablk_helper libahci cryptd libata ehci_pci 
xhci_pci ehci_hcd scsi_mod xhci_hcd usbcore usb_common
Nov 27 01:52:36 heisenberg kernel: CPU: 7 PID: 18086 Comm: btrfs Not tainted 
4.2.0-1-amd64 #1 Debian 4.2.6-1
Nov 27 01:52:36 heisenberg kernel: Hardware name: FUJITSU LIFEBOOK 
E782/FJNB23E, BIOS Version 1.11 05/24/2012
Nov 27 01:52:36 heisenberg kernel:   a02e6260 
8154e2f6 
Nov 27 01:52:36 heisenberg kernel:  8106e5b1 880235a3c42c 
7ffd3d3796c0 8802f0e5c000
Nov 27 01:52:36 heisenberg kernel:  0004 88010543c500 
a02d2d81 88041e5ebb00
Nov 27 01:52:36 heisenberg kernel: Call Trace:
Nov 27 01:52:36 heisenberg kernel:  [] ? dump_stack+0x40/0x50
Nov 27 01:52:36 heisenberg kernel:  [] ? 
warn_slowpath_common+0x81/0xb0
Nov 27 01:52:36 heisenberg kernel:  [] ? 
btrfs_ioctl_send+0x661/0x1120 [btrfs]
Nov 27 01:52:36 heisenberg kernel:  [] ? 
__alloc_pages_nodemask+0x194/0x9e0
Nov 27 01:52:36 heisenberg kernel:  [] ? 
btrfs_ioctl+0x26c/0x2a10 [btrfs]
Nov 27 01:52:36 heisenberg kernel:  [] ? 
sched_move_task+0xca/0x1d0
Nov 27 01:52:36 heisenberg kernel:  [] ? 
cpumask_next_and+0x2e/0x50
Nov 27 01:52:36 heisenberg kernel:  [] ? 
select_task_rq_fair+0x23f/0x5c0
Nov 27 01:52:36 heisenberg kernel:  [] ? 
enqueue_task_fair+0x387/0x1120
Nov 27 01:52:36 heisenberg kernel:  [] ? 
native_sched_clock+0x24/0x80
Nov 27 01:52:36 heisenberg kernel:  [] ? sched_clock+0x5/0x10
Nov 27 01:52:36 heisenberg kernel:  [] ? 
do_vfs_ioctl+0x2c3/0x4a0
Nov 27 01:52:36 heisenberg kernel:  [] ? _do_fork+0x146/0x3a0
Nov 27 01:52:36 heisenberg kernel:  [] ? SyS_ioctl+0x76/0x90
Nov 27 01:52:36 heisenberg kernel:  [] ? 
system_call_fast_compare_end+0xc/0x6b
Nov 27 01:52:36 heisenberg kernel: ---[ end trace f5fa91e2672eead0 ]---

smime.p7s
Description: S/MIME cryptographic signature


Re: [PATCH v2] btrfs-progs: fsck: Fix a false alert where extent record has wrong metadata flag

2015-11-26 Thread Christoph Anton Mitterer
Hey.

I can confirm that the new patch fixes the issue on both test
filesystems.

Thanks for working that out. I guess there's no longer a need to keep
that old filesystems now?!

Cheers,
Chris.

On Thu, 2015-11-26 at 15:27 +0100, David Sterba wrote:
> On Wed, Nov 25, 2015 at 02:19:06PM +0800, Qu Wenruo wrote:
> > In process_extent_item(), it gives 'metadata' initial value 0, but
> > for
> > non-skinny-metadata case, metadata extent can't be judged just from
> > key
> > type and it forgot that case.
> > 
> > This causes a lot of false alert in non-skinny-metadata filesystem.
> > 
> > Fix it by set correct metadata value before calling
> > add_extent_rec().
> > 
> > Reported-by: Christoph Anton Mitterer <cales...@scientia.net>
> > Signed-off-by: Qu Wenruo <quwen...@cn.fujitsu.com>
> 
> Patch replaced, thanks. The test image is pushed as well.

smime.p7s
Description: S/MIME cryptographic signature


Re: btrfs: poor performance on deleting many large files

2015-11-26 Thread Christoph Anton Mitterer
On Thu, 2015-11-26 at 16:52 +, Duncan wrote:
> For people doing snapshotting in particular, atime updates can be a
> big 
> part of the differences between snapshots, so it's particularly
> important 
> to set noatime if you're snapshotting.
What everything happens when that is left at relatime?

I'd guess that obviously everytime the atime is updated there will be
some CoW, but only on meta-data blocks, right?

Does this then lead to fragmentation problems in the meta-data block
groups?

And how serious are the effects on space that is eaten up... say I have
n snapshots and access all of their files... then I'd probably get n
times the metadata, right? Which would sound quite dramatic...

Or is just parts of the metadate copied with new atimes?


Thanks,
Chris.

smime.p7s
Description: S/MIME cryptographic signature


Re: [auto-]defrag, nodatacow - general suggestions?(was: btrfs: poor performance on deleting many large files?)

2015-11-25 Thread Christoph Anton Mitterer
Hey.

I've worried before about the topics Mitch has raised.
Some questions.

1) AFAIU, the fragmentation problem exists especially for those files
that see many random writes, especially, but not limited to, big files.
Now that databases and VMs are affected by this, is probably broadly
known in the meantime (well at least by people on that list).
But I'd guess there are n other cases where such IO patterns can happen
which one simply never notices, while the btrfs continues to degrade.

So is there any general approach towards this?

And what are the actual possible consequences? Is it just that fs gets
slower (due to the fragmentation) or may I even run into other issues
to the point the space is eaten up or the fs becomes basically
unusable?

This is especially important for me, because for some VMs and even DBs
I wouldn't want to use nodatacow, because I want to have the
checksumming. (i.e. those cases where data integrity is much more
important than security)


2) Why does notdatacow imply nodatasum and can that ever be decoupled?
For me the checksumming is actually the most important part of btrfs
(not that I wouldn't like its other features as well)... so turning it
off is something I really would want to avoid.

Plus it opens questions like: When there are no checksums, how can it
(in the RAID cases) decide which block is the good one in case of
corruptions?


3) When I would actually disable datacow for e.g. a subvolume that
holds VMs or DBs... what are all the implications?
Obviously no checksumming, but what happens if I snapshot such a
subvolume or if I send/receive it?
I'd expect that then some kind of CoW needs to take place or does that
simply not work?


4) Duncan mentioned that defrag (and I guess that's also for auto-
defrag) isn't ref-link aware...
Isn't that somehow a complete showstopper?

As soon as one uses snapshot, and would defrag or auto defrag any of
them, space usage would just explode, perhaps to the extent of ENOSPC,
and rendering the fs effectively useless.

That sounds to me like, either I can't use ref-links, which are crucial
not only to snapshots but every file I copy with cp --reflink auto ...
or I can't defrag... which however will sooner or later cause quite
some fragmentation issues on btrfs?


5) Especially keeping (4) in mind but also the other comments in from
Duncan and Austin...
Is auto-defrag now recommended to be generally used?
Are both auto-defrag and defrag considered stable to be used? Or are
there other implications, like when I use compression


6) Does defragmentation work with compression? Or is it just filefrag
which can't cope with it?

Any other combinations or things with the typicaly btrfs technologies
(cow/nowcow, compression, snapshots, subvols, compressions, defrag,
balance) that one can do but which lead to unexpected problems (I, for
example, wouldn't have expected that defragmentation isn't ref-link
aware... still kinda shocked ;) )

For example, when I do a balance and change the compression, and I have
multiple snaphots or files within one subvol that share their blocks...
would that also lead to copies being made and the space growing
possibly dramatically?


7) How das free-space defragmentation happen (or is there even such a
thing)?
For example, when I have my big qemu images, *not* using nodatacow, and
I copy the image e.g. with qemu-img old.img new.img ... and delete the
old then.
Then I'd expect that the new.img is more or less not fragmented,... but
will my free space (from the removed old.img) still be completely
messed up sooner or later driving me into problems?


8) why does a balance not also defragment? Since everything is anyway
copied... why not defragmenting it?
I somehow would have hoped that a balance cleans up all kinds of
things,... like free space issues and also fragmentation.


Given all these issues,... fragmentation, situations in which space may
grow dramatically where the end-user/admin may not necessarily expect
it (e.g. the defrag or the balance+compression case?)... btrfs seem to
require much more in-depth knowledge and especially care (that even
depends on the type of data) on the end-user/admin side than the
traditional filesystems.
Are there for example any general recommendations what to regularly to
do keep the fs in a clean and proper shape (and I don't count "start
with a fresh one and copy the data over" as a valid way).


Thanks,
Chris.

> 

smime.p7s
Description: S/MIME cryptographic signature


Re: shall distros run btrfsck on boot?

2015-11-24 Thread Christoph Anton Mitterer
On Tue, 2015-11-24 at 22:33 +, Hugo Mills wrote:
> whereas a read-only mount of a journalling FS _must_ modify the disk
> data after an unclean shitdown, in order to be useful (because the FS
> isn't consistent without the journal replay).
I've always considered that rather a bug,... or at least a very
annoying handling in ext*
If I specify "read-only" than nothing should ever be written.
If that's not possible because of an unclean shutdown and a journal
that needs to be replayed, the mount should (without any further
special option) rather fail then mount it pseudo-read-only.

Cheers,
Chris

smime.p7s
Description: S/MIME cryptographic signature


Re: subvols and parents - how?

2015-11-24 Thread Christoph Anton Mitterer
On Tue, 2015-11-24 at 21:55 +, Hugo Mills wrote:
>    In practice, new content is checked by a number of people when
> it's
> put in, so the case of someone putting random poorly-thought-out crap
> in the wiki isn't particularly likely to happen.
Well... it may work in 99% cases... but there could something slip
through, which isn't as easy the case in manpages, which also tend to
be less messy than the huge pile of wiki pages where similar/related
things are described on different pages.

Imagine a case, a non-experienced user update the wiki saying that --
repair should be used, he may not even doing it in bad faith, perhaps
he had success with it and now writes a recipe.
It may take a while until someone of the more experienced guys notices
that and corrects it.
But if ", in the meantime had some fs corruptions,... I may experience
already severe problems by following that suggestion... (and while I do
have many backups of all my data, others may not, and if their life's
data is concerned, they'd be screwed).

So even if it takes you just a few hours to correct such rubbish, you
know that Murphy's law may still hit n people during that time ;-)


> Please feel free to add the things you'd like to see. As I said
> above, we do check the docs on the wiki as they're changed, so if
> you're wrong on some details, it won't be a major issue for very
> long. If you want to discuss details before you write something,
> start
> a conversation -- either on here, or on IRC (or even on the Talk
> pages
> of the wiki).
Well I can write a list together of things which I think should be part
of some more general documentation (i.e. less documentation about
options of the tools)... questions a complete newcomer to btrfs may
have who needs however more than "just a filesystem".


>    Note that the "parent" of send -p and of snapshots is not the same
> relationship as the "parent" (containing subvol) of the tree
> structure. This is an awkward nomenclature problem, and I'm not sure
> how to fix it.
Yeah, that was clear... :-)
Maybe call the "parent" from send -p "base" or something like that...
IMHO that would fit more as the parent there is more like a
"fundament".

Anyway, it's still not as bad as the usage of "RAID1" ;-)


> because
> you can't rename a subvol across another subvol boundary.
That's not quite clear to me... I had subvols like that:
/top/root/below-root
/top/below-top
and was able to move that to:
/top/root/below-top
/top/below-root

i.e. not just changing names but swapping as in:
mv /top/root/below-top /top/tmp
mv /top/below-root /top/root/below-root
mv /top/tmp /top/below-top

with top, root, below-top and below-root all being the same subvols


Thanks a lot for your explanations :)

Chris.

smime.p7s
Description: S/MIME cryptographic signature


btrfs-mount improvemt suggestions

2015-11-24 Thread Christoph Anton Mitterer
Hey.

I though rather than just being around here and complaining all the
time about documentation I might help to improve the same a bit.

the btrfs-mount manpage could be a good start and I'd propose a more
structurised format as I've did it for the first few options:

*alloc_start='bytes'*::
(default: *1M*) +
Sets the start of block allocations on each of the filesystem’s devices
to happen above 'bytes' bytes (optionally followed by one of the case
insensitive suffixes *K* (for Ki), *M* (for Mi) and *G* (for Gi). +
+
Mainly used for debugging purposes.

*autodefrag*::
*noautodefrag*::
(default: *noautodefrag*; since: 3.0) +
Control whether auto-defragmentation is enabled (with *autodefrag*) or
disabled (with *noautodefrag*). +
+
Auto-defragmentation detects small random writes into files and queues
them up for defragmentation. Works best for small files and is not well
suited for large database or virtual machine image workloads.

*check_int*::
*check_int_data*::
*check_int_print_mask='bitmask'*::
(default: unset; since: 3.0) +
Control whether integrity checking module (which requires the kernel
option *BTRFS_FS_CHECK_INTEGRITY* to be enabled) is enabled (with
*check_int*, *check_int_data* or *check_int_print_mask*) or disabled
(when unset) as well as its operation mode. +
+
With *check_int* the integrity checking module examines all block write
requests in order to ensure on-disk consistency, at a large memory and
CPU cost. +
With *check_int_data*, which implies the option *check_int*, it further
includes extent data in the integrity checks.
With *check_int_print_mask*, its operation mode can be controlled by
a bitmask 'bitmask' of *BTRFSIC_PRINT_MASK_** values as defined in
`fs/btrfs/check-integrity.c`. +
+
See the comments at the top of `fs/btrfs/check-integrity.c` for more
information.

*commit='seconds'*::
(default: *30*; since: 3.12) +
Set periodic commit interval to 'seconds' seconds. +
+
This is the interval at which data is synchronised to the block device.
Higher values may improve performance but at the expense of loosing data
from a longer period in case of system crashes, et cetera. +
A warning is give for values of 'seconds' greater than *300*.

*compress[='type']*::
*compress-force[='type']*::
(default: unset) +
Control whether data compression is enabled (with *compress* or
*compress-force*) or or disabled (when unset). +
+
With *compress* only data that compresses well is going to be
compressed, while with *compress-force* data is compressed whether it
compresses well or not. +
The compression type can be set via 'type', with valid values being: +
*no* (disables compression, useful when re-mounting), *zlib* (the
default if no 'type' is set), *lzo*
+
Enabling compression implies the options *datacow* and *datasum*.


That would also include these changes:
commit 88a0ba7065e09497806ffc2a493ab72d0940e1af
Author: Christoph Anton Mitterer <m...@christoph.anton.mitterer.name>
Date:   Wed Nov 25 02:51:25 2015 +0100

btrfs-progs: minor documentation improvements

Overhauled the formatting of symbols:
- Options, terminal-values or parts thereof are marked with *…*.
- Non-terminal-values are marked with '…'.
- Commands, pathnames are marked with `…`.
- Added missing marks and manpage references.
- Used the correct spelling of option names (lower-case).

Signed-off-by: Christoph Anton Mitterer <m...@christoph.anton.mitterer.name>

commit 830f71df85232e12c3795bc5c0335c1c1150c2f4
Author: Christoph Anton Mitterer <m...@christoph.anton.mitterer.name>
Date:   Wed Nov 25 02:09:10 2015 +0100

btrfs-progs: minor documentation improvements

Swapt the order of the default value and the availability.
The former seems much more important for daily use, while no one will care 
about
the version in which something was introduced in 10 years, as everyone has 
far
newer versions.

    Signed-off-by: Christoph Anton Mitterer <m...@christoph.anton.mitterer.name>

commit a1f913c9dd678fba10d134a651f67d01e8c8ae38
Author: Christoph Anton Mitterer <m...@christoph.anton.mitterer.name>
Date:   Wed Nov 25 02:02:17 2015 +0100

btrfs-progs: minor documentation improvements

- Moved the documentation of all default values to the top of each option’s
  section.
- Added missing default values.
- Added missing line breaks.
    
Signed-off-by: Christoph Anton Mitterer <m...@christoph.anton.mitterer.name>


If the responsible maintainer likes that style, I would continue to
refurbish the remainder of the manpage in 

Re: bad extent [5993525264384, 5993525280768), type mismatch with chunk

2015-11-24 Thread Christoph Anton Mitterer
On Wed, 2015-11-25 at 08:59 +0800, Qu Wenruo wrote:
> Did you use the complied btrfsck? Or use the system btrfsck by
> mistake?
I'm pretty sure cause I already did the whole procedure twice, but let
me repeat and record it here just to be 100% sure:

$ make clean
Cleaning
$ md5sum cmds-check.c 
a7e7d871c3b666df6b56c724dbfd1d86  cmds-check.c
$ export CFLAGS="-g -O0 -Wall -D_FORTIFY_SOURCE=2"
$ ./configure --disable-convert --disable-documentation
[snip]
$ make
# ./btrfs check /dev/mapper/data-b
Checking filesystem on /dev/mapper/data-b
UUID: 250ddae1-7b37-4b22-89e9-4dc5886c810f
checking extents
[getting a cup of tea]
.. and voila it works...

which is kinda weird... I still have the previous run in the bash
history... and I *did* invoke ./btrfs and not btrfs.
Also I just haven't done any further patching... so it cannot be that
the patch wasn't applied before.

WTF?! Apparently I suffer from Gremlins :-/

*a little while later*

And back they are...(the errors)... o.O
This time I checked both of my devices that shown the symptoms
concurrently... data-b as above showed no errors.
data-old-a, came with the same errors as before.

Is there anything non-deterministic involved?


Cheers,
Chris.


smime.p7s
Description: S/MIME cryptographic signature


Re: bad extent [5993525264384, 5993525280768), type mismatch with chunk

2015-11-24 Thread Christoph Anton Mitterer
Hey again.

So it seems that data-b is always fine (well at least three times in a
row) and data-old-a always gives errors.

including e.g:
bad extent [3067663679488, 3067663695872), type mismatch with chunk
bad extent [3067663876096, 3067663892480), type mismatch with chunk
bad extent [3067663892480, 3067663908864), type mismatch with chunk
bad extent [3067663908864, 3067663925248), type mismatch with chunk
bad extent [3067669348352, 3067669364736), type mismatch with chunk
bad extent [3067669430272, 3067669446656), type mismatch with chunk
bad extent [3067669659648, 3067669676032), type mismatch with chunk
bad extent [3067669790720, 3067669807104), type mismatch with chunk
bad extent [3067669807104, 3067669823488), type mismatch with chunk
bad extent [3067669823488, 3067669839872), type mismatch with chunk
bad extent [3067669872640, 3067669889024), type mismatch with chunk
bad extent [3067669921792, 3067669938176), type mismatch with chunk
bad extent [3067671805952, 3067671822336), type mismatch with chunk

I've started debugging (everything as before) with:
(gdb) break cmds-check.c:4387
Breakpoint 1 at 0x42cf2b: file cmds-check.c, line 4387.
(gdb) break cmds-check.c:4394
Breakpoint 2 at 0x42cf57: file cmds-check.c, line 4394.
(gdb) break cmds-check.c:4411
Breakpoint 3 at 0x42cfa6: file cmds-check.c, line 4411.
(gdb) break cmds-check.c:4421
Breakpoint 4 at 0x42d000: file cmds-check.c, line 4421.

Hit a:
Breakpoint 1, check_extent_type (rec=0x1a44130) at cmds-check.c:4387
4387rec->wrong_chunk_type = 1;
(gdb) bt
#0  check_extent_type (rec=0x1a44130) at cmds-check.c:4387
#1  0x0042d6a5 in add_extent_rec (extent_cache=0x7fffdf30, 
parent_key=0x0, parent_gen=0, start=1097665216512, nr=16384, 
extent_item_refs=1, is_root=0, inc_ref=0, set_checked=0, 
metadata=0, extent_rec=1, max_size=16384) at cmds-check.c:4576
#2  0x0042ecc9 in process_extent_item (root=0x919d20, 
extent_cache=0x7fffdf30, eb=0x1a0edb0, slot=95) at cmds-check.c:5142
#3  0x00430aea in run_next_block (root=0x919d20, bits=0x91e220, 
bits_nr=1024, last=0x7fffdb78, pending=0x7fffdf10, seen=0x7fffdf20, 
reada=0x7fffdf00, 
nodes=0x7fffdef0, extent_cache=0x7fffdf30, 
chunk_cache=0x7fffdf90, dev_cache=0x7fffdfa0, 
block_group_cache=0x7fffdf70, dev_extent_cache=0x7fffdf40, ri=0x6cef30)
at cmds-check.c:5960
#4  0x004356c4 in deal_root_from_list (list=0x7fffdc00, 
root=0x919d20, bits=0x91e220, bits_nr=1024, pending=0x7fffdf10, 
seen=0x7fffdf20, reada=0x7fffdf00, 
nodes=0x7fffdef0, extent_cache=0x7fffdf30, 
chunk_cache=0x7fffdf90, dev_cache=0x7fffdfa0, 
block_group_cache=0x7fffdf70, dev_extent_cache=0x7fffdf40)
at cmds-check.c:8014
#5  0x00435d91 in check_chunks_and_extents (root=0x919d20) at 
cmds-check.c:8181
#6  0x00438e2b in cmd_check (argc=1, argv=0x7fffe220) at 
cmds-check.c:9627
#7  0x00409d49 in main (argc=2, argv=0x7fffe220) at btrfs.c:252
(gdb) continue
Continuing.

Breakpoint 1, check_extent_type (rec=0x1a44130) at cmds-check.c:4387
4387rec->wrong_chunk_type = 1;
(gdb) bt
#0  check_extent_type (rec=0x1a44130) at cmds-check.c:4387
#1  0x0042d856 in add_tree_backref (extent_cache=0x7fffdf30, 
bytenr=1097665216512, parent=1314162819072, root=0, found_ref=0) at 
cmds-check.c:4624
#2  0x0042ede2 in process_extent_item (root=0x919d20, 
extent_cache=0x7fffdf30, eb=0x1a0edb0, slot=95) at cmds-check.c:5161
#3  0x00430aea in run_next_block (root=0x919d20, bits=0x91e220, 
bits_nr=1024, last=0x7fffdb78, pending=0x7fffdf10, seen=0x7fffdf20, 
reada=0x7fffdf00, 
nodes=0x7fffdef0, extent_cache=0x7fffdf30, 
chunk_cache=0x7fffdf90, dev_cache=0x7fffdfa0, 
block_group_cache=0x7fffdf70, dev_extent_cache=0x7fffdf40, ri=0x6cef30)
at cmds-check.c:5960
#4  0x004356c4 in deal_root_from_list (list=0x7fffdc00, 
root=0x919d20, bits=0x91e220, bits_nr=1024, pending=0x7fffdf10, 
seen=0x7fffdf20, reada=0x7fffdf00, 
nodes=0x7fffdef0, extent_cache=0x7fffdf30, 
chunk_cache=0x7fffdf90, dev_cache=0x7fffdfa0, 
block_group_cache=0x7fffdf70, dev_extent_cache=0x7fffdf40)
at cmds-check.c:8014
#5  0x00435d91 in check_chunks_and_extents (root=0x919d20) at 
cmds-check.c:8181
#6  0x00438e2b in cmd_check (argc=1, argv=0x7fffe220) at 
cmds-check.c:9627
#7  0x00409d49 in main (argc=2, argv=0x7fffe220) at btrfs.c:252

You've mentioned add_extent_rec() before, but that doesn't seem to
contain bytenr so I cannot break on it.

I tried it with add_tree_backref instead, maybe that's already helpful
for you until you give me further instructions on what to debug:
Breakpoint 5 at 0x42d84a: file cmds-check.c, line 4624.
(gdb) continue 
Continuing.

Breakpoint 5, add_tree_backref (extent_cache=0x7fffdf30, 

Re: subvols and parents - how?

2015-11-24 Thread Christoph Anton Mitterer
On Tue, 2015-11-24 at 23:30 +, Hugo Mills wrote:
>    Yes, that makes sense.
Feel free to shamelessly use my idea (as well as the one to call btrfs'
RAID1 replica2 or something else)
:-O


>    With a recent mv
root@heisenberg:/mnt# mv --version
mv (GNU coreutils) 8.23

=> not recent enough...


> but I think you'll find that the UUID of the subvols changes. (At
> least, I hope it does. If it doesn't, then my mental model of what
> the FS is doing is *really* screwed up).
Well... see below:

root@heisenberg:~# truncate  -s 2G image
root@heisenberg:~# losetup -f image 
root@heisenberg:~# mkfs.btrfs /dev/loop0 
btrfs-progs v4.3
See http://btrfs.wiki.kernel.org for more information.

Performing full device TRIM (2.00GiB) ...
Label:  (null)
UUID:   10e1a55c-448a-4f37-ae5c-6a7801a7f202
Node size:  16384
Sector size:4096
Filesystem size:2.00GiB
Block group profiles:
  Data: single8.00MiB
  Metadata: DUP 110.38MiB
  System:   DUP  12.00MiB
SSD detected:   no
Incompat features:  extref, skinny-metadata
Number of devices:  1
Devices:
   IDSIZE  PATH
1 2.00GiB  /dev/loop0

root@heisenberg:~# mount /dev/loop0 /mnt/
root@heisenberg:/mnt# btrfs subvolume create root
Create subvolume './root'
root@heisenberg:/mnt# btrfs subvolume create below-top
Create subvolume './below-top'
root@heisenberg:/mnt# cd root/
root@heisenberg:/mnt/root# btrfs subvolume create below-root
Create subvolume './below-root'
root@heisenberg:/mnt# btrfs subvolume list /mnt/ -pacguqt
ID  gen cgenparent  top level   parent_uuid uuidpath
--  --- --  -   --- 
257 9   7   5   5   -   
8fbf521e-77f9-0d49-9891-87767f98c655root
258 8   8   5   5   -   
b49131e9-4207-aa42-8195-c50de5f06136below-top
259 9   9   257 257 -   
20c042be-ead8-204a-a684-94c1a770e739/root/below-root

root@heisenberg:/mnt# mv root/below-root/ tmp
root@heisenberg:/mnt# mv below-top/ root/
root@heisenberg:/mnt# mv tmp/ below-root
root@heisenberg:/mnt# btrfs subvolume list /mnt/ -pacguqt
ID  gen cgenparent  top level   parent_uuid uuidpath
--  --- --  -   --- 
257 9   7   5   5   -   
8fbf521e-77f9-0d49-9891-87767f98c655root
258 8   8   257 257 -   
b49131e9-4207-aa42-8195-c50de5f06136/root/below-top
259 9   9   5   5   -   
20c042be-ead8-204a-a684-94c1a770e739below-root
root@heisenberg:/mnt# 


So the UUIDs seem to stay the same (or are these other UUIDs?)

Hope I haven't ruined your day now ;-)


Cheers,
Chris.

smime.p7s
Description: S/MIME cryptographic signature


Re: subvols and parents - how?

2015-11-24 Thread Christoph Anton Mitterer
On Tue, 2015-11-24 at 08:29 +, Duncan wrote:
> OK, found it on the wiki.  It wasn't under use-cases, where I
> initially 
> thought to look, but under sysadmin guide.  Specifically, see section
> 4.2, managing snapshots, but I'd suggest reading the entire
> subvolumes 
> discussion, section 4, or even most/all of the page.
> 
> https://btrfs.wiki.kernel.org/index.php/SysadminGuide
Well I've had read that, but it's pretty vague and especially doesn't
mentioned any of the filesystem internal implications (if there are
any)... like I wondered before, whether this has effects on ref-links
not being used when e.g. sending/recieving ... or on future planned
extensions like recursive snapshots.


> 
> Suppose you only want to rollback /, because some update screwed you
> up, 
> but not /home, which is fine.  If /home is a nested subvolume, then 
> you're now mounting the nested home subvolume from some other nesting
> tree entirely,
That's a bit unclear to me,... I thought when I make a snapshot, any
nested subvols wouldn't be snapshotted and thus be empty dirs.
So I'd have rather that if I would simply have no /home (if I didn't
move it to the rolled-back subvol manually)


> 5
> > 
> +-roots (dir not subvol, note the s, rootS, plural)
> > +-root (subvol, mountpoint /)
> > > +-boot/ (dir)
> > > +-root/ (dir)
> > > +-lib/  (dir)
> > > +-home/ (empty dir as mountpoint)
> > +-root-snapshot-2015.0301 (dated snapshot of root)
> > +-root-snapshot-2015.0601 (dated snapshot of root)
> > +-root-snapshot-2015.0901 (dated snapshot of root)
> +-homes (dir not subvol)
> > +-home (subvol, mountpoint /home)
> > +-home-snapshot-2015.0301 (dated snapshot of home)
> ...
That's more what I've had in mind...
Actually something like this:
5
|
+-root       (=subvol)
| +-boot
| +-home     (subvo=/home being mounted heron)
| +-lib
+-home       (subvol, the current version)
+-snapshots  (=dir)
  +-root
  | +-2015-01-14 (subvol, snapshot)
  | +-2015-09-30 (subvol, snapshot)
  +-home
    +-2015-06-04 (subvol, snapshot)
    +-2015-09-01 (subvol, snapshot)


And it once more points to the problem of the wiki... anyone can write
(I think even I) and it's totally unclear at the first glance how
"[non-]official" and outdated something may be.
Apart from the problem that many important questions (from the PoV of
an more advanced admin that doesn't just do mkfs.btrfs and then never
again thinks about it) :-(


> Meanwhile, if the intention for a subvolume is simply to exclude that
> subtree from snapshotting of the parent, as might be the case for
> example 
that is in fact also use case.. so in practise I'll probably have a mix
of (a) and (b).


> if you have a VMs subvol, with the VM image files set NOCOW to avoid 
> fragmentation, since snapshotting nocow files forces cow1 (a cow at
> the 
> first write of that block, before returning to nocow, because a
> snapshot 
> locks the existing extents in place for the snapshot, so initial
> writes 
> to a block after a snapshot /can't be nocow or it'd change the
> snapshot 
> too...)
Ah that's good to know how that works (or better said, that it works at
all)... I've already wondered myself several times what happens when I
snapshot NOCOW files, ... something that's I guess also missing from
the (missing ;-) ) btrfs-end-user overview and details documentation.


> OTOH, if there's a chance you might want to mount that subvolume in a
> roll-back scenario, under the snapshot you're rolling back to, then
> it 
> makes sense to put it directly under ID 5 again, and mount it in any
> case.
Yes.


> Then there's the security angle to consider.  With the (basically, 
> possibly modified as I suggested) flat layout, mounting something
> doesn't 
> automatically give people in-tree access to nested subvolumes
> (subject to 
> normal file permissions, of course), like nested layout does.  And
> with 
> (possibly modified) flat layout, the whole subvolume tree doesn't
> need to 
> be mounted all the time either, only when you're actually working
> with 
> subvolumes.
Uhm, I don't get the big security advantage here... whether nested or
manually mounted to a subdir,... if the permissions are insecure I'll
have a problem... if they're secure, than not.

Of course I use insecure permissions and don't mount the subvols then
I'd be safe in flat setup, but not in a nested setup...(where they
subvol is "auto-mounted")...

But that seems pretty awkward.



Mhh I think my main question turns out to be whether the different
layouts would have any technical (i.e. not administrative) effects...
like the aforementioned stuff of recursive snapshots (should such thing
ever come to life).
But at least from the userspace tools it seems that I can move subvols
arbitrarily and they adapt their parent IDs accordingly...

So I guess whatever setup one uses nested/flat/mixed... doesn't rule
out any functionalities for the future?!

thx,
Chris.

smime.p7s
Description: S/MIME cryptographic signature


Re: btrfs send reproducibly fails for a specific subvolume after sending 15 GiB, scrub reports no errors

2015-11-24 Thread Christoph Anton Mitterer
On Tue, 2015-11-24 at 15:58 -0500, Austin S Hemmelgarn wrote:
> I had tried using send/receive once with -p, but had numerous issues.
 
> The incrementals I've been doing have used -c instead, and I hadn't had 
> any issues with data loss with that.  The issue outlined here was only a 
> small part of why I stopped using it for backups.  The main reason was 
> to provide better consistency between my local copies and what I upload 
> to S3/Dropbox, meaning I only have to test one back up image per 
> filesystem backed-up, instead of two.

Okay maybe I just don't understand how to use send/receive correctly...


What I have is about the following (simplified):

master-fs:
5
|
+--data (subvol, my precious data)
|
+--snapshots
   |
   +--2015-11-01 (suvol, ro-snapshot of /data)

So 2015-11-01 is basically the first snapshot ever made.

Now I want to have it on:
backup-fs
+--2015-11-01 (suvol, ro-snapshot of /data)


So far I did
btrfs send /master-fs/snapshots/2015-11-01 | btrfs receive /backup-fs/2015-11-01




Then time goes by and I get new content in the data subvol, so what
I'd like to have then is a new snapshot on the master-fs:
5
|
+--data (subvol, more of my precious data)
|
+--snapshots
   |
   +--2015-11-01 (suvol, ro-snapshot of /data)
   +--2015-11-20 (suvol, ro-snapshot of /data)

And this should go incrementally on backup-fs:
backup-fs
+--2015-11-01 (suvol, ro-snapshot of /data)
+--2015-11-20
(suvol, ro-snapshot of /data)

So far I used something like:
btrfs send -p 2015-11-01 /master-fs/snapshots/2015-11-20 | btrfs receive 
/backup-fs/2015-11-20

And obviously I want it to share all the ref-links and stuf...


So in other words, what's the difference between -p and -c? :D


Thx,
Chris.


smime.p7s
Description: S/MIME cryptographic signature


Re: btrfs send reproducibly fails for a specific subvolume after sending 15 GiB, scrub reports no errors

2015-11-24 Thread Christoph Anton Mitterer
On Tue, 2015-11-24 at 15:44 -0500, Austin S Hemmelgarn wrote:
> I would say it's currently usable for one-shot stuff, but probably
> not 
> reliably useable for automated things without some kind of 
> administrative oversight.  In theory, it wouldn't be hard to write a 
> script to automate fixing this particular issue when send encounters
> it, 
> but that has it's own issues (you have to either toggle the snapshot 
> writable temporarily, or modify the source and re-snapshot).

Well AFAIU, *this* very issue is at least something that bails out
loudly with an error... I rather worry about cases where send/receive
just exits without any error (status or message) and still didn't
manage to correctly copy everything.

The case that I had was that I incrementally send/received (with -p)
backups to another disk.
At some point in time I removed one of the older snapshots on that
backup disk... and then had fs errors... as if the data would have been
gone.. :(

smime.p7s
Description: S/MIME cryptographic signature


Re: btrfs send reproducibly fails for a specific subvolume after sending 15 GiB, scrub reports no errors

2015-11-24 Thread Christoph Anton Mitterer
On Tue, 2015-11-24 at 21:27 +, Hugo Mills wrote:
>    -p only sends the file metadata for the changes from the reference
> snapshot to the sent snapshot. -c sends all the file metadata, but
> will preserve the reflinks between the sent snapshot and the (one or
> more) reference snapshots.
Let me see if I got that right:
- -p sends just the differences, for both data and meta-data.
- Plus, -c sends *all* the metadata, you said... but will it send all
data (and simply ignore what's already there) or will it also just send
the differences in terms of data?
- So that means effectively I'll end up with the same... right?

In other words, -p should be a tiny bit faster... but not that extremely much 
(unless I have tons[0] of metadata changes)

>  You can only use one -p (because there's
> only one difference you can compute at any one time), but you can use
> as many -c as you like (because you can share extents with any number
> of subvols).
So that means, if it would work correctly, -p would be the right choice
for me, as I never have multiple snapshots that I need to draw my
relinks from, right?


>    In implementation terms, on the receiver, -p takes a (writable)
> snapshot of the reference subvol, and modifies it according to the
> stream data. -c makes a new empty subvol, and populates it from
> scratch, using the reflink ioctl to use data which is known to exist
> in the reference subvols.
I see...
I think the manpage needs more information like this... :)


Thanks, for you help :-)
Chris.


[0] People may argue that one has XXbytes of metadata, and tons are a
measurement of weight... but when I recently carried 4 of the 8TB HDDs
in my back... I came to the conclusion that data correlates to gram ;-)

smime.p7s
Description: S/MIME cryptographic signature


Re: shall distros run btrfsck on boot?

2015-11-24 Thread Christoph Anton Mitterer
On Tue, 2015-11-24 at 11:14 -0600, Eric Sandeen wrote:
> In a nutshell, though, I think a filesystem repair should be an
> admin-initiated
> action, not something that surprises you on a boot, at least for a
> journaling
> filesystem which is designed to maintain its integrity even in the
> face of
> a power loss or crash.

Well I wouldn't agree here... I maintain some >2PiB of storage for a
LHC Tier-2,... right now everything with ext4.
During normal operation we can of course not have any fsck, but every
now and then, when we reboot, it happens automatically,... and
regularly shows at least some (apparently non-serious) glitches.


IMHO, either the kernel driver itself already checks "everything", then
we wouldn't need a dedicated check tool.
Or it does not, but in that case, there will be people who want to have
that in-depth checks run regularly (and even if it's just every half a
year).
I better wait half an hour at boot, and find such errors, rather than
that they silently pile up until I really run into troubles.

That being said, of course it should be configurable for the admin...
and it is, via fstab.
So apart from that, given the expectation that btrfsck should be rock-
solid as e.g. e2fsck in some future, I wouldn't see why people
shouldn't have the necessary facilities to have it auto-run.


Cheers,
Chris.

smime.p7s
Description: S/MIME cryptographic signature


Re: bad extent [5993525264384, 5993525280768), type mismatch with chunk

2015-11-24 Thread Christoph Anton Mitterer
On Tue, 2015-11-24 at 13:35 +0800, Qu Wenruo wrote:
> Hopes you didn't wait too long.
No worries, didn't hold my breath ;)


> The fixing patch is CCed to you, or you can get it from patchwork:
> https://patchwork.kernel.org/patch/7687611/
Unfortunately that doesn't make the error messages go away.
:(

Shall I start debugging again?

Cheers,
Chris.

smime.p7s
Description: S/MIME cryptographic signature


Re: btrfs send reproducibly fails for a specific subvolume after sending 15 GiB, scrub reports no errors

2015-11-24 Thread Christoph Anton Mitterer
Hey.

All that sounds pretty serious, doesn't it? So in other words, AFAIU,
send/receive cannot really be reliably used.

I did so far for making incremental backups, but I've also experienced
some problems (though not what this is about here).


Cheers,
Chris.

smime.p7s
Description: S/MIME cryptographic signature


Re: bad extent [5993525264384, 5993525280768), type mismatch with chunk

2015-11-23 Thread Christoph Anton Mitterer
On Mon, 2015-11-23 at 09:10 +0800, Qu Wenruo wrote:
> Also, you won't want compiler to do extra optimization
I did the following:
$ export CFLAGS="-g -O0 -Wall -D_FORTIFY_SOURCE=2"
$ ./configure --disable-convert --disable-documentation

So if you want me to get rid of _FORTIFY_SOURCE, please tell.


> 
> After make, you won't need to install the btrfs-progs, you can just
> use 
> gdb to debug local ./btrfsck and add new breakpoints to do the trick.
# gdb ./btrfs
GNU gdb (Debian 7.10-1) 7.10
Copyright (C) 2015 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later 
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
.
Find the GDB manual and other documentation resources online at:
.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./btrfs...done.
(gdb) break cmds-check.c:4421
Breakpoint 1 at 0x42d000: file cmds-check.c, line 4421.
(gdb) run check /dev/mapper/data-b
Starting program: /home/calestyo/bfsck/btrfs-tools-4.3/btrfs check 
/dev/mapper/data-b
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
...
bad extent [6619620278272, 6619620294656), type mismatch with chunk
bad extent [6619620294656, 6619620311040), type mismatch with chunk
bad extent [6619620311040, 6619620327424), type mismatch with chunk
checking free space cache
checking fs roots

with not breakpoint reached. And I've actually did that with both btrfs
where the problem occurred (the master, and the one that send/received
snapshots incrementally from it).


Hope that helps... anything further to do?

Chris.

smime.p7s
Description: S/MIME cryptographic signature


Re: shall distros run btrfsck on boot?

2015-11-23 Thread Christoph Anton Mitterer
On Tue, 2015-11-24 at 04:35 +, Duncan wrote:
> I'm a list regular and btrfs user, not a dev, but all the indications
 
> continue to point to _not_ running it automatically at boot, nobody
> even 
> _suggesting_ otherwise.
Sure, I just asked because maybe that would have just been an
anachronism from the days btrfsck was much more alpha.


>   The btrfs kernel code itself detects and often 
> corrects many problems, and btrfs check is simply not recommended for
> automatic at-boot scheduling -- if the kernel code can't fix it
> without 
> intervention, then the problem is too serious to be fixed without 
> intervention by some scheduled btrfs check run, as well.
I once had an issue with a btrfs, where the kernel didn't show anything
but btrfsck did...(not the one Qu's currently looking into).
And I though the same is basically the case for other filesystems like
ext.


> In fact, take a look at the shipped fsck.btrfs shell-script, based
> upon 
> the xfs one.  As both the code and the comments suggest, it's 
> specifically designed to simply return success
Sure, but that could have simply been forgotten to update...


Thanks for the update on the status in that matter :)
Cheers,
Chris.

smime.p7s
Description: S/MIME cryptographic signature


subvols and parents - how?

2015-11-23 Thread Christoph Anton Mitterer
Hey.

I'd have a, mainly administrative, question.

When I use subvolumes than these have always a parent subvolume (except
ID5), so I can basically decide between two ways:

a) make child subvolumes, e.g.
5
|
+-root   (=subvol, mountpoint /)
  +-boot/
  +-root/
  +-lib/
  +-home/ (=subvolume)
and soon on... perhaps the whole thing without the dedicated root-
subovlume (although that's probably not so smart, I guess).

b) place at least some of the subvolumes directly below the top-level
and mount them e.g. via /etc/fstab, e.g.
5
|
+-root   (=subvol, mountpoint /)
| +-boot/
| +-root/
| +-lib/
+-home/ (=subvolume, mountpoint /home)


Now I wondered whether this has any technical implications, but neither
the wiki, nor the manpages seem to explain a lot here.


The "differences", AFAIU, are the follows:
- When I mount a given subvolume,.. it's childs are automatically 
  "there".
  Whereas when I don't have them as childs (as in (b)) I must of
  course mount them somehow manually.
- Analogously for umounting.
- I can move existing subvols to higher/lower levels, and the parent
  IDs will change accordingly.

So basically it makes no difference, right? Or is there anything more
technical going on? E.g. with the ref-links or so?
Right now, there are, AFAIK, neither recursive snapshots (and
especially not atomic ones) nor recursive send/receive, right?
If that should ever be implemented, would I perhaps have problems with 
(a) or (b)?


Thanks,
Chris.

btw, for a developer:
$ btrfs subvolume list /mnt/ -pacguqt
ID  gen cgenparent  top level   parent_uuid uuidpath
--  --- --  -   --- 
257 16  8   5   5   -   
64f8a75b-5cb5-6e4d-9f30-d45fe3d9e060/root

There seem to be some colum mis-aglinment issues in the output =)
btrfsprogs 4.3

smime.p7s
Description: S/MIME cryptographic signature


Re: bad extent [5993525264384, 5993525280768), type mismatch with chunk

2015-11-23 Thread Christoph Anton Mitterer
On Tue, 2015-11-24 at 08:46 +0800, Qu Wenruo wrote:
> But there are also some other places like line 4411, 4394 and 4387.
Ah of course, I didn't have a look for further places

$ grep -n "rec->wrong_chunk_type = 1" cmds-check.c
4387:   rec->wrong_chunk_type = 1;
4394:   rec->wrong_chunk_type = 1;
4411:   rec->wrong_chunk_type = 1;
4421:   rec->wrong_chunk_type = 1;



> So there are still 3 breakpoint needs to add.
GNU gdb (Debian 7.10-1) 7.10
Copyright (C) 2015 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later 
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
.
Find the GDB manual and other documentation resources online at:
.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./btrfs...done.
(gdb) break cmds-check.c:4387
Breakpoint 1 at 0x42cf2b: file cmds-check.c, line 4387.
(gdb) break cmds-check.c:4394
Breakpoint 2 at 0x42cf57: file cmds-check.c, line 4394.
(gdb) break cmds-check.c:4411
Breakpoint 3 at 0x42cfa6: file cmds-check.c, line 4411.
(gdb) break cmds-check.c:4421
Breakpoint 4 at 0x42d000: file cmds-check.c, line 4421.
(gdb) run check /dev/mapper/data-b
Starting program: /home/calestyo/bfsck/btrfs-tools-4.3/btrfs check 
/dev/mapper/data-b
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Checking filesystem on /dev/mapper/data-b
UUID: 250ddae1-7b37-4b22-89e9-4dc5886c810f
checking extents

Breakpoint 1, check_extent_type (rec=0x20a6740) at cmds-check.c:4387
4387rec->wrong_chunk_type = 1;
(gdb) continue 
Continuing.

Breakpoint 1, check_extent_type (rec=0x20a6740) at cmds-check.c:4387
4387rec->wrong_chunk_type = 1;
(gdb) cont 100
Will ignore next 99 crossings of breakpoint 1.  Continuing.

Breakpoint 1, check_extent_type (rec=0x20a9880) at cmds-check.c:4387
4387rec->wrong_chunk_type = 1;
(gdb) cont 1000
Will ignore next 999 crossings of breakpoint 1.  Continuing.

That goes on for a few millions... until we get the:
bad extent [6619620016128, 6619620032512), type mismatch with chunk
bad extent [6619620032512, 6619620048896), type mismatch with chunk
again.. and the check exits normally with:
checking free space cache
checking fs roots
checking csums
checking root refs
found 5862373889375 bytes used err is 0
total csum bytes: 5715302800
total tree bytes: 9903816704
total fs tree bytes: 2475769856
total extent tree bytes: 938393600
btree space waste bytes: 1072581913
file data blocks allocated: 9170230497280
 referenced 9281014861824
btrfs-progs v4.3
[Inferior 1 (process 18130) exited normally]


So it's the one in 4378.

Anything further to do? :)

Chris.

smime.p7s
Description: S/MIME cryptographic signature


Re: bad extent [5993525264384, 5993525280768), type mismatch with chunk

2015-11-23 Thread Christoph Anton Mitterer
On Tue, 2015-11-24 at 10:54 +0800, Qu Wenruo wrote:
> And it would be even better if you want to be a lab mouse for
> incoming fixing patches.
Sure,.. if I get some cheese... and it would be great if you could give
me patches that apply to 4.3.


>  (It won't hurt nor destroy your data)
wouldn't  matter... it's already backuped again and that fs is for
playing now ;-)


Just tell me in case you need to give up so that I can start re-using
the device then.

Cheers,
Chris.

smime.p7s
Description: S/MIME cryptographic signature


Re: bad extent [5993525264384, 5993525280768), type mismatch with chunk

2015-11-23 Thread Christoph Anton Mitterer
On Tue, 2015-11-24 at 10:09 +0800, Qu Wenruo wrote:
> I'll dig further to see what's causing the problem.
I guess you'd prefer if I keep the fs for later verification?


> Thanks for all the debug info, it really helps a lot!
Well thanks for your efforts as well :)

Chris.

smime.p7s
Description: S/MIME cryptographic signature


shall distros run btrfsck on boot?

2015-11-23 Thread Christoph Anton Mitterer
Hey.

Short question since that came up on debian-devel.

Now that btrfs check get's more and more useful, are the developers
going to recommend running it periodically on boot (of course that
wouldn't work right now, as it would *always* check)?

Plus... is btrfs check (without any arguments) non-desctructive, or are
there corner-cases where it may lead to any changes on the devices?

Thanks,
Chris.

smime.p7s
Description: S/MIME cryptographic signature


Re: [PATCH v2 3/5] btrfs-progs: kernel based default features for mkfs

2015-11-23 Thread Christoph Anton Mitterer
On Mon, 2015-11-23 at 11:05 -0500, Austin S Hemmelgarn wrote:
> I would find it useful if btrfs gives a warning if it creates a
> > filesystem which (because unsupported in the current kernel) lacks
> > features which are considered default by then.
> It should give a warning if the user requests a feature that is 
> unsupported by the kernel it's being run on, but it should not by 
> default try to enable something that isn't supported by the kernel
> it's 
> running on.
Well that as well, and of course it shouldn't try to enable a feature
that wouldn't work, but what I meant was, e.g. if I create a fs with
btrfs-progs 4.3 (where skinny-extents are default) but on such an old
kernel where this isn't supported yet,... it should tell me "Normally
I'd create the fs with skinny-extents, but I don't as your kernel is
too old".


> It is actually possible to clone a btrfs filesystem, just not in a
> way 
> that people used to stuff like ext4 would recognize.  In essence, you
> need to take the FS mostly off-line, force all subvolumes to read-
> only, 
> then use send-receive to transfer things, and finally make the 
> subvolumes writable again.  I've been considering doing a script to
> do 
> this automatically, but have never gotten around to it as it's not 
> something that is quick to code, and it's not something I do very
> often.
And that would also keep all ref-links, etc.? I.e. the copied fs
wouldn't eat up much more space than the original?
Well than such script should be part of btrfs-progs :-)

Cheers,
Chris.

smime.p7s
Description: S/MIME cryptographic signature


Re: [PATCH v2 3/5] btrfs-progs: kernel based default features for mkfs

2015-11-23 Thread Christoph Anton Mitterer
Hey.

On Mon, 2015-11-23 at 20:56 +0800, Anand Jain wrote:
> This patch disables default features based on the running kernel.
Not sure if that's very realistic in practise (most people will have
some distro, whose btrfsprogs version probably matches the kernel), but
purely from the end-user PoV:

I would find it useful if btrfs gives a warning if it creates a
filesystem which (because unsupported in the current kernel) lacks
features which are considered default by then.


AFAIU, really "clonding" (I mean including all snapshots, subvols,
etc.) a btrfs is not possible right now (or is it?), so a btrfs is
something that rather should stay longer (as one cannot easily
copy it to/with a new one)... so for me as an end-user, it may be
easier to switch to a newer kernel, in order to get a feature which is
default, than to migrate the fs later.


Cheers,
Chris.

smime.p7s
Description: S/MIME cryptographic signature


Re: bad extent [5993525264384, 5993525280768), type mismatch with chunk

2015-11-21 Thread Christoph Anton Mitterer
Hey Qu.


On Sun, 2015-11-22 at 10:04 +0800, Qu Wenruo wrote:
> If any of you can recompile btrfs-progs and use gdb to debug it,
> would 
> anyone please to investigate where did the wrong_chunk_type is set?
> It is in the function check_extent_type():

Not 100% what you want... AFAIU, you just want to see whether that line
is reached?

If didn't re-compile but used the btrfs-tools-dbg package, but I guess
that should do.

In the debian version that line seems to be at:
4374 
4375 /* Check if the type of extent matches with its chunk */
4376 static void check_extent_type(struct extent_record *rec)
4377 {
...
4419 bg_type = BTRFS_BLOCK_GROUP_METADATA;
4420 if (!(bg_cache->flags & bg_type))
4421 rec->wrong_chunk_type = 1;
4422 }
4423 }

Running:
# gdb btrfs
(gdb) dir /root/btrfs-tools-4.3
Source directories searched: /root/btrfs-tools-4.3:$cdir:$cwd
(gdb) break cmds-check.c:4421
Breakpoint 1 at 0x41d859: file cmds-check.c, line 4421.
(gdb) run check /dev/mapper/data-b
Starting program: /bin/btrfs check /dev/mapper/data-b
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".

... in fact reaches that breakpoint:
Breakpoint 1, check_extent_type (rec=rec@entry=0x884290) at cmds-check.c:4423
4423}

... but the error message ("bad extent [5993525264384, 5993525280768),
type mismatch with chunk") doesn't seem to be printed at that stage... 


If I continue, it goes for a while:

Breakpoint 1, check_extent_type (rec=rec@entry=0x884290) at cmds-check.c:4423
4423}
(gdb) cont 10
Will ignore next 9 crossings of breakpoint 1.  Continuing.

and so on for at least several million crossings... I then removed the
breakpoint and after a longer while the usual error messages came up.


Does that help?

Chris.

smime.p7s
Description: S/MIME cryptographic signature


Re: bad extent [5993525264384, 5993525280768), type mismatch with chunk

2015-11-20 Thread Christoph Anton Mitterer
On Fri, 2015-11-20 at 17:23 +0100, Laurent Bonnaud wrote:
> So here is the output of "btrfs-debug-tree -t 2 " in case it may
Gosh... 15M via mail?! o.O

Anyway an update from my side...
I've copied all data from the fs in question to a new btrfs,... done
under Linux 4.2.6 and btrfs-progs v4.3.
No data was lost or anyhow corrupted (also shown by extensive tests
using my XATTR hashsums and other backups I've had).

On the new fs, btrfs doesn't report that error anymore.
No snapshots/etc made so far on the new fs.



Qu, have you had a look at btrfs check already?

And you've explained that the fs was okay and only the check was
wrong... but since the false positive errors don't appear on my copied
fs, does that really mean that the skinny metadata changed so much, or
was anything changed in the kernel that the same didn't appear on the
new fs anymore (or was it perhaps because I was using snapshots?)


Cheers,
Chris.

smime.p7s
Description: S/MIME cryptographic signature


Re: [Enhancement] ... and please rename "raid1" to something better

2015-11-20 Thread Christoph Anton Mitterer
On Fri, 2015-11-20 at 11:05 +, Duncan wrote:
> It's missing raid1. =:^(
speaking of which...

Wouldn't the developers consider to rename raid1 to something more
correct? E.g. replicas2 or dup or whatever.

RAID1 has ever had the meaning of mirrored devices and the closest to
this in btrfs would be N replicas with N devices, and not two as it is
now.
Also I wouldn't know of any other system that doesn't use "RAID1" in
the traditional meaning (MD, basically every HW RAID I came across).

And I'm not against the mode of having 2 duplicates itself... just
against the naming, which I think easily confuses users and they may
awake with a bad surprise.


Cheers,
Chris.

smime.p7s
Description: S/MIME cryptographic signature


Re: [PATCH 00/15] btrfs: Hot spare and Auto replace

2015-11-15 Thread Christoph Anton Mitterer
Hey.

You guys may want to update:
https://btrfs.wiki.kernel.org/index.php/Project_ideas#Hot_spare_support

Cheers,
Chris.

smime.p7s
Description: S/MIME cryptographic signature


Re: bad extent [5993525264384, 5993525280768), type mismatch with chunk

2015-11-14 Thread Christoph Anton Mitterer
On Sun, 2015-11-15 at 09:29 +0800, Qu Wenruo wrote:
> > > If type is wrong, all the extents inside the chunk should be
> > > reported
> > > as
> > > mismatch type with chunk.
> > Isn't that the case? At least there are so many reported extents...
> 
> If you posted all the output
Sure, I posted everything that the dump gave :)

> , that's just a little more than nothing.
> Just tens of error reported, compared to millions of extents.
> And in your case, if a chunk is really bad, it will report about 65K
> errors.
I see..


> I think it's a btrfsck issue, at least from the dump info, your
> extent 
> tree is OK.
> And if there is no other error reported from btrfsck, your filesystem
> should be OK.
Nope.. there were no further errors.



> > In any case, I'll keep the fs in question for a while, so that I
> > can do
> > verifications in case you have patches.
> 
> Nice.
Just tell me if you have something.



btw: I saw these:
Nov 15 02:01:42 heisenberg kernel: INFO: task btrfs-transacti:28379 blocked for 
more than 120 seconds.
Nov 15 02:01:42 heisenberg kernel:   Not tainted 4.2.0-1-amd64 #1
Nov 15 02:01:42 heisenberg kernel: "echo 0 > 
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 15 02:01:42 heisenberg kernel: btrfs-transacti D 8109a1b0 0 
28379  2 0x
Nov 15 02:01:42 heisenberg kernel:  88016e3e6500 0046 
007a 88040be88f00
Nov 15 02:01:42 heisenberg kernel:  2659 88013807 
88041e355840 7fff
Nov 15 02:01:42 heisenberg kernel:  815508e0 88013806fbb8 
0007 815500ff
Nov 15 02:01:42 heisenberg kernel: Call Trace:
Nov 15 02:01:42 heisenberg kernel:  [] ? 
bit_wait_timeout+0x70/0x70
Nov 15 02:01:42 heisenberg kernel:  [] ? schedule+0x2f/0x70
Nov 15 02:01:42 heisenberg kernel:  [] ? 
schedule_timeout+0x1f7/0x290
Nov 15 02:01:42 heisenberg kernel:  [] ? 
extent_write_cache_pages.isra.28.constprop.43+0x222/0x330 [btrfs]
Nov 15 02:01:42 heisenberg kernel:  [] ? read_tsc+0x5/0x10
Nov 15 02:01:42 heisenberg kernel:  [] ? 
bit_wait_timeout+0x70/0x70
Nov 15 02:01:42 heisenberg kernel:  [] ? 
io_schedule_timeout+0x9d/0x110
Nov 15 02:01:42 heisenberg kernel:  [] ? bit_wait_io+0x35/0x60
Nov 15 02:01:42 heisenberg kernel:  [] ? 
__wait_on_bit+0x5a/0x90
Nov 15 02:01:42 heisenberg kernel:  [] ? 
find_get_pages_tag+0x116/0x150
Nov 15 02:01:42 heisenberg kernel:  [] ? 
wait_on_page_bit+0xb6/0xc0
Nov 15 02:01:42 heisenberg kernel:  [] ? 
autoremove_wake_function+0x40/0x40
Nov 15 02:01:42 heisenberg kernel:  [] ? 
filemap_fdatawait_range+0xc7/0x140
Nov 15 02:01:42 heisenberg kernel:  [] ? 
btrfs_wait_ordered_range+0x73/0x110 [btrfs]
Nov 15 02:01:42 heisenberg kernel:  [] ? 
btrfs_wait_cache_io+0x5d/0x1e0 [btrfs]
Nov 15 02:01:42 heisenberg kernel:  [] ? 
btrfs_start_dirty_block_groups+0x17c/0x3f0 [btrfs]
Nov 15 02:01:42 heisenberg kernel:  [] ? 
btrfs_commit_transaction+0x1b4/0xa90 [btrfs]
Nov 15 02:01:42 heisenberg kernel:  [] ? 
start_transaction+0x90/0x580 [btrfs]
Nov 15 02:01:42 heisenberg kernel:  [] ? 
transaction_kthread+0x224/0x240 [btrfs]
Nov 15 02:01:42 heisenberg kernel:  [] ? 
btrfs_cleanup_transaction+0x510/0x510 [btrfs]
Nov 15 02:01:42 heisenberg kernel:  [] ? kthread+0xc1/0xe0
Nov 15 02:01:42 heisenberg kernel:  [] ? 
kthread_create_on_node+0x170/0x170
Nov 15 02:01:42 heisenberg kernel:  [] ? 
ret_from_fork+0x3f/0x70
Nov 15 02:01:42 heisenberg kernel:  [] ? 
kthread_create_on_node+0x170/0x170
Nov 15 02:03:42 heisenberg kernel: INFO: task btrfs-transacti:28379 blocked for 
more than 120 seconds.
Nov 15 02:03:42 heisenberg kernel:   Not tainted 4.2.0-1-amd64 #1
Nov 15 02:03:42 heisenberg kernel: "echo 0 > 
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 15 02:03:42 heisenberg kernel: btrfs-transacti D 8109a1b0 0 
28379  2 0x
Nov 15 02:03:42 heisenberg kernel:  88016e3e6500 0046 
007a 88040be88f00
Nov 15 02:03:42 heisenberg kernel:  2659 88013807 
88041e355840 7fff
Nov 15 02:03:42 heisenberg kernel:  815508e0 88013806fbb8 
0007 815500ff
Nov 15 02:03:42 heisenberg kernel: Call Trace:
Nov 15 02:03:42 heisenberg kernel:  [] ? 
bit_wait_timeout+0x70/0x70
Nov 15 02:03:42 heisenberg kernel:  [] ? schedule+0x2f/0x70
Nov 15 02:03:42 heisenberg kernel:  [] ? 
schedule_timeout+0x1f7/0x290
Nov 15 02:03:42 heisenberg kernel:  [] ? 
extent_write_cache_pages.isra.28.constprop.43+0x222/0x330 [btrfs]
Nov 15 02:03:42 heisenberg kernel:  [] ? read_tsc+0x5/0x10
Nov 15 02:03:42 heisenberg kernel:  [] ? 
bit_wait_timeout+0x70/0x70
Nov 15 02:03:42 heisenberg kernel:  [] ? 
io_schedule_timeout+0x9d/0x110
Nov 15 02:03:42 heisenberg kernel:  [] ? bit_wait_io+0x35/0x60
Nov 15 02:03:42 heisenberg kernel:  [] ? 
__wait_on_bit+0x5a/0x90
Nov 15 02:03:42 heisenberg kernel:  [] ? 
find_get_pages_tag+0x116/0x150
Nov 15 02:03:42 heisenberg kernel:  [] ? 

Re: bad extent [5993525264384, 5993525280768), type mismatch with chunk

2015-11-13 Thread Christoph Anton Mitterer
I just got the backup disk back, also btrfs, which was made via
send/receive...
It has the same errors during fsck.

The main disk still hasn't found any file (apart from a few, others for
which none of my hash sums were stored at all) that doesn't verify.


So I guess there's definitely some bug in btrfs, that even propagated
via send/receive or was so common that it happens on another fs as
well; even though it's unclear whether this was just in older versions.
Perhaps Qu can shed some light on this, when he had a look at my extent
tree dump.


Cheers,
Chris.

smime.p7s
Description: S/MIME cryptographic signature


Re: bad extent [5993525264384, 5993525280768), type mismatch with chunk

2015-11-13 Thread Christoph Anton Mitterer
On Fri, 2015-11-13 at 07:05 +, Duncan wrote:
> 8 TiB disks -- are those the disk-managed SMR "archive" disks I've
> read 
> about on a number of threads?
Yes... but...

> If so, that hardware is almost certainly the cause, as they're known
> to 
> be problematic on current kernels.  While most filesystems (all?)
> will 
> apparently go corrupt on them, it can remain invisible corruption for
> quite some time on many of them, but btrfs with its checksums and etc
> will tend to show up the problems far sooner, and there have been at 
> least 2-3 threads on the problem already, on this list.
I think it's pretty unlikely that this is the reason.

- I never saw any errors popping up from the lower driver levels (i.e.
the ATA errors all those people saw), and I've regularly checked
- I always did the checksum verification based on my own hashes stored
in each file's XATTRS, without any error so far.
- I wrote far more data (the device is nearly fully) without any error
(XATTRs/hashes) than the time when most of these people noticed sever
corruptions, which seemed to happen already after some GB.
- I think it's pretty unlikely that all data (in terms of hashsums)
would be okay, and that these corruptions would have just appeared in
some of btrfs meta-data.

I'm not sure why I don't suffer from these issues, probably because I
run them only via USB/SATA bridges, which, while they're USB3.0, are
probably too slow for these errors to pop up.

See my comment:
https://bugzilla.kernel.org/show_bug.cgi?id=93581#c70


Further, a small status update:
As mentioned this night, I've kicked of a full run of verifying all of
my XATTRs-set hashsums... (and these hashsums are basically all
computed when the data is known to be valid, e.g. for DSLR pictures
straight off the SD, etc.).
In terms of numbers of files, that run is half through,... so far with
only a handful of errors related to files where I've apparently forgot
to set the sums. No errors (so far).

So unless btrfs completely lost file entries (and I guess that wouldn't
just affect the extent tree?), and I thus wouldn't verify these files
at all or notice them missing,... everything seems fine, as far as I
can tell, (so far).

I'll basically just about to head of to get my backup disk, which I
haven't at home... to see whether it also shows these fsck errors. So
stay tuned.

Cheers,
Chris.

smime.p7s
Description: S/MIME cryptographic signature


Re: bad extent [5993525264384, 5993525280768), type mismatch with chunk

2015-11-13 Thread Christoph Anton Mitterer
On Sat, 2015-11-14 at 09:22 +0800, Qu Wenruo wrote:
> Manually checked they all.
thanks a lot :-)


> Strangely, they are all OK... although it's a good news for you.
Oh man... you're s mean ;-D


> They are all tree blocks and are all in metadata block group.
and I guess that's... expected/intended?


> It seems to be a btrfsck false alert
that's a relieve (for me)

Well I've already started to copy all files from the device to a new
one... unfortunately I'll loose all older snapshots (at least on the
new fs) but instead I get skinny-metadata, which wasn't the default
back then.
(being able to copy a full fs, with all subvols/snapshots is IMHO
really something that should be worked on)


> If type is wrong, all the extents inside the chunk should be reported
> as 
> mismatch type with chunk.
Isn't that the case? At least there are so many reported extents...

> And according to the dump result, the reported ones are not
> continuous 
> even they have adjacent extents but adjacent ones are not reported.
I'm not so deep into btrfs... is this kinda expected and if not how
could all this happen? Or is it really just a check issue and
filesystem-wise fully as it should be?


> Did you have any smaller btrfs with the same false alert?
Uhm... I can check, but I don't think so, especially as all other btrfs
I have are newer and already have skinny-metadata.
The only ones I had without are those two big 8TB HDDs...
Unfortunately they contain sensitive data from work, which I don't
think I can copy, otherwise  could have sent you the device or so...

> Although I'll check the code to find what's wrong, but if you have
> any 
> small enough image, debugging will be much much faster.
In any case, I'll keep the fs in question for a while, so that I can do
verifications in case you have patches.

thanks a lot,
Chris.

smime.p7s
Description: S/MIME cryptographic signature


Re: bad extent [5993525264384, 5993525280768), type mismatch with chunk

2015-11-12 Thread Christoph Anton Mitterer
I've uploaded the full output of btrfs check on that device to:
http://christoph.anton.mitterer.name/tmp/public/cbec6446-898b-11e5-90a4-502690aa641f.xz

there are nearly 600k of these error lines... WTF?!

Also, the filesystem still mounts (without any errors to dmesg)


Any help would be appreciated,
thx,
Chris.

smime.p7s
Description: S/MIME cryptographic signature


bad extent [5993525264384, 5993525280768), type mismatch with chunk

2015-11-12 Thread Christoph Anton Mitterer
Hey.

I get these errors on fsck'ing a btrfs:
bad extent [5993525264384, 5993525280768), type mismatch with chunk
bad extent [5993525280768, 5993525297152), type mismatch with chunk
bad extent [5993525297152, 5993525313536), type mismatch with chunk
bad extent [5993529442304, 5993529458688), type mismatch with chunk
bad extent [5993529458688, 5993529475072), type mismatch with chunk
bad extent [5993530015744, 5993530032128), type mismatch with chunk
bad extent [5993530359808, 5993530376192), type mismatch with chunk
bad extent [5993530376192, 5993530392576), type mismatch with chunk
bad extent [5993530392576, 5993530408960), type mismatch with chunk
bad extent [5993530408960, 5993530425344), type mismatch with chunk
bad extent [5993531260928, 5993531277312), type mismatch with chunk
bad extent [5993531310080, 5993531326464), type mismatch with chunk

What do they mean? And how to correct it without data loss (cause this
would be critical/precious data)?

Oddly, I've fsck'ed the very same fs last time I've unmounted it (with
no errors)... and now this.
The only difference would be newer kernel and btrfsprogs.


Thanks,
Chris.

smime.p7s
Description: S/MIME cryptographic signature


Re: bad extent [5993525264384, 5993525280768), type mismatch with chunk

2015-11-12 Thread Christoph Anton Mitterer
On Fri, 2015-11-13 at 11:23 +0800, Qu Wenruo wrote:
> No, "-t 2" means only dump extent tree, no privacy issues at all.
> Since only numeric inode/snapshot number and offset inside file.
> Or I'll give you a warning on privacy.
> 
> No file name at all, just try it yourself.
I'm preparing it...


> It may happened in older kernels, just recent btrfsck can detect the
> problem.
Okay, and how do "we" find the cause?


> > kernel 4.2.6
> > btrfsprogs 4.3
> > 
> New enough, that's the only good news yet...
Well but I did use the same fs with older kernels... so it may have
happened much earlier... :-(

I do have a SHA512 checksum on every file... stored in the XATTRS...

So as long as the file hierarchy would still be complete I could at
least check whether all files are still really valid.


Cheers,
Chris.

smime.p7s
Description: S/MIME cryptographic signature


Re: bad extent [5993525264384, 5993525280768), type mismatch with chunk

2015-11-12 Thread Christoph Anton Mitterer
Hey.

On Fri, 2015-11-13 at 10:13 +0800, Qu Wenruo wrote:
> Like this one, if any extent type doesn't match with its chunk, like
> metadata extent in a data chunk, btrfsck will report like that.
So these errors... are they anything serious? I.e. like data
loss/corruption? Or is it more a "cosmetic" issue?

And would there be a way for btrfs check to repair that thing?

And is there a way to find out to which file these extents belong?


> The filesystem seems to be a converted one from ext*.
No,it was not... any other reasons that could cause this?

It's actually a very plain btrfs... no RAID, no qgroups,... the only
thing I really did was creating snapshots and incrementally send'ing
them to other btrfs (i.e. the backups).
I'd have expected that btrfs is more or less table in these cases.


> But some user, like Roman Mamedov in the maillist, said a balance 
> operation can solve it.
> It's worthy trying but it may also cause unknown bugs.
So what's the safest way? Copying off all data and creating the fs from
scratch?
If so, is there a (safe) way to copy a fully fs with the snapshots?

But as I've said, this wasn't an ext converted fs, so in case I do this
or the balance we probably loose any chances to further debug.

And is there any way to tell more certain, whether the balance would
help or whether I'd just get more possibly even hidden corruptions? I
mean right... well it would be painful to recover from the most recent
backup, but not extremely painful.


Right now I'm doing a full read-only scrub, which will take a while as
it's a nearly full 8TB HDD, so far no errors.


Thanks,
Chris.

smime.p7s
Description: S/MIME cryptographic signature


Re: bad extent [5993525264384, 5993525280768), type mismatch with chunk

2015-11-12 Thread Christoph Anton Mitterer
On Fri, 2015-11-13 at 11:23 +0800, Qu Wenruo wrote:
> No, "-t 2" means only dump extent tree, no privacy issues at all.
> Since only numeric inode/snapshot number and offset inside file.
> Or I'll give you a warning on privacy.

Done...
http://christoph.anton.mitterer.name/tmp/public/856fc21a-89b8-11e5-abaf-502690aa641f.xz


I tried to figure out which kernel I've started with when I created the
fs,... that was around the 1th of March 2015... but I've only had the
Debian unstable kernel form that time (not the current vanilla)... and
I guess that was 3.16.


Thanks,
Chris.

smime.p7s
Description: S/MIME cryptographic signature


Re: bad extent [5993525264384, 5993525280768), type mismatch with chunk

2015-11-12 Thread Christoph Anton Mitterer
On Fri, 2015-11-13 at 10:52 +0800, Qu Wenruo wrote:
> You can provide the output of "btrfs-debug-tree -t 2 " to help
> further debug.
> It would be quite big, so it's better to zip it.
That would contain all the filenames, right? Hmm that could be
problematic because of privacy issues...


> Although it may not help a lot, but at least I can tell you which
> file 
> extents are affected. (By the hard way, I can only tell you which
> inode 
> in which subvolume is affected, all in numeric form)
> 
> And I could get enough info to determine what's the wrong type.
> (data extent in metadata chunk or vice verse, or even system chunk is
> involved)
sigh... I mean... how can that happen, if nothing of the more recent
things is used... I'd have guess others would have noted such a bug
before.


> > And is there any way to tell more certain, whether the balance
> > would
> > help or whether I'd just get more possibly even hidden corruptions?
> > I
> > mean right... well it would be painful to recover from the most
> > recent
> > backup, but not extremely painful.
> 
> When extent and chunk type get wrong, only god knows what will
> happen...
> So no useful help here.
If btrfs check already notices the mismatch, shouldn't it then be
possible to set the correct type?


> If the type mismatch errors are the only error output from fsck, then
> scrub should not help or report anything useful.
I see...


> And at last, what's the kernel and btrfs-progs version?
kernel 4.2.6
btrfsprogs 4.3

smime.p7s
Description: S/MIME cryptographic signature


Re: bad extent [5993525264384, 5993525280768), type mismatch with chunk

2015-11-12 Thread Christoph Anton Mitterer
And I should perhaps mention one more thing:

As I've said I have these two 8TiB disks... one which is basically the
master with loads of precious data, the other being a backup from the
master, regularly created with incremental btrfs send/receive.

Everytime I did this (which is every two months or so), I also do a
complete manual diff of all new/changed files (between the two
devices).
And when I first filled the master in March, where I copied the data
from some other ext4 disks, I did so as well.

And there were never differences or missing files.


That's why I kinda wonder if the whole thing is in anyway critical or
an "issue" at all.


Cheers,
Chris.

smime.p7s
Description: S/MIME cryptographic signature


Re: strange "No space left on device"

2015-11-08 Thread Christoph Anton Mitterer
On Sun, 2015-11-08 at 20:39 +, Duncan wrote:
> Wow, yes!  Good catch, Henk! =:^)  Hugo obviously didn't catch it,
> and I 
> wouldn't have either, as the bad size detection behavior is so 
> unexpected, it just wouldn't occur to me to look!
Hmm... all that *may* be more likely an error of myself when copying
and pasting the terminal output together:

I did actually change the 3rd partition to use 1GiB in later tries at
the expense of the 5th one shrinking, so the part table would have
looked like this in these later tries:
512M
1G
1G
1G
4G

Which would again fit the output of the various mkfs.btrfs.

Sorry if that was the case, apologies for any confusion.


The problem seemed to went away when explicitly using --mixed.


> (Apparently, btrfs-progs-4.3 does away with the default to mixed-
> mode at 1 GiB or under, tho it is still recommended.
Well I still had 4.2 ...

> I'm not exactly sure of
> why, tho I think it had to do with being able to use sub-GiB btrfs
> for 
> testing without having to worry about mixed mode.
Kinda strange... shouldn't it work out of the box for users and not
developers?

To be honest, no one should need to read through the wiki, just to be
able to create a small sized fs.

And even if no mixed D/M block group allocation was used... it
shouldn't just fail out-of-the-box with a few byte large files on a 1
GB fs.





Cheers,
Chris.

smime.p7s
Description: S/MIME cryptographic signature


mkfs.btrfs doesn't detect SSD

2015-11-07 Thread Christoph Anton Mitterer
Hey.

I'm creating a filesystem on Samsung Evo 850 Pro on top of a dm-
crypt/LUKS container (with TRIM not being passed on, for the usual
security reasons):
# mkfs.btrfs --label system /dev/mapper/system 
btrfs-progs v4.2.2
See http://btrfs.wiki.kernel.org for more information.

Label:  system
UUID:   65531196-2e43-4c49-b495-ac9abc57d7d8
Node size:  16384
Sector size:4096
Filesystem size:937.00GiB
Block group profiles:
  Data: single8.00MiB
  Metadata: DUP   1.01GiB
  System:   DUP  12.00MiB
SSD detected:   no
Incompat features:  extref, skinny-metadata
Number of devices:  1
Devices:
   IDSIZE  PATH
1   937.00GiB  /dev/mapper/system


As you can see it doesn't detect the SSD.
Isn't that kinda problematic cause btrfs does more when it detects and
SSD than just using TRIM (e.g. not DUPing meta-data)?

Can I somehow override this to get the SSD "detected"?
Or what is the general suggestion here? Having it handled as SSD or as
non-SSD, as said, when dm-crypt is used below and TRIM is not intended
to be used.


Cheers,
Chris.

smime.p7s
Description: S/MIME cryptographic signature


Re: mkfs.btrfs doesn't detect SSD

2015-11-07 Thread Christoph Anton Mitterer
Hmm in fact it seems to be the kernel who wrongly, detects the type:
/sys/block/sdb/queue/rotational = 1
or more like the USB/SATA bridge simply reports it wrong.

Anyway, is there a way to override? Or will setting
/sys/block/sdb/queue/rotational = 0 give the expected behaviour?


Thanks,
Chris.

smime.p7s
Description: S/MIME cryptographic signature


strange "No space left on device"

2015-11-07 Thread Christoph Anton Mitterer
Hey.

I just repeatedly did the following twice on a ~8GB USB stick, under
Debian sid (ergo kernel 4.2.0-1-amd64, btrfsprogs 4.2.2-1).

First, created some GPT on the stick:
Number  Start (sector)End (sector)  Size   Code  Name
   12048 1048575   511.0 MiB   EF02  BIOS boot partition
   2 1048576 3145727   1024.0 MiB  8300  Linux filesystem
   3 3145728 4194303   512.0 MiB   8300  Linux filesystem
   4 4194304 6291455   1024.0 MiB  8300  Linux filesystem
   5 629145615687644   4.5 GiB 8300  Linux filesystem


Then filesystems:
root@heisenberg:~# mkfs.btrfs --nodiscard --label boot-main /dev/sdb2 
btrfs-progs v4.2.2
See http://btrfs.wiki.kernel.org for more information.

Label:  boot-main
UUID:   150ee9fb-c650-4b5b-8f64-606e589e429a
Node size:  16384
Sector size:4096
Filesystem size:1.00GiB
Block group profiles:
  Data: single8.00MiB
  Metadata: DUP  59.19MiB
  System:   DUP  12.00MiB
SSD detected:   no
Incompat features:  extref, skinny-metadata
Number of devices:  1
Devices:
   IDSIZE  PATH
1 1.00GiB  /dev/sdb2

root@heisenberg:~# mkfs.btrfs --nodiscard --label boot-data /dev/sdb3
btrfs-progs v4.2.2
See http://btrfs.wiki.kernel.org for more information.

Label:  boot-data
UUID:   b1c1fc77-c965-4f0c-a2b5-44a301fd8772
Node size:  16384
Sector size:4096
Filesystem size:1.00GiB
Block group profiles:
  Data: single8.00MiB
  Metadata: DUP  59.19MiB
  System:   DUP  12.00MiB
SSD detected:   no
Incompat features:  extref, skinny-metadata
Number of devices:  1
Devices:
   IDSIZE  PATH
1 1.00GiB  /dev/sdb3

root@heisenberg:~# mkfs.btrfs --nodiscard --label boot-archive
/dev/sdb4
btrfs-progs v4.2.2
See http://btrfs.wiki.kernel.org for more information.

Label:  boot-archive
UUID:   a063cf3b-0322-4ec7-a8c1-2dabecad9f57
Node size:  16384
Sector size:4096
Filesystem size:1.00GiB
Block group profiles:
  Data: single8.00MiB
  Metadata: DUP  59.19MiB
  System:   DUP  12.00MiB
SSD detected:   no
Incompat features:  extref, skinny-metadata
Number of devices:  1
Devices:
   IDSIZE  PATH
1 1.00GiB  /dev/sdb4

root@heisenberg:~# mkfs.btrfs --nodiscard --label boot-rescue /dev/sdb5
btrfs-progs v4.2.2
See http://btrfs.wiki.kernel.org for more information.

Label:  boot-rescue
UUID:   104b7bc3-3b8c-4a08-b0a6-9172ed664e68
Node size:  16384
Sector size:4096
Filesystem size:3.98GiB
Block group profiles:
  Data: single8.00MiB
  Metadata: DUP 211.75MiB
  System:   DUP  12.00MiB
SSD detected:   no
Incompat features:  extref, skinny-metadata
Number of devices:  1
Devices:
   IDSIZE  PATH
1 3.98GiB  /dev/sdb5



No errors in the kernel log.



Then I got usage info:
root@heisenberg:/data/SSS/boot# btrfs filesystem usage data/
Overall:
Device size:   1.00GiB
Device allocated:    126.38MiB
Device unallocated:  897.62MiB
Device missing:  0.00B
Used:    256.00KiB
Free (estimated):    905.62MiB  (min: 456.81MiB)
Data ratio:   1.00
Metadata ratio:   2.00
Global reserve:   16.00MiB  (used: 0.00B)

Data,single: Size:8.00MiB, Used:0.00B
   /dev/sdb3   8.00MiB

Metadata,DUP: Size:51.19MiB, Used:112.00KiB
   /dev/sdb3 102.38MiB

System,DUP: Size:8.00MiB, Used:16.00KiB
   /dev/sdb3  16.00MiB

Unallocated:
   /dev/sdb3 897.62MiB
root@heisenberg:/data/SSS/boot# btrfs filesystem usage main/
Overall:
Device size:   1.00GiB
Device allocated:    126.38MiB
Device unallocated:  897.62MiB
Device missing:  0.00B
Used:    256.00KiB
Free (estimated):    905.62MiB  (min: 456.81MiB)
Data ratio:   1.00
Metadata ratio:   2.00
Global reserve:   16.00MiB  (used: 0.00B)

Data,single: Size:8.00MiB, Used:0.00B
   /dev/sdb2   8.00MiB

Metadata,DUP: Size:51.19MiB, Used:112.00KiB
   /dev/sdb2 102.38MiB

System,DUP: Size:8.00MiB, Used:16.00KiB
   /dev/sdb2  16.00MiB

Unallocated:
   /dev/sdb2 897.62MiB
root@heisenberg:/data/SSS/boot# btrfs filesystem usage rescue/
Overall:
Device size:   3.98GiB
Device allocated:    431.50MiB
Device unallocated:    3.56GiB
Device missing:  0.00B
Used:    320.00KiB
Free (estimated):    

Re: strange "No space left on device"

2015-11-07 Thread Christoph Anton Mitterer
On Sat, 2015-11-07 at 23:30 +, Hugo Mills wrote:
>    These are all really small.
Well enough for booting =)


>    I would suggest running mkfs with --mixed for all of these
> filesystems and trying again.
I thought btrfs would do that automatically:
https://btrfs.wiki.kernel.org/index.php/FAQ#if_your_device_is_small
"Using mixed block groups is recommended for filesystems of 1GiB or
smaller and mkfs.btrfs will force mixed block groups automatically in
that case."

Anyway I even if that's the reason, than I think something's quite
wrong here... a) it doesn't happen always (I just create 10 times a fs
in the same partition, and no problem) ... b) in those cases where I
could produce the issue, it went away after a balance... so perhaps
they just just balance the fs right after mkfs o.O

Cheers,
Chris.

smime.p7s
Description: S/MIME cryptographic signature


<    1   2   3   4   >