Re: dear developers, can we have notdatacow + checksumming, plz?

2015-12-22 Thread Duncan
Austin S. Hemmelgarn posted on Mon, 21 Dec 2015 08:36:02 -0500 as
excerpted:

> On 2015-12-16 21:09, Christoph Anton Mitterer wrote:

>> On Tue, 2015-12-15 at 11:00 -0500, Austin S. Hemmelgarn wrote:

>>> nodatacow only [avoids fragmentation] if the file is
>>> pre-allocated, if it isn't, then it still ends up fragmented.

>> Hmm is that "it may end up fragmented" or a "it will definitely? Cause
>> I'd have hoped, that if nothing else had been written in the meantime,
>> btrfs would perhaps try to write next to the already allocated blocks.

> If there are multiple files being written, then there is a relatively
> high probability that they will end up fragmented if they are more than
> about 64k and aren't pre-allocated.

Does the 30-second-by-default commit window (and similarly 30-second-
default dirty-flush-time at the VFS level) modify this at all?  It has 
been my assumption that same-file writes accumulated during this time 
should merge, increasing efficiency and decreasing fragmentation (both 
with and without nocow), tho of course further writes outside this 30-
second window will likely trigger it, if other files have been written in 
parallel or in the mean time.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: dear developers, can we have notdatacow + checksumming, plz?

2015-12-22 Thread Austin S. Hemmelgarn

On 2015-12-22 04:12, Duncan wrote:

Austin S. Hemmelgarn posted on Mon, 21 Dec 2015 08:36:02 -0500 as
excerpted:


On 2015-12-16 21:09, Christoph Anton Mitterer wrote:



On Tue, 2015-12-15 at 11:00 -0500, Austin S. Hemmelgarn wrote:



nodatacow only [avoids fragmentation] if the file is
pre-allocated, if it isn't, then it still ends up fragmented.



Hmm is that "it may end up fragmented" or a "it will definitely? Cause
I'd have hoped, that if nothing else had been written in the meantime,
btrfs would perhaps try to write next to the already allocated blocks.



If there are multiple files being written, then there is a relatively
high probability that they will end up fragmented if they are more than
about 64k and aren't pre-allocated.


Does the 30-second-by-default commit window (and similarly 30-second-
default dirty-flush-time at the VFS level) modify this at all?  It has
been my assumption that same-file writes accumulated during this time
should merge, increasing efficiency and decreasing fragmentation (both
with and without nocow), tho of course further writes outside this 30-
second window will likely trigger it, if other files have been written in
parallel or in the mean time.

I think it does, but not much, and it depends on the workload.  I do 
notice less fragmentation on the filesystems I increase the commit 
window on, and more on ones I decrease it, but the difference is pretty 
small as long as you use something reasonable (I've never tested 
anything higher than 300, and I rarely go above 60).  My guess based on 
what the commit window is for (namely, it's the amount of time the log 
tree gets updated before forcing a transaction to be committed) would be 
that it has less effect if stuff is regularly calling fsync().

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: dear developers, can we have notdatacow + checksumming, plz?

2015-12-21 Thread Austin S. Hemmelgarn

On 2015-12-16 21:09, Christoph Anton Mitterer wrote:

On Tue, 2015-12-15 at 11:00 -0500, Austin S. Hemmelgarn wrote:

Well sure, I think we'de done most of this and have dedicated
controllers, at least of a quality that funding allows us ;-)
But regardless how much one tunes, and how good the hardware is. If
you'd then loose always a fraction of your overall IO, and be it
just
5%, to defragging these types of files, one may actually want to
avoid
this at all, for which nodatacow seems *the* solution.

nodatacow only works for that if the file is pre-allocated, if it
isn't,
then it still ends up fragmented.

Hmm is that "it may end up fragmented" or a "it will definitely?
Cause I'd have hoped, that if nothing else had been written in the
meantime, btrfs would perhaps try to write next to the already
allocated blocks.
If there are multiple files being written, then there is a relatively 
high probability that they will end up fragmented if they are more than 
about 64k and aren't pre-allocated.




The problem is not entirely the lack of COW semantics, it's also
the
fact that it's impossible to implement an atomic write on a hard
disk.

Sure... but that's just the same for the nodatacow writes of data.
(And the same, AFAIU, for CoW itself, just that we'd notice any
corruption in case of a crash due to the CoWed nature of the fs and
could go back to the last generation).

Yes, but it's also the reason that using either COW or a log-
structured
filesystem (like NILFS2, LogFS, or I think F2FS) is important for
consistency.

So then it's no reason why it shouldn't work.
The meta-data is CoWed, any incomplete writes of checksumdata in that
(be it for CoWed data or no-CoWed data, should the later be
implemented), would be protected at that level.

Currently, the no-CoWed data is, AFAIU completely at risk of being
corrupted (no checksums, no journal).

Checksums on no-CoWed data would just improve that.
Except that without COW semantics on the data blocks, you can't be sure 
whether the checksum is for the data that is there, the data that was 
going to be written there, or data that had been there previously.  This 
will significantly increase the chances of having false positives, which 
really isn't a viable tradeoff.




What about VMs? At least a quick google search didn't give me any
results on whether there would be e.g. checksumming support for
qcow2.
For raw images there surely is not.

I don't mean that the VMM does checksumming, I mean that the guest OS
should be the one to handle the corruption.  No sane OS doesn't run
at
least some form of consistency checks when mounting a filesystem.

Well but we're not talking about having a filesystem that "looks clear"
here. For this alone we wouldn't need any checksumming at all.

We talk about data integrity protection, i.e. all files and their
contents. Nothing which a fsck inside a guest VM would ever notice (I
mean by a fsck), if there are just some bit flips or things like that.
That really depends on what is being done inside the VM.  If you're 
using BTRFS or even dm-verity, you should have no issues detecting the 
corruption.





And even if DBs do some checksumming now, it may be just a
consequence
of that missing in the filesystems.
As I've written somewhere else in the previous mail: it's IMHO much
better if one system takes care on this, where the code is well
tested,
than each application doing it's own thing.

That's really a subjective opinion.  The application knows better
than
we do what type of data integrity it needs, and can almost certainly
do
a better job of providing it than we can.

Hmm I don't see that.
When we, at the filesystem level, provide data integrity, than all data
is guaranteed to be valid.
What more should an application be able to provide? At best they can do
the same thing faster, but even for that I see no immediate reason to
believe it.
Any number of things.  As of right now, there are no local filesystems 
on Linux that provide:
1. Cryptographic verification of the file data (Technically possible 
with IMA and EVM, or with DM-Verity (if the data is supposed to be 
read-only), but those require extra setup, and aren't part of the FS).
2. Erasure coding other than what is provided by RAID5/6 (At least one 
distributed cluster filesystem provides this (Ceph), but running such a 
FS on a single node is impractical).
3. Efficient transactional logging (for example, the type that is needed 
by most RDBMS software).
4. Easy selective protections (Some applications need only part of their 
data protected).


Item 1 can't really be provided by BTRFS under it's current design, it 
would require at least implementing support for cryptographically secure 
hashes in place of CRC32c (and each attempt to do that has been pretty 
much shot down).  Item 2 is possible, and is something I would love to 
see support for, but would require a significant amount of coding, and 
almost certainly wouldn't anywhere near as flexible as letting 

Re: dear developers, can we have notdatacow + checksumming, plz?

2015-12-16 Thread Christoph Anton Mitterer
On Tue, 2015-12-15 at 11:00 -0500, Austin S. Hemmelgarn wrote:
> > Well sure, I think we'de done most of this and have dedicated
> > controllers, at least of a quality that funding allows us ;-)
> > But regardless how much one tunes, and how good the hardware is. If
> > you'd then loose always a fraction of your overall IO, and be it
> > just
> > 5%, to defragging these types of files, one may actually want to
> > avoid
> > this at all, for which nodatacow seems *the* solution.
> nodatacow only works for that if the file is pre-allocated, if it
> isn't, 
> then it still ends up fragmented.
Hmm is that "it may end up fragmented" or a "it will definitely?
Cause I'd have hoped, that if nothing else had been written in the
meantime, btrfs would perhaps try to write next to the already
allocated blocks.


> > > The problem is not entirely the lack of COW semantics, it's also
> > > the
> > > fact that it's impossible to implement an atomic write on a hard
> > > disk.
> > Sure... but that's just the same for the nodatacow writes of data.
> > (And the same, AFAIU, for CoW itself, just that we'd notice any
> > corruption in case of a crash due to the CoWed nature of the fs and
> > could go back to the last generation).
> Yes, but it's also the reason that using either COW or a log-
> structured 
> filesystem (like NILFS2, LogFS, or I think F2FS) is important for 
> consistency.
So then it's no reason why it shouldn't work.
The meta-data is CoWed, any incomplete writes of checksumdata in that
(be it for CoWed data or no-CoWed data, should the later be
implemented), would be protected at that level.

Currently, the no-CoWed data is, AFAIU completely at risk of being
corrupted (no checksums, no journal).

Checksums on no-CoWed data would just improve that.


> > What about VMs? At least a quick google search didn't give me any
> > results on whether there would be e.g. checksumming support for
> > qcow2.
> > For raw images there surely is not.
> I don't mean that the VMM does checksumming, I mean that the guest OS
> should be the one to handle the corruption.  No sane OS doesn't run
> at 
> least some form of consistency checks when mounting a filesystem.
Well but we're not talking about having a filesystem that "looks clear"
here. For this alone we wouldn't need any checksumming at all.

We talk about data integrity protection, i.e. all files and their
contents. Nothing which a fsck inside a guest VM would ever notice (I
mean by a fsck), if there are just some bit flips or things like that.


> > 
> > And even if DBs do some checksumming now, it may be just a
> > consequence
> > of that missing in the filesystems.
> > As I've written somewhere else in the previous mail: it's IMHO much
> > better if one system takes care on this, where the code is well
> > tested,
> > than each application doing it's own thing.
> That's really a subjective opinion.  The application knows better
> than 
> we do what type of data integrity it needs, and can almost certainly
> do 
> a better job of providing it than we can.
Hmm I don't see that.
When we, at the filesystem level, provide data integrity, than all data
is guaranteed to be valid.
What more should an application be able to provide? At best they can do
the same thing faster, but even for that I see no immediate reason to
believe it.

And in practise it seems far more likely that if countless applications
should such task on their own, that it's more error prone (that's why
we have libraries for all kinds of code, trying to reuse code,
minimising the possibility of errors in countless home-brew solutions),
or not done at all.


> > > > - the data was written out correctly, but before the csum
> > > > was
> > > >   written the system crashed, so the csum would now tell us
> > > > that
> > > > the
> > > >   block is bad, while in reality it isn't.
> > > There is another case to consider, the data got written out, but
> > > the
> > > crash happened while writing the checksum (so the checksum was
> > > partially
> > > written, and is corrupt).  This means we get a false positive on
> > > a
> > > disk
> > > error that isn't there, even when the data is correct, and that
> > > should
> > > be avoided if at all possible.
> > I've had that, and I've left it quoted above.
> > But as I've said before: That's one case out of many? How likely is
> > it
> > that the crash happens exactly after a large data block has been
> > written followed by a relatively tiny amount of checksum data.
> > I'd assume it's far more likely that the crash happens during
> > writing
> > the data.
> Except that the whole metadata block pointing to that data block gets
> rewritten, not just the checksum.
But that's the case anyway, isn't it? With or without checksums.



> > And regarding "reporting data to be in error, which is actually
> > correct"... isn't that what all journaling systems may do?
> No, most of them don't actually do that.  The general design of a 
> journaling filesystem is that 

Re: dear developers, can we have notdatacow + checksumming, plz?

2015-12-16 Thread Duncan
Austin S. Hemmelgarn posted on Tue, 15 Dec 2015 11:00:40 -0500 as
excerpted:

> And in particular, the only
> journaling filesystem that I know of that even allows the option of
> journaling the file contents instead of just metadata is ext4.

IIRC, ext3 was the first to have it in Linux mainline, with data=writeback 
for the speed freaks that don't care about data loss, data=ordered as the 
default normal option (except for that infamous period when Linus lost 
his head and let people talk him into switching to data=writeback, 
despite the risks... he later came back to his senses and reverted that), 
and data=journal for the folks that were willing to pay trade a bit of 
speed for better data protection (tho it was famous for surprising 
everybody, in that in certain use-cases it was extremely fast, faster 
than data=writeback, something I don't think was ever fully explained).

To my knowledge ext3 still has that, tho I haven't used it probably a 
decade.

Reiserfs has all three data= options as well, with data=ordered the 
default, tho it only had data=writeback initially.  While I've used 
reiserfs for years, it has always been with the default data=ordered 
since that was introduced, and I'd be surprised if data=journal had the 
same use-case speed advantage that it did on ext3, as it's too 
different.  Meanwhile, that early data=writeback default is where 
reiserfs got its ill repute for data loss, but it had long switched to 
data=ordered by default by the time Linus lost his senses and tried 
data=writeback by default on ext3.  Because I was on reiserfs from 
data=writeback era, I was rather glad most kernel hackers didn't want to 
touch it by the time Linus let them talk him into data=writeback on ext3, 
and thus left reiserfs (which again had long been data=ordered by default 
by then) well enough alone.

But I did help a few people running ext3 trace down their new ext3 
stability issues to that bad data=writeback experiment, and persuaded 
them to specify data=ordered, which solved their problems, so indeed 
they /were/ data=writeback related.  And happily, Linus did eventually 
regain his senses and return ext3 to data=ordered by default once again.

And based on what you said, ext4 still has all three data= options, 
including data=journal.  But I wasn't sure on that myself (tho I would 
have assumed it inherited it from ext3) and thus am /definitely/ not sure 
whether it inherits ext3's data=journal speed advantages in certain 
corner-cases.

I have no idea whether other journaled filesystems allow choosing the 
journal level or not, tho.  I only know of those three.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: dear developers, can we have notdatacow + checksumming, plz?

2015-12-16 Thread Duncan
Austin S. Hemmelgarn posted on Tue, 15 Dec 2015 11:00:40 -0500 as
excerpted:

> AFAIUI, checksums are stored per-instance for every block.  This is
> important in a multi-device filesystem in case you lose a device, so
> that you still have a checksum for the block.  There should be no
> difference between extent layout and compression between devices
> however.

I don't believe that's quite correct.

What is correct, to the best of my knowledge, is that checksums are 
metadata, and thus have whatever duplication/parity level metadata is 
assigned.

For single devices, that is of course by default dup, 2X the metadata and 
thus 2X the checksums, both on the single data (as effectively the only 
choice on a single device, at least thru 4.3, tho there's a patch adding 
dup data as an option that I think should be in 4.4) when covering data, 
dup metadata when covering it.

For multiple devices, it's default raid1 metadata, default single data, 
so the picture doesn't differ much by default from the single-device 
default picture.  It's also possible to do single metadata, raidN data, 
which really doesn't make sense except for raid0 data, and thus I believe 
there's a warning about that sort of layout in newer mkfs.btrfs, or when 
lowering the metadata redundancy using balance filters.

But of course it's possible to do raid1 data and metadata, which would be 
two copies of each, regardless of the number of devices (except that it's 
2+, of course).  But the copies aren't 1:1 assigned.  That is, if they're 
equal generation, btrfs can read either checksum and apply it to either 
data/metadata block.  (Of course if they're not equal generation, btrfs 
will choose the higher one, thus covering the case of writing at the time 
of a crash, since either they will both be the same generation if the 
root block wasn't updated to the new one on either one yet, or one will 
be a higher/newer generation than the other, if it had already finished 
writing one but not the other at the time of the crash.)

This is why it's an extremely good idea if you have a pair of devices in 
raid1, and you mount one of them degraded/writable with the other 
unavailable for some reason, that you don't also mount the other one 
writable and then try to recombined them.  Chances are the generations 
wouldn't match and it'd pick the one with the higher generation, but if 
they did for some reason match, and both checksums were valid on their 
data, but the data differed... either one could be chosen, and a scrub 
might choose either one to fix the other, as well, which could in theory 
result in a file with intermixed blocks from the two different versions!

Just ensure that if one is mounted writable, it's the only one mounted 
writable if there's a chance of recombining, and you'll be fine, as it'll 
be the only one with advancing generations.  And if by some accident both 
are mounted writable separately, the best bet is to be sure and wipe the 
one, then add it as a new device, if you're going to reintroduce it to 
the same filesystem.

Of course this gets a bit more complicated with 3+ device raid1, since 
currently, there's still only two copies of each block and two copies of 
the checksum, meaning there's at least one device without a copy of each 
block, and if the filesystem is mounted degraded writable repeatedly with 
a random device missing...

Similarly, the permutations can be calculated for the other raid types, 
and for mixed raid types like raid6 data (specified) and raid1 metadata 
(unspecified so the default used), but I won't attempt that here.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: dear developers, can we have notdatacow + checksumming, plz?

2015-12-15 Thread Austin S. Hemmelgarn

On 2015-12-14 22:15, Christoph Anton Mitterer wrote:

On Mon, 2015-12-14 at 09:16 -0500, Austin S. Hemmelgarn wrote:

When one starts to get a bit deeper into btrfs (from the admin/end-
user
side) one sooner or later stumbles across the recommendation/need
to
use nodatacow for certain types of data (DBs, VM images, etc.) and
the
reason, AFAIU, being the inherent fragmentation that comes along
with
the CoW, which is especially noticeable for those types of files
with
lots of random internal writes.

It is worth pointing out that in the case of DB's at least, this is
because at least some of the do COW internally to provide the
transactional semantics that are required for many workloads.

Guess that also applies to some VM images then, IIRC qcow2 does CoW.

Yep, and I think that VMWare's image format does too.





a) for performance reasons (when I consider our research software
which
often has IO as the limiting factor and where we want as much IO
being
used by actual programs as possible)...

There are other things that can be done to improve this.  I would
assume
of course that you're already doing some of them (stuff like using
dedicated storage controller cards instead of the stuff on the
motherboard), but some things often get overlooked, like actually
taking
the time to fine-tune the I/O scheduler for the workload (Linux has
particularly brain-dead default settings for CFQ, and the deadline
I/O
scheduler is only good in hard-real-time usage or on small hard
drives
that actually use spinning disks).

Well sure, I think we'de done most of this and have dedicated
controllers, at least of a quality that funding allows us ;-)
But regardless how much one tunes, and how good the hardware is. If
you'd then loose always a fraction of your overall IO, and be it just
5%, to defragging these types of files, one may actually want to avoid
this at all, for which nodatacow seems *the* solution.
nodatacow only works for that if the file is pre-allocated, if it isn't, 
then it still ends up fragmented.




The big argument for defragmenting a SSD is that it makes it such
that
you require fewer I/O requests to the device to read a file

I've had read about that too, but since I haven't had much personal
experience or measurements in that respect, I didn't list it :)
I can't give any real numbers, but I've seen noticeable performance 
improvements on good SSD's (Intel, Samsung, and Crucial) when making 
sure that things are defragmented.



The problem is not entirely the lack of COW semantics, it's also the
fact that it's impossible to implement an atomic write on a hard
disk.

Sure... but that's just the same for the nodatacow writes of data.
(And the same, AFAIU, for CoW itself, just that we'd notice any
corruption in case of a crash due to the CoWed nature of the fs and
could go back to the last generation).
Yes, but it's also the reason that using either COW or a log-structured 
filesystem (like NILFS2, LogFS, or I think F2FS) is important for 
consistency.




but I wouldn't know that relational DBs really do cheksuming of the
data.

All the ones I know of except GDBM and BerkDB do in fact provide the
option of checksumming.  It's pretty much mandatory if you want to be
considered for usage in financial, military, or medical applications.

Hmm I see... PostgreSQL seem to have it since 9.3 ... didn't know
that... only crc16 but at least something.



Long story short, it does happen every now and then, that a scrub
shows
file errors, for neither the RAID was broken, nor there were any
block
errors reported by the disks, or anything suspicious in SMART.
In other words, silent block corruption.

Or a transient error in system RAM that ECC didn't catch, or a
undetected error in the physical link layer to the disks, or an error
in
the disk cache or controller, or any number of other things.

Well sure,... I was referring to these particular cases, where silent
block corruption was the most likely reason.
The data was reproducibly read identical, which probably rules out bad
RAM or controller, etc.



   BTRFS
could only protect against some cases, not all (for example, if you
have
a big enough error in RAM that ECC doesn't catch it, you've got
serious
issues that just about nothing short of a cold reboot can save you
from).

Sure, I haven't claimed, that checksumming for no-CoWed data is a
solution for everything.



But, AFAIU, not doing CoW, while not having a journal (or does it
have
one for these cases???) almost certainly means that the data (not
necessarily the fs) will be inconsistent in case of a crash during
a
no-CoWed write anyway, right?
Wouldn't it be basically like ext2?

Kind of, but not quite.  Even with nodatacow, metadata is still COW,
which is functionally as safe as a traditional journaling filesystem
like XFS or ext4.

Sure, I was referring to the data part only, should have made that more
clear.



Absolute worst case scenario for both nodatacow on
BTRFS, and a traditional journaling 

Re: dear developers, can we have notdatacow + checksumming, plz?

2015-12-14 Thread Christoph Anton Mitterer
On Mon, 2015-12-14 at 17:42 +1100, Russell Coker wrote:
> My understanding of BTRFS is that the metadata referencing data
> blocks has the 
> checksums for those blocks, then the blocks which link to that
> metadata (EG 
> directory entries referencing file metadata) has checksums of those.
You mean basically, that all metadata is chained, right?

> For each 
> metadata block there is a new version that is eventually linked from
> a new 
> version of the tree root.
> 
> This means that the regular checksum mechanisms can't work with nocow
> data.  A 
> filesystem can have checksums just pointing to data blocks but you
> need to 
> cater for the case where a corrupt metadata block points to an old
> version of 
> a data block and matching checksum.  The way that BTRFS works with an
> entire 
> checksumed tree means that there's no possibility of pointing to an
> old 
> version of a data block.
Hmm I'm not sure whether I understand that (or better said, I'm
probably sure I don't :D).

AFAIU, the metadata is always CoWed, right? So when a nodatacow file is
written, I'd assume it's mtime was update, which already leads to
CoWing of metadata... just that now, the checksums should be written as
well.

If the metadata block is corrupt, then should that be noticed via the
csums on that?

And you said "The way that BTRFS works with an entire checksumed tree
means that there's no possibility of pointing to an old version of a
data block."... how would that work for nodatacow'ed blocks? If there
is a crash it cannot know whether it was still the old block or the new
one or any garbage in between?!


> The NetApp published research into hard drive errors indicates that
> they are 
> usually in small numbers and located in small areas of the disk.  So
> if BTRFS 
> had a nocow file with any storage method other than dup you would
> have metadata 
> and file data far enough apart that they are not likely to be hit by
> the same 
> corruption (and the same thing would apply with most Ext4 Inode
> tables and 
> data blocks).
Well put aside any such research (whose results aren't guaranteed to be
always the case)... but that's just one reason from my motivation why
I've said checksums for no-CoWed files would be great (I used the
multi-device example though, not DUP).


> I think that a file mode where there were checksums on data 
> blocks with no checksums on the metadata tree would be useful.  But
> it would 
> require a moderate amount of coding
Do you mean in general, or having this as a mode for nodatacow'ed
files?
Loosing the meta data checksumming, doesn't seem really much more
appealing than not having data checksumming :-(


> and there's lots of other things that the 
> developers are working on.
Sure, I just wanted to bring this to their attending... I already
imagined that they wouldn't drop their current work to do that, just
because me whining for it ;-)


Thanks,
Chris.

smime.p7s
Description: S/MIME cryptographic signature


Re: dear developers, can we have notdatacow + checksumming, plz?

2015-12-14 Thread Christoph Anton Mitterer
On Mon, 2015-12-14 at 09:16 -0500, Austin S. Hemmelgarn wrote:
> > When one starts to get a bit deeper into btrfs (from the admin/end-
> > user
> > side) one sooner or later stumbles across the recommendation/need
> > to
> > use nodatacow for certain types of data (DBs, VM images, etc.) and
> > the
> > reason, AFAIU, being the inherent fragmentation that comes along
> > with
> > the CoW, which is especially noticeable for those types of files
> > with
> > lots of random internal writes.
> It is worth pointing out that in the case of DB's at least, this is 
> because at least some of the do COW internally to provide the 
> transactional semantics that are required for many workloads.
Guess that also applies to some VM images then, IIRC qcow2 does CoW.



> > a) for performance reasons (when I consider our research software
> > which
> > often has IO as the limiting factor and where we want as much IO
> > being
> > used by actual programs as possible)...
> There are other things that can be done to improve this.  I would
> assume 
> of course that you're already doing some of them (stuff like using 
> dedicated storage controller cards instead of the stuff on the 
> motherboard), but some things often get overlooked, like actually
> taking 
> the time to fine-tune the I/O scheduler for the workload (Linux has 
> particularly brain-dead default settings for CFQ, and the deadline
> I/O 
> scheduler is only good in hard-real-time usage or on small hard
> drives 
> that actually use spinning disks).
Well sure, I think we'de done most of this and have dedicated
controllers, at least of a quality that funding allows us ;-)
But regardless how much one tunes, and how good the hardware is. If
you'd then loose always a fraction of your overall IO, and be it just
5%, to defragging these types of files, one may actually want to avoid
this at all, for which nodatacow seems *the* solution.


> The big argument for defragmenting a SSD is that it makes it such
> that 
> you require fewer I/O requests to the device to read a file
I've had read about that too, but since I haven't had much personal
experience or measurements in that respect, I didn't list it :)


> The problem is not entirely the lack of COW semantics, it's also the
> fact that it's impossible to implement an atomic write on a hard
> disk. 
Sure... but that's just the same for the nodatacow writes of data.
(And the same, AFAIU, for CoW itself, just that we'd notice any
corruption in case of a crash due to the CoWed nature of the fs and
could go back to the last generation).


> > but I wouldn't know that relational DBs really do cheksuming of the
> > data.
> All the ones I know of except GDBM and BerkDB do in fact provide the 
> option of checksumming.  It's pretty much mandatory if you want to be
> considered for usage in financial, military, or medical applications.
Hmm I see... PostgreSQL seem to have it since 9.3 ... didn't know
that... only crc16 but at least something.


> > Long story short, it does happen every now and then, that a scrub
> > shows
> > file errors, for neither the RAID was broken, nor there were any
> > block
> > errors reported by the disks, or anything suspicious in SMART.
> > In other words, silent block corruption.
> Or a transient error in system RAM that ECC didn't catch, or a 
> undetected error in the physical link layer to the disks, or an error
> in 
> the disk cache or controller, or any number of other things.
Well sure,... I was referring to these particular cases, where silent
block corruption was the most likely reason.
The data was reproducibly read identical, which probably rules out bad
RAM or controller, etc.


>   BTRFS 
> could only protect against some cases, not all (for example, if you
> have 
> a big enough error in RAM that ECC doesn't catch it, you've got
> serious 
> issues that just about nothing short of a cold reboot can save you
> from).
Sure, I haven't claimed, that checksumming for no-CoWed data is a
solution for everything.


> > But, AFAIU, not doing CoW, while not having a journal (or does it
> > have
> > one for these cases???) almost certainly means that the data (not
> > necessarily the fs) will be inconsistent in case of a crash during
> > a
> > no-CoWed write anyway, right?
> > Wouldn't it be basically like ext2?
> Kind of, but not quite.  Even with nodatacow, metadata is still COW, 
> which is functionally as safe as a traditional journaling filesystem 
> like XFS or ext4.
Sure, I was referring to the data part only, should have made that more
clear.


> Absolute worst case scenario for both nodatacow on 
> BTRFS, and a traditional journaling filesystem, the contents of the
> file 
> are inconsistent.  However, almost all of the things that are 
> recommended use cases for nodatacow (primarily database files and VM 
> images) have some internal method of detecting and dealing with 
> corruption (because of the traditional filesystem semantics ensuring 
> metadata consistency, but not data 

Re: dear developers, can we have notdatacow + checksumming, plz?

2015-12-14 Thread Austin S. Hemmelgarn

On 2015-12-13 23:59, Christoph Anton Mitterer wrote:

(consider that question being asked with that face on: http://goo.gl/LQaOuA)

Hey.

I've had some discussions on the list these days about not having
checksumming with nodatacow (mostly with Hugo and Duncan).

They both basically told me it wouldn't be straight possible with CoW,
and Duncan thinks it may not be so much necessary, but none of them
could give me really hard arguments, why it cannot work (or perhaps I
was just too stupid to understand them ^^)... while at the same time I
think that it would be generally utmost important to have checksumming
(real world examples below).

Also, I remember that in 2014, Ted Ts'o told me that there are some
plans ongoing to get data checksumming into ext4, with possibly even
some guy at RH actually doing it sooner or later.

Since these threads were rather admin-work-centric, developers may have
skipped it, therefore, I decided to write down some thoughts
label them with a more attracting subject and give it some bigger
attention.
O:-)




1) Motivation why, it makes sense to have checksumming (especially also
in the nodatacow case)


I think of all major btrfs features I know of (apart from the CoW
itself and having things like reflinks), checksumming is perhaps the
one that distinguishes it the most from traditional filesystems.

Sure we have snapshots, multi-device support and compression - but we
could have had that as well with LVM and software/hardware RAID... (and
ntfs supported compression IIRC ;) ).
Of course, btrfs does all that in a much smarter way, I know, but it's
nothing generally new.
The *data* checksumming at filesystem level, to my knowledge, is
however. Especially that it's always verified. Awesome. :-)


When one starts to get a bit deeper into btrfs (from the admin/end-user
side) one sooner or later stumbles across the recommendation/need to
use nodatacow for certain types of data (DBs, VM images, etc.) and the
reason, AFAIU, being the inherent fragmentation that comes along with
the CoW, which is especially noticeable for those types of files with
lots of random internal writes.
It is worth pointing out that in the case of DB's at least, this is 
because at least some of the do COW internally to provide the 
transactional semantics that are required for many workloads.


Now duncan implied, that this could improve in the future, with the
auto-defragmentation getting (even) better, defrag becoming usable
again for those that do snapshots or reflinked copies and btrfs itself
generally maturing more and more.
But I kinda wonder to what extent one will be really able to solve
that, what seems to me a CoW-inherent "problem",...
Even *if* one can make the auto-defrag much smarter, it would still
mean that such files, like big DBs, VMs, or scientific datasets that
are internally rewritten, may get more or less constantly defragmented.
That may be quite undesired...
a) for performance reasons (when I consider our research software which
often has IO as the limiting factor and where we want as much IO being
used by actual programs as possible)...
There are other things that can be done to improve this.  I would assume 
of course that you're already doing some of them (stuff like using 
dedicated storage controller cards instead of the stuff on the 
motherboard), but some things often get overlooked, like actually taking 
the time to fine-tune the I/O scheduler for the workload (Linux has 
particularly brain-dead default settings for CFQ, and the deadline I/O 
scheduler is only good in hard-real-time usage or on small hard drives 
that actually use spinning disks).

b) SSDs...
Not really sure about that; btrfs seems to enable the autodefrag even
when an SSD is detected,... what is it doing? Placing the block in a
smart way on different chips so that accesses can be better
parallelised by the controller?
This really isn't possible with an SSD.  Except for NVMe and Open 
Channel SSD's, they use the same interfaces as a regular hard drive, 
which means you get absolutely no information about the data layout on 
the device.


The big argument for defragmenting a SSD is that it makes it such that 
you require fewer I/O requests to the device to read a file, and in most 
cases, the device will outlive it's usefulness because of performance 
long before it dies due to wearing out the flash storage.

Anyway, (a) is could be already argument enough, not to run solve the
problem by a smart-[auto-]defrag, should that actually be implemented.

So I think having notdatacow is great and not just a workaround till
everything else gets better to handle these cases.
Thus, checksumming, which is such a vital feature, should also be
possible for that.
The problem is not entirely the lack of COW semantics, it's also the 
fact that it's impossible to implement an atomic write on a hard disk. 
If we could tell the disk 'ensure that this set of writes either all 
happen, or none of them happen', then we could do 

Re: dear developers, can we have notdatacow + checksumming, plz?

2015-12-13 Thread Russell Coker
On Mon, 14 Dec 2015 03:59:18 PM Christoph Anton Mitterer wrote:
> I've had some discussions on the list these days about not having
> checksumming with nodatacow (mostly with Hugo and Duncan).
> 
> They both basically told me it wouldn't be straight possible with CoW,
> and Duncan thinks it may not be so much necessary, but none of them
> could give me really hard arguments, why it cannot work (or perhaps I
> was just too stupid to understand them ^^)... while at the same time I
> think that it would be generally utmost important to have checksumming
> (real world examples below).

My understanding of BTRFS is that the metadata referencing data blocks has the 
checksums for those blocks, then the blocks which link to that metadata (EG 
directory entries referencing file metadata) has checksums of those.  For each 
metadata block there is a new version that is eventually linked from a new 
version of the tree root.

This means that the regular checksum mechanisms can't work with nocow data.  A 
filesystem can have checksums just pointing to data blocks but you need to 
cater for the case where a corrupt metadata block points to an old version of 
a data block and matching checksum.  The way that BTRFS works with an entire 
checksumed tree means that there's no possibility of pointing to an old 
version of a data block.

The NetApp published research into hard drive errors indicates that they are 
usually in small numbers and located in small areas of the disk.  So if BTRFS 
had a nocow file with any storage method other than dup you would have metadata 
and file data far enough apart that they are not likely to be hit by the same 
corruption (and the same thing would apply with most Ext4 Inode tables and 
data blocks).  I think that a file mode where there were checksums on data 
blocks with no checksums on the metadata tree would be useful.  But it would 
require a moderate amount of coding and there's lots of other things that the 
developers are working on.

-- 
My Main Blog http://etbe.coker.com.au/
My Documents Bloghttp://doc.coker.com.au/
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html