(consider that question being asked with that face on: http://goo.gl/LQaOuA)

Hey.

I've had some discussions on the list these days about not having
checksumming with nodatacow (mostly with Hugo and Duncan).

They both basically told me it wouldn't be straight possible with CoW,
and Duncan thinks it may not be so much necessary, but none of them
could give me really hard arguments, why it cannot work (or perhaps I
was just too stupid to understand them ^^)... while at the same time I
think that it would be generally utmost important to have checksumming
(real world examples below).

Also, I remember that in 2014, Ted Ts'o told me that there are some
plans ongoing to get data checksumming into ext4, with possibly even
some guy at RH actually doing it sooner or later.

Since these threads were rather admin-work-centric, developers may have
skipped it, therefore, I decided to write down some thoughts&ideas
label them with a more attracting subject and give it some bigger
attention.
O:-)




1) Motivation why, it makes sense to have checksumming (especially also
in the nodatacow case)


I think of all major btrfs features I know of (apart from the CoW
itself and having things like reflinks), checksumming is perhaps the
one that distinguishes it the most from traditional filesystems.

Sure we have snapshots, multi-device support and compression - but we
could have had that as well with LVM and software/hardware RAID... (and
ntfs supported compression IIRC ;) ).
Of course, btrfs does all that in a much smarter way, I know, but it's
nothing generally new.
The *data* checksumming at filesystem level, to my knowledge, is
however. Especially that it's always verified. Awesome. :-)


When one starts to get a bit deeper into btrfs (from the admin/end-user 
side) one sooner or later stumbles across the recommendation/need to
use nodatacow for certain types of data (DBs, VM images, etc.) and the
reason, AFAIU, being the inherent fragmentation that comes along with
the CoW, which is especially noticeable for those types of files with
lots of random internal writes.

Now duncan implied, that this could improve in the future, with the
auto-defragmentation getting (even) better, defrag becoming usable
again for those that do snapshots or reflinked copies and btrfs itself
generally maturing more and more.
But I kinda wonder to what extent one will be really able to solve
that, what seems to me a CoW-inherent "problem",...
Even *if* one can make the auto-defrag much smarter, it would still
mean that such files, like big DBs, VMs, or scientific datasets that
are internally rewritten, may get more or less constantly defragmented.
That may be quite undesired...
a) for performance reasons (when I consider our research software which
often has IO as the limiting factor and where we want as much IO being
used by actual programs as possible)...
b) SSDs...
Not really sure about that; btrfs seems to enable the autodefrag even
when an SSD is detected,... what is it doing? Placing the block in a
smart way on different chips so that accesses can be better
parallelised by the controller?
Anyway, (a) is could be already argument enough, not to run solve the
problem by a smart-[auto-]defrag, should that actually be implemented.

So I think having notdatacow is great and not just a workaround till
everything else gets better to handle these cases.
Thus, checksumming, which is such a vital feature, should also be
possible for that.


Duncan also mention that in some of those cases, the integrity is
already protected by the application layer, making it less important to
have it at the fs-layer.
Well, this may be true for file-sharing protocols, but I wouldn't know
that relational DBs really do cheksuming of the data.
They have journals, of course, but these protect against crashes, not
against silent block errors and that like.
And I wouldn't know that VM hypervisors do checksuming (but perhaps
I've just missed that).

Here I can give a real-world example, from the Tier-2 that I run for
LHC at work/university.
We have large amounts of storage (perhaps not as large as what Google
and Facebook have, or what the NSA stores about us)... but it's still
some ~ 2PiB, or a bit more.
That's managed with some special storage management software called
dCache. dCache even stores checksums, but per file, so that means for
normal reads, these cannot be verified (well technically it's
supported, but with our usual file sizes, this is not working) so what
remains are scrubs.
For The two PiB, we have some... roughly 50-60 nodes, each with
something between 12 and 24 disks, usually in either one or two RAID6
volumes, all different kinds of hard disks.
And we do run these scrubs quite rarely, since it costs IO that could
be used for actual computing jobs (a problem that wouldn't be there
with how btrfs calculates the sums on read, the data is then read
anyway)... so likely there are even more errors that are just never
noticed, because the datasets are removed again, before being scrubbed.


Long story short, it does happen every now and then, that a scrub shows
file errors, for neither the RAID was broken, nor there were any block
errors reported by the disks, or anything suspicious in SMART.
In other words, silent block corruption.

One may rely on the applications to do integrity protection, but I
think that's not realistic, and perhaps that shouldn't be their task
anyway (at least not when it's about storage device block errors and
that like).

I don't think it's on the horizon that things like DBs or large
scientific data files do their own integrity protection (i.e. one that
protects against bad blocks, and not just journalling that preserves
consistency in case of crashes).
And handling that on the fs level is anyway quite nice, I think.
It doesn't mean that countless applications need to handle this on the
application layer, making it configurable whether it should be enabled
(for integrity protection) or disabled (for more speed), each of them
writing a lot of code for that.
If we can control that on the fs layer, by setting datasum/nodatasum,
all needed is already there - except, that as of now, nodatacowed stuff
is excluded in btrfs.





2) Technical


Okay the following is obviously based on my naive view of how things
could work, which may not necessarily go well with how an actual fs
developer sees things ;-)

As said in the introduction, I can't quite believe that data
checksumming should in principle be possible for ext4, but not for
btrfs non-CoWed parts.


Duncan&Hugo said, the reason is basically it cannot do checksums with
no-CoW, because there's no guarantee that the fs doesn't end up
inconsistently...

But, AFAIU, not doing CoW, while not having a journal (or does it have
one for these cases???) almost certainly means that the data (not
necessarily the fs) will be inconsistent in case of a crash during a
no-CoWed write anyway, right?
Wouldn't it be basically like ext2?

Or we have the case of multi-device, e.g. RAID1, multiple copies of the
same blocks, a crash has happened during writing such (no-CoWed and no-
checksummed)...
Again it's almost certainly that at least one (maybe even both) of the
blocks contains garbage and likely (at least a 50% chance) we get that
one when the actual read happens later (I was told btrfs would behave
in these cases like e.g MD RAID does,... deliver what the first
readable block said).

If btrfs would calculate checksums and write them e.g. after or before
the actual data was written,... what would be the worst that could
happen (in my naive understanding of course ;-) ) at a crash?
- I'd say either one is lucky, and checksum and data matches.
  Yay.
- Or it doesn't match, which could boil down to the following two
  cases:
  - the data wasn't written out correctly and is actually garbage
    => then we can be happy, that the checksum wouldn't match and we'd 
       get an error
  - the data was written out correctly, but before the csum was
    written the system crashed, so the csum would now tell us that the
    block is bad, while in reality it isn't.
    or the other way round:
    the csum was written out (completely)... and no data was written
    at all before the system crashed (so the old block would be still 
    completely there)
    => in both cases: so what? Having that particular case happening
       is probably far less likely, than csumming actually detecting a
       bad block, or not completely written data in case of a crash.
       (Not to talk about all the cases where nothing crashes, and
       where we simply would want to detect block errors, bus errors,
       etc.)
=> Of course it wouldn't be as nice as in CoW, where it could
   simply take the most recent consistent state of that block, but
   still way better than:
   - delivering bogus data to the application in n other cases
   - not being able to decide which of m block copies is valid, if a
     RAID is scrubbed

And as said before, AFAIU, nodatacow'ed files have no journal in btrfs
as in ext3/4, so it's basically anyway that such files, when written
during a crash, may end up in any state, right? Which makes not having
a csum sound even worse, since nothing tells that this file is possibly
bad.


Not having checksumming seems to be especially bad in the multi-device
case... what happens when one runs a scrub? AFAIU, it simply does what
e.g. MD does: taking the first readable block, writing it to any
others, thereby possibly destroying the actually good one?



Not sure about whether the following would make any practical sense:
If data checksumming would work for nodatacow, then maybe some people
may even choose to run btrfs in CoW1 mode,.. they still could have most
fancy features from btrfs (checksumming, snapshots, perhaps even
refcopy?) but unless snapshots or refcopies are explicitly made, btrfs
doesn't do CoW.


Well, thanks for spending (hopefully not wasting ;-) ) your time on
reading my X-Mas wish ;)

Cheers,
Chris.

Attachment: smime.p7s
Description: S/MIME cryptographic signature

Reply via email to