Re: [auto-]defrag, nodatacow - general suggestions?(was: btrfs: poor performance on deleting many large files?)

Christoph Anton Mitterer Sun, 13 Dec 2015 18:51:02 -0800

On Wed, 2015-12-09 at 13:36 +0000, Duncan wrote:
> Answering the BTW first, not to my knowledge, and I'd be
> skeptical.  In 
> general, btrfs is cowed, and that's the focus.  To the extent that
> nocow 
> is necessary for fragmentation/performance reasons, etc, the idea is
> to 
> try to make cow work better in those cases, for example by working on
> autodefrag to make it better at handling large files without the
> scaling 
> issues it currently has above half a gig or so, and thus to confine
> nocow 
> to a smaller and smaller niche use-case, rather than focusing on
> making 
> nocow better.
> Of course it remains to be seen how much better they can do with 
> autodefrag, etc, but at this point, there's way more project 
> possibilities than people to develop them, so even if they do find
> they 
> can't make cow work much better for these cases, actually working on
> nocow 
> would still be rather far down the list, because there's so many
> other 
> improvement and feature opportunities that will get the focus
> first.  
> Which in practice probably puts it in "it'd be nice, but it's low
> enough 
> priority that we're talking five years out or more, unless of course 
> someone else qualified steps up and that's their personal itch they
> want 
> to scratch", territory.
I guess I'll split out my answer on that, in a fresh thread about
checksums for nodatacow later, hoping to attract some more devs there
:-)


I think however, again with my naive understanding on how CoW works and
what it inherently implies, that there cannot be a real good solution
to the fragmentation problem for DB/etc. files.

And as such, I'd think that having a checksumming feature for
notdatacow as well, even if it's not perfect, is definitely worth it.


> As for the updated checksum after modification, the problem with that
> is 
> that in the mean time, the checksum wouldn't verify,
Well one could either implement some locking,.. but I don't see the
general problem here... if the block is still being written (and I
count updating the meta-data, including checksum, to that) it cannot be
read anyway, can it? It may be only half written and the data returned
would be garbage.


>  and while btrfs 
> could of course keep status in memory during normal operations,
> that's 
> not the problem, the problem is what happens if there's a crash and
> in-
> memory state vaporizes.  In that case, when btrfs remounted, it'd
> have no 
> way of knowing why the checksum didn't match, just that it didn't,
> and 
> would then refuse access to that block in the file, because for all
> it 
> knows, it /is/ a block error.
And this would only happen in the rare cases that anything crashes,
where it's anyway quite likely that this no-CoWed block will be
garbage.
I'll talk about that more in the separate thread... so let's move
things there.


> Same here.  In fact, my most anticipated feature is N-way-mirroring, 
Hmm ... not totally sure about that...
AFAIU, N-way-mirroring is what currently the currently wrongly called
RAID1 is in btrfs, i.e. having N replicas of everything on M devices,
right?
In other words, not being a N-parity-RAID and not guaranteeing that
*any* N disks could fail, right?

Hmm I guess that would be definitely nice to have, especially since
then we could have true RAID1, i.e. N=M.

But it's probably rather important for those scenarios, where either
resilience matters a lot... and/or   those where write speed doesn't
but read speed does, right?

Taking the example of our use case at the university, i.e. the LHC
Tier-2 we run,... that would rather be uninteresting.
We typically have storage nodes (and many of them) of say 16-24
devices, and based on funding constraints, resilience concerns and IO
performance, we place them in RAID6 (yeah i know, RAID5 is faster, but
even with hotspares in place, practise lead too often to lost RAIDs).

Especially for the bigger nodes, with more disks, we'd rather have a N-
parity RAID, where any N disks can fail)... of course performance
considerations may kill that desire again ;)


> It is a big and basic feature, but turning it off isn't the end of
> the 
> world, because then it's still the same level of reliability other 
> solutions such as raid generally provide.
Sure... I never meant it as "loss to what we already have in other
systems"... but as "loss compared to how awesome[0] btrfs could be ;-)"


> But as it happens, both VM image management and databases tend to
> come 
> with their own integrity management, in part precisely because the 
> filesystem could never provide that sort of service.
Well that's only partially true, to my knowledge.
a) I wouldn't know that hypervisors do that at all.
b) DBs have of course their journal, but that protects only against
crashes,... not against bad blocks nor does it help you to decide which
block is good when you have multiple.


> After all, you can always decide not to run it if you're worried
> about the space effects it's going to have
Hmm well,... and the manpage actually mentions that it blows up when
snapshots are used... at least in some technical language...
So,.. you're possibly right, here,... though I guess many may just do 
btrfs filesystem --help which looses no word about the possible grave
effects of defrag.


> But even at that point, while snapshot-aware-defrag is still on the
> list,  I'm not sure if it's ever going to be actually viable.  It may
> be that the scaling issues are just too big, and it simply can't be
> made to work  both correctly and in anything approaching practical
> time.
Well, I shall hope not
:)


> Worst-case, you  set nocow and turn off snapshotting, but that's 
> exactly the situation
> you're in anyway with other filesystems, so you're no worse off than
> if you were using them.
> Meanwhile, where those btrfs features *can* be used, which is on
> /most/ 
> files, with only limited exceptions, it's all upside! =:^)
Sure :D ... but that doesn't mean we should try do minimise the upside
cases if possible :-)


> FWIW, I've seen it asserted that autodefrag is snapshot aware a few
> times 
> now, but I'm not personally sure that is the case and I don't see any
> immediately obvious reason it would be, when (manual) defrag isn't,
> so 
> I've refrained from making that claim, myself.  If I were to see
> multiple 
> devs make that assertion, I'd be more confident, but I believe I've
> only 
> seen it from Hugo, and while I trust him in general because in
> general 
> what he says makes sense, here, as I said, it just doesn't make
> immediate 
> sense to me that the two would be so different
Yes, that was my concern as well...


> The biggest downside of autodefrag is its performance on large
> (generally 
> noticeable at between half a gig and a gig) random-rewrite-pattern
> files 
> in actively-being-rewritten use.  For all other cases it's generally 
> recommended, but that's why it's not the default.
Hmm that makes it a bit difficult to use when you have mixed use cases.
Can't they just add a feature that allows one to select up to which
file sizes autodefrag kicks in.

Interestingly, I've enabled it now, and as I've mentioned before I run
several VMs on that machine (which has a SSD), so far intentionally not
set nodatacow... however, so far I don't see any aggressive rewriting,
though admittedly, I wouldn't know how to properly tell whether auto-
defrag was doing heavy IO or not, it doesn't show up as a kernel thread
it seems.

> AFAIK autodefrag only queues up the defrag when it detects
> fragmentation 
> beyond some threshold, and it only checks and thus only detects at
> file 
> (re)write.
Sounds reasonable... especially I wouldn't want the situation in which
it basically constantly rewrites files, just because of few fragments.

Another case however, could be more tricky do detect: Files which
continuously and quickly fragment at whole or at least in parts.
AFAIU, it would basically not make any sense to try any defrag on such
(because of the "it quickly fragments" again).

Also it would be nice to have some knobs to control in more detail how
much IO it spends on autodefrag, perhaps even on a per fs basis or even
more detailed.


> Further, when a filesystem is highly fragmented and autodefrag is
> first 
> turned on, often it actually rather negatively affects performance
> for a 
> few days, because so many files are so fragmented that it's queuing
> up 
> defrags for nearly everything written.
I've read that advice of your's before... so you basically think it
would also queue up files that are fragmented, even when these are not
written to since it was turned on.

Interestingly, I did turn it on just a few days, and so far I haven't
seem much disk activity that would point to autodefrag.



Thanks again for your time and answers :)
Chris.


[0] In Germany there's the term "eierlegende Wollmilchsau", it
basically describes a pig, which gives milk, eggs and wool... perhaps
one can translate it with "jack of all trades device".
(No I don't want btrfs, to include a webbrowser and PDF reader ;) )

smime.p7s
Description: S/MIME cryptographic signature

Re: [auto-]defrag, nodatacow - general suggestions?(was: btrfs: poor performance on deleting many large files?)

Reply via email to