Re: [auto-]defrag, nodatacow - general suggestions?(was: btrfs: poor performance on deleting many large files?)

Christoph Anton Mitterer Wed, 16 Dec 2015 13:59:29 -0800

On Wed, 2015-12-09 at 16:36 +0000, Duncan wrote:
> But... as I've pointed out in other replies, in many cases including
> this 
> specific one (bittorrent), applications have already had to develop
> their 
> own integrity management features
Well let's move discussion upon that into the "dear developers, can we
have notdatacow + checksumming, plz?" where I showed in one of the more
recent threads that bittorrent seems rather to be the only thing which
does use that per default... while on the VM image front, nothing seems
to support it, and on the DB front, some support it, but don't use it
per default.



> In the bittorrent case specifically, torrent chunks are already 
> checksummed, and if they don't verify upon download, the chunk is
> thrown 
> away and redownloaded.
I'm not a bittorrent expert, because I don't use it, but that sounds to
be more like the edonkey model, where - while there are checksums -
these are only used until the download completes. Then you have the
complete file, any checksum info thrown away, and the file again being
"at risk" (i.e. not checksum protected).


> And after the download is complete and the file isn't being
> constantly 
> rewritten, it's perfectly fine to copy it elsewhere, into a dir where
> nocow doesn't apply.
Sure, but again, nothing the user may automatically do, and there's
still the gap between the final verification from the bt software, to
the time it's copied over.
Arguably, that may be very short, but I see no reasons to make any
breaks in the everything-verified chain from the btrfs side.


>   With the copy, btrfs will create checksums, and if 
> you're paranoid you can hashcheck the original nocow copy against the
> new 
> checksummed/cow copy, and after that, any on-media changes will be
> caught 
> by the normal checksum verification mechanisms.
As before... of course you're right that one can do this, but nothing
that happens per default.
And I think that's just one of the nice things btrfs would/should give
us. That the filesystem assures that data is valid, at least in terms
of storage device and bus errors (it cannot protect of course against
memory errors or that like).


> > Hmm doesn't seem really good to me if systemd would do that, cause
> > it
> > then excludes any such files from being snapshot.
> 
> Of course if the directories are already present due to systemd
> upgrading 
> from non-btrfs-aware versions, they'll remain as normal dirs, not 
> subvolumes.  This is the case here.
Well, even if not, because one starts from a fresh system... people may
not want that.


> And of course you can switch them around to dirs if you like, and/or 
> override the shipped tmpfiles.d config with your own.
... sure but, people may not even notice that.
I don't think such a decision is up to systemd.
Anyway, since we're btrfs here, not systemd, that shouldn't bother us
;)


> > > and small ones such as the sqlite files generated by firefox and
> > > various email clients are handled quite well by autodefrag, with
> > > that
> > > general desktop usage being its primary target.
> > Which is however not yet the default...
> Distro integration bug! =:^)
Nah,... really not...
I'm quite sure that most distros will generally decide against
diverting from upstream in such choices.



> > It feels a bit, if there should be some tools provided by btrfs,
> > which
> > tell the users which files are likely problematic and should be
> > nodatacow'ed
> And there very well might be such a tool... five or ten years down
> the 
> road when btrfs is much more mature and generally stabilized, well
> beyond 
> the "still maturing and stabilizing" status of the moment.
Hmm let's hope btrfs isn't finished only when the next-gen default fs
arrives ;^)



> But it can be the case that as filesystem fragmentation levels rise,
> free-
> space itself is fragmented, to the point where files that would
> otherwise 
> not be fragmented as they're created once and never touched again,
> end up 
> fragmented, because there's simply no free-space extents big enough
> to 
> create them in unfragmented, so a bunch of smaller free-space extents
> must be used where one larger one would have been used had it
> existed.
In kinda curios, what free space fragmentation actually means here.

Ist simply like this:
+----------+-----+---+--------+
|     F    |  D  | F |    D   |
+----------+-----+---+--------+
Where D is data (i.e. files/metadata) and F is free space.
In other words, (F)ree space itself is not further subdivided and only
fragmented by the (D)ata extents in between.

Or is it more complex like this:
+-----+----+-----+---+--------+
|  F  |  F |  D  | F |    D   |
+-----+
----+-----+---+--------+
Where the (F)ree space itself is subdivided
into "extents" (not necessarily of the same size), and btrfs couldn't
use e.g. the first two F's as one contiguous amount of free space for a
larger (D)ata extent of that size:
+----------+-----+---+--------+
|    
D    |  D  | F |    D   |
+----------+-----+---+--------+
but would split
that up into two instead:
+-----+----+-----+---+--------+
|  D  |  D |  D
 | F |    D   |
+-----+----+-----+---+--------+

?


> In that regard, yes, it can affect other files, but it affects them
> by 
> fragmentation, so no, it doesn't affect unfragmented files... to the 
> extent that there are any unfragmented files left.
I see :)


> > Are there any problems caused by all this with respect to free
> > space
> > fragmentation? And what exactly are the consequences of free space
> > fragmentation? ;)
> I must have intuited the question as I just answered it, above! =:^)
O:-D


> And of course that's assuming the worst case, that autodefrag is
> /not/ 
> snapshot-aware.  If it is, then the problem effectively vaporizes 
> entirely.


> > Hmm and what about mixed-use systems,... which have both, desktop
> > and
> > server like IO patterns?
> 
> Valid question.  And autodefrag, like most btrfs-specific mount
> options, 
> remains filesystem-global at this point, too, so it's not like you
> can 
> mount different subvolumes, some with autodefrag, some without (tho 
> that's a planned future implementation detail).
> 
> But, at least personally, I tend to prefer separate filesystems, not 
> subvols, in any case, primarily because I don't like having my data
> eggs 
> all in the same filesystem basket and then watching its bottom drop
> out 
> when I find it unmountable!
> 
> But the filesystem-global nature of autodefrag and similar mount
> options, 
> tends to encourage the separate filesystem layout as well, as in that
> case you simply don't have to worry, because the server stuff is on
> its 
> own separate btrfs where the autdefrag on the desktop btrfs can't 
> interfere with it, as each separate filesystem can have its own mount
> options. =:^)
> 
> So that'd be /my/ preferred solution, but I can indeed see it being a
 
> problem for those users (or distros) that prefer one big filesystem
> with 
> subvolumes, which some do, because then it's all in a single storage
> pool 
> and thus easier to manage.
Well the problem I see here mainly is, with additional filesystems
(while you're absolutely right with your eggs basket example ;) )...
one has again the problem of partitioning or using e.g. LVM in order
not to allocate a more or less fixed number of bytes for each of the
different fs to be created for different purposes.
Now placing LVM below btrfs is at least conceptually bad, because btrfs
would already provide similar/same features by itself.

So that would be the nice part of just using subvols, with different
e.g. auto-defrag options though: it doesn't matter which of the subvols
eats up more space eventually - they share it.



> > btw: I think documentation (at least the manpage) doesn't tell
> > whether
> > btrfs defragment -c XX will work on files which aren't fragmented.
> 
> It implies it, but I don't believe it's explicit.
> 
> The implication is due to the implication that defrag with the
> compress 
> option is effectively compress, in that it rewrites everything it's
> told 
> to compress in that case, of course defragging in the process due to
> the 
> rewrite, but with the primary purpose being the compress, when used
> in 
> that manner.
Hmm,... I guess it would be better if there was a separate option for
that... or at least some more clear documentation.

> he obviously didn't
> think thru the fact that compression MUST be a rewrite, thereby
> breaking 
> snapshot reflinks, even were normal non-compression defrag to be
> snapshot 
> aware, because compression substantially changes the way the file is 
> stored), that's _implied_, not explicit.
So you mean, even if ref-link aware defrag would return, it would still
break them again when compressing/uncompressing/recompressing?
I'd have hoped that then, all snapshots respectively other reflinks
would simply also change to being compressed, or at least that there
would then be an option that allows to choose... break up the reflinks,
or change them.

> Yes.  Try it on a file that's large enough (a gig or so should do it
> nicely) to make a difference in the btrfs fi df listing.  Compare
> before 
> and after listings.
Okay... I'll need to think a bit more about how to actually trigger
that.
Cause a) one doesn't get notice, AFAICS, whether autodefrag ran, b) I
need to actually manage creating a fragmented file first, and c)
understand what each of the fi df's values actually mean ;)



> As I said to my knowledge it hasn't been tried, but AFAIK, truncate,
> followed by sync (or fsync), doesn't do sparse.  I've seen it used
> for (I 
> believe) similar purposes elsewhere, which is why I suggested its use
> here.
Hmm at least doing an
trunacete --size 10G foo
sync
doesn't seem to cause any disk IO.


> Of course if truncate doesn't work, catting from /dev/urandom should
> do 
> the trick, as that should be neither sparse nor compressible.  
Or perhaps a bit faster, /dev/zero ;-P

 
> > b) How can one find out wheter defragmentation worked well? I guess
> > with
> > filefrag in the compress=no case an not at all in any other?
> 
> I recently found out that filefrag -v actually lists the extent byte 
> addresses, thus making it possible to manually (or potentially via 
> script) whether the 128-KiB compression blocks are contiguous or
> not.  
> Contiguous would mean same extent, even if filefrag doesn't
> understand 
> that yet.
> 
> But certainly, filefrag in the uncompressed case is exactly what I
> had in 
> mind.
I'm a bit unsure how to read filefrag's output... (even in the
uncompressed case).
What would it show me if there was fragmentation


> Yeah, the LHC is rather a special case.
Well many fields of science actually go into that ranges now,
astronomers, microbiologists, genetic sciences, brain research, other
fields of physics, we even have had contact with some guys from
humanities that apparently think their "research" would need that large
amounts of storage (don't ask me,... I didn't understand it ^^)


> Tho to be fair, were I managing data for them or that sort of data
> set 
> where shear size makes backups impractical, I'd probably be at least
> as 
> conservative about btrfs usage as you're sounding, not necessarily in
> the 
> specifics, but simply because while btrfs is indeed stabilizing, I 
> haven't had any argument on this list against my oft stated opinion
> that 
> it's not fully stable and mature yet, and won't be for some time.
Sure,... I mean right now it's not a shame for btrfs, that one perhaps.
wouldn't recommend it for that usage.
But in the future the goal should be, that it can be used,... in our
case that would be probably still simple, as the vast majority of data
(i.e. PiBs that are archived) are typically write once read many.
But the files that are processed in jobs, aren't necessarily. They may
easily do all these things were right now btrfs may still start to
choke sooner or later.



> As such, to date I'd be unlikely to consider btrfs at all for data
> where 
> backups aren't feasible, unless it really is simply throw-away data 
> (which from your description isn't the case there), and would be
> leery as 
> well about btrfs usage where backups are available, but simply 
> impractical to deal with, due to shear size and data transfer time.
Sure, talking about right now,... but at least in 5-10 years, btrfs has
hopefully matured enough that people don't have to start making backups
on the fresh fs ;)


> > For quite a while I was thinking about productively using btrfs at
> > our
> > local Tier-2 in Munich, but then decided against:
> 
> As should be apparent from the above, I basically agree.
> 
> I did want to mention that I enjoyed seeing your large-scale
> description, 
> however, as well as your own reasoning for the decisions you have
> made.  
> (Of course it's confirming my own opinion so I'm likely to enjoy it,
> but 
> still...)
Well it was also meant as giving some insight to the devs on which
problems real world scenarios might suffer.

It would be interesting to hear from Chris (Mason), how things are
going at facebook (IIRC, they were testing btrfs in production),
especially with regards to maintainability and all these things we were
talking about (fragmentation ant that like).

But of course, even if it works out perfectly for them, one may not
immediately generalise... perhaps they don't do snapshots ;-) ... or
their nodes are more muliple-redundant and throw-away (i.e. if one VM's
fs breaks or gets slower, it would be automatically re-deployed and
populated with data).
These, and of course having the maintainer of the fs hired, may not be
things every site can afford (and it would also require cloning Chris,
which he may not be particularly fond of ;) ).


> > Speaking of which:
> > Is there somewhere a good documentation of what exactly all this
> > numbers
> > of show, df, usage and so on tell?
> 
> It's certainly in quite a few on-list posts over the years
okay,.. in other words: no ;-)
scatter over the years list posts don't count as documentation :P


Cheers,
Chris.

smime.p7s
Description: S/MIME cryptographic signature

Re: [auto-]defrag, nodatacow - general suggestions?(was: btrfs: poor performance on deleting many large files?)

Reply via email to