Re: exclusive subvolume space missing

Duncan Sat, 02 Dec 2017 17:46:39 -0800

Tomasz Pala posted on Sat, 02 Dec 2017 18:18:19 +0100 as excerpted:

> On Sat, Dec 02, 2017 at 17:28:12 +0100, Tomasz Pala wrote:
> 
>>> Suppose you start with a 100 MiB file (I'm adjusting the sizes down from
>> [...]
>>> Now make various small changes to the file, say under 16 KiB each.  These
>>> will each be COWed elsewhere as one might expect. by default 16 KiB at
>>> a time I believe (might be 4 KiB, as it was back when the default leaf
>> 
>> I got ~500 small files (100-500 kB) updated partially in regular
>> intervals:
>> 
>> # du -Lc **/*.rrd | tail -n1
>> 105M    total


FWIW, I've no idea what rrd files, or rrdcached (from the grandparent post)
are (other than that a quick google suggests that it's...
round-robin-database... and the database bit alone sounds bad in this context
as database-file rewrites are known to be a worst-case for cow-based
filesystems), but it sounds like you suspect that they have this
rewrite-most pattern that could explain your problem...

>>> But here's the kicker.  Even without a snapshot locking that original 100
>>> MiB extent in place, if even one of the original 16 KiB blocks isn't
>>> rewritten, that entire 100 MiB extent will remain locked in place, as the
>>> original 16 KiB blocks that have been changed and thus COWed elsewhere
>>> aren't freed one at a time, the full 100 MiB extent only gets freed, all
>>> at once, once no references to it remain, which means once that last
>>> block of the extent gets rewritten.
> 
> OTOH - should this happen with nodatacow files? As I mentioned before,
> these files are chattred +C (however this was not their initial state
> due to https://bugzilla.kernel.org/show_bug.cgi?id=189671 ).
> Am I wrong thinking, that in such case they should occupy twice their
> size maximum? Or maybe there is some tool that could show me the real
> space wasted by file, including extents count etc?

Nodatacow... isn't as simple as the name might suggest.

For one thing, snapshots depend on COW and lock the extents they reference
in-place, so while a file might be set nocow and that setting is retained,
the first write to a block after a snapshot *MUST* cow that block... because
the snapshot has the existing version referenced and it can't change without
changing the snapshot as well, and that would of course defeat the purpose
of snapshots.

Tho the attribute is retained and further writes to the same already cowed
block won't cow it again.

FWIW, on this list that behavior is often referred to as cow1, cow only the
first time that a block is written after a snapshot locks the previous
version in place.

The effect of cow1 depends on the frequency and extent of block rewrites vs.
the frequency of snapshots of the subvolume they're on.  As should be
obvious if you think about it, once you've done the cow1, further rewrites
to the same block before further snapshots won't cow further, so if only
a few blocks are repeatedly rewritten multiple times between snapshots, the
effect should be relatively small.  Similarly if snapshots happen far more
frequently than block rewrites, since in that case most of the snapshots
won't have anything changed (for that file anyway) since the last one.

However, if most of the file gets rewritten between snapshots and the
snapshot frequency is often enough to be a major factor, the effect can be
practically as bad as if the file weren't nocow in the first place.

If I knew a bit more about rrd's rewrite pattern... and your snapshot
pattern...


Second, as you alluded, for btrfs files must be set nocow before anything
is written to them.  Quoting the chattr (1) manpage:  "If it is set on a
file which already has data blocks, it is undefined when the blocks
assigned to the file will be fully stable."

Not being a dev I don't read the code to know what that means in practice,
but it could well be effectively cow1, which would yield the maximum 2X
size you assumed.

But I think it's best to take "undefined" at its meaning, and assume
worst-case "no effect at all", for size calculation purposes, unless you
really /did/ set it at file creation, before the file had content.

And the easiest way to do /that/, and something that might be worthwhile
doing anyway if you think unreclaimed still referenced extents are your
problem, is to set the nocow flag on the /directory/, then copy the
files into it, taking care to actually create them new, that is, use
--reflink=never or copy the files to a different filesystem, perhaps
tmpfs, and back, so they /have/ to be created new.  Of course with the
rewriter (rrdcached, apparently) shut down for the process.

Then, once the files are safely back in place and the filesystem synced
so the data is actually on disk, you can delete the old copies (which
will continue to serve as backups until then), and sync the filesystem
again.

While snapshots will of course continue to keep extents they reference
locked, for unsnapshotted files at least, this process should clear up
any still referenced but partially unused extents for those files, thus
clearing up the problem if this is it.  After deleting the original
copies to free the space and syncing, you can check to see.


Meanwhile, /because/ nocow has these complexities along with others (nocow
automatically turns off data checksumming and compression for the files
too), and the fact that they nullify some of the big reasons people might
choose btrfs in the first place, I actually don't recommend setting
nocow in the first place -- if usage is such than a file needs nocow,
my thinking is that btrfs isn't a particularly good hosting choice for
that file in the first place, a more traditional rewrite-in-place
filesystem is likely to be a better fit.

OTOH, it's also quite possible that people chose btrfs at least partly
for other reasons, say the "storage pool" qualities, and would rather
just shove everything on a single btrfs "pool" and not have to worry
about it, however much that sets off my own "all eggs in one basket"
risk alert alarms. [shrug]  For them, having to separate all their
nocow stuff into a different non-btrfs filesystem would defeat their
purpose, and they'd rather just deal with all the complexities of
nocow.

For this sort of usage, we actually have reports that first setting up
the nocow dirs and ensuring that files inherit the nocow at creating,
so they /are/ actually nocow, then setting up snapshotting at a sane
schedule, /and/ setting up a periodic (perhaps weekly or monthly)
defrag of their nocow files to eliminate the fragmentation caused
by the snapshot-triggered cow1, actually works reasonably well.

Of course if snapshots are being kept effectively "forever", that'll
make space usage even /worse/, because defrag breaks reflinks and
unshares the data, but arguably, that's doing it wrong, because
snapshots are /not/ backups, and what might be temporarily snapshotted
should eventually be real-backed-up, allowing one to delete those
snapshots, thus freeing the space they took.  Of course if you're
using btrfs send/receive to do those backups, keeping around
selected "parent" snapshots to reference with send is useful,
but choosing a new "send parent" at least every quarter, say, and
deleting the old ones, does at least put a reasonable limit on
the time such snapshots need to be kept on the operational filesystem.

And since more than double-digits snapshots of the same subvolume creates
scaling issues for btrfs balance, snapshot deletion, etc, a regular
snapshot thinning schedule combined with a cap on the age of the oldest
one nicely helps there as well. =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: exclusive subvolume space missing

Reply via email to