On Mon, 2015-12-14 at 10:51 +0000, Duncan wrote:
> > AFAIU, the one the get's fragmented then is the snapshot, right,
> > and the
> > "original" will stay in place where it was? (Which is of course
> > good,
> > because one probably marked it nodatacow, to avoid that
> > fragmentation
> > problem on internal writes).
> 
> No.  Or more precisely, keep in mind that from btrfs' perspective, in
> terms of reflinks, once made, there's no "original" in terms of
> special 
> treatment, all references to the extent are treated the same.
Sure... you misunderstood me I guess..

> 
> What a snapshot actually does is create another reference (reflink)
> to an 
> extent.
[snip snap]
> So in the 
> case of nocow, a cow1 (one-time-cow) exception must be made,
> rewriting 
> the changed data to a new location, as the old location continues to
> be 
> referenced by at least one other reflink.
That's what I've meant.


> So (with the fact that writable snapshots are available and thus it
> can 
> be the snapshot that changed if it's what was written to) the one
> that 
> gets the changed fragment written elsewhere, thus getting fragmented,
> is 
> the one that changed, whether that's the working copy or the snapshot
> of 
> that working copy.
Yep,.. that's what I've suspected and asked for.

The "original" file, in the sense of the file that first reflinked the
contiguous blocks,... will continue to point to these continuous
blocks.

While the "new" file, i.e he CoW-1-ed snapshot's file, will partially
reflink blocks form the contiguous range, and it's rewritten blocks
will reflink somewhere else.
Thus the "new" file is the one that gets fragmented.


> > And one more:
> > You both said, auto-defrag is generally recommended.
> > Does that also apply for SSDs (where we want to avoid unnecessary
> > writes)?
> > It does seem to get enabled, when SSD mode is detected.
> > What would it actually do on an SSD?
> Did you mean it does _not_ seem to get (automatically) enabled, when
> SSD 
> mode is detected, or that it _does_ seem to get enabled, when 
> specifically included in the mount options, even on SSDs?
I does seem to get enabled, when specifically included in the mount
options (the ssd mount option is not used), i.e.:
/dev/mapper/system      /       btrfs   
subvol=/root,defaults,noatime,autodefrag        0       1
leads to:
[    5.294205] BTRFS: device label foo devid 1 transid 13 /dev/disk/by-label/foo
[    5.295957] BTRFS info (device sdb3): disk space caching is enabled
[    5.296034] BTRFS: has skinny extents
[   67.082702] BTRFS: device label system devid 1 transid 60710 
/dev/mapper/system
[   67.111185] BTRFS info (device dm-0): disk space caching is enabled
[   67.111267] BTRFS: has skinny extents
[   67.305084] BTRFS: detected SSD devices, enabling SSD mode
[   68.562084] BTRFS info (device dm-0): enabling auto defrag
[   68.562150] BTRFS info (device dm-0): disk space caching is enabled



> Or did you actually mean it the way you wrote it, that it seems to be
> enabled (implying automatically, along with ssd), when ssd mode is 
> detected?
No, sorry for being unclear.
I meant it that way, that having the ssd detected doesn't auto-disable
auto-defrag, which I thought may make sense, given that I didn't know
exactly what it would do on SSDs...
IIRC, Hugo or Austin, mentioned the thing with making for better IOPS,
but I haven't had considered that to have impact enough... so I thought
it could have made sense to ignore the "autodefrag" mount option in
case an ssd was detected.



> There are three factors I'm aware of here as well, all favoring 
> autodefrag, just as the two above favored leaving it off.
> 
> 1) IOPS, Input/Output Operations Per Second.  SSDs typically have
> both an 
> IOPS and a throughput rating.  And unlike spinning rust, where raw
> non-
> sequential-write IOPS are generally bottlenecked by seek times, on
> SSDs 
> with their zero seek-times, IOPS can actually be the bottleneck.
Hmm it would be really nice to get someone who has found a way to make
some sound analysis/benchmarking of that.


> 2) SSD physical write and erase block sizes as multiples of the
> logical/
> read block size.  To the extent that extent sizes are multiples of
> the 
> write and/or erase-block size, writing larger extents will reduce
> write 
> amplification due to writing and blocks smaller than the write or
> erase 
> block size.
Hmm... okay I don't know the details of how btrfs does this, but I'd
have expected that all extents are aligned to the underlying physical
devices' block structure.
Thus each extent should start at such write/erase block, and at most it
shouldn't perfectly at the end of the extent.
If the file is fragmented (i.e. more than one extent), I'd have even
hoped that all but the last one fit perfectly.

So what you basically mean, AFAIU, is that by having auto-defrag, you
get larger extents (i.e. smaller ones collapsed into one) and by thus
you get less cut off at the end of extents where these don't match
exactly the underlying write/erase blocks?

I still don't see the advantage here,... neighbouring extents would
hopefully still be aligned,... and it doesn't seem that one saves write
cycles but rather have more due to the defrag.


> While the initial autodefrag rewrite is a second-cycle write after a 
> fragmented write, spending a write cycle for the autodefrag,
> consistent 
> use of autodefrag should help keep file fragmentation and thus
> ultimately 
> space fragmentation to a minimum, so initial writes, where there's
> enough 
> data to write an initially large extent, won't be forced to be broken
> into smaller extents because there's simply no large free-space
> extents 
> left due to space fragmentation.
> IOW, autodefrag used consistently should reduce space fragmentation
> as 
> well as file fragmentation, and this reduced space fragmentation will
> lead to the possibility of writing larger extents initially, where
> the 
> amount of data to be written allows it, thereby reducing initial file
> write fragmentation and the need for autodefrag as a result.
Okay... but AFAIU, that's more like the effects described in (1) and
has less to do with erase/write block sizes...


> 3) Btrfs metadata management overhead.  While btrfs tracks things
> like 
> checksums at fixed sizes,
btw: over which amounts of data is each checksum calculated?


>  other metadata is per extent.  Obviously, the 
> more extents a file has, the harder btrfs has to work to track them
> all.  
> Maintenance tasks such as balance and check already have scaling
> issues; 
> do we really want to make them worse by forcing them to track
> thousands 
> or tens of thousands of extents per (large) file where they could be 
> tracking a dozen or two?
Okay, but these effects are IMHO also more similar to (1),... I'd
probably call them "meta-data compaction" or so...


> On balance, I was persuaded to use autodefrag on my own btrfs' on
> SSDs, 
> and believe the near-term write-cycle damage may in fact be largely 
> counteracted by indirect free-space defrag effect and the effect that
> in 
> turn has on the ability to even find large areas of cohesive free
> space 
> to write into in the first place.  With that largely counteracted,
> the 
> other benefits in my mind again outweigh the negatives, so autodefrag
> continues to be worth it in general, even on SSDs.
Intuitively, I'd tend to agree... even though I either didn't fully
understand your (2) and count it and (3) rather to (1).

Would be interesting to see a actual analysis with measurements from
one of the filesystem/block device geeks.



> I suppose someone will eventually do that sort of testing, but of
> course 
> even if they did it now, with btrfs code still to be optimized and 
> various scaling work still to be done, it's anyone's guess if the
> test 
> results would still apply a few years down the road, after that
> scaling 
> and optimization work.
Sure... :-)

I guess I'll leave it on, and when in 5-10 years after btrfs has been
stabilised and optimised someone comes up with rock solid data proofing
it, I can claim that I always knew it...
And if data disproves it, I can claim it's all Duncan's fault who lured
me into this
;^-P

Cheers,
Chris.

Attachment: smime.p7s
Description: S/MIME cryptographic signature

Reply via email to