Hi Qu,

So sorry for the incredibly delayed reply [it got lost in my drafts
folder], I sincerely appreciate the time you took to respond.  There
is a lot in your responses that I suspect would benefit readers of the
btrfs wiki, so I've drawn attention to them by replying inline.  I've
omitted the sections David resolved with his merge.

P.S. Even graduate-level native speakers struggle with the
multitude of special-cases in English!

On Sun, Oct 22, 2017 at 06:54:16PM +0800, Qu Wenruo wrote:
> Hi Nicholas,
> 
> Thanks for the documentation update.
> Since I'm not a native English speaker, I may not help much to organize
> the sentence, but I can help to explain the question noted in the
> modification.
> 
> On 2017年10月22日 08:00, Nicholas D Steeves wrote:
> > In one big patch, as requested
[...]
> > --- a/Documentation/btrfs-balance.asciidoc
> > +++ b/Documentation/btrfs-balance.asciidoc
> > @@ -21,7 +21,7 @@ filesystem.
> >  The balance operation is cancellable by the user. The on-disk state of the
> >  filesystem is always consistent so an unexpected interruption (eg. system 
> > crash,
> >  reboot) does not corrupt the filesystem. The progress of the balance 
> > operation
> > -is temporarily stored and will be resumed upon mount, unless the mount 
> > option
> > +****is temporarily stored**** (EDIT: where is it stored?) and will be 
> > resumed upon mount, unless the mount option
> 
> To be specific, they are stored in data reloc tree and tree reloc tree.
> 
> Data reloc tree stores the data/metadata written to new location.
> 
> And tree reloc tree is kind of special snapshot for each tree whose tree
> block is get relocated during the relocation.

Is there already a document on the btrfs allocation?  This seems like
it might be a nice addition for the wiki.  I'm guessing it would fit
under
https://btrfs.wiki.kernel.org/index.php/Main_Page#Developer_documentation

> > @@ -200,11 +200,11 @@ section 'PROFILES'.
> >  ENOSPC
> >  ------
> >  
> > -The way balance operates, it usually needs to temporarily create a new 
> > block
> > +****The way balance operates, it usually needs to temporarily create a new 
> > block
> >  group and move the old data there. For that it needs work space, otherwise
> >  it fails for ENOSPC reasons.
> >  This is not the same ENOSPC as if the free space is exhausted. This refers 
> > to
> > -the space on the level of block groups.
> > +the space on the level of block groups.**** (EDIT: What is the 
> > relationship between the new block group and the work space?  Is the "old 
> > data" removed from the new block group?  Please say something about block 
> > groups to clarify)
> 
> Here I think we're talking about allocating new block group, so it's
> using unallocated space.
> 
> While for normal space usage, we're allocating from *allocated* block
> group space.
> 
> So there are two levels of space allocation:
> 
> 1) Extent level
>    Always allocated from existing block group (or chunk).
>    Data extent, tree block allocation are all happening at this level.
> 
> 2) Block group (or chunk, which are the same) level
>    Always allocated from free device space.
> 
> I think the original sentence just wants to address this.

Also seems like a good fit for a btrfs allocation document.

> >  
> >  The free work space can be calculated from the output of the *btrfs 
> > filesystem show*
> >  command:
> > @@ -227,7 +227,7 @@ space. After that it might be possible to run other 
> > filters.
> >  
> >  Conversion to profiles based on striping (RAID0, RAID5/6) require the work
> >  space on each device. An interrupted balance may leave partially filled 
> > block
> > -groups that might consume the work space.
> > +groups that ****might**** (EDIT: is this 2nd level of uncertainty 
> > necessary?) consume the work space.
> >  
[...]
> > @@ -3,7 +3,7 @@ btrfs-filesystem(8)
[...]
> >  SYNOPSIS
> >  --------
> > @@ -53,8 +53,8 @@ not total size of filesystem.
> >  when the filesystem is full. Its 'total' size is dynamic based on the
> >  filesystem size, usually not larger than 512MiB, 'used' may fluctuate.
> >  +
> > -The global block reserve is accounted within Metadata. In case the 
> > filesystem
> > -metadata are exhausted, 'GlobalReserve/total + Metadata/used = 
> > Metadata/total'.
> > +The global block reserve is accounted within Metadata. ****In case the 
> > filesystem
> > +metadata are exhausted, 'GlobalReserve/total + Metadata/used = 
> > Metadata/total'.**** (EDIT: s/are/is/? And please write more for clarity. 
> > Is "global block reserve" part of GlobalReserve that is accounted within 
> > Metadata?  Isn't all of GlobalReserve's metadata accounted within Metadata? 
> >  eg: "global block reserve" is the data portion of GlobalReserve, but all 
> > metadata is accounted for in Metadata.)
> 
> GlobalReserve is accounted as Metadata, but most of time it's just as a
> buffer until we really run out of metadata space.
> 
> It's like metadata headroom reserved for really important time.
> 
> So under most situation, the GlobalReserve usage should be 0.
> And it's not accounted as Meta/used. (so, if there is Meta/free, then it
> belongs to Meta/free)
> 
> But when GlobalReserve/used is not 0, the used part is accounted to
> Meta/Used, and the unused part (GlobalReserve/free if exists) belongs to
> Meta/free.
> 
> Not sure how to explain it better.

Thank you, you've explained it wonderfully.  (This also seems like a
good fit for a btrfs allocation document)

> >  +
> >  `Options`
> >  +
> > @@ -93,10 +93,10 @@ You can also turn on compression in defragment 
> > operations.
> >  +
> >  WARNING: Defragmenting with Linux kernel versions < 3.9 or ≥ 3.14-rc2 as 
> > well as
> >  with Linux stable kernel versions ≥ 3.10.31, ≥ 3.12.12 or ≥ 3.13.4 will 
> > break up
> > -the ref-links of COW data (for example files copied with `cp --reflink`,
> > +the reflinks of COW data (for example files copied with `cp --reflink`,
> >  snapshots or de-duplicated data).
> >  This may cause considerable increase of space usage depending on the 
> > broken up
> > -ref-links.
> > +reflinks.
> >  +
> [snip]
> > +broken up reflinks.
> >  
> >  *barrier*::
> >  *nobarrier*::
> >  (default: on)
> >  +
> >  Ensure that all IO write operations make it through the device cache and 
> > are stored
> > -permanently when the filesystem is at it's consistency checkpoint. This
> > +permanently when the filesystem is at ****(EDIT: "its" or "one of its" 
> > consistency checkpoint[s])****. This
> 
> I think it is "one of its", as there are in fact 2 checkpoints for btrfs:
> 1) Normal transaction commitment
> 2) Log tree commitment
>    Which only commits the log trees and log tree root.
> 
> But I'm not really sure if log tree commitment is also under the control
> of barrier.

Is there a document on the topic of "Things btrfs does to keep your
data safe, and things it does to maintain a consistent state"?  This
can to there, with a subsection for "differences during a balance
operation" if necessary.  David merged "its consistency checkpoint",
which I think is fine for general-user-facing documentation, but
because you mentioned log tree commitment I'm also wondering if 2) is
not under the control of a barrier.  Without this barrier, aren't the
log trees more likely to be corrupted and/or out-of-date in the event
of sudden loss of power or crash?

[...]
> >  
> >  *sync* <path> [subvolid...]::
> > -Wait until given subvolume(s) are completely removed from the filesystem
> > -after deletion. If no subvolume id is given, wait until all current  
> > deletion
> > -requests are completed, but do not wait for subvolumes deleted meanwhile.
> > -The status of subvolume ids is checked periodically.
> > +Wait until given subvolume[s] are completely removed from the filesystem 
> > after
> > +deletion. If no subvolume id is given, wait until all current deletion 
> > requests
> > +are completed, but do not wait for subvolumes deleted in the meantime.  
> > ****The
> > +status of subvolume ids is checked periodically.**** (EDIT: How is the 
> > relevant to sync?  Should it read "the status of all subvolume ids are 
> > periodically synced as a normal background operation"?)
> 
> The background is, subvolume deletion is expensive for btrfs, so
> subvolume deletion is split into 2 stages:
> 1) Unlike the subvolume
>    So no one can access the deleted subvolume
> 
> 2) Delete the subvolume tree blocks and its data in background
>    And for tree blocks, we skip the normal tree balance, to speed up the
>    deletion.
> 
> I think the original sentence means we won't wait for the 2nd stage.

When I started using btrfs with linux-3.16 I regularly ran into issues
when I omitted a btrfs sub sync step when deleting, creating, and then
deleting snapshots, so I started syncing subvolumes religiously after
each operation.  If the btrfs sub sync step is still a recommended
practice, I wonder if this is the place to say so.  Maybe it's no
longer necessary?

[...]
> >  *-d|--data <profile>*::
> > @@ -79,7 +79,7 @@ default value is 16KiB (16384) or the page size, 
> > whichever is bigger. Must be a
> >  multiple of the sectorsize and a power of 2, but not larger than 64KiB 
> > (65536).
> >  Leafsize always equals nodesize and the options are aliases.
> >  +
> > -Smaller node size increases fragmentation but lead to higher b-trees which 
> > in
> > +Smaller node size increases fragmentation ****but lead to higher 
> > b-trees**** (EDIT: "but leads to taller/deeper/more/increased-usage-of 
> > b-trees"?) which in
> 
> What's the difference between "higher" and "taller"?
> Seems quite similar to me though.

I could be wrong, but I think one of
"taller/deeper/more/increased-usage-of b-trees" is closer to what you
want to say, because "smaller node size...leads to higher b-trees"
sounds like a smaller node size leads to the emergence of something
like a higher-order of b-trees that operate or function differently
than b-trees usually do in btrfs.

[I've deleted my pedantic explanation, because I think googling for
"taller vs higher" will provide the resources you need]

> > @@ -166,7 +166,7 @@ root partition created with RAID1/10/5/6 profiles. The 
> > mount action can happen
> >  before all block devices are discovered. The waiting is usually done on the
> >  initramfs/initrd systems.
> >  
> > -As of kernel 4.9, RAID5/6 is still considered experimental and shouldn't be
> > +As of kernel ****4.9**** (EDIT: 4.14 status?), RAID5/6 is still considered 
> > experimental and shouldn't be
> 
> Well, this changed a lot in v4.14. So definitely need to be modified.
> 
> At least Oracle is considring RAID5/6 stable. Maybe we'd better to wait
> for several other releases to see if this is true.

Wow!  If so, congratulations!

Sincerely,
Nicholas

Attachment: signature.asc
Description: PGP signature

Reply via email to