I don't disagree with the _ideal_ of your patch. I just think that it's
impossible to implement it without lying to the user or making things
just as bad in a different way. I would _like_ you to be right. But my
thing is finding and quantifying failure cases and the entire question
is full of fail.
This is not an attack on you personally, it's a mismatch between the
storage and file system paradigms that we've seen first because we are
the first to really blend the two fairly.
Here is a completely legal BTRFS working set. (it's a little extreme.)
/dev/sda :: '|Sf|Sf|Sp|0f|1f|0p|0p|Mf|Mf|Mp|1p|1.25GiB-unallocated|
/dev/sdb :: '|0f|1f|0p|0p|Mp|1p| 4.75GiB-unalloated |
Legend
p == partial, about half full.
f == full, or full enough to treat as full.
S == Single allocated chunk
0 == RAID=0 allocated chunk
1 == RAID=1 allocated chunk
M == metadata chunk
History: This filesystem started out on a single drive, then it's
bounced between RAID-0 and RAID-1 at least twice. The owner has _never_
let a conversion finish. Indeed this user has just changed modes a
couple times.
The current filesystem flag says RAID-1.
But we currently have .5GiB of "single" slack, 2GiB of RAID-0 slack,
1GiB of RAID-1 slack, 2GiB of space where a total of 1GiB more RIAD1
extents can be created, and we have 3GiB of space on /dev/sdb that _can_
_not_ be allocated. We have room for 1 more metadata extent on each
drive, but if we allocate two more metadat extents on each drive we will
burn up 1.25 GiB by reducing it to 0.75GiB.
First, a question.
Will a BTRFS in RAID1 mode add file data to extents that are in other
modes? That is, will the filesystem _use_ the 2.5GiB of available
"single" and "RAID0" data? If no, then that's 2.5GiB of "phantom
consumption" space that insn't "used" but also isn't usable.
The size of the store is 20GiB. The default of 2x10GiB you propose would
be 10GiB. But how do you identify the 3GiB "missing" because of the
lopsided allocation history?
Seem unlikely? The rotten cod example I've given is unlikely.
But a more even case is downright common and likely. Say you run a nice
old-fashoned MUTT mail-spool. "most" of your files are small enough to
live in metadata. You start with one drive. and allocate 2 single-data
and 10 metatata (5xDup). Then you add a second drive of equal size. (the
metadata just switched to DUP-as-RAID1-alike mode) And then you do a
dconvert=raid0.
That uneven allocation of metadata will be a 2GiB difference between the
two drives forever.
So do you shave 2GiB off of your @size?
Do you shave @2GiB off your @available?
Do you overreport your available by @2GiB and end up _still_ having
things "available" when you get your ENOSPC?
How about this ::
/dev/sda == |Sf|Sf|Mf|Mf|Mf|Mf|Sf|Sf|Sp|Mp|Mp| .5GiB free|
/dev/sdb == |10 GiB free |
Operator fills his drive, then adds a second one, then _foolishly_ tries
to convert it to RAID0 when the power fails. In order to check the FS he
boots with no_balance. Then his maintenance window closes and he has to
go back into production, at which point he forgets (or isn't allowed) to
do the balance. The flags are set but now no more extents can be allocated.
Size is 20GiB, slack is 10.5GiB. Operator is about to get ENOSPACE.
Yes a balance would fix it, but that's not the question.
In the meantime what does your patch report?
Or...
/dev/sda == |Sf|Sf|Mf|Mf|Mf|Mf|Sf|Sf|Sp|Mp|Mp| .5GiB free|
/dev/sdb == |10 GiB free |
/dev/sdc == |10 GiB free |
Does a -dconvert=raid5 and immediately gets ENOSPC for all the blocks.
According to the flags we've got 10GiB free...
Or we end up with an egregious metadata history from lots of small files
and we've got a perfectly fine RAID1 with several GiB of slack but none
of that slack is 1GiB contiguous. All the slack has just come from
reclaiming metadata.
/dev/sda == |Sf|Sf|Mp|Mp|Rx|Rx|Mp|Mp|Rx|Rx|Mp|Mp| N-free slack|
(R == reclaimed, e.g. avalable to extent-tree.c for allocation)
We have a 1.5GB of "poisoned" space here; it can hold metadata but not
data. So is that 1.5 in your @available calculation? How do you mark it
up as used.
...
And I've been ingoring the Mp(s) completely. What if I've got a good two
GiB of partial space in the metadata, but that's all I've got. You write
a file of any size and you'll get ENOSPC even though you've got that
GiB. Was it in @size? Is it in @avail?
...
See you keep giving me these examples where the history of the
filesystem is uniform. It was made a certain way and stayed that way.
But in real life this sort of thing is going to happen and your patch
simply report's a _different_ _wrong_ number. A _friendlier_ wrong
number, I'll grant you that, but still wrong.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html