On 12/16/2014 11:30 AM, Robert White wrote:
On 12/15/2014 01:36 AM, Robert White wrote:
So we don't just hand-wave over statfs(). We include the
dev_item.bytes_excluded in the superblock and we decide once-and-for-all
(with any geometry creation, or completed conversion) how many bytes
just _can't_ be reached but only once we _know_ they cant be reached.
And we memorialize that unreachable data in the superblocks.

Thereafter we report the raw numbers after subtracting anything we know
cannot be reached.

All other "helpful" solutions are NP-complete and insoluble.

On multiple re-readings of my own words and running off to the POSIX definitions _and_ kernel sources (which don't agree).

The practical bits first ::

I would add a "-c | --compatable" option to btrfs fi df
that let it produce /bin/df format-compatable output that gave the "real" numbers as defined near the end.


/dev/sda 1TiB
/dev/sdb 2TiB


mkfs.btrfs /dev/sd{a,b} -d raid1

@size=3TiB @used=0TiB @available=2TiB

The above would be ideal. But POSIX says "no". f_blocks is defined (only in the comments) as "total data blocks in the filesystem" and /bin/df pivots on that assumption, so the only usable option left is ::

@size=2TiB @used=0TiB @available=2TiB

After which @used would be the real, raw space consumed. If it takes 2GiB or 4GiB to store 1GiB (q.v. RAID 1 and 10) then @used would go up by that 2 or 4 GiB.

Hi Robert, thanx for your proposal about this.

IMHO, output of df command shoud be more friendly to user.
Well, I think we have a disagreement on this point, let's take a look at what the zfs is doing.

/dev/sda7- 10G
/dev/sda8- 10G
# zpool create myzpool mirror /dev/sda7 /dev/sda8 -f
# df -h /myzpool/
Filesystem      Size  Used Avail Use% Mounted on
myzpool         9.8G   21K  9.8G   1% /myzpool

That said that df command should tell user the space info they can see.
It means the output is the information from the FS level rather than device level or _storage_manager level.

Thanx
Yang

Given the not-displayed, not reported, excluded_by_geometry values (e.g. @waste) the equation should always be ::

@size - @waste = @used + @available

The fact that /bin/df doesn't display all four values is just tough, The fact that it calculates one "for us" is really annoying, show-super would be the place to go find the truth.

The @waste value is soft because while 1TiB of /dev/sdb that is not going to be used isn't a _particular_ 1TiB. It could be low blocks or high blocks or randomly distributed blocks that end up not having data.

So keeping with my thought that (ideally) @size should be the "safe dd size" for doing a raw-block transcribe of the devices and filesystem, it is most correct for @size to be real storage size. But sadly, posix didn't define that value for that role, so we are forced to munge around. (particularly since /bin/df calculates stuff "for us").


Calculation of the @waste would have to happen in two phases. At initiation phase of any convert @waste would be set to zero. At completion of any _full_ convert, when we know that there are no leftover bits that could lead to rampant mis-report, @waste would be calculated for each device as a dev_item. Then the total would be stored as a global item.

btrfs tools would report all four items.

statfs() would have to report (@size-@waste) and @available, but that's a problem with limits to the assumptions made by statfs() designers two decades ago.

I don't know which numbers we keep on hand and which we derive so...

@available, if calculated dynamically would be
sum(@size, -@waste, -@used).

@used, if calculated dynamically, would be
sum(@size, -@waste, -@available).

This would also keep all the determinations of @waste well defined and relegated to specific, infrequently executed blocks of code.

GIVEN ALSO ::

The BTRFS dynamic internal layout allows for completely valid states that are inconsistent with the current filesystem flags... Such as it is legal to set the RAID1 mode for data but still having RAID0, RAID5, and any manner of other extents present... there is no finite solution to every particular layout that exists.

This condition is even _mandatory_ in an evolving system. May persist if conversion is interrupted and then the balance is aborted. And might be purely legal if you supply a convert option and limit the number of blocks to process in the same run.

Each individual extent block is it's own master in terms of what "mode the filesystem is actally in" when that extent is being accessed. This fact is _unchangeable_.


STANDARDS REFERENCES and Issues...

The actual standard from POSIX at The Open Group refers to f_blocks as "Total number of blocks on file system in units of f_frsize".

See :: http://pubs.opengroup.org/onlinepubs/009695399/basedefs/sys/statvfs.h.html

The linux kernel source and man pages say "total data blocks in filesystem".

I don't know where/when/why the "total blocks" got re-qualified as "total data blocks" in the linux history, but it's probably incorrect on plain reading.

The df command itself suffers a similar problem as the POSIX standard doesn't talk about "data blocks" etc.

Problematically, of course, the statfs() call doesn't really allow for any means to address slack/waste space and the reverse calculation for us becomes impossible.

This gets back to the "no right answer in BTRFS" issue.

There is a lot of missing magic here. Back when INODES where just one thing with one size statfs results were probably either-or and "Everybody Knew" how to turn the inode count into a block count and history just rolled on.

I think the real answer would be to invent an expanded statfs() call that returned the real numbers for @total_size, @system_overhead_used, @waste_space, @unusable_space, etc -- that is to come up with a generic model for a modern storage system -- and let real calculations take place. But I don't have the "community chops" to start that ball rolling.

CONCLUSIONS ::

Given the inherent assumptions of statfs(), there is _no_ solution that will be correct in all cases.
.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to