Chris,
        More comments below...

        Cheers,
        Don

>Date: Tue, 28 Aug 2007 17:23:58 -0500
>From: Chris Kirby <chris.kirby at sun.com>
>Subject: Re: [zfs-code] statvfs change
>
>Don, thanks for your comments, please see below:
>
>
>Don Cragun wrote:
>>>Date: Mon, 27 Aug 2007 16:04:42 -0500
>>>From: Chris Kirby <chris.kirby at sun.com>
> >
>> The base point is that any time you lie to applications, some
>> application software is going to make wrong decisions based on the lie.
>
>
>Yes, and we certainly don't want to lie. But returning an
>error when we can return valid (albeit less precise) info
>will also cause applications to make wrong decisions.

Yes.  But, if you give them an error, the application writer will soon
be notified that their application isn't working on ZFS due to an
EOVERFLOW error.  If you lie <--- give them less precise info, other
applications will make bad assumptions and fail later in strange and
wondrous ways (with no indication that an overflowed value was the
culprit).

>
>In the case of the netBeans installer, it died because
>it thought there wasn't enough free space when in fact,
>there were several TB of space available.
>
>I suspect that most apps that use f_bfree/f_bavail just
>want to know if they have enough space to write their
>data.

Unfortunately, ZFS has no way of knowing when the application calling
statvfs() is not one of those apps.

>
>
>> 
>>>For ZFS, we report f_frsize as 512 regardless of the size of
>>>the fs.  ...
>> 
>> 
>> Why?  Why shouldn't you always set f_frsize to the actual size of an
>> allocation unit on the filesystem?  Is it still true that we don't
>> support disks formatted with 1024 byte sectors?
>
>
>For ZFS, we don't have a fixed allocation block size so in general
>there won't be one true f_frsize across and entire VFS.  So we return
>SPA_MINBLOCKSIZE (512) for f_frsize.
>
>
>> When you cap f_files, f_ffree, and f_favail at UINT32_MAX when the
>> correct values for these fields are larger; you are not returning valid
>> information.
>
>
>I think it's valid in the sense that you will be able to create at
>least UINT32_MAX files.  Of course once you've done so,
>we might still report that you can create UINT32_MAX
>additional files.  :-)

You may also find an app (A) that checks number of free files, removes
a few and then crashes because it was supposed to be running on a quiet
machine and has now detected that some other app (B) is creating files
as fast as A can remove them.  (B doesn't really exist, but since the
number of free files isn't rising, A has to assume that B is active.)

>
>For any application making a decision on an available file count such
>that UINT32_MAX is not enough, but UINT32_MAX+1 would be OK, is
>using the correct largefile syscalls like statvfs64.

And the only way for that app to detect the this is what is going on,
statvfs() has to fail with an EOVERFLOW error.

>
>> 
>> You may be returning "valid" values for f_frsize, f_blocks, f_bfree,
>> and f_bavail; but you aren't checking to see if that is true or not.
>> (If shifting f_blocks, f_bfree, or f_bavail right throws away a bit
>> that was not a zero bit; the scaled values being returned are not
>> valid.)
>
>
>You're right that we're discarding some bytes through the scaling
>process.  However, any non-zero bits that are discarded are effectively
>partial f_frsize blocks.  For any filesystem large enough to get into
>this situation, we're talking about a relatively *very* small
>amount of rounding down.  (e.g. for a 1PB fs, f_frsize is
>only 256K)
>
>Remember that the fs code can be doing delayed writes, delayed
>allocation, background delete processing, etc.  So the statvfs
>values are just rumors anyway.  Most filesystems don't even bother
>to grab a lock when reporting statvfs info.
>
>> 
>> Since the statvfs(2) and statvfs.h(3HEAD) man pages don't state any
>> relationship between f_bsize and f_frsize, applications may well have
>> made their own assumptions.  Is there documentation somewhere that
>> specifies how many bytes should be written at a time (on boundaries
>> that is a multiple of that value) to get the most efficiency out of
>> the underlying hardware?  I would hope that f_bsize would be that
>> value.  If it is, it seems that f_bsize should be an integral multiple
>> of f_frsize.
>
>Aside from the comment in statvfs(2) about f_bsize being the
>"preferred file system block size", I can't find any documentation
>that talks about that.
>
>For filesystems that support direct I/O, f_bsize has traditionally
>provided the most efficient I/O size multiplier.
>
>But the setting of f_bsize is up to the underlying fs.  And at least
>for QFS, UFS, and ZFS, its value is not scaled based on f_frsize.
>That's also why I don't rescale f_bsize.

Correct.  I'm not suggesting that statvfs() should scale f_bsize; I'm
saying that if you scale f_frsize, some application may be think its
world has turned upside down because the relationship it thought
existed between f_frsize and f_bsize is no longer true.

I believe statvfs() should be returning an error condition with errno
set to EOVERFLOW and that applications that run into the EOVERFLOW
should be fixed to handle the brave new world of large filesystems.

By the logic you're using, we would not have needed to change the df
utility to be large filesystem aware; we should have just let it
truncate the number of blocks it said were available for all
filesystems to 32-bit values.  For a sysadmin that wants to know if the
ZFS filesystem that was just created came out at the correct size, this
clearly is not sufficient; but for "most" users who just want to know
if there is room to create a file, it will meet their needs perfectly.

For any particular call to statvfs(), the system won't know whether the
discarded low order bits of f_blocks, f_bfree, and f_bavail and the
high order bits of f_files, f_ffree, and f_favail are important or
not.  The only safe thing to do is report the overflows and fix the
applications that get the resulting EOVERFLOW errors.

>
>
>-Chris


Reply via email to