I was just working on this as well, though your implementation is *far*
more complete than mine. (I was only looking at making changes to the
discard implementation in block/raw-posix.c.)

I've got several comments, which I've separated by logical topic...


-------- BLKDISCARD --------

fallocate() only supports files. In my patch, I was fstat()ing the fd
just after opening it and caching an is_device boolean. Then, when doing
discards, if !is_device, call fallocate(), else (i.e. for devices) do
the following (note: untested code):

        __uint64_t range[2];

        range[0] = sector_num << 9;
        range[1] = (__uint64_t)nb_sectors << 9;

        if (ioctl(s->fd, BLKDISCARD, &range) && errno != EOPNOTSUPP) {
             return -errno;
        }
        return 0;

Note that for BLKDISCARD, you probably do NOT want to clear has_discard
if you get EOPNOTSUPP. For example, if you have LVM across multiple
disks where some support discard and some do not, you'll get EOPNOTSUPP
for some discard requests, but not all. ext4 had a behavior where it
would stop issuing discards after one failed, but the LVM guys heavily
criticized that and asked them to get rid of that behavior. (I haven't
checked to make sure it actually happened.)

I'd imagine it's still fine to stop trying fallocate() hole punches
after the first failure; I'm not aware of any filesystem where discards
would be supported in only some ranges of a single file, nor would that
make much sense.


-------- sync requirements --------

I'm not sure if fallocate() and/or BLKDISCARD always guarantee that the
discard has made it to stable storage. If they don't, does O_DIRECT or
O_DSYNC on open() cause them to make any such guarantee? If not, should
you be calling fdatasync() or fsync() when the user has specified
cache=direct or cache=writethrough?

Note that the Illumos implementation (see below) has a flag to ask for
either behavior.


-------- non-Linux Implementations --------

FreeBSD and Illumos have ioctl()s similar to BLKDISCARD and I have
similar untested code for those as well:

/* FreeBSD */
#ifdef DIOCGDELETE
    if (s->is_device) {
        off_t range[2];

        range[0] = sector_num << 9;
        range[1] = (off_t)nb_sectors << 9;

        if (ioctl(s->fd, DIOCGDELETE, &range) != 0 && errno != ENOIOCTL)
        {
            return -errno;
        }
        return 0;
    }
#endif


/* Illumos */
#ifdef DKIOCFREE
    if (s->is_device) {
        dkioc_free_t df;

        /* TODO: Is this correct? Are the other discard ioctl()s sync or async? 
*/
        df.df_flags = (s->open_flags & (O_DIRECT | O_DSYNC)) ?
            DF_WAIT_SYNC : 0;
        df.df_start = sector_num << 9;
        df.df_length = (diskaddr_t)nb_sectors << 9;

        if (ioctl(s->fd, DKIOCFREE, &range)) != 0 && errno != ENOTTY)
        {
            return -errno;
        }
        return 0;
    }
#endif



-------- Thin vs. Thick Provisioning --------

On Fri, 2012-03-09 at 00:35 -0800, Chris Wedgwood wrote:
> Simplest still compare the blocks allocated by the file
> to it's length (ie. stat.st_blocks != stat.st_size>>9).

I thought of this as well. It covers "99%" of the cases, but there's one
case where it breaks down. Imagine I have a sparse file backing my
virtual disk. In the guest, I fill the virtual disk 100%. Then, I
restart QEMU. Now it thinks that sparse file is non-sparse and stops
issuing hole punches.

To be completely correct, I suggest the following behavior:
     1. Add a discard boolean option to the disk layer.
     2. If discard is not specified:
              * For files, detect a true/false value by comparing
                stat.st_blocks != stat.st_size>>9.
              * For devices, assume a fixed value (true?).
     3. If discard is true, issue discards.
     4. If discard is false, do not issue discards.


-------- CONFIG_XFS --------

XFS has implemented the FALLOC_FL_PUNCH_HOLE interface since it was
created. So unless you're concerned about people compiling QEMU on newer
systems with FALLOC_FL_PUNCH_HOLE and then running that binary on older
kernels, there's no case where one needs both the CONFIG_XFS code and
FALLOC_FL_PUNCH_HOLE code.

-- 
Richard

Attachment: signature.asc
Description: This is a digitally signed message part

Reply via email to