I was just working on this as well, though your implementation is *far* more complete than mine. (I was only looking at making changes to the discard implementation in block/raw-posix.c.)
I've got several comments, which I've separated by logical topic... -------- BLKDISCARD -------- fallocate() only supports files. In my patch, I was fstat()ing the fd just after opening it and caching an is_device boolean. Then, when doing discards, if !is_device, call fallocate(), else (i.e. for devices) do the following (note: untested code): __uint64_t range[2]; range[0] = sector_num << 9; range[1] = (__uint64_t)nb_sectors << 9; if (ioctl(s->fd, BLKDISCARD, &range) && errno != EOPNOTSUPP) { return -errno; } return 0; Note that for BLKDISCARD, you probably do NOT want to clear has_discard if you get EOPNOTSUPP. For example, if you have LVM across multiple disks where some support discard and some do not, you'll get EOPNOTSUPP for some discard requests, but not all. ext4 had a behavior where it would stop issuing discards after one failed, but the LVM guys heavily criticized that and asked them to get rid of that behavior. (I haven't checked to make sure it actually happened.) I'd imagine it's still fine to stop trying fallocate() hole punches after the first failure; I'm not aware of any filesystem where discards would be supported in only some ranges of a single file, nor would that make much sense. -------- sync requirements -------- I'm not sure if fallocate() and/or BLKDISCARD always guarantee that the discard has made it to stable storage. If they don't, does O_DIRECT or O_DSYNC on open() cause them to make any such guarantee? If not, should you be calling fdatasync() or fsync() when the user has specified cache=direct or cache=writethrough? Note that the Illumos implementation (see below) has a flag to ask for either behavior. -------- non-Linux Implementations -------- FreeBSD and Illumos have ioctl()s similar to BLKDISCARD and I have similar untested code for those as well: /* FreeBSD */ #ifdef DIOCGDELETE if (s->is_device) { off_t range[2]; range[0] = sector_num << 9; range[1] = (off_t)nb_sectors << 9; if (ioctl(s->fd, DIOCGDELETE, &range) != 0 && errno != ENOIOCTL) { return -errno; } return 0; } #endif /* Illumos */ #ifdef DKIOCFREE if (s->is_device) { dkioc_free_t df; /* TODO: Is this correct? Are the other discard ioctl()s sync or async? */ df.df_flags = (s->open_flags & (O_DIRECT | O_DSYNC)) ? DF_WAIT_SYNC : 0; df.df_start = sector_num << 9; df.df_length = (diskaddr_t)nb_sectors << 9; if (ioctl(s->fd, DKIOCFREE, &range)) != 0 && errno != ENOTTY) { return -errno; } return 0; } #endif -------- Thin vs. Thick Provisioning -------- On Fri, 2012-03-09 at 00:35 -0800, Chris Wedgwood wrote: > Simplest still compare the blocks allocated by the file > to it's length (ie. stat.st_blocks != stat.st_size>>9). I thought of this as well. It covers "99%" of the cases, but there's one case where it breaks down. Imagine I have a sparse file backing my virtual disk. In the guest, I fill the virtual disk 100%. Then, I restart QEMU. Now it thinks that sparse file is non-sparse and stops issuing hole punches. To be completely correct, I suggest the following behavior: 1. Add a discard boolean option to the disk layer. 2. If discard is not specified: * For files, detect a true/false value by comparing stat.st_blocks != stat.st_size>>9. * For devices, assume a fixed value (true?). 3. If discard is true, issue discards. 4. If discard is false, do not issue discards. -------- CONFIG_XFS -------- XFS has implemented the FALLOC_FL_PUNCH_HOLE interface since it was created. So unless you're concerned about people compiling QEMU on newer systems with FALLOC_FL_PUNCH_HOLE and then running that binary on older kernels, there's no case where one needs both the CONFIG_XFS code and FALLOC_FL_PUNCH_HOLE code. -- Richard
signature.asc
Description: This is a digitally signed message part