This patch series adds code to the block layer that allows performing I/O requests in smaller granularities than required by the host backend (most importantly, O_DIRECT restrictions). It achieves this for reads by rounding the request to host-side block boundary, and for writes by performing a read-modify-write cycle (and serialising requests touching the same block so that the RMW doesn't write back stale data).
Originally I intended to reuse a lot of code from Paolo's previous patch series, however as I tried to integrate pread/pwrite, which already do a very similar thing (except for considering concurrency), and because I wanted to implement zero-copy, most of this series ended up being new code. Zero-copy is possible in a common case because while XFS defauls to a 4k sector size and therefore 4k on-disk O_DIRECT alignment for 512E disks, it still only has a 512 byte memory alignment requirement. (Unfortunately the XFS_IOC_DIOINFO ioctl claims 4k even for memory, but we know that the value is wrong and can probe it.) This series does not cover 4k guests on a 512 byte host, and I'm not sure yet what to do with this case. Paolos series contained a patch to protect against "torn reads" (i.e. reads running in parallel with writes, which return old data for one half of a sector and new data for the other half) by serialising requests if the guest block size was greater than the host block size. One problem with this approach is that it assumes that a single host block size even exists and can be compared against on the top level. Different backing files can be stored on different storage, though, with different block sizes. Another problem is that block drivers can split requests internally (imagine a qcow2 image with 512 byte clusters), which would have to be detected as well. Finally, it's unclear what to do with cache modes using the kernel page cache. Technically, these have a required alignment of 1 byte, which is always smaller than the guest alignment. We always have to expect short writes, so we can't say "it's always the granularity of the request". However, serialising _every_ request certainly doesn't seem reasonable; we've never done it, and we've never got any bug reports. Other non-file protocols may have the same problem. (And all of this is ignoring that with multiple users of the block device - e.g. guest device, NBD server, block jobs - there isn't even a single guest block size, but it must be passed per request if done properly.) Anyway, so I'm hoping for a review of this series in order to get 512b-on-4k merged soon, and some help/discussion for the 4k-on-512 case. Kevin Wolf (17): qemu_memalign: Allow small alignments block: Detect unaligned length in bdrv_qiov_is_aligned() block: Don't use guest sector size for qemu_blockalign() block: Introduce bdrv_aligned_preadv() block: Introduce bdrv_co_do_preadv() block: Introduce bdrv_aligned_pwritev() block: write: Handle COR dependency after I/O throttling block: Introduce bdrv_co_do_pwritev() block: Switch BdrvTrackedRequest to byte granularity block: Allow waiting for overlapping requests between begin/end block: Make zero-after-EOF work with larger alignment block: Generalise and optimise COR serialisation block: Make overlap range for serialisation dynamic block: Align requests in bdrv_co_do_pwritev() block: Change coroutine wrapper to byte granularity block: Make bdrv_pread() a bdrv_prwv_co() wrapper block: Make bdrv_pwrite() a bdrv_prwv_co() wrapper Paolo Bonzini (2): block: rename buffer_alignment to guest_block_size raw: Probe required direct I/O alignment block.c | 572 ++++++++++++++++++++++++++++++---------------- block/backup.c | 7 +- block/raw-posix.c | 102 +++++++-- block/raw-win32.c | 41 ++++ hw/block/virtio-blk.c | 2 +- hw/ide/core.c | 2 +- hw/scsi/scsi-disk.c | 2 +- hw/scsi/scsi-generic.c | 2 +- include/block/block.h | 3 +- include/block/block_int.h | 24 +- util/oslib-posix.c | 5 + 11 files changed, 539 insertions(+), 223 deletions(-) -- 1.8.1.4