Running with mismatched host and guest logical block sizes is going to become more important as 4k-sector disks become more widespread. This is because we need a 512 byte disk to boot from.
Mismatched block sizes have two problems: 1) with cache=none or with non-raw protocols, you just cannot do 512-byte granularity output. You need to do read-modify-write cycles like "hybrid" 512b-logical/4k-physical disks do. (Note that actually only the iSCSI protocol supports 4k logical blocks). 2) when host block size < guest block size, guests issue 4k-aligned I/O and expect it to be atomic. This problem cannot really be solved completely, because power or I/O failures could leave a partially-written block ("torn page"). However, at least you can serialize reads against overlapping writes, which guarantees correctness as long as shutdown is clean and there are no I/O errors. Read-modify-write cycles are of course slower, and need to serialize writes which makes the situation even worse. However, the performance impact of emulating 512-byte sectors is within noise when partitions are aligned. File system blocks are usually 4k or bigger, and OSes tend to use 4k-aligned buffers. So when partitions are aligned no misaligned I/O is sent and no bounce buffer is necessary either. The situation is much different if partitions are misaligned or if the guest is using O_DIRECT with a 512-byte aligned buffer. I benchmarked only the former using iozone on a RHEL6 guest (2GB memory, 20GB ext4 partition with the whole 4k-sector disk assigned to the guest). Graphs aren't really pretty, but two points are more or less discernible (also more or less obvious): - writes incur a larger overhead than reads by 5-10%; - for larger file sizes the penalty is smaller, probably because the I/O scheduler can work better (with almost no penalty for reads); for smaller file sizes, up to 1M or even more for some scenarios, misalignment worsened performance by 10-25%. The series is structured as follows. Patches 1 to 6 clean up the handling of flag bits, so that non-raw protocols can always request read-modify-write operation (even when cache != none). Patches 7 to 11 distinguish host and guest block sizes in the BlockDriverState. Patches 12 to 15 reuse the request tracking mechanism to implement RMW and to avoid torn pages. Patch 16 passes down the host block size as physical block size so that hopefully guest OSes try to align partitions. Patch 17 adds an option to qemu-io that lets you test these scenarios even without a 4k-sector disk. Paolo Bonzini (17): block: do not rely on open_flags for bdrv_is_snapshot block: store actual flags in bs->open_flags block: pass protocol flags up to the format block: non-raw protocols never cache block: remove enable_write_cache block: move flag bits together raw: remove the aligned_buf block: rename buffer_alignment to guest_block_size block: add host_block_size raw: probe host_block_size iscsi: save host block size block: allow waiting only for overlapping writes block: allow waiting at arbitrary granularity block: protect against "torn reads" for guest_block_size > host_block_size block: align and serialize I/O when guest_block_size < host_block_size block: default physical block size to host block size qemu-io: add blocksize argument to open Makefile.objs | 4 +- block.c | 313 ++++++++++++++++++++++++++++++++++++++++++++++------- block.h | 17 +--- block/curl.c | 1 + block/iscsi.c | 2 + block/nbd.c | 1 + block/raw-posix.c | 97 ++++++++++------- block/raw-win32.c | 42 +++++++ block/rbd.c | 1 + block/sheepdog.c | 1 + block/vdi.c | 1 + block_int.h | 25 ++--- hw/ide/core.c | 2 +- hw/scsi-disk.c | 2 +- hw/scsi-generic.c | 2 +- hw/virtio-blk.c | 2 +- qemu-io.c | 33 +++++- trace-events | 1 + 18 files changed, 429 insertions(+), 118 deletions(-) -- 1.7.7.1