I have used the following program to test #define _GNU_SOURCE #include <stdio.h> #include <unistd.h> #include <fcntl.h> #include <sys/types.h> #include <malloc.h> #include <string.h>
int main(int argc, char *argv[]) { int fd = open(argv[1], O_RDWR | O_CREAT | O_DIRECT, 0644); void *buf; int i = 0, align = atoi(argv[2]); do { buf = memalign(align, 4096); if (align >= 4096) break; if ((unsigned long)buf & 4095) break; i++; } while (1); printf("%d %p\n", i, buf); memset(buf, 0x11, 4096); for (i = 0; i < 100000; i++) { lseek(fd, SEEK_CUR, 4096); write(fd, buf, 4096); } close(fd); return 0; } for in in `seq 1 30` ; do a.out aa ; done The file was placed into 8 GB partition on HDD below to avoid speed change due to different offset on disk. Results are reliable: - 189 vs 180 seconds on Linux 3.16 The following setups have been tested: 1) ext4 with block size equals to 1024 over 512/512 physical/logical sector size SSD disk 2) ext4 with block size equals to 4096 over 512/512 physical/logical sector size SSD disk 3) ext4 with block size equals to 4096 over 512/4096 physical/logical sector size rotational disk (WDC WD20EZRX) 4) xfs with block size equals to 4096 over 512/512 physical/logical sector size SSD disk The difference is quite reliable and the same 5%. qemu-io -n -c 'write -P 0xaa 0 1G' 1.img for image in qcow2 format is 1% faster. qemu-img is also affected. The difference in between qemu-img create -f qcow2 1.img 64G qemu-io -n -c 'write -P 0xaa 0 1G' 1.img time for i in `seq 1 30` ; do qemu-img convert 1.img -t none -O raw 2.img ; rm -rf 2.img ; done is around 126 vs 119 seconds. The justification of the performance improve is quite interesting. >From the kernel point of view each request to the disk was split by two. This could be seen by blktrace like this: 9,0 11 1 0.000000000 11151 Q WS 312737792 + 1023 [qemu-img] 9,0 11 2 0.000007938 11151 Q WS 312738815 + 8 [qemu-img] 9,0 11 3 0.000030735 11151 Q WS 312738823 + 1016 [qemu-img] 9,0 11 4 0.000032482 11151 Q WS 312739839 + 8 [qemu-img] 9,0 11 5 0.000041379 11151 Q WS 312739847 + 1016 [qemu-img] 9,0 11 6 0.000042818 11151 Q WS 312740863 + 8 [qemu-img] 9,0 11 7 0.000051236 11151 Q WS 312740871 + 1017 [qemu-img] 9,0 5 1 0.169071519 11151 Q WS 312741888 + 1023 [qemu-img] After the patch the pattern becomes normal: 9,0 6 1 0.000000000 12422 Q WS 314834944 + 1024 [qemu-img] 9,0 6 2 0.000038527 12422 Q WS 314835968 + 1024 [qemu-img] 9,0 6 3 0.000072849 12422 Q WS 314836992 + 1024 [qemu-img] 9,0 6 4 0.000106276 12422 Q WS 314838016 + 1024 [qemu-img] and the amount of requests sent to disk (could be calculated counting number of lines in the output of blktrace) is reduced about 2 times. Both qemu-img and qemu-io are affected while qemu-kvm is not. The guest does his job well and real requests comes properly aligned (to page). Changes from v5: - found justification from kernel point of view - fixed checkpatch warnings in the patch 2 Changes from v4: - patches reordered - dropped conversion from 512 to BDRV_SECTOR_SIZE - getpagesize() is replaced with MAX(4096, getpagesize()) as suggested by Kevin Changes from v3: - portable way to calculate system page size used - 512/4096 values are replaced with proper macros/values Changes from v2: - opt_mem_alignment is split to opt_mem_alignment for bounce buffering and min_mem_alignment to check buffers coming from guest. Changes from v1: - enforces 4096 alignment in qemu_(try_)blockalign, avoid touching of bdrv_qiov_is_aligned path not to enforce additional bounce buffering as suggested by Paolo - reduces 10% to 5% in patch description to better fit 180 vs 189 difference Signed-off-by: Denis V. Lunev <d...@openvz.org> CC: Paolo Bonzini <pbonz...@redhat.com> CC: Kevin Wolf <kw...@redhat.com> CC: Stefan Hajnoczi <stefa...@redhat.com>