On Thu, Aug 22, 2019 at 5:28 PM Max Reitz <mre...@redhat.com> wrote: > On 16.08.19 23:21, Nir Soffer wrote: > > When creating an image with preallocation "off" or "falloc", the first > > block of the image is typically not allocated. When using Gluster > > storage backed by XFS filesystem, reading this block using direct I/O > > succeeds regardless of request length, fooling alignment detection. > > > > In this case we fallback to a safe value (4096) instead of the optimal > > value (512), which may lead to unneeded data copying when aligning > > requests. Allocating the first block avoids the fallback. > > > > When using preallocation=off, we always allocate at least one filesystem > > block: > > > > $ ./qemu-img create -f raw test.raw 1g > > Formatting 'test.raw', fmt=raw size=1073741824 > > > > $ ls -lhs test.raw > > 4.0K -rw-r--r--. 1 nsoffer nsoffer 1.0G Aug 16 23:48 test.raw > > > > I did quick performance tests for these flows: > > - Provisioning a VM with a new raw image. > > - Copying disks with qemu-img convert to new raw target image > > > > I installed Fedora 29 server on raw sparse image, measuring the time > > from clicking "Begin installation" until the "Reboot" button appears: > > > > Before(s) After(s) Diff(%) > > ------------------------------- > > 356 389 +8.4 > > > > I ran this only once, so we cannot tell much from these results. > > So you’d expect it to be fast but it was slower? Well, you only ran it > once and it isn’t really a precise benchmark... > > > The second test was cloning the installation image with qemu-img > > convert, doing 10 runs: > > > > for i in $(seq 10); do > > rm -f dst.raw > > sleep 10 > > time ./qemu-img convert -f raw -O raw -t none -T none src.raw > dst.raw > > done > > > > Here is a table comparing the total time spent: > > > > Type Before(s) After(s) Diff(%) > > --------------------------------------- > > real 530.028 469.123 -11.4 > > user 17.204 10.768 -37.4 > > sys 17.881 7.011 -60.7 > > > > Here we see very clear improvement in CPU usage. > > > > Signed-off-by: Nir Soffer <nsof...@redhat.com> > > --- > > block/file-posix.c | 25 +++++++++++++++++++++++++ > > tests/qemu-iotests/150.out | 1 + > > tests/qemu-iotests/160 | 4 ++++ > > tests/qemu-iotests/175 | 19 +++++++++++++------ > > tests/qemu-iotests/175.out | 8 ++++---- > > tests/qemu-iotests/221.out | 12 ++++++++---- > > tests/qemu-iotests/253.out | 12 ++++++++---- > > 7 files changed, 63 insertions(+), 18 deletions(-) > > > > diff --git a/block/file-posix.c b/block/file-posix.c > > index b9c33c8f6c..3964dd2021 100644 > > --- a/block/file-posix.c > > +++ b/block/file-posix.c > > @@ -1755,6 +1755,27 @@ static int handle_aiocb_discard(void *opaque) > > return ret; > > } > > > > +/* > > + * Help alignment detection by allocating the first block. > > + * > > + * When reading with direct I/O from unallocated area on Gluster backed > by XFS, > > + * reading succeeds regardless of request length. In this case we > fallback to > > + * safe aligment which is not optimal. Allocating the first block > avoids this > > + * fallback. > > + * > > + * Returns: 0 on success, -errno on failure. > > + */ > > +static int allocate_first_block(int fd) > > +{ > > + ssize_t n; > > + > > + do { > > + n = pwrite(fd, "\0", 1, 0); > > This breaks when fd has been opened with O_DIRECT. >
It seems that we always open images without O_DIRECT when creating an image in qemu-img create, or when creating a target image in qemu-img convert. Here is a trace of qemu-img create: $ strace -f -tt -o /tmp/create.trace ./qemu-img create -f raw -o preallocation=falloc /tmp/gv1/src.raw 1g Formatting '/tmp/gv1/src.raw', fmt=raw size=1073741824 preallocation=falloc 1. open #1 17686 18:58:23.681921 openat(AT_FDCWD, "/tmp/gv1/src.raw", O_RDONLY|O_NONBLOCK|O_CLOEXEC) = 9 17686 18:58:23.683753 fstat(9, {st_mode=S_IFREG|0600, st_size=1073741824, ...}) = 0 17686 18:58:23.683829 close(9) = 0 2. open #2 17686 18:58:23.684146 openat(AT_FDCWD, "/tmp/gv1/src.raw", O_RDONLY|O_NONBLOCK|O_CLOEXEC) = 9 17686 18:58:23.684227 fstat(9, {st_mode=S_IFREG|0600, st_size=1073741824, ...}) = 0 17686 18:58:23.684256 close(9) = 0 3. open #3 17686 18:58:23.684339 openat(AT_FDCWD, "/tmp/gv1/src.raw", O_RDWR|O_CREAT|O_CLOEXEC, 0644) = 9 ... 17688 18:58:23.690178 fstat(9, {st_mode=S_IFREG|0600, st_size=1073741824, ...}) = 0 17688 18:58:23.690217 ftruncate(9, 0 <unfinished ...> ... 17688 18:58:23.700266 <... ftruncate resumed>) = 0 ... 17688 18:58:23.700595 fstat(9, <unfinished ...> ... 17688 18:58:23.700619 <... fstat resumed>{st_mode=S_IFREG|0600, st_size=0, ...}) = 0 ... 17688 18:58:23.700651 fallocate(9, 0, 0, 1073741824 <unfinished ...> ... 17688 18:58:23.728141 <... fallocate resumed>) = 0 ... 17688 18:58:23.728196 pwrite64(9, "\0", 1, 0) = 1 ... 17686 18:58:23.738391 close(9) = 0 Here is convert trace: $ strace -f -tt -o /tmp/convert.trace ./qemu-img convert -f raw -O raw -t none -T none /tmp/gv1/src.raw /tmp/gv1/dst.raw 1. open #1 18175 19:07:23.364417 openat(AT_FDCWD, "/tmp/gv1/dst.raw", O_RDONLY|O_NONBLOCK|O_CLOEXEC) = 10 18175 19:07:23.365282 fstat(10, {st_mode=S_IFREG|0600, st_size=1073741824, ...}) = 0 18175 19:07:23.365323 close(10) = 0 2. open #2 18175 19:07:23.365660 openat(AT_FDCWD, "/tmp/gv1/dst.raw", O_RDONLY|O_NONBLOCK|O_CLOEXEC) = 10 18175 19:07:23.365717 fstat(10, {st_mode=S_IFREG|0600, st_size=1073741824, ...}) = 0 18175 19:07:23.365742 close(10) = 0 3. open #3 18175 19:07:23.365839 openat(AT_FDCWD, "/tmp/gv1/dst.raw", O_RDWR|O_CREAT|O_CLOEXEC, 0644) = 10 ... 18177 19:07:23.372112 fstat(10, {st_mode=S_IFREG|0600, st_size=1073741824, ...}) = 0 18177 19:07:23.372138 ftruncate(10, 0) = 0 ... 18177 19:07:23.375760 fstat(10, {st_mode=S_IFREG|0600, st_size=0, ...}) = 0 18177 19:07:23.375788 ftruncate(10, 1073741824) = 0 18177 19:07:23.376383 pwrite64(10, "\0", 1, 0) = 1 ... 18175 19:07:23.390989 close(10) = 0 4. open #4 18175 19:07:23.391429 openat(AT_FDCWD, "/tmp/gv1/dst.raw", O_RDONLY|O_NONBLOCK|O_CLOEXEC) = 10 18175 19:07:23.392433 fstat(10, {st_mode=S_IFREG|0600, st_size=1073741824, ...}) = 0 18175 19:07:23.392483 close(10) = 0 5. open #5 18175 19:07:23.392743 openat(AT_FDCWD, "/tmp/gv1/dst.raw", O_RDWR|O_DIRECT|O_CLOEXEC) = 10 ... 18175 19:07:23.393731 ioctl(10, BLKSSZGET, 0x7ffe75ead334) = -1 ENOSYS (Function not implemented) 18175 19:07:23.393784 pread64(10, 0x558796451000, 1, 0) = -1 EINVAL (Invalid argument) 18175 19:07:23.395362 pread64(10, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 512, 0) = 512 18175 19:07:23.395905 pread64(10, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096, 0) = 4096 ... (Which happens when you open some file with cache.direct=on, and then > use e.g. QMP’s block_resize.) > What would be a command triggering this? I can add a test. It isn’t that bad because eventually you simply ignore the error. But > it still makes me wonder whether we shouldn’t write like the biggest > power of two that does not exceed the new file length or MAX_BLOCKSIZE. > It makes sense if there is a way to cause qemu-img to use O_DIRECT when creating an image. > + } while (n == -1 && errno == EINTR); > > + > > + return (n == -1) ? -errno : 0; > > +} > > + > > static int handle_aiocb_truncate(void *opaque) > > { > > RawPosixAIOData *aiocb = opaque; > > @@ -1794,6 +1815,8 @@ static int handle_aiocb_truncate(void *opaque) > > /* posix_fallocate() doesn't set errno. */ > > error_setg_errno(errp, -result, > > "Could not preallocate new data"); > > + } else if (current_length == 0) { > > + allocate_first_block(fd); > > Should posix_fallocate() not take care of precisely this? > Only if the filesystem does not support fallocate() (e.g. NFS < 4.2). In this case posix_fallocate() is doing: for (offset += (len - 1) % increment; len > 0; offset += increment) { len -= increment; if (offset < st.st_size) { unsigned char c; ssize_t rsize = __pread (fd, &c, 1, offset); if (rsize < 0) return errno; /* If there is a non-zero byte, the block must have been allocated already. */ else if (rsize == 1 && c != 0) continue; } if (__pwrite (fd, "", 1, offset) != 1) return errno; } https://code.woboq.org/userspace/glibc/sysdeps/posix/posix_fallocate.c.html#96 So opening a file with O_DIRECT will break preallocation=falloc on such filesystems, and writing one byte in allocate_first_block() is safe. > > } > > } else { > > result = 0; > > [...] > > > diff --git a/tests/qemu-iotests/160 b/tests/qemu-iotests/160 > > index df89d3864b..ad2d054a47 100755 > > --- a/tests/qemu-iotests/160 > > +++ b/tests/qemu-iotests/160 > > @@ -57,6 +57,10 @@ for skip in $TEST_SKIP_BLOCKS; do > > $QEMU_IMG dd if="$TEST_IMG" of="$TEST_IMG.out" skip="$skip" -O > "$IMGFMT" \ > > 2> /dev/null > > TEST_IMG="$TEST_IMG.out" _check_test_img > > + > > + # We always write the first byte of an image. > > + printf "\0" > "$TEST_IMG.out.dd" > > + > > dd if="$TEST_IMG" of="$TEST_IMG.out.dd" skip="$skip" status=none > > Won’t this dd completely overwrite $TEST_IMG.out.dd (especially given > the lack of conv=notrunc)? > There is an issue only if dd open the file with O_TRUNC. I will test this again. > > > echo > > diff --git a/tests/qemu-iotests/175 b/tests/qemu-iotests/175 > > index 51e62c8276..c6a3a7bb1e 100755 > > --- a/tests/qemu-iotests/175 > > +++ b/tests/qemu-iotests/175 > > @@ -37,14 +37,16 @@ trap "_cleanup; exit \$status" 0 1 2 3 15 > > # the file size. This function hides the resulting difference in the > > # stat -c '%b' output. > > # Parameter 1: Number of blocks an empty file occupies > > -# Parameter 2: Image size in bytes > > +# Parameter 2: Minimal number of blocks in an image > > +# Parameter 3: Image size in bytes > > _filter_blocks() > > { > > extra_blocks=$1 > > - img_size=$2 > > + min_blocks=$2 > > + img_size=$3 > > > > - sed -e "s/blocks=$extra_blocks\\(\$\\|[^0-9]\\)/nothing allocated/" > \ > > - -e "s/blocks=$((extra_blocks + img_size / > 512))\\(\$\\|[^0-9]\\)/everything allocated/" > > + sed -e "s/blocks=$((extra_blocks + > min_blocks))\\(\$\\|[^0-9]\\)/min allocation/" \ > > I don’t think adding extra_blocks to min_blocks makes sense. Just > min_blocks alone should be what we want here. > We had failing tests (in vdsm) showing that filesystem may return more blocs than expected even for non-empty files, so this may be needed. I did not test it yet with a filesystem that show this issue, but I found your instructions how to create it. Thanks for reviewing. Nir