On Thu, Aug 22, 2019 at 9:11 PM Max Reitz <mre...@redhat.com> wrote: > On 22.08.19 18:39, Nir Soffer wrote: > > On Thu, Aug 22, 2019 at 5:28 PM Max Reitz <mre...@redhat.com > > <mailto:mre...@redhat.com>> wrote: > > > > On 16.08.19 23:21, Nir Soffer wrote: > > > When creating an image with preallocation "off" or "falloc", the > first > > > block of the image is typically not allocated. When using Gluster > > > storage backed by XFS filesystem, reading this block using direct > I/O > > > succeeds regardless of request length, fooling alignment detection. > > > > > > In this case we fallback to a safe value (4096) instead of the > optimal > > > value (512), which may lead to unneeded data copying when aligning > > > requests. Allocating the first block avoids the fallback. > > > > > > When using preallocation=off, we always allocate at least one > > filesystem > > > block: > > > > > > $ ./qemu-img create -f raw test.raw 1g > > > Formatting 'test.raw', fmt=raw size=1073741824 > > > > > > $ ls -lhs test.raw > > > 4.0K -rw-r--r--. 1 nsoffer nsoffer 1.0G Aug 16 23:48 test.raw > > > > > > I did quick performance tests for these flows: > > > - Provisioning a VM with a new raw image. > > > - Copying disks with qemu-img convert to new raw target image > > > > > > I installed Fedora 29 server on raw sparse image, measuring the > time > > > from clicking "Begin installation" until the "Reboot" button > appears: > > > > > > Before(s) After(s) Diff(%) > > > ------------------------------- > > > 356 389 +8.4 > > > > > > I ran this only once, so we cannot tell much from these results. > > > > So you’d expect it to be fast but it was slower? Well, you only ran > it > > once and it isn’t really a precise benchmark... > > > > > The second test was cloning the installation image with qemu-img > > > convert, doing 10 runs: > > > > > > for i in $(seq 10); do > > > rm -f dst.raw > > > sleep 10 > > > time ./qemu-img convert -f raw -O raw -t none -T none > > src.raw dst.raw > > > done > > > > > > Here is a table comparing the total time spent: > > > > > > Type Before(s) After(s) Diff(%) > > > --------------------------------------- > > > real 530.028 469.123 -11.4 > > > user 17.204 10.768 -37.4 > > > sys 17.881 7.011 -60.7 > > > > > > Here we see very clear improvement in CPU usage. > > > > > > Signed-off-by: Nir Soffer <nsof...@redhat.com > > <mailto:nsof...@redhat.com>> > > > --- > > > block/file-posix.c | 25 +++++++++++++++++++++++++ > > > tests/qemu-iotests/150.out | 1 + > > > tests/qemu-iotests/160 | 4 ++++ > > > tests/qemu-iotests/175 | 19 +++++++++++++------ > > > tests/qemu-iotests/175.out | 8 ++++---- > > > tests/qemu-iotests/221.out | 12 ++++++++---- > > > tests/qemu-iotests/253.out | 12 ++++++++---- > > > 7 files changed, 63 insertions(+), 18 deletions(-) > > > > > > diff --git a/block/file-posix.c b/block/file-posix.c > > > index b9c33c8f6c..3964dd2021 100644 > > > --- a/block/file-posix.c > > > +++ b/block/file-posix.c > > > @@ -1755,6 +1755,27 @@ static int handle_aiocb_discard(void > *opaque) > > > return ret; > > > } > > > > > > +/* > > > + * Help alignment detection by allocating the first block. > > > + * > > > + * When reading with direct I/O from unallocated area on Gluster > > backed by XFS, > > > + * reading succeeds regardless of request length. In this case we > > fallback to > > > + * safe aligment which is not optimal. Allocating the first block > > avoids this > > > + * fallback. > > > + * > > > + * Returns: 0 on success, -errno on failure. > > > + */ > > > +static int allocate_first_block(int fd) > > > +{ > > > + ssize_t n; > > > + > > > + do { > > > + n = pwrite(fd, "\0", 1, 0); > > > > This breaks when fd has been opened with O_DIRECT. > > > > > > It seems that we always open images without O_DIRECT when creating an > image > > in qemu-img create, or when creating a target image in qemu-img convert. > > Yes. But you don’t call this function directly from image creation code > but instead from the truncation function. (The former also calls the > latter, but truncating is also an operation on its own.) > > [...] > > > (Which happens when you open some file with cache.direct=on, and then > > use e.g. QMP’s block_resize.) > > > > > > What would be a command triggering this? I can add a test. > > block_resize, as I’ve said: > > $ ./qemu-img create -f raw empty.img 0 >
This is extreme edge case - why would someone create such image? > $ x86_64-softmmu/qemu-system-x86_64 \ > -qmp stdio \ > -blockdev file,node-name=file,filename=empty.img,cache.direct=on \ > <<EOF > {'execute':'qmp_capabilities'} > This is probably too late for the allocation, since we already probed the alignment before executing block_resize, and used a safe fallback (4096). It can help if the image is reopened, since we probe alignment again. > {'execute':'block_resize', > 'arguments':{'node-name':'file', > 'size':1048576}} EOF > $ ./qemu-img map empty.img > Offset Length Mapped to File > > (You’d expect a data chunk here.) I suppose you can get the same effect with blockdev-create and some > format that explicitly resizes the file to some target length (LUKS does > this, I think), but this is the most direct route. > I will try to handle -blockdev in the next version. > It isn’t that bad because eventually you simply ignore the error. But > > it still makes me wonder whether we shouldn’t write like the biggest > > power of two that does not exceed the new file length or > MAX_BLOCKSIZE. > > > > > > It makes sense if there is a way to cause qemu-img to use O_DIRECT when > > creating an image. > > > > > + } while (n == -1 && errno == EINTR); > > > + > > > + return (n == -1) ? -errno : 0; > > > +} > > > + > > > static int handle_aiocb_truncate(void *opaque) > > > { > > > RawPosixAIOData *aiocb = opaque; > > > @@ -1794,6 +1815,8 @@ static int handle_aiocb_truncate(void > *opaque) > > > /* posix_fallocate() doesn't set errno. */ > > > error_setg_errno(errp, -result, > > > "Could not preallocate new > data"); > > > + } else if (current_length == 0) { > > > + allocate_first_block(fd); > > > > Should posix_fallocate() not take care of precisely this? > > > > > > Only if the filesystem does not support fallocate() (e.g. NFS < 4.2). > > > > In this case posix_fallocate() is doing: > > > > for (offset += (len - 1) % increment; len > 0; offset += increment) > > { > > len -= increment; > > if (offset < st.st_size) > > { > > unsigned char c; > > ssize_t rsize = __pread (fd, &c, 1, offset); > > if (rsize < 0) > > return errno; > > /* If there is a non-zero byte, the block must have been > > allocated already. */ > > else if (rsize == 1 && c != 0) > > continue; > > } > > if (__pwrite (fd, "", 1, offset) != 1) > > return errno; > > } > > > > > https://code.woboq.org/userspace/glibc/sysdeps/posix/posix_fallocate.c.html#96 > > > > So opening a file with O_DIRECT will break preallocation=falloc on such > > filesystems, > > But won’t the function above just fail with EINVAL? > allocate_first_block() is executed only in case of success. > Sure, but if posix_fallocate() fails, we fail qemu-img create/convert. > and writing one byte in allocate_first_block() is safe. > > > > > } > > > } else { > > > result = 0; > > > > [...] > > > > > diff --git a/tests/qemu-iotests/160 b/tests/qemu-iotests/160 > > > index df89d3864b..ad2d054a47 100755 > > > --- a/tests/qemu-iotests/160 > > > +++ b/tests/qemu-iotests/160 > > > @@ -57,6 +57,10 @@ for skip in $TEST_SKIP_BLOCKS; do > > > $QEMU_IMG dd if="$TEST_IMG" of="$TEST_IMG.out" skip="$skip" > > -O "$IMGFMT" \ > > > 2> /dev/null > > > TEST_IMG="$TEST_IMG.out" _check_test_img > > > + > > > + # We always write the first byte of an image. > > > + printf "\0" > "$TEST_IMG.out.dd" > > > + > > > dd if="$TEST_IMG" of="$TEST_IMG.out.dd" skip="$skip" > status=none > > > > Won’t this dd completely overwrite $TEST_IMG.out.dd (especially given > > the lack of conv=notrunc)? > > > > > > There is an issue only if dd open the file with O_TRUNC. > > It isn’t an issue, I just don’t understand why the printf would be > necessary at all. > > dd should always truncate the output image unless conv=notrunc is > specified. But even if it didn’t do that, in all of these test cases it > should copy some data from $TEST_IMG to the output, and thus should > always overwrite the first byte anyway. > Right, this change is not needed. > I will test > > this again. > > > > > > > > echo > > > diff --git a/tests/qemu-iotests/175 b/tests/qemu-iotests/175 > > > index 51e62c8276..c6a3a7bb1e 100755 > > > --- a/tests/qemu-iotests/175 > > > +++ b/tests/qemu-iotests/175 > > > @@ -37,14 +37,16 @@ trap "_cleanup; exit \$status" 0 1 2 3 15 > > > # the file size. This function hides the resulting difference in > the > > > # stat -c '%b' output. > > > # Parameter 1: Number of blocks an empty file occupies > > > -# Parameter 2: Image size in bytes > > > +# Parameter 2: Minimal number of blocks in an image > > > +# Parameter 3: Image size in bytes > > > _filter_blocks() > > > { > > > extra_blocks=$1 > > > - img_size=$2 > > > + min_blocks=$2 > > > + img_size=$3 > > > > > > - sed -e "s/blocks=$extra_blocks\\(\$\\|[^0-9]\\)/nothing > > allocated/" \ > > > - -e "s/blocks=$((extra_blocks + img_size / > > 512))\\(\$\\|[^0-9]\\)/everything allocated/" > > > + sed -e "s/blocks=$((extra_blocks + > > min_blocks))\\(\$\\|[^0-9]\\)/min allocation/" \ > > > > I don’t think adding extra_blocks to min_blocks makes sense. Just > > min_blocks alone should be what we want here. > > > > > > We had failing tests (in vdsm) showing that filesystem may return more > > blocs than > > expected even for non-empty files, so this may be needed. > > But min_blocks is exactly the number of blocks of a file that has one > allocated block. I don’t see how adding the number of blocks an empty > file occupies makes sense. > You are right. The test fails on filesystem that allocates extra blocks. Nir