On Thu, Aug 22, 2019 at 5:28 PM Max Reitz <mre...@redhat.com> wrote:

> On 16.08.19 23:21, Nir Soffer wrote:
> > When creating an image with preallocation "off" or "falloc", the first
> > block of the image is typically not allocated. When using Gluster
> > storage backed by XFS filesystem, reading this block using direct I/O
> > succeeds regardless of request length, fooling alignment detection.
> >
> > In this case we fallback to a safe value (4096) instead of the optimal
> > value (512), which may lead to unneeded data copying when aligning
> > requests.  Allocating the first block avoids the fallback.
> >
> > When using preallocation=off, we always allocate at least one filesystem
> > block:
> >
> >     $ ./qemu-img create -f raw test.raw 1g
> >     Formatting 'test.raw', fmt=raw size=1073741824
> >
> >     $ ls -lhs test.raw
> >     4.0K -rw-r--r--. 1 nsoffer nsoffer 1.0G Aug 16 23:48 test.raw
> >
> > I did quick performance tests for these flows:
> > - Provisioning a VM with a new raw image.
> > - Copying disks with qemu-img convert to new raw target image
> >
> > I installed Fedora 29 server on raw sparse image, measuring the time
> > from clicking "Begin installation" until the "Reboot" button appears:
> >
> > Before(s)  After(s)     Diff(%)
> > -------------------------------
> >      356        389        +8.4
> >
> > I ran this only once, so we cannot tell much from these results.
>
> So you’d expect it to be fast but it was slower?  Well, you only ran it
> once and it isn’t really a precise benchmark...
>
> > The second test was cloning the installation image with qemu-img
> > convert, doing 10 runs:
> >
> >     for i in $(seq 10); do
> >         rm -f dst.raw
> >         sleep 10
> >         time ./qemu-img convert -f raw -O raw -t none -T none src.raw
> dst.raw
> >     done
> >
> > Here is a table comparing the total time spent:
> >
> > Type    Before(s)   After(s)    Diff(%)
> > ---------------------------------------
> > real      530.028    469.123      -11.4
> > user       17.204     10.768      -37.4
> > sys        17.881      7.011      -60.7
> >
> > Here we see very clear improvement in CPU usage.
> >
> > Signed-off-by: Nir Soffer <nsof...@redhat.com>
> > ---
> >  block/file-posix.c         | 25 +++++++++++++++++++++++++
> >  tests/qemu-iotests/150.out |  1 +
> >  tests/qemu-iotests/160     |  4 ++++
> >  tests/qemu-iotests/175     | 19 +++++++++++++------
> >  tests/qemu-iotests/175.out |  8 ++++----
> >  tests/qemu-iotests/221.out | 12 ++++++++----
> >  tests/qemu-iotests/253.out | 12 ++++++++----
> >  7 files changed, 63 insertions(+), 18 deletions(-)
> >
> > diff --git a/block/file-posix.c b/block/file-posix.c
> > index b9c33c8f6c..3964dd2021 100644
> > --- a/block/file-posix.c
> > +++ b/block/file-posix.c
> > @@ -1755,6 +1755,27 @@ static int handle_aiocb_discard(void *opaque)
> >      return ret;
> >  }
> >
> > +/*
> > + * Help alignment detection by allocating the first block.
> > + *
> > + * When reading with direct I/O from unallocated area on Gluster backed
> by XFS,
> > + * reading succeeds regardless of request length. In this case we
> fallback to
> > + * safe aligment which is not optimal. Allocating the first block
> avoids this
> > + * fallback.
> > + *
> > + * Returns: 0 on success, -errno on failure.
> > + */
> > +static int allocate_first_block(int fd)
> > +{
> > +    ssize_t n;
> > +
> > +    do {
> > +        n = pwrite(fd, "\0", 1, 0);
>
> This breaks when fd has been opened with O_DIRECT.
>

It seems that we always open images without O_DIRECT when creating an image
in qemu-img create, or when creating a target image in qemu-img convert.

Here is a trace of qemu-img create:

$ strace -f -tt -o /tmp/create.trace ./qemu-img create -f raw -o
preallocation=falloc /tmp/gv1/src.raw 1g
Formatting '/tmp/gv1/src.raw', fmt=raw size=1073741824 preallocation=falloc

1. open #1

17686 18:58:23.681921 openat(AT_FDCWD, "/tmp/gv1/src.raw",
O_RDONLY|O_NONBLOCK|O_CLOEXEC) = 9
17686 18:58:23.683753 fstat(9, {st_mode=S_IFREG|0600, st_size=1073741824,
...}) = 0
17686 18:58:23.683829 close(9)          = 0

2. open #2

17686 18:58:23.684146 openat(AT_FDCWD, "/tmp/gv1/src.raw",
O_RDONLY|O_NONBLOCK|O_CLOEXEC) = 9
17686 18:58:23.684227 fstat(9, {st_mode=S_IFREG|0600, st_size=1073741824,
...}) = 0
17686 18:58:23.684256 close(9)          = 0

3. open #3

17686 18:58:23.684339 openat(AT_FDCWD, "/tmp/gv1/src.raw",
O_RDWR|O_CREAT|O_CLOEXEC, 0644) = 9
...
17688 18:58:23.690178 fstat(9, {st_mode=S_IFREG|0600, st_size=1073741824,
...}) = 0
17688 18:58:23.690217 ftruncate(9, 0 <unfinished ...>
...
17688 18:58:23.700266 <... ftruncate resumed>) = 0
...
17688 18:58:23.700595 fstat(9,  <unfinished ...>
...
17688 18:58:23.700619 <... fstat resumed>{st_mode=S_IFREG|0600, st_size=0,
...}) = 0
...
17688 18:58:23.700651 fallocate(9, 0, 0, 1073741824 <unfinished ...>
...
17688 18:58:23.728141 <... fallocate resumed>) = 0
...
17688 18:58:23.728196 pwrite64(9, "\0", 1, 0) = 1
...
17686 18:58:23.738391 close(9)          = 0

Here is convert trace:

$ strace -f -tt -o /tmp/convert.trace ./qemu-img convert -f raw -O raw -t
none -T none /tmp/gv1/src.raw /tmp/gv1/dst.raw

1. open #1

18175 19:07:23.364417 openat(AT_FDCWD, "/tmp/gv1/dst.raw",
O_RDONLY|O_NONBLOCK|O_CLOEXEC) = 10
18175 19:07:23.365282 fstat(10, {st_mode=S_IFREG|0600, st_size=1073741824,
...}) = 0
18175 19:07:23.365323 close(10)         = 0

2. open #2

18175 19:07:23.365660 openat(AT_FDCWD, "/tmp/gv1/dst.raw",
O_RDONLY|O_NONBLOCK|O_CLOEXEC) = 10
18175 19:07:23.365717 fstat(10, {st_mode=S_IFREG|0600, st_size=1073741824,
...}) = 0
18175 19:07:23.365742 close(10)         = 0

3. open #3

18175 19:07:23.365839 openat(AT_FDCWD, "/tmp/gv1/dst.raw",
O_RDWR|O_CREAT|O_CLOEXEC, 0644) = 10
...
18177 19:07:23.372112 fstat(10, {st_mode=S_IFREG|0600, st_size=1073741824,
...}) = 0
18177 19:07:23.372138 ftruncate(10, 0)  = 0
...
18177 19:07:23.375760 fstat(10, {st_mode=S_IFREG|0600, st_size=0, ...}) = 0
18177 19:07:23.375788 ftruncate(10, 1073741824) = 0
18177 19:07:23.376383 pwrite64(10, "\0", 1, 0) = 1
...
18175 19:07:23.390989 close(10)         = 0

4. open #4

18175 19:07:23.391429 openat(AT_FDCWD, "/tmp/gv1/dst.raw",
O_RDONLY|O_NONBLOCK|O_CLOEXEC) = 10
18175 19:07:23.392433 fstat(10, {st_mode=S_IFREG|0600, st_size=1073741824,
...}) = 0
18175 19:07:23.392483 close(10)         = 0

5. open #5

18175 19:07:23.392743 openat(AT_FDCWD, "/tmp/gv1/dst.raw",
O_RDWR|O_DIRECT|O_CLOEXEC) = 10
...
18175 19:07:23.393731 ioctl(10, BLKSSZGET, 0x7ffe75ead334) = -1 ENOSYS
(Function not implemented)
18175 19:07:23.393784 pread64(10, 0x558796451000, 1, 0) = -1 EINVAL
(Invalid argument)
18175 19:07:23.395362 pread64(10,
"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 512,
0) = 512
18175 19:07:23.395905 pread64(10,
"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
4096, 0) = 4096
...

(Which happens when you open some file with cache.direct=on, and then
> use e.g. QMP’s block_resize.)
>

What would be a command triggering this? I can add a test.

It isn’t that bad because eventually you simply ignore the error.  But
> it still makes me wonder whether we shouldn’t write like the biggest
> power of two that does not exceed the new file length or MAX_BLOCKSIZE.
>

It makes sense if there is a way to cause qemu-img to use O_DIRECT when
creating an image.

> +    } while (n == -1 && errno == EINTR);
> > +
> > +    return (n == -1) ? -errno : 0;
> > +}
> > +
> >  static int handle_aiocb_truncate(void *opaque)
> >  {
> >      RawPosixAIOData *aiocb = opaque;
> > @@ -1794,6 +1815,8 @@ static int handle_aiocb_truncate(void *opaque)
> >                  /* posix_fallocate() doesn't set errno. */
> >                  error_setg_errno(errp, -result,
> >                                   "Could not preallocate new data");
> > +            } else if (current_length == 0) {
> > +                allocate_first_block(fd);
>
> Should posix_fallocate() not take care of precisely this?
>

Only if the filesystem does not support fallocate() (e.g. NFS < 4.2).

In this case posix_fallocate() is doing:

  for (offset += (len - 1) % increment; len > 0; offset += increment)
    {
      len -= increment;
      if (offset < st.st_size)
        {
          unsigned char c;
          ssize_t rsize = __pread (fd, &c, 1, offset);
          if (rsize < 0)
            return errno;
          /* If there is a non-zero byte, the block must have been
             allocated already.  */
          else if (rsize == 1 && c != 0)
            continue;
        }
      if (__pwrite (fd, "", 1, offset) != 1)
        return errno;
    }

https://code.woboq.org/userspace/glibc/sysdeps/posix/posix_fallocate.c.html#96

So opening a file with O_DIRECT will break preallocation=falloc on such
filesystems,
and writing one byte in allocate_first_block() is safe.

> >              }
> >          } else {
> >              result = 0;
>
> [...]
>
> > diff --git a/tests/qemu-iotests/160 b/tests/qemu-iotests/160
> > index df89d3864b..ad2d054a47 100755
> > --- a/tests/qemu-iotests/160
> > +++ b/tests/qemu-iotests/160
> > @@ -57,6 +57,10 @@ for skip in $TEST_SKIP_BLOCKS; do
> >      $QEMU_IMG dd if="$TEST_IMG" of="$TEST_IMG.out" skip="$skip" -O
> "$IMGFMT" \
> >          2> /dev/null
> >      TEST_IMG="$TEST_IMG.out" _check_test_img
> > +
> > +    # We always write the first byte of an image.
> > +    printf "\0" > "$TEST_IMG.out.dd"
> > +
> >      dd if="$TEST_IMG" of="$TEST_IMG.out.dd" skip="$skip" status=none
>
> Won’t this dd completely overwrite $TEST_IMG.out.dd (especially given
> the lack of conv=notrunc)?
>

There is an issue only if dd open the file with O_TRUNC. I will test this
again.

>
> >      echo
> > diff --git a/tests/qemu-iotests/175 b/tests/qemu-iotests/175
> > index 51e62c8276..c6a3a7bb1e 100755
> > --- a/tests/qemu-iotests/175
> > +++ b/tests/qemu-iotests/175
> > @@ -37,14 +37,16 @@ trap "_cleanup; exit \$status" 0 1 2 3 15
> >  # the file size.  This function hides the resulting difference in the
> >  # stat -c '%b' output.
> >  # Parameter 1: Number of blocks an empty file occupies
> > -# Parameter 2: Image size in bytes
> > +# Parameter 2: Minimal number of blocks in an image
> > +# Parameter 3: Image size in bytes
> >  _filter_blocks()
> >  {
> >      extra_blocks=$1
> > -    img_size=$2
> > +    min_blocks=$2
> > +    img_size=$3
> >
> > -    sed -e "s/blocks=$extra_blocks\\(\$\\|[^0-9]\\)/nothing allocated/"
> \
> > -        -e "s/blocks=$((extra_blocks + img_size /
> 512))\\(\$\\|[^0-9]\\)/everything allocated/"
> > +    sed -e "s/blocks=$((extra_blocks +
> min_blocks))\\(\$\\|[^0-9]\\)/min allocation/" \
>
> I don’t think adding extra_blocks to min_blocks makes sense.  Just
> min_blocks alone should be what we want here.
>

We had failing tests (in vdsm) showing that filesystem may return more
blocs than
expected even for non-empty files, so this may be needed. I did not test
it yet with a filesystem that show this issue, but I found your instructions
how to create it.

Thanks for reviewing.

Nir

Reply via email to