Re: Dropping posix_fallocate for -o preallocation falloc
On Tue, Aug 25, 2020 at 2:27 PM Kevin Wolf wrote: > > Am 23.08.2020 um 16:46 hat Nir Soffer geschrieben: > > Using -o preallocation falloc works great on NFS 4.2 and local file system, > > when fallocate() is supported, but when it is not, posix_fallocate falls > > back > > to very inefficient way: > > https://code.woboq.org/userspace/glibc/sysdeps/posix/posix_fallocate.c.html#96 > > > > This will read the last byte for every 4k block, and if the byte is > > null, write one null byte. > > > > This minimizes the amount of data sent over the wire, but is very slow. > > > > In file-posix we optimize this flow by not truncating the file to the > > final size, so this will only write one null byte for every 4k block, > > but this is still very slow. > > > > Except the poor performance, we have a bug showing that for some reason, > > this does not work well with OFD locking: > > https://bugzilla.redhat.com/1851097 > > > > In oVirt 4.4.2 we avoid the issue by not using -o preallocation > > falloc. Instead we use our own fallocate helper: > > https://github.com/oVirt/vdsm/blob/master/helpers/fallocate > > > > (We got feedback that the name of this helper is confusing since it does > > destructive operation when fallocate() is not supported. We will > > change the name) > > > > This helper is similar to posix_fallocate, but instead of falling back > > to writing one byte per 4k block, it falls back to writing zeros in > > large blocks. > > > > Testing shows that this improves fallocation time by 385% for one disk, and > > 468% for 10 concurrent disk preallocation: > > https://bugzilla.redhat.com/1850267#c25 > > > > I think the next step is to move this change into qemu, so all users can > > benefit from this change. > > > > I think the way to do this is to replace posix_fallocate() with fallocate(), > > and fallback to "full" preallocation if fallocate is not supported. > > Yes, this makes sense to me. > > However, what full preallocation uses, is not exactly a large buffer > either (64k). Especially with your desire to use this with O_DIRECT, > we'd better use something like a 2 MB buffer. > > Actually, looking at the code, it's completely broken for O_DIRECT > because it g_malloc()s the (then misaligned) buffer. This needs to > become qemu_try_blockalign0(). It's currently failing: > > $ ./qemu-img create -f raw ~/tmp/test.raw 64M > Formatting '/home/kwolf/tmp/test.raw', fmt=raw size=67108864 > $ ./qemu-img resize --image-opts --preallocation=full > driver=file,filename=/home/kwolf/tmp/test.raw,cache.direct=on 72M > qemu-img: Could not write zeros for preallocation: Invalid argument > > > However with current code, in qemu-img create, we don't have a way to > > force O_DIRECT for the preallocation, and in qemu-img convert the > > preallocation step does not respect the -t none flag. Not using > > O_DIRECT in oVirt is very bad, and likely to cause timeouts in sanlock > > when the kernel flushes the page cache. > > If we're talking about regular files, the easy way without changing QEMU > would be to use the same way as in my example above: Create a 0 byte > image file and then resize it with full preallocation. > > But I can see that passing a cache mode to image creation is desirable. > So are probably some other options for image creation processes where > the image is actually opened with its block driver. > > Possible options I see in the order from an ad-hoc hack to a fully > generic interface: > > 1. Just add a cache mode create option to file-posix. qemu-img create >would mostly ignore this option for qcow2 (it would use O_DIRECT for >creating the raw image file, but not for opening when writing the >qcow2 data, including preallocation) > >QMP blockde-create would allow you to get the desired behaviour >because you open the raw layer explicitly there. It just wouldn't be >available in qemu-img create. > > 2. Add a cache mode/flags parameter to .bdrv_co_create_opts (the legacy >callback used for qemu-img create, but not QMP blockdev-create). >Format drivers would then use that for their bdrv_open() call on the >protocol layer; file-posix would directly use it for opening the >file. > > 3. Add a full BlockdevOptions parameter to .bdrv_co_create_opts() that >is used for the bdrv_open() of the protocol layer. I guess this would >bring in features that are currently not accessible from qemu-img >create because they can't be encoded in a URL but require options >(such as multiple servers for rbd or gluster). > > 4. Have some kind of full blockdev-create support in qemu-img create. >This wouldn't require any changes to the block drivers, but finding a >good syntax for qemu-img create might not be trivial. > > Given that 1-3 are all hacks to the legacy interface, maybe 4 is what we > should be doing to expose the full set of QEMU features in qemu-img > create, even though it poses interface questions that we wouldn't have > to answer with the other
Re: Dropping posix_fallocate for -o preallocation falloc
Am 23.08.2020 um 16:46 hat Nir Soffer geschrieben: > Using -o preallocation falloc works great on NFS 4.2 and local file system, > when fallocate() is supported, but when it is not, posix_fallocate falls back > to very inefficient way: > https://code.woboq.org/userspace/glibc/sysdeps/posix/posix_fallocate.c.html#96 > > This will read the last byte for every 4k block, and if the byte is > null, write one null byte. > > This minimizes the amount of data sent over the wire, but is very slow. > > In file-posix we optimize this flow by not truncating the file to the > final size, so this will only write one null byte for every 4k block, > but this is still very slow. > > Except the poor performance, we have a bug showing that for some reason, > this does not work well with OFD locking: > https://bugzilla.redhat.com/1851097 > > In oVirt 4.4.2 we avoid the issue by not using -o preallocation > falloc. Instead we use our own fallocate helper: > https://github.com/oVirt/vdsm/blob/master/helpers/fallocate > > (We got feedback that the name of this helper is confusing since it does > destructive operation when fallocate() is not supported. We will > change the name) > > This helper is similar to posix_fallocate, but instead of falling back > to writing one byte per 4k block, it falls back to writing zeros in > large blocks. > > Testing shows that this improves fallocation time by 385% for one disk, and > 468% for 10 concurrent disk preallocation: > https://bugzilla.redhat.com/1850267#c25 > > I think the next step is to move this change into qemu, so all users can > benefit from this change. > > I think the way to do this is to replace posix_fallocate() with fallocate(), > and fallback to "full" preallocation if fallocate is not supported. Yes, this makes sense to me. However, what full preallocation uses, is not exactly a large buffer either (64k). Especially with your desire to use this with O_DIRECT, we'd better use something like a 2 MB buffer. Actually, looking at the code, it's completely broken for O_DIRECT because it g_malloc()s the (then misaligned) buffer. This needs to become qemu_try_blockalign0(). It's currently failing: $ ./qemu-img create -f raw ~/tmp/test.raw 64M Formatting '/home/kwolf/tmp/test.raw', fmt=raw size=67108864 $ ./qemu-img resize --image-opts --preallocation=full driver=file,filename=/home/kwolf/tmp/test.raw,cache.direct=on 72M qemu-img: Could not write zeros for preallocation: Invalid argument > However with current code, in qemu-img create, we don't have a way to > force O_DIRECT for the preallocation, and in qemu-img convert the > preallocation step does not respect the -t none flag. Not using > O_DIRECT in oVirt is very bad, and likely to cause timeouts in sanlock > when the kernel flushes the page cache. If we're talking about regular files, the easy way without changing QEMU would be to use the same way as in my example above: Create a 0 byte image file and then resize it with full preallocation. But I can see that passing a cache mode to image creation is desirable. So are probably some other options for image creation processes where the image is actually opened with its block driver. Possible options I see in the order from an ad-hoc hack to a fully generic interface: 1. Just add a cache mode create option to file-posix. qemu-img create would mostly ignore this option for qcow2 (it would use O_DIRECT for creating the raw image file, but not for opening when writing the qcow2 data, including preallocation) QMP blockde-create would allow you to get the desired behaviour because you open the raw layer explicitly there. It just wouldn't be available in qemu-img create. 2. Add a cache mode/flags parameter to .bdrv_co_create_opts (the legacy callback used for qemu-img create, but not QMP blockdev-create). Format drivers would then use that for their bdrv_open() call on the protocol layer; file-posix would directly use it for opening the file. 3. Add a full BlockdevOptions parameter to .bdrv_co_create_opts() that is used for the bdrv_open() of the protocol layer. I guess this would bring in features that are currently not accessible from qemu-img create because they can't be encoded in a URL but require options (such as multiple servers for rbd or gluster). 4. Have some kind of full blockdev-create support in qemu-img create. This wouldn't require any changes to the block drivers, but finding a good syntax for qemu-img create might not be trivial. Given that 1-3 are all hacks to the legacy interface, maybe 4 is what we should be doing to expose the full set of QEMU features in qemu-img create, even though it poses interface questions that we wouldn't have to answer with the other options. > So needed changes are: > > 1. Add a way to control cache in qemu-img create (-t none? -o cache=none?) > 2. Respect -t none in qemu-img convert -o preallocation falloc -o cache=none would be what results from my option 1.
Dropping posix_fallocate for -o preallocation falloc
Using -o preallocation falloc works great on NFS 4.2 and local file system, when fallocate() is supported, but when it is not, posix_fallocate falls back to very inefficient way: https://code.woboq.org/userspace/glibc/sysdeps/posix/posix_fallocate.c.html#96 This will read the last byte for every 4k block, and if the byte is null, write one null byte. This minimizes the amount of data sent over the wire, but is very slow. In file-posix we optimize this flow by not truncating the file to the final size, so this will only write one null byte for every 4k block, but this is still very slow. Except the poor performance, we have a bug showing that for some reason, this does not work well with OFD locking: https://bugzilla.redhat.com/1851097 In oVirt 4.4.2 we avoid the issue by not using -o preallocation falloc. Instead we use our own fallocate helper: https://github.com/oVirt/vdsm/blob/master/helpers/fallocate (We got feedback that the name of this helper is confusing since it does destructive operation when fallocate() is not supported. We will change the name) This helper is similar to posix_fallocate, but instead of falling back to writing one byte per 4k block, it falls back to writing zeros in large blocks. Testing shows that this improves fallocation time by 385% for one disk, and 468% for 10 concurrent disk preallocation: https://bugzilla.redhat.com/1850267#c25 I think the next step is to move this change into qemu, so all users can benefit from this change. I think the way to do this is to replace posix_fallocate() with fallocate(), and fallback to "full" preallocation if fallocate is not supported. However with current code, in qemu-img create, we don't have a way to force O_DIRECT for the preallocation, and in qemu-img convert the preallocation step does not respect the -t none flag. Not using O_DIRECT in oVirt is very bad, and likely to cause timeouts in sanlock when the kernel flushes the page cache. So needed changes are: 1. Add a way to control cache in qemu-img create (-t none? -o cache=none?) 2. Respect -t none in qemu-img convert -o preallocation falloc 3. Replace posix_falloate to fallocate https://github.com/qemu/qemu/blob/152be6de9100e58b5d896272e951d4c910bd735a/block/file-posix.c#L1868 4. Fall back to full zeroing if fallocate is not supported https://github.com/qemu/qemu/blob/152be6de9100e58b5d896272e951d4c910bd735a/block/file-posix.c#L1891 5. Probably use larger zero buffer, 64k is not efficient https://github.com/qemu/qemu/blob/152be6de9100e58b5d896272e951d4c910bd735a/block/file-posix.c#L1907 What do you think? Nir