On Sat, Nov 17, 2018 at 10:59:26PM +0200, Nir Soffer wrote: > On Fri, Nov 16, 2018 at 5:26 PM Kevin Wolf <kw...@redhat.com> wrote: > > > Am 15.11.2018 um 23:27 hat Nir Soffer geschrieben: > > > On Sun, Nov 11, 2018 at 6:11 PM Nir Soffer <nsof...@redhat.com> wrote: > > > > > > > On Wed, Nov 7, 2018 at 7:55 PM Nir Soffer <nsof...@redhat.com> wrote: > > > > > > > >> On Wed, Nov 7, 2018 at 7:27 PM Kevin Wolf <kw...@redhat.com> wrote: > > > >> > > > >>> Am 07.11.2018 um 15:56 hat Nir Soffer geschrieben: > > > >>> > Wed, Nov 7, 2018 at 4:36 PM Richard W.M. Jones <rjo...@redhat.com> > > > >>> wrote: > > > >>> > > > > >>> > > Another thing I tried was to change the NBD server (nbdkit) so > > that > > > >>> it > > > >>> > > doesn't advertise zero support to the client: > > > >>> > > > > > >>> > > $ nbdkit --filter=log --filter=nozero memory size=6G > > > >>> logfile=/tmp/log \ > > > >>> > > --run './qemu-img convert ./fedora-28.img -n $nbd' > > > >>> > > $ grep '\.\.\.$' /tmp/log | sed 's/.*\([A-Z][a-z]*\).*/\1/' | > > uniq > > > >>> -c > > > >>> > > 2154 Write > > > >>> > > > > > >>> > > Not surprisingly no zero commands are issued. The size of the > > write > > > >>> > > commands is very uneven -- it appears to be send one command per > > > >>> block > > > >>> > > of zeroes or data. > > > >>> > > > > > >>> > > Nir: If we could get information from imageio about whether > > zeroing > > > >>> is > > > >>> > > implemented efficiently or not by the backend, we could change > > > >>> > > virt-v2v / nbdkit to advertise this back to qemu. > > > >>> > > > > >>> > There is no way to detect the capability, ioctl(BLKZEROOUT) always > > > >>> > succeeds, falling back to manual zeroing in the kernel silently > > > >>> > > > > >>> > Even if we could, sending zero on the wire from qemu may be even > > > >>> > slower, and it looks like qemu send even more requests in this case > > > >>> > (2154 vs ~1300). > > > >>> > > > > >>> > Looks like this optimization in qemu side leads to worse > > performance, > > > >>> > so it should not be enabled by default. > > > >>> > > > >>> Well, that's overgeneralising your case a bit. If the backend does > > > >>> support efficient zero writes (which file systems, the most common > > case, > > > >>> generally do), doing one big write_zeroes request at the start can > > > >>> improve performance quite a bit. > > > >>> > > > >>> It seems the problem is that we can't really know whether the > > operation > > > >>> will be efficient because the backends generally don't tell us. Maybe > > > >>> NBD could introduce a flag for this, but in the general case it > > appears > > > >>> to me that we'll have to have a command line option. > > > >>> > > > >>> However, I'm curious what your exact use case and the backend used > > in it > > > >>> is? Can something be improved there to actually get efficient zero > > > >>> writes and get even better performance than by just disabling the big > > > >>> zero write? > > > >> > > > >> > > > >> The backend is some NetApp storage connected via FC. I don't have > > > >> more info on this. We get zero rate of about 1G/s on this storage, > > which > > > >> is quite slow compared with other storage we tested. > > > >> > > > >> One option we check now is if this is the kernel silent fallback to > > manual > > > >> zeroing when the server advertise wrong value of write_same_max_bytes. > > > >> > > > > > > > > We eliminated this using blkdiscard. This is what we get on with this > > > > storage > > > > zeroing 100G LV: > > > > > > > > for i in 1 2 4 8 16 32; do time blkdiscard -z -p ${i}m > > > > > > /dev/6e1d84f9-f939-46e9-b108-0427a08c280c/2d5c06ce-6536-4b3c-a7b6-13c6d8e55ade; > > > > done > > > > > > > > real 4m50.851s > > > > user 0m0.065s > > > > sys 0m1.482s > > > > > > > > real 4m30.504s > > > > user 0m0.047s > > > > sys 0m0.870s > > > > > > > > real 4m19.443s > > > > user 0m0.029s > > > > sys 0m0.508s > > > > > > > > real 4m13.016s > > > > user 0m0.020s > > > > sys 0m0.284s > > > > > > > > real 2m45.888s > > > > user 0m0.011s > > > > sys 0m0.162s > > > > > > > > real 2m10.153s > > > > user 0m0.003s > > > > sys 0m0.100s > > > > > > > > We are investigating why we get low throughput on this server, and also > > > > will check > > > > several other servers. > > > > > > > > Having a command line option to control this behavior sounds good. I > > don't > > > >> have enough data to tell what should be the default, but I think the > > safe > > > >> way would be to keep old behavior. > > > >> > > > > > > > > We file this bug: > > > > https://bugzilla.redhat.com/1648622 > > > > > > > > > > More data from even slower storage - zeroing 10G lv on Kaminario K2 > > > > > > # time blkdiscard -z -p 32m /dev/test_vg/test_lv2 > > > > > > real 50m12.425s > > > user 0m0.018s > > > sys 2m6.785s > > > > > > Maybe something is wrong with this storage, since we see this: > > > > > > # grep -s "" /sys/block/dm-29/queue/* | grep write_same_max_bytes > > > /sys/block/dm-29/queue/write_same_max_bytes:512 > > > > > > Since BLKZEROOUT always fallback to manual slow zeroing silently, > > > maybe we can disable the aggressive pre-zero of the entire device > > > for block devices, and keep this optimization for files when fallocate() > > > is supported? > > > > I'm not sure what the detour through NBD changes, but qemu-img directly > > on a block device doesn't use BLKZEROOUT first, but > > FALLOC_FL_PUNCH_HOLE. > > > Looking at block/file-posix.c (83c496599cc04926ecbc3e47a37debaa3e38b686) > we don't use PUNCH_HOLE for block devices: > > 1472 if (aiocb->aio_type & QEMU_AIO_BLKDEV) { > 1473 return handle_aiocb_write_zeroes_block(aiocb); > 1474 } > > qemu uses BLKZEROOUT, which is not guaranteed to be fast on storage side, > and even worse fallback silently to manual zero if storage does not support > WRITE_SAME. > > Maybe we can add a flag that avoids anything that > > could be slow, such as BLKZEROOUT, as a fallback (and also the slow > > emulation that QEMU itself would do if all kernel calls fail). > > > > But the issue here is not how qemu-img handles this case, but how NBD > server can handle it. NBD may support zeroing, but there is no way to tell > if zeroing is going to be fast, since the backend writing zeros to storage > has the same limits of qemu-img. > > So I think we need to fix the performance regression in 2.12 by enabling > pre-zero of entire disk only if FALLOCATE_FL_PUNCH_HOLE can be used > and only if it can be used without a fallback to slow zero method. > > Enabling this optimization for anything else requires changing the entire > stack (storage, kernel, NBD protocol) to support reporting fast zero > capability > or limit zero to fast operations.
I may be missing something here, but doesn't imageio know if the backing block device starts out as all zeroes? If so couldn't it maintain a bitmap and simply ignore zero requests sent for unwritten disk blocks? Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com virt-builder quickly builds VMs from scratch http://libguestfs.org/virt-builder.1.html