Re: [PATCH v2 0/4] virtio: Improve boot time of virtio-scsi-pci and virtio-blk-pci

2021-05-17 Thread Michael S. Tsirkin
On Mon, May 17, 2021 at 10:32:59AM +0200, Greg Kurz wrote:
> On Wed, 12 May 2021 17:05:53 +0100
> Stefan Hajnoczi  wrote:
> 
> > On Fri, May 07, 2021 at 06:59:01PM +0200, Greg Kurz wrote:
> > > Now that virtio-scsi-pci and virtio-blk-pci map 1 virtqueue per vCPU,
> > > a serious slow down may be observed on setups with a big enough number
> > > of vCPUs.
> > > 
> > > Exemple with a pseries guest on a bi-POWER9 socket system (128 HW 
> > > threads):
> > > 
> > >   virtio-scsi  virtio-blk
> > > 
> > > 1 0m20.922s   0m21.346s
> > > 2 0m21.230s   0m20.350s
> > > 4 0m21.761s   0m20.997s
> > > 8 0m22.770s   0m20.051s
> > > 160m22.038s   0m19.994s
> > > 320m22.928s   0m20.803s
> > > 640m26.583s   0m22.953s
> > > 128   0m41.273s   0m32.333s
> > > 256   2m4.727s1m16.924s
> > > 384   6m5.563s3m26.186s
> > > 
> > > Both perf and gprof indicate that QEMU is hogging CPUs when setting up
> > > the ioeventfds:
> > > 
> > >  67.88%  swapper [kernel.kallsyms]  [k] power_pmu_enable
> > >   9.47%  qemu-kvm[kernel.kallsyms]  [k] smp_call_function_single
> > >   8.64%  qemu-kvm[kernel.kallsyms]  [k] power_pmu_enable
> > > =>2.79%  qemu-kvmqemu-kvm   [.] 
> > > memory_region_ioeventfd_before
> > > =>2.12%  qemu-kvmqemu-kvm   [.] 
> > > address_space_update_ioeventfds
> > >   0.56%  kworker/8:0-mm  [kernel.kallsyms]  [k] smp_call_function_single
> > > 
> > > address_space_update_ioeventfds() is called when committing an MR
> > > transaction, i.e. for each ioeventfd with the current code base,
> > > and it internally loops on all ioventfds:
> > > 
> > > static void address_space_update_ioeventfds(AddressSpace *as)
> > > {
> > > [...]
> > > FOR_EACH_FLAT_RANGE(fr, view) {
> > > for (i = 0; i < fr->mr->ioeventfd_nb; ++i) {
> > > 
> > > This means that the setup of ioeventfds for these devices has
> > > quadratic time complexity.
> > > 
> > > This series simply changes the device models to extend the transaction
> > > to all virtqueueues, like already done in the past in the generic
> > > code with 710fccf80d78 ("virtio: improve virtio devices initialization
> > > time").
> > > 
> > > Only virtio-scsi and virtio-blk are covered here, but a similar change
> > > might also be beneficial to other device types such as host-scsi-pci,
> > > vhost-user-scsi-pci and vhost-user-blk-pci.
> > > 
> > >   virtio-scsi  virtio-blk
> > > 
> > > 1 0m21.271s   0m22.076s
> > > 2 0m20.912s   0m19.716s
> > > 4 0m20.508s   0m19.310s
> > > 8 0m21.374s   0m20.273s
> > > 160m21.559s   0m21.374s
> > > 320m22.532s   0m21.271s
> > > 640m26.550s   0m22.007s
> > > 128   0m29.115s   0m27.446s
> > > 256   0m44.752s   0m41.004s
> > > 384   1m2.884s0m58.023s
> > > 
> > > This should fix https://bugzilla.redhat.com/show_bug.cgi?id=1927108
> > > which reported the issue for virtio-scsi-pci.
> > > 
> > > Changes since v1:
> > > - Add some comments (Stefan)
> > > - Drop optimization on the error path in patch 2 (Stefan)
> > > 
> > > Changes since RFC:
> > > 
> > > As suggested by Stefan, splimplify the code by directly beginning and
> > > committing the memory transaction from the device model, without all
> > > the virtio specific proxying code and no changes needed in the memory
> > > subsystem.
> > > 
> > > Greg Kurz (4):
> > >   virtio-blk: Fix rollback path in virtio_blk_data_plane_start()
> > >   virtio-blk: Configure all host notifiers in a single MR transaction
> > >   virtio-scsi: Set host notifiers and callbacks separately
> > >   virtio-scsi: Configure all host notifiers in a single MR transaction
> > > 
> > >  hw/block/dataplane/virtio-blk.c | 45 -
> > >  hw/scsi/virtio-scsi-dataplane.c | 72 -
> > >  2 files changed, 97 insertions(+), 20 deletions(-)
> > > 
> > > -- 
> > > 2.26.3
> > > 
> > 
> > Thanks, applied to my block tree:
> > https://gitlab.com/stefanha/qemu/commits/block
> > 
> 
> Hi Stefan,
> 
> It seems that Michael already merged the previous version of this
> patch set with its latest PR.
> 
> https://gitlab.com/qemu-project/qemu/-/commit/6005ee07c380cbde44292f5f6c96e7daa70f4f7d
> 
> It is thus missing the v1->v2 changes. Basically some comments to
> clarify the optimization we're doing with the MR transaction and
> the removal of the optimization on an error path.
> 
> The optimization on the error path isn't needed indeed but it
> doesn't hurt. No need to change that now that the patches are
> upstream.
> 
> I can post a follow-up patch to add the missing comments though.
> While here, I'd even add these comments in the generic
> virtio_device_*_ioeventfd_impl() calls as well, 

Re: [PATCH v2 0/4] virtio: Improve boot time of virtio-scsi-pci and virtio-blk-pci

2021-05-17 Thread Stefan Hajnoczi
On Mon, May 17, 2021 at 10:32:59AM +0200, Greg Kurz wrote:
> On Wed, 12 May 2021 17:05:53 +0100
> Stefan Hajnoczi  wrote:
> 
> > On Fri, May 07, 2021 at 06:59:01PM +0200, Greg Kurz wrote:
> > > Now that virtio-scsi-pci and virtio-blk-pci map 1 virtqueue per vCPU,
> > > a serious slow down may be observed on setups with a big enough number
> > > of vCPUs.
> > > 
> > > Exemple with a pseries guest on a bi-POWER9 socket system (128 HW 
> > > threads):
> > > 
> > >   virtio-scsi  virtio-blk
> > > 
> > > 1 0m20.922s   0m21.346s
> > > 2 0m21.230s   0m20.350s
> > > 4 0m21.761s   0m20.997s
> > > 8 0m22.770s   0m20.051s
> > > 160m22.038s   0m19.994s
> > > 320m22.928s   0m20.803s
> > > 640m26.583s   0m22.953s
> > > 128   0m41.273s   0m32.333s
> > > 256   2m4.727s1m16.924s
> > > 384   6m5.563s3m26.186s
> > > 
> > > Both perf and gprof indicate that QEMU is hogging CPUs when setting up
> > > the ioeventfds:
> > > 
> > >  67.88%  swapper [kernel.kallsyms]  [k] power_pmu_enable
> > >   9.47%  qemu-kvm[kernel.kallsyms]  [k] smp_call_function_single
> > >   8.64%  qemu-kvm[kernel.kallsyms]  [k] power_pmu_enable
> > > =>2.79%  qemu-kvmqemu-kvm   [.] 
> > > memory_region_ioeventfd_before
> > > =>2.12%  qemu-kvmqemu-kvm   [.] 
> > > address_space_update_ioeventfds
> > >   0.56%  kworker/8:0-mm  [kernel.kallsyms]  [k] smp_call_function_single
> > > 
> > > address_space_update_ioeventfds() is called when committing an MR
> > > transaction, i.e. for each ioeventfd with the current code base,
> > > and it internally loops on all ioventfds:
> > > 
> > > static void address_space_update_ioeventfds(AddressSpace *as)
> > > {
> > > [...]
> > > FOR_EACH_FLAT_RANGE(fr, view) {
> > > for (i = 0; i < fr->mr->ioeventfd_nb; ++i) {
> > > 
> > > This means that the setup of ioeventfds for these devices has
> > > quadratic time complexity.
> > > 
> > > This series simply changes the device models to extend the transaction
> > > to all virtqueueues, like already done in the past in the generic
> > > code with 710fccf80d78 ("virtio: improve virtio devices initialization
> > > time").
> > > 
> > > Only virtio-scsi and virtio-blk are covered here, but a similar change
> > > might also be beneficial to other device types such as host-scsi-pci,
> > > vhost-user-scsi-pci and vhost-user-blk-pci.
> > > 
> > >   virtio-scsi  virtio-blk
> > > 
> > > 1 0m21.271s   0m22.076s
> > > 2 0m20.912s   0m19.716s
> > > 4 0m20.508s   0m19.310s
> > > 8 0m21.374s   0m20.273s
> > > 160m21.559s   0m21.374s
> > > 320m22.532s   0m21.271s
> > > 640m26.550s   0m22.007s
> > > 128   0m29.115s   0m27.446s
> > > 256   0m44.752s   0m41.004s
> > > 384   1m2.884s0m58.023s
> > > 
> > > This should fix https://bugzilla.redhat.com/show_bug.cgi?id=1927108
> > > which reported the issue for virtio-scsi-pci.
> > > 
> > > Changes since v1:
> > > - Add some comments (Stefan)
> > > - Drop optimization on the error path in patch 2 (Stefan)
> > > 
> > > Changes since RFC:
> > > 
> > > As suggested by Stefan, splimplify the code by directly beginning and
> > > committing the memory transaction from the device model, without all
> > > the virtio specific proxying code and no changes needed in the memory
> > > subsystem.
> > > 
> > > Greg Kurz (4):
> > >   virtio-blk: Fix rollback path in virtio_blk_data_plane_start()
> > >   virtio-blk: Configure all host notifiers in a single MR transaction
> > >   virtio-scsi: Set host notifiers and callbacks separately
> > >   virtio-scsi: Configure all host notifiers in a single MR transaction
> > > 
> > >  hw/block/dataplane/virtio-blk.c | 45 -
> > >  hw/scsi/virtio-scsi-dataplane.c | 72 -
> > >  2 files changed, 97 insertions(+), 20 deletions(-)
> > > 
> > > -- 
> > > 2.26.3
> > > 
> > 
> > Thanks, applied to my block tree:
> > https://gitlab.com/stefanha/qemu/commits/block
> > 
> 
> Hi Stefan,
> 
> It seems that Michael already merged the previous version of this
> patch set with its latest PR.
> 
> https://gitlab.com/qemu-project/qemu/-/commit/6005ee07c380cbde44292f5f6c96e7daa70f4f7d
> 
> It is thus missing the v1->v2 changes. Basically some comments to
> clarify the optimization we're doing with the MR transaction and
> the removal of the optimization on an error path.
> 
> The optimization on the error path isn't needed indeed but it
> doesn't hurt. No need to change that now that the patches are
> upstream.
> 
> I can post a follow-up patch to add the missing comments though.
> While here, I'd even add these comments in the generic
> virtio_device_*_ioeventfd_impl() calls as well, 

Re: [PATCH v2 0/4] virtio: Improve boot time of virtio-scsi-pci and virtio-blk-pci

2021-05-17 Thread Greg Kurz
On Wed, 12 May 2021 17:05:53 +0100
Stefan Hajnoczi  wrote:

> On Fri, May 07, 2021 at 06:59:01PM +0200, Greg Kurz wrote:
> > Now that virtio-scsi-pci and virtio-blk-pci map 1 virtqueue per vCPU,
> > a serious slow down may be observed on setups with a big enough number
> > of vCPUs.
> > 
> > Exemple with a pseries guest on a bi-POWER9 socket system (128 HW threads):
> > 
> >   virtio-scsi  virtio-blk
> > 
> > 1   0m20.922s   0m21.346s
> > 2   0m21.230s   0m20.350s
> > 4   0m21.761s   0m20.997s
> > 8   0m22.770s   0m20.051s
> > 16  0m22.038s   0m19.994s
> > 32  0m22.928s   0m20.803s
> > 64  0m26.583s   0m22.953s
> > 128 0m41.273s   0m32.333s
> > 256 2m4.727s1m16.924s
> > 384 6m5.563s3m26.186s
> > 
> > Both perf and gprof indicate that QEMU is hogging CPUs when setting up
> > the ioeventfds:
> > 
> >  67.88%  swapper [kernel.kallsyms]  [k] power_pmu_enable
> >   9.47%  qemu-kvm[kernel.kallsyms]  [k] smp_call_function_single
> >   8.64%  qemu-kvm[kernel.kallsyms]  [k] power_pmu_enable
> > =>2.79%  qemu-kvmqemu-kvm   [.] 
> > memory_region_ioeventfd_before
> > =>2.12%  qemu-kvmqemu-kvm   [.] 
> > address_space_update_ioeventfds
> >   0.56%  kworker/8:0-mm  [kernel.kallsyms]  [k] smp_call_function_single
> > 
> > address_space_update_ioeventfds() is called when committing an MR
> > transaction, i.e. for each ioeventfd with the current code base,
> > and it internally loops on all ioventfds:
> > 
> > static void address_space_update_ioeventfds(AddressSpace *as)
> > {
> > [...]
> > FOR_EACH_FLAT_RANGE(fr, view) {
> > for (i = 0; i < fr->mr->ioeventfd_nb; ++i) {
> > 
> > This means that the setup of ioeventfds for these devices has
> > quadratic time complexity.
> > 
> > This series simply changes the device models to extend the transaction
> > to all virtqueueues, like already done in the past in the generic
> > code with 710fccf80d78 ("virtio: improve virtio devices initialization
> > time").
> > 
> > Only virtio-scsi and virtio-blk are covered here, but a similar change
> > might also be beneficial to other device types such as host-scsi-pci,
> > vhost-user-scsi-pci and vhost-user-blk-pci.
> > 
> >   virtio-scsi  virtio-blk
> > 
> > 1   0m21.271s   0m22.076s
> > 2   0m20.912s   0m19.716s
> > 4   0m20.508s   0m19.310s
> > 8   0m21.374s   0m20.273s
> > 16  0m21.559s   0m21.374s
> > 32  0m22.532s   0m21.271s
> > 64  0m26.550s   0m22.007s
> > 128 0m29.115s   0m27.446s
> > 256 0m44.752s   0m41.004s
> > 384 1m2.884s0m58.023s
> > 
> > This should fix https://bugzilla.redhat.com/show_bug.cgi?id=1927108
> > which reported the issue for virtio-scsi-pci.
> > 
> > Changes since v1:
> > - Add some comments (Stefan)
> > - Drop optimization on the error path in patch 2 (Stefan)
> > 
> > Changes since RFC:
> > 
> > As suggested by Stefan, splimplify the code by directly beginning and
> > committing the memory transaction from the device model, without all
> > the virtio specific proxying code and no changes needed in the memory
> > subsystem.
> > 
> > Greg Kurz (4):
> >   virtio-blk: Fix rollback path in virtio_blk_data_plane_start()
> >   virtio-blk: Configure all host notifiers in a single MR transaction
> >   virtio-scsi: Set host notifiers and callbacks separately
> >   virtio-scsi: Configure all host notifiers in a single MR transaction
> > 
> >  hw/block/dataplane/virtio-blk.c | 45 -
> >  hw/scsi/virtio-scsi-dataplane.c | 72 -
> >  2 files changed, 97 insertions(+), 20 deletions(-)
> > 
> > -- 
> > 2.26.3
> > 
> 
> Thanks, applied to my block tree:
> https://gitlab.com/stefanha/qemu/commits/block
> 

Hi Stefan,

It seems that Michael already merged the previous version of this
patch set with its latest PR.

https://gitlab.com/qemu-project/qemu/-/commit/6005ee07c380cbde44292f5f6c96e7daa70f4f7d

It is thus missing the v1->v2 changes. Basically some comments to
clarify the optimization we're doing with the MR transaction and
the removal of the optimization on an error path.

The optimization on the error path isn't needed indeed but it
doesn't hurt. No need to change that now that the patches are
upstream.

I can post a follow-up patch to add the missing comments though.
While here, I'd even add these comments in the generic
virtio_device_*_ioeventfd_impl() calls as well, since they already
have the very same optimization.

Anyway, I guess you can drop the patches from your tree.

Cheers,

--
Greg

> Stefan



pgpEOHcuzs2DX.pgp
Description: OpenPGP digital signature


Re: [PATCH v2 0/4] virtio: Improve boot time of virtio-scsi-pci and virtio-blk-pci

2021-05-12 Thread Stefan Hajnoczi
On Fri, May 07, 2021 at 06:59:01PM +0200, Greg Kurz wrote:
> Now that virtio-scsi-pci and virtio-blk-pci map 1 virtqueue per vCPU,
> a serious slow down may be observed on setups with a big enough number
> of vCPUs.
> 
> Exemple with a pseries guest on a bi-POWER9 socket system (128 HW threads):
> 
>   virtio-scsi  virtio-blk
> 
> 1 0m20.922s   0m21.346s
> 2 0m21.230s   0m20.350s
> 4 0m21.761s   0m20.997s
> 8 0m22.770s   0m20.051s
> 160m22.038s   0m19.994s
> 320m22.928s   0m20.803s
> 640m26.583s   0m22.953s
> 128   0m41.273s   0m32.333s
> 256   2m4.727s1m16.924s
> 384   6m5.563s3m26.186s
> 
> Both perf and gprof indicate that QEMU is hogging CPUs when setting up
> the ioeventfds:
> 
>  67.88%  swapper [kernel.kallsyms]  [k] power_pmu_enable
>   9.47%  qemu-kvm[kernel.kallsyms]  [k] smp_call_function_single
>   8.64%  qemu-kvm[kernel.kallsyms]  [k] power_pmu_enable
> =>2.79%  qemu-kvmqemu-kvm   [.] memory_region_ioeventfd_before
> =>2.12%  qemu-kvmqemu-kvm   [.] 
> address_space_update_ioeventfds
>   0.56%  kworker/8:0-mm  [kernel.kallsyms]  [k] smp_call_function_single
> 
> address_space_update_ioeventfds() is called when committing an MR
> transaction, i.e. for each ioeventfd with the current code base,
> and it internally loops on all ioventfds:
> 
> static void address_space_update_ioeventfds(AddressSpace *as)
> {
> [...]
> FOR_EACH_FLAT_RANGE(fr, view) {
> for (i = 0; i < fr->mr->ioeventfd_nb; ++i) {
> 
> This means that the setup of ioeventfds for these devices has
> quadratic time complexity.
> 
> This series simply changes the device models to extend the transaction
> to all virtqueueues, like already done in the past in the generic
> code with 710fccf80d78 ("virtio: improve virtio devices initialization
> time").
> 
> Only virtio-scsi and virtio-blk are covered here, but a similar change
> might also be beneficial to other device types such as host-scsi-pci,
> vhost-user-scsi-pci and vhost-user-blk-pci.
> 
>   virtio-scsi  virtio-blk
> 
> 1 0m21.271s   0m22.076s
> 2 0m20.912s   0m19.716s
> 4 0m20.508s   0m19.310s
> 8 0m21.374s   0m20.273s
> 160m21.559s   0m21.374s
> 320m22.532s   0m21.271s
> 640m26.550s   0m22.007s
> 128   0m29.115s   0m27.446s
> 256   0m44.752s   0m41.004s
> 384   1m2.884s0m58.023s
> 
> This should fix https://bugzilla.redhat.com/show_bug.cgi?id=1927108
> which reported the issue for virtio-scsi-pci.
> 
> Changes since v1:
> - Add some comments (Stefan)
> - Drop optimization on the error path in patch 2 (Stefan)
> 
> Changes since RFC:
> 
> As suggested by Stefan, splimplify the code by directly beginning and
> committing the memory transaction from the device model, without all
> the virtio specific proxying code and no changes needed in the memory
> subsystem.
> 
> Greg Kurz (4):
>   virtio-blk: Fix rollback path in virtio_blk_data_plane_start()
>   virtio-blk: Configure all host notifiers in a single MR transaction
>   virtio-scsi: Set host notifiers and callbacks separately
>   virtio-scsi: Configure all host notifiers in a single MR transaction
> 
>  hw/block/dataplane/virtio-blk.c | 45 -
>  hw/scsi/virtio-scsi-dataplane.c | 72 -
>  2 files changed, 97 insertions(+), 20 deletions(-)
> 
> -- 
> 2.26.3
> 

Thanks, applied to my block tree:
https://gitlab.com/stefanha/qemu/commits/block

Stefan


signature.asc
Description: PGP signature