Re: [PATCH v3 2/2] vhost: Perform memory section dirty scans once per iteration

2024-03-14 Thread Eugenio Perez Martin
On Thu, Mar 14, 2024 at 7:35 PM Si-Wei Liu  wrote:
>
>
>
> On 3/14/2024 8:34 AM, Eugenio Perez Martin wrote:
> > On Thu, Mar 14, 2024 at 9:38 AM Si-Wei Liu  wrote:
> >> On setups with one or more virtio-net devices with vhost on,
> >> dirty tracking iteration increases cost the bigger the number
> >> amount of queues are set up e.g. on idle guests migration the
> >> following is observed with virtio-net with vhost=on:
> >>
> >> 48 queues -> 78.11%  [.] vhost_dev_sync_region.isra.13
> >> 8 queues -> 40.50%   [.] vhost_dev_sync_region.isra.13
> >> 1 queue -> 6.89% [.] vhost_dev_sync_region.isra.13
> >> 2 devices, 1 queue -> 18.60%  [.] vhost_dev_sync_region.isra.14
> >>
> >> With high memory rates the symptom is lack of convergence as soon
> >> as it has a vhost device with a sufficiently high number of queues,
> >> the sufficient number of vhost devices.
> >>
> >> On every migration iteration (every 100msecs) it will redundantly
> >> query the *shared log* the number of queues configured with vhost
> >> that exist in the guest. For the virtqueue data, this is necessary,
> >> but not for the memory sections which are the same. So essentially
> >> we end up scanning the dirty log too often.
> >>
> >> To fix that, select a vhost device responsible for scanning the
> >> log with regards to memory sections dirty tracking. It is selected
> >> when we enable the logger (during migration) and cleared when we
> >> disable the logger. If the vhost logger device goes away for some
> >> reason, the logger will be re-selected from the rest of vhost
> >> devices.
> >>
> >> After making mem-section logger a singleton instance, constant cost
> >> of 7%-9% (like the 1 queue report) will be seen, no matter how many
> >> queues or how many vhost devices are configured:
> >>
> >> 48 queues -> 8.71%[.] vhost_dev_sync_region.isra.13
> >> 2 devices, 8 queues -> 7.97%   [.] vhost_dev_sync_region.isra.14
> >>
> >> Co-developed-by: Joao Martins 
> >> Signed-off-by: Joao Martins 
> >> Signed-off-by: Si-Wei Liu 
> >> ---
> >> v2 -> v3:
> >>- add after-fix benchmark to commit log
> >>- rename vhost_log_dev_enabled to vhost_dev_should_log
> >>- remove unneeded comparisons for backend_type
> >>- use QLIST array instead of single flat list to store vhost
> >>  logger devices
> >>- simplify logger election logic
> >>
> >> ---
> >>   hw/virtio/vhost.c | 63 
> >> ++-
> >>   include/hw/virtio/vhost.h |  1 +
> >>   2 files changed, 58 insertions(+), 6 deletions(-)
> >>
> >> diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
> >> index efe2f74..d91858b 100644
> >> --- a/hw/virtio/vhost.c
> >> +++ b/hw/virtio/vhost.c
> >> @@ -45,6 +45,7 @@
> >>
> >>   static struct vhost_log *vhost_log[VHOST_BACKEND_TYPE_MAX];
> >>   static struct vhost_log *vhost_log_shm[VHOST_BACKEND_TYPE_MAX];
> >> +static QLIST_HEAD(, vhost_dev) vhost_log_devs[VHOST_BACKEND_TYPE_MAX];
> >>
> >>   /* Memslots used by backends that support private memslots (without an 
> >> fd). */
> >>   static unsigned int used_memslots;
> >> @@ -149,6 +150,43 @@ bool vhost_dev_has_iommu(struct vhost_dev *dev)
> >>   }
> >>   }
> >>
> >> +static inline bool vhost_dev_should_log(struct vhost_dev *dev)
> >> +{
> >> +assert(dev->vhost_ops);
> >> +assert(dev->vhost_ops->backend_type > VHOST_BACKEND_TYPE_NONE);
> >> +assert(dev->vhost_ops->backend_type < VHOST_BACKEND_TYPE_MAX);
> >> +
> >> +return dev == 
> >> QLIST_FIRST(_log_devs[dev->vhost_ops->backend_type]);
> >> +}
> >> +
> >> +static inline void vhost_dev_elect_mem_logger(struct vhost_dev *hdev, 
> >> bool add)
> >> +{
> >> +VhostBackendType backend_type;
> >> +
> >> +assert(hdev->vhost_ops);
> >> +
> >> +backend_type = hdev->vhost_ops->backend_type;
> >> +assert(backend_type > VHOST_BACKEND_TYPE_NONE);
> >> +assert(backend_type < VHOST_BACKEND_TYPE_MAX);
> >> +
> >> +if (add && !QLIST_IS_INSERTED(hdev, logdev_entry)) {
> >> +if (QLIST_EMPTY(_log_devs[backend_type])) {
> >> +QLIST_INSERT_HEAD(_log_devs[backend_type],
> >> +  hdev, logdev_entry);
> >> +} else {
> >> +/*
> >> + * The first vhost_device in the list is selected as the 
> >> shared
> >> + * logger to scan memory sections. Put new entry next to the 
> >> head
> >> + * to avoid inadvertent change to the underlying logger 
> >> device.
> >> + */
> > Why is changing the logger device a problem? All the code paths are
> > either changing the QLIST or logging, isn't it?
> Changing logger device doesn't affect functionality for sure, but may
> have inadvertent effect on cache locality, particularly it's relevant to
> the log scanning process in the hot path. The code makes sure there's no
> churn on the leading logger selection as a result of adding new vhost
> device, unless the selected logger device will be gone and a re-election
> of 

Re: [PATCH v3 2/2] vhost: Perform memory section dirty scans once per iteration

2024-03-14 Thread Si-Wei Liu




On 3/14/2024 8:34 AM, Eugenio Perez Martin wrote:

On Thu, Mar 14, 2024 at 9:38 AM Si-Wei Liu  wrote:

On setups with one or more virtio-net devices with vhost on,
dirty tracking iteration increases cost the bigger the number
amount of queues are set up e.g. on idle guests migration the
following is observed with virtio-net with vhost=on:

48 queues -> 78.11%  [.] vhost_dev_sync_region.isra.13
8 queues -> 40.50%   [.] vhost_dev_sync_region.isra.13
1 queue -> 6.89% [.] vhost_dev_sync_region.isra.13
2 devices, 1 queue -> 18.60%  [.] vhost_dev_sync_region.isra.14

With high memory rates the symptom is lack of convergence as soon
as it has a vhost device with a sufficiently high number of queues,
the sufficient number of vhost devices.

On every migration iteration (every 100msecs) it will redundantly
query the *shared log* the number of queues configured with vhost
that exist in the guest. For the virtqueue data, this is necessary,
but not for the memory sections which are the same. So essentially
we end up scanning the dirty log too often.

To fix that, select a vhost device responsible for scanning the
log with regards to memory sections dirty tracking. It is selected
when we enable the logger (during migration) and cleared when we
disable the logger. If the vhost logger device goes away for some
reason, the logger will be re-selected from the rest of vhost
devices.

After making mem-section logger a singleton instance, constant cost
of 7%-9% (like the 1 queue report) will be seen, no matter how many
queues or how many vhost devices are configured:

48 queues -> 8.71%[.] vhost_dev_sync_region.isra.13
2 devices, 8 queues -> 7.97%   [.] vhost_dev_sync_region.isra.14

Co-developed-by: Joao Martins 
Signed-off-by: Joao Martins 
Signed-off-by: Si-Wei Liu 
---
v2 -> v3:
   - add after-fix benchmark to commit log
   - rename vhost_log_dev_enabled to vhost_dev_should_log
   - remove unneeded comparisons for backend_type
   - use QLIST array instead of single flat list to store vhost
 logger devices
   - simplify logger election logic

---
  hw/virtio/vhost.c | 63 ++-
  include/hw/virtio/vhost.h |  1 +
  2 files changed, 58 insertions(+), 6 deletions(-)

diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index efe2f74..d91858b 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -45,6 +45,7 @@

  static struct vhost_log *vhost_log[VHOST_BACKEND_TYPE_MAX];
  static struct vhost_log *vhost_log_shm[VHOST_BACKEND_TYPE_MAX];
+static QLIST_HEAD(, vhost_dev) vhost_log_devs[VHOST_BACKEND_TYPE_MAX];

  /* Memslots used by backends that support private memslots (without an fd). */
  static unsigned int used_memslots;
@@ -149,6 +150,43 @@ bool vhost_dev_has_iommu(struct vhost_dev *dev)
  }
  }

+static inline bool vhost_dev_should_log(struct vhost_dev *dev)
+{
+assert(dev->vhost_ops);
+assert(dev->vhost_ops->backend_type > VHOST_BACKEND_TYPE_NONE);
+assert(dev->vhost_ops->backend_type < VHOST_BACKEND_TYPE_MAX);
+
+return dev == QLIST_FIRST(_log_devs[dev->vhost_ops->backend_type]);
+}
+
+static inline void vhost_dev_elect_mem_logger(struct vhost_dev *hdev, bool add)
+{
+VhostBackendType backend_type;
+
+assert(hdev->vhost_ops);
+
+backend_type = hdev->vhost_ops->backend_type;
+assert(backend_type > VHOST_BACKEND_TYPE_NONE);
+assert(backend_type < VHOST_BACKEND_TYPE_MAX);
+
+if (add && !QLIST_IS_INSERTED(hdev, logdev_entry)) {
+if (QLIST_EMPTY(_log_devs[backend_type])) {
+QLIST_INSERT_HEAD(_log_devs[backend_type],
+  hdev, logdev_entry);
+} else {
+/*
+ * The first vhost_device in the list is selected as the shared
+ * logger to scan memory sections. Put new entry next to the head
+ * to avoid inadvertent change to the underlying logger device.
+ */

Why is changing the logger device a problem? All the code paths are
either changing the QLIST or logging, isn't it?
Changing logger device doesn't affect functionality for sure, but may 
have inadvertent effect on cache locality, particularly it's relevant to 
the log scanning process in the hot path. The code makes sure there's no 
churn on the leading logger selection as a result of adding new vhost 
device, unless the selected logger device will be gone and a re-election 
of another logger is needed.


-Siwei




+QLIST_INSERT_AFTER(QLIST_FIRST(_log_devs[backend_type]),
+   hdev, logdev_entry);
+}
+} else if (!add && QLIST_IS_INSERTED(hdev, logdev_entry)) {
+QLIST_REMOVE(hdev, logdev_entry);
+}
+}
+
  static int vhost_sync_dirty_bitmap(struct vhost_dev *dev,
 MemoryRegionSection *section,
 hwaddr first,
@@ -166,12 +204,14 @@ static int vhost_sync_dirty_bitmap(struct vhost_dev *dev,
  start_addr = MAX(first, 

Re: [PATCH v3 2/2] vhost: Perform memory section dirty scans once per iteration

2024-03-14 Thread Eugenio Perez Martin
On Thu, Mar 14, 2024 at 9:38 AM Si-Wei Liu  wrote:
>
> On setups with one or more virtio-net devices with vhost on,
> dirty tracking iteration increases cost the bigger the number
> amount of queues are set up e.g. on idle guests migration the
> following is observed with virtio-net with vhost=on:
>
> 48 queues -> 78.11%  [.] vhost_dev_sync_region.isra.13
> 8 queues -> 40.50%   [.] vhost_dev_sync_region.isra.13
> 1 queue -> 6.89% [.] vhost_dev_sync_region.isra.13
> 2 devices, 1 queue -> 18.60%  [.] vhost_dev_sync_region.isra.14
>
> With high memory rates the symptom is lack of convergence as soon
> as it has a vhost device with a sufficiently high number of queues,
> the sufficient number of vhost devices.
>
> On every migration iteration (every 100msecs) it will redundantly
> query the *shared log* the number of queues configured with vhost
> that exist in the guest. For the virtqueue data, this is necessary,
> but not for the memory sections which are the same. So essentially
> we end up scanning the dirty log too often.
>
> To fix that, select a vhost device responsible for scanning the
> log with regards to memory sections dirty tracking. It is selected
> when we enable the logger (during migration) and cleared when we
> disable the logger. If the vhost logger device goes away for some
> reason, the logger will be re-selected from the rest of vhost
> devices.
>
> After making mem-section logger a singleton instance, constant cost
> of 7%-9% (like the 1 queue report) will be seen, no matter how many
> queues or how many vhost devices are configured:
>
> 48 queues -> 8.71%[.] vhost_dev_sync_region.isra.13
> 2 devices, 8 queues -> 7.97%   [.] vhost_dev_sync_region.isra.14
>
> Co-developed-by: Joao Martins 
> Signed-off-by: Joao Martins 
> Signed-off-by: Si-Wei Liu 
> ---
> v2 -> v3:
>   - add after-fix benchmark to commit log
>   - rename vhost_log_dev_enabled to vhost_dev_should_log
>   - remove unneeded comparisons for backend_type
>   - use QLIST array instead of single flat list to store vhost
> logger devices
>   - simplify logger election logic
>
> ---
>  hw/virtio/vhost.c | 63 
> ++-
>  include/hw/virtio/vhost.h |  1 +
>  2 files changed, 58 insertions(+), 6 deletions(-)
>
> diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
> index efe2f74..d91858b 100644
> --- a/hw/virtio/vhost.c
> +++ b/hw/virtio/vhost.c
> @@ -45,6 +45,7 @@
>
>  static struct vhost_log *vhost_log[VHOST_BACKEND_TYPE_MAX];
>  static struct vhost_log *vhost_log_shm[VHOST_BACKEND_TYPE_MAX];
> +static QLIST_HEAD(, vhost_dev) vhost_log_devs[VHOST_BACKEND_TYPE_MAX];
>
>  /* Memslots used by backends that support private memslots (without an fd). 
> */
>  static unsigned int used_memslots;
> @@ -149,6 +150,43 @@ bool vhost_dev_has_iommu(struct vhost_dev *dev)
>  }
>  }
>
> +static inline bool vhost_dev_should_log(struct vhost_dev *dev)
> +{
> +assert(dev->vhost_ops);
> +assert(dev->vhost_ops->backend_type > VHOST_BACKEND_TYPE_NONE);
> +assert(dev->vhost_ops->backend_type < VHOST_BACKEND_TYPE_MAX);
> +
> +return dev == QLIST_FIRST(_log_devs[dev->vhost_ops->backend_type]);
> +}
> +
> +static inline void vhost_dev_elect_mem_logger(struct vhost_dev *hdev, bool 
> add)
> +{
> +VhostBackendType backend_type;
> +
> +assert(hdev->vhost_ops);
> +
> +backend_type = hdev->vhost_ops->backend_type;
> +assert(backend_type > VHOST_BACKEND_TYPE_NONE);
> +assert(backend_type < VHOST_BACKEND_TYPE_MAX);
> +
> +if (add && !QLIST_IS_INSERTED(hdev, logdev_entry)) {
> +if (QLIST_EMPTY(_log_devs[backend_type])) {
> +QLIST_INSERT_HEAD(_log_devs[backend_type],
> +  hdev, logdev_entry);
> +} else {
> +/*
> + * The first vhost_device in the list is selected as the shared
> + * logger to scan memory sections. Put new entry next to the head
> + * to avoid inadvertent change to the underlying logger device.
> + */

Why is changing the logger device a problem? All the code paths are
either changing the QLIST or logging, isn't it?

> +QLIST_INSERT_AFTER(QLIST_FIRST(_log_devs[backend_type]),
> +   hdev, logdev_entry);
> +}
> +} else if (!add && QLIST_IS_INSERTED(hdev, logdev_entry)) {
> +QLIST_REMOVE(hdev, logdev_entry);
> +}
> +}
> +
>  static int vhost_sync_dirty_bitmap(struct vhost_dev *dev,
> MemoryRegionSection *section,
> hwaddr first,
> @@ -166,12 +204,14 @@ static int vhost_sync_dirty_bitmap(struct vhost_dev 
> *dev,
>  start_addr = MAX(first, start_addr);
>  end_addr = MIN(last, end_addr);
>
> -for (i = 0; i < dev->mem->nregions; ++i) {
> -struct vhost_memory_region *reg = dev->mem->regions + i;
> -vhost_dev_sync_region(dev, section, start_addr, end_addr,
> -  

[PATCH v3 2/2] vhost: Perform memory section dirty scans once per iteration

2024-03-14 Thread Si-Wei Liu
On setups with one or more virtio-net devices with vhost on,
dirty tracking iteration increases cost the bigger the number
amount of queues are set up e.g. on idle guests migration the
following is observed with virtio-net with vhost=on:

48 queues -> 78.11%  [.] vhost_dev_sync_region.isra.13
8 queues -> 40.50%   [.] vhost_dev_sync_region.isra.13
1 queue -> 6.89% [.] vhost_dev_sync_region.isra.13
2 devices, 1 queue -> 18.60%  [.] vhost_dev_sync_region.isra.14

With high memory rates the symptom is lack of convergence as soon
as it has a vhost device with a sufficiently high number of queues,
the sufficient number of vhost devices.

On every migration iteration (every 100msecs) it will redundantly
query the *shared log* the number of queues configured with vhost
that exist in the guest. For the virtqueue data, this is necessary,
but not for the memory sections which are the same. So essentially
we end up scanning the dirty log too often.

To fix that, select a vhost device responsible for scanning the
log with regards to memory sections dirty tracking. It is selected
when we enable the logger (during migration) and cleared when we
disable the logger. If the vhost logger device goes away for some
reason, the logger will be re-selected from the rest of vhost
devices.

After making mem-section logger a singleton instance, constant cost
of 7%-9% (like the 1 queue report) will be seen, no matter how many
queues or how many vhost devices are configured:

48 queues -> 8.71%[.] vhost_dev_sync_region.isra.13
2 devices, 8 queues -> 7.97%   [.] vhost_dev_sync_region.isra.14

Co-developed-by: Joao Martins 
Signed-off-by: Joao Martins 
Signed-off-by: Si-Wei Liu 
---
v2 -> v3:
  - add after-fix benchmark to commit log
  - rename vhost_log_dev_enabled to vhost_dev_should_log
  - remove unneeded comparisons for backend_type
  - use QLIST array instead of single flat list to store vhost
logger devices
  - simplify logger election logic

---
 hw/virtio/vhost.c | 63 ++-
 include/hw/virtio/vhost.h |  1 +
 2 files changed, 58 insertions(+), 6 deletions(-)

diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index efe2f74..d91858b 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -45,6 +45,7 @@
 
 static struct vhost_log *vhost_log[VHOST_BACKEND_TYPE_MAX];
 static struct vhost_log *vhost_log_shm[VHOST_BACKEND_TYPE_MAX];
+static QLIST_HEAD(, vhost_dev) vhost_log_devs[VHOST_BACKEND_TYPE_MAX];
 
 /* Memslots used by backends that support private memslots (without an fd). */
 static unsigned int used_memslots;
@@ -149,6 +150,43 @@ bool vhost_dev_has_iommu(struct vhost_dev *dev)
 }
 }
 
+static inline bool vhost_dev_should_log(struct vhost_dev *dev)
+{
+assert(dev->vhost_ops);
+assert(dev->vhost_ops->backend_type > VHOST_BACKEND_TYPE_NONE);
+assert(dev->vhost_ops->backend_type < VHOST_BACKEND_TYPE_MAX);
+
+return dev == QLIST_FIRST(_log_devs[dev->vhost_ops->backend_type]);
+}
+
+static inline void vhost_dev_elect_mem_logger(struct vhost_dev *hdev, bool add)
+{
+VhostBackendType backend_type;
+
+assert(hdev->vhost_ops);
+
+backend_type = hdev->vhost_ops->backend_type;
+assert(backend_type > VHOST_BACKEND_TYPE_NONE);
+assert(backend_type < VHOST_BACKEND_TYPE_MAX);
+
+if (add && !QLIST_IS_INSERTED(hdev, logdev_entry)) {
+if (QLIST_EMPTY(_log_devs[backend_type])) {
+QLIST_INSERT_HEAD(_log_devs[backend_type],
+  hdev, logdev_entry);
+} else {
+/*
+ * The first vhost_device in the list is selected as the shared
+ * logger to scan memory sections. Put new entry next to the head
+ * to avoid inadvertent change to the underlying logger device.
+ */
+QLIST_INSERT_AFTER(QLIST_FIRST(_log_devs[backend_type]),
+   hdev, logdev_entry);
+}
+} else if (!add && QLIST_IS_INSERTED(hdev, logdev_entry)) {
+QLIST_REMOVE(hdev, logdev_entry);
+}
+}
+
 static int vhost_sync_dirty_bitmap(struct vhost_dev *dev,
MemoryRegionSection *section,
hwaddr first,
@@ -166,12 +204,14 @@ static int vhost_sync_dirty_bitmap(struct vhost_dev *dev,
 start_addr = MAX(first, start_addr);
 end_addr = MIN(last, end_addr);
 
-for (i = 0; i < dev->mem->nregions; ++i) {
-struct vhost_memory_region *reg = dev->mem->regions + i;
-vhost_dev_sync_region(dev, section, start_addr, end_addr,
-  reg->guest_phys_addr,
-  range_get_last(reg->guest_phys_addr,
- reg->memory_size));
+if (vhost_dev_should_log(dev)) {
+for (i = 0; i < dev->mem->nregions; ++i) {
+struct vhost_memory_region *reg = dev->mem->regions + i;
+vhost_dev_sync_region(dev, section, start_addr, end_addr,