Re: [PATCH] coroutine: resize pool periodically instead of limiting size

2021-09-13 Thread Stefan Hajnoczi
On Thu, Sep 09, 2021 at 05:08:08PM +0100, Daniel P. Berrangé wrote:
> On Thu, Sep 09, 2021 at 04:51:44PM +0100, Stefan Hajnoczi wrote:
> > On Thu, Sep 09, 2021 at 09:37:20AM +0100, Daniel P. Berrangé wrote:
> > > On Wed, Sep 01, 2021 at 05:09:23PM +0100, Stefan Hajnoczi wrote:
> > > > It was reported that enabling SafeStack reduces IOPS significantly
> > > > (>25%) with the following fio benchmark on virtio-blk using a NVMe host
> > > > block device:
> > > > 
> > > >   # fio --rw=randrw --bs=4k --iodepth=64 --runtime=1m --direct=1 \
> > > > --filename=/dev/vdb --name=job1 --ioengine=libaio --thread \
> > > > --group_reporting --numjobs=16 --time_based \
> > > > --output=/tmp/fio_result
> > > > 
> > > > Serge Guelton and I found that SafeStack is not really at fault, it just
> > > > increases the cost of coroutine creation. This fio workload exhausts the
> > > > coroutine pool and coroutine creation becomes a bottleneck. Previous
> > > > work by Honghao Wang also pointed to excessive coroutine creation.
> > > > 
> > > > Creating new coroutines is expensive due to allocating new stacks with
> > > > mmap(2) and mprotect(2). Currently there are thread-local and global
> > > > pools that recycle old Coroutine objects and their stacks but the
> > > > hardcoded size limit of 64 for thread-local pools and 128 for the global
> > > > pool is insufficient for the fio benchmark shown above.
> > > 
> > > Rather than keeping around a thread local pool of coroutine
> > > instances, did you ever consider keeping around a pool of
> > > allocated stacks ? Essentially it seems like you're syaing
> > > the stack allocation is the problem due to it using mmap()
> > > instead of malloc() and thus not benefiting from any of the
> > > performance tricks malloc() impls use to avoid repeated
> > > syscalls on every allocation.  If 'qemu_alloc_stack' and
> > > qemu_free_stack could be made more intelligent by caching
> > > stacks, then perhaps the coroutine side can be left "dumb" ?
> > 
> > What is the advantage of doing that? Then the Coroutine struct needs to
> > be malloced each time. Coroutines are the only users of
> > qemu_alloc_stack(), so I think pooling the Coroutines is optimal.
> 
> I mostly thought it might lead itself to cleaner implementation if the
> pooling logic is separate from the main coroutine logic. It could be
> easier to experiment with different allocation strategies if the code
> related to pooling is well isolated.

There is an advantage to pooling the Coroutine struct in addition to the
stack, so I'm not sure if it's worth reducing the scope of the pool.

Stefan


signature.asc
Description: PGP signature


Re: [PATCH] coroutine: resize pool periodically instead of limiting size

2021-09-09 Thread Daniel P . Berrangé
On Thu, Sep 09, 2021 at 04:51:44PM +0100, Stefan Hajnoczi wrote:
> On Thu, Sep 09, 2021 at 09:37:20AM +0100, Daniel P. Berrangé wrote:
> > On Wed, Sep 01, 2021 at 05:09:23PM +0100, Stefan Hajnoczi wrote:
> > > It was reported that enabling SafeStack reduces IOPS significantly
> > > (>25%) with the following fio benchmark on virtio-blk using a NVMe host
> > > block device:
> > > 
> > >   # fio --rw=randrw --bs=4k --iodepth=64 --runtime=1m --direct=1 \
> > >   --filename=/dev/vdb --name=job1 --ioengine=libaio --thread \
> > >   --group_reporting --numjobs=16 --time_based \
> > > --output=/tmp/fio_result
> > > 
> > > Serge Guelton and I found that SafeStack is not really at fault, it just
> > > increases the cost of coroutine creation. This fio workload exhausts the
> > > coroutine pool and coroutine creation becomes a bottleneck. Previous
> > > work by Honghao Wang also pointed to excessive coroutine creation.
> > > 
> > > Creating new coroutines is expensive due to allocating new stacks with
> > > mmap(2) and mprotect(2). Currently there are thread-local and global
> > > pools that recycle old Coroutine objects and their stacks but the
> > > hardcoded size limit of 64 for thread-local pools and 128 for the global
> > > pool is insufficient for the fio benchmark shown above.
> > 
> > Rather than keeping around a thread local pool of coroutine
> > instances, did you ever consider keeping around a pool of
> > allocated stacks ? Essentially it seems like you're syaing
> > the stack allocation is the problem due to it using mmap()
> > instead of malloc() and thus not benefiting from any of the
> > performance tricks malloc() impls use to avoid repeated
> > syscalls on every allocation.  If 'qemu_alloc_stack' and
> > qemu_free_stack could be made more intelligent by caching
> > stacks, then perhaps the coroutine side can be left "dumb" ?
> 
> What is the advantage of doing that? Then the Coroutine struct needs to
> be malloced each time. Coroutines are the only users of
> qemu_alloc_stack(), so I think pooling the Coroutines is optimal.

I mostly thought it might lead itself to cleaner implementation if the
pooling logic is separate from the main coroutine logic. It could be
easier to experiment with different allocation strategies if the code
related to pooling is well isolated.

Regards,
Daniel
-- 
|: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o-https://fstop138.berrange.com :|
|: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|




Re: [PATCH] coroutine: resize pool periodically instead of limiting size

2021-09-09 Thread Stefan Hajnoczi
On Wed, Sep 08, 2021 at 10:30:09PM -0400, Daniele Buono wrote:
> Stefan, the patch looks great.
> Thank you for debugging the performance issue that was happening with
> SafeStack.
> 
> On 9/2/2021 4:10 AM, Stefan Hajnoczi wrote:
> > On Wed, Sep 01, 2021 at 05:09:23PM +0100, Stefan Hajnoczi wrote:
> > > It was reported that enabling SafeStack reduces IOPS significantly
> > > (>25%) with the following fio benchmark on virtio-blk using a NVMe host
> > > block device:
> > > 
> > ># fio --rw=randrw --bs=4k --iodepth=64 --runtime=1m --direct=1 \
> > >   --filename=/dev/vdb --name=job1 --ioengine=libaio --thread \
> > >   --group_reporting --numjobs=16 --time_based \
> > >  --output=/tmp/fio_result
> > > 
> > > Serge Guelton and I found that SafeStack is not really at fault, it just
> > > increases the cost of coroutine creation. This fio workload exhausts the
> > > coroutine pool and coroutine creation becomes a bottleneck. Previous
> > > work by Honghao Wang also pointed to excessive coroutine creation.
> > > 
> > > Creating new coroutines is expensive due to allocating new stacks with
> > > mmap(2) and mprotect(2). Currently there are thread-local and global
> > > pools that recycle old Coroutine objects and their stacks but the
> > > hardcoded size limit of 64 for thread-local pools and 128 for the global
> > > pool is insufficient for the fio benchmark shown above.
> > > 
> > > This patch changes the coroutine pool algorithm to a simple thread-local
> > > pool without a size limit. Threads periodically shrink the pool down to
> > > a size sufficient for the maximum observed number of coroutines.
> > > 
> > > This is a very simple algorithm. Fancier things could be done like
> > > keeping a minimum number of coroutines around to avoid latency when a
> > > new coroutine is created after a long period of inactivity. Another
> > > thought is to stop the timer when the pool size is zero for power saving
> > > on threads that aren't using coroutines. However, I'd rather not add
> > > bells and whistles unless they are really necessary.
> 
> I agree we should aim at something that is as simple as possible.
> 
> However keeping a minumum number of coroutines could be useful to avoid
> performance regressions from the previous version.
> 
> I wouldn't say restore the global pool - that's way too much trouble -
> but keeping the old "pool size" configuration parameter as the miniumum
> pool size could be a good idea to maintain the previous performance in
> cases where the dynamic pool shrinks too much.

Good point. We're taking a risk by freeing all coroutines when the timer
expires. It's safer to stick to the old pool size limit to avoid
regressions.

I'll send another revision.

Stefan


signature.asc
Description: PGP signature


Re: [PATCH] coroutine: resize pool periodically instead of limiting size

2021-09-09 Thread Stefan Hajnoczi
On Thu, Sep 09, 2021 at 09:37:20AM +0100, Daniel P. Berrangé wrote:
> On Wed, Sep 01, 2021 at 05:09:23PM +0100, Stefan Hajnoczi wrote:
> > It was reported that enabling SafeStack reduces IOPS significantly
> > (>25%) with the following fio benchmark on virtio-blk using a NVMe host
> > block device:
> > 
> >   # fio --rw=randrw --bs=4k --iodepth=64 --runtime=1m --direct=1 \
> > --filename=/dev/vdb --name=job1 --ioengine=libaio --thread \
> > --group_reporting --numjobs=16 --time_based \
> > --output=/tmp/fio_result
> > 
> > Serge Guelton and I found that SafeStack is not really at fault, it just
> > increases the cost of coroutine creation. This fio workload exhausts the
> > coroutine pool and coroutine creation becomes a bottleneck. Previous
> > work by Honghao Wang also pointed to excessive coroutine creation.
> > 
> > Creating new coroutines is expensive due to allocating new stacks with
> > mmap(2) and mprotect(2). Currently there are thread-local and global
> > pools that recycle old Coroutine objects and their stacks but the
> > hardcoded size limit of 64 for thread-local pools and 128 for the global
> > pool is insufficient for the fio benchmark shown above.
> 
> Rather than keeping around a thread local pool of coroutine
> instances, did you ever consider keeping around a pool of
> allocated stacks ? Essentially it seems like you're syaing
> the stack allocation is the problem due to it using mmap()
> instead of malloc() and thus not benefiting from any of the
> performance tricks malloc() impls use to avoid repeated
> syscalls on every allocation.  If 'qemu_alloc_stack' and
> qemu_free_stack could be made more intelligent by caching
> stacks, then perhaps the coroutine side can be left "dumb" ?

What is the advantage of doing that? Then the Coroutine struct needs to
be malloced each time. Coroutines are the only users of
qemu_alloc_stack(), so I think pooling the Coroutines is optimal.

> > 
> > This patch changes the coroutine pool algorithm to a simple thread-local
> > pool without a size limit. Threads periodically shrink the pool down to
> > a size sufficient for the maximum observed number of coroutines.
> > 
> > This is a very simple algorithm. Fancier things could be done like
> > keeping a minimum number of coroutines around to avoid latency when a
> > new coroutine is created after a long period of inactivity. Another
> > thought is to stop the timer when the pool size is zero for power saving
> > on threads that aren't using coroutines. However, I'd rather not add
> > bells and whistles unless they are really necessary.
> > 
> > The global pool is removed by this patch. It can help to hide the fact
> > that local pools are easily exhausted, but it's doesn't fix the root
> > cause. I don't think there is a need for a global pool because QEMU's
> > threads are long-lived, so let's keep things simple.
> 
> QEMU's threads may be long-lived,  but are the workloads they service
> expected to be consistent over time?
> 
> eg can we ever get a situation where Thread A has a peak of activity
> causing caching of many coroutines, and then go idle for a long time,
> while Thread B then has a peak and is unable to take advantage of the
> cache from Thread A that is now unused ?

It's possible to trigger that case, but it's probably not a real-world
scenario. It requires a storage controller that is emulated in the vCPU
thread (like AHCI) and a workload that alternates between vCPUs.
However, that configuration isn't expected to perform well anyway
(virtio-blk and virtio-scsi use IOThreads instead of running emulation
in vCPU threads because this reduces the cost of the vmexit).

Stefan


signature.asc
Description: PGP signature


Re: [PATCH] coroutine: resize pool periodically instead of limiting size

2021-09-09 Thread Daniel P . Berrangé
On Wed, Sep 01, 2021 at 05:09:23PM +0100, Stefan Hajnoczi wrote:
> It was reported that enabling SafeStack reduces IOPS significantly
> (>25%) with the following fio benchmark on virtio-blk using a NVMe host
> block device:
> 
>   # fio --rw=randrw --bs=4k --iodepth=64 --runtime=1m --direct=1 \
>   --filename=/dev/vdb --name=job1 --ioengine=libaio --thread \
>   --group_reporting --numjobs=16 --time_based \
> --output=/tmp/fio_result
> 
> Serge Guelton and I found that SafeStack is not really at fault, it just
> increases the cost of coroutine creation. This fio workload exhausts the
> coroutine pool and coroutine creation becomes a bottleneck. Previous
> work by Honghao Wang also pointed to excessive coroutine creation.
> 
> Creating new coroutines is expensive due to allocating new stacks with
> mmap(2) and mprotect(2). Currently there are thread-local and global
> pools that recycle old Coroutine objects and their stacks but the
> hardcoded size limit of 64 for thread-local pools and 128 for the global
> pool is insufficient for the fio benchmark shown above.

Rather than keeping around a thread local pool of coroutine
instances, did you ever consider keeping around a pool of
allocated stacks ? Essentially it seems like you're syaing
the stack allocation is the problem due to it using mmap()
instead of malloc() and thus not benefiting from any of the
performance tricks malloc() impls use to avoid repeated
syscalls on every allocation.  If 'qemu_alloc_stack' and
qemu_free_stack could be made more intelligent by caching
stacks, then perhaps the coroutine side can be left "dumb" ?

> 
> This patch changes the coroutine pool algorithm to a simple thread-local
> pool without a size limit. Threads periodically shrink the pool down to
> a size sufficient for the maximum observed number of coroutines.
> 
> This is a very simple algorithm. Fancier things could be done like
> keeping a minimum number of coroutines around to avoid latency when a
> new coroutine is created after a long period of inactivity. Another
> thought is to stop the timer when the pool size is zero for power saving
> on threads that aren't using coroutines. However, I'd rather not add
> bells and whistles unless they are really necessary.
> 
> The global pool is removed by this patch. It can help to hide the fact
> that local pools are easily exhausted, but it's doesn't fix the root
> cause. I don't think there is a need for a global pool because QEMU's
> threads are long-lived, so let's keep things simple.

QEMU's threads may be long-lived,  but are the workloads they service
expected to be consistent over time?

eg can we ever get a situation where Thread A has a peak of activity
causing caching of many coroutines, and then go idle for a long time,
while Thread B then has a peak and is unable to take advantage of the
cache from Thread A that is now unused ?


Regards,
Daniel
-- 
|: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o-https://fstop138.berrange.com :|
|: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|




Re: [PATCH] coroutine: resize pool periodically instead of limiting size

2021-09-08 Thread Daniele Buono

Stefan, the patch looks great.
Thank you for debugging the performance issue that was happening with
SafeStack.

On 9/2/2021 4:10 AM, Stefan Hajnoczi wrote:

On Wed, Sep 01, 2021 at 05:09:23PM +0100, Stefan Hajnoczi wrote:

It was reported that enabling SafeStack reduces IOPS significantly
(>25%) with the following fio benchmark on virtio-blk using a NVMe host
block device:

   # fio --rw=randrw --bs=4k --iodepth=64 --runtime=1m --direct=1 \
--filename=/dev/vdb --name=job1 --ioengine=libaio --thread \
--group_reporting --numjobs=16 --time_based \
 --output=/tmp/fio_result

Serge Guelton and I found that SafeStack is not really at fault, it just
increases the cost of coroutine creation. This fio workload exhausts the
coroutine pool and coroutine creation becomes a bottleneck. Previous
work by Honghao Wang also pointed to excessive coroutine creation.

Creating new coroutines is expensive due to allocating new stacks with
mmap(2) and mprotect(2). Currently there are thread-local and global
pools that recycle old Coroutine objects and their stacks but the
hardcoded size limit of 64 for thread-local pools and 128 for the global
pool is insufficient for the fio benchmark shown above.

This patch changes the coroutine pool algorithm to a simple thread-local
pool without a size limit. Threads periodically shrink the pool down to
a size sufficient for the maximum observed number of coroutines.

This is a very simple algorithm. Fancier things could be done like
keeping a minimum number of coroutines around to avoid latency when a
new coroutine is created after a long period of inactivity. Another
thought is to stop the timer when the pool size is zero for power saving
on threads that aren't using coroutines. However, I'd rather not add
bells and whistles unless they are really necessary.


I agree we should aim at something that is as simple as possible.

However keeping a minumum number of coroutines could be useful to avoid
performance regressions from the previous version.

I wouldn't say restore the global pool - that's way too much trouble -
but keeping the old "pool size" configuration parameter as the miniumum
pool size could be a good idea to maintain the previous performance in
cases where the dynamic pool shrinks too much.



The global pool is removed by this patch. It can help to hide the fact
that local pools are easily exhausted, but it's doesn't fix the root
cause. I don't think there is a need for a global pool because QEMU's
threads are long-lived, so let's keep things simple.

Performance of the above fio benchmark is as follows:

   Before   After
IOPS 60k 97k

Memory usage varies over time as needed by the workload:

 VSZ (KB) RSS (KB)
Before fio  4705248  843128
During fio  5747668 (+ ~100 MB)  849280
After fio   4694996 (- ~100 MB)  845184

This confirms that coroutines are indeed being freed when no longer
needed.

Thanks to Serge Guelton for working on identifying the bottleneck with
me!

Reported-by: Tingting Mao 
Cc: Serge Guelton 
Cc: Honghao Wang 
Cc: Paolo Bonzini 
Cc: Daniele Buono 
Signed-off-by: Stefan Hajnoczi 
---
  include/qemu/coroutine-pool-timer.h | 36 +
  include/qemu/coroutine.h|  7 
  iothread.c  |  6 +++
  util/coroutine-pool-timer.c | 35 
  util/main-loop.c|  5 +++
  util/qemu-coroutine.c   | 62 ++---
  util/meson.build|  1 +
  7 files changed, 119 insertions(+), 33 deletions(-)
  create mode 100644 include/qemu/coroutine-pool-timer.h
  create mode 100644 util/coroutine-pool-timer.c


Adding Andrew and Jenifer in case they have thoughts on improving QEMU's
coroutine pool algorithm.



diff --git a/include/qemu/coroutine-pool-timer.h 
b/include/qemu/coroutine-pool-timer.h
new file mode 100644
index 00..c0b520ce99
--- /dev/null
+++ b/include/qemu/coroutine-pool-timer.h
@@ -0,0 +1,36 @@
+/*
+ * QEMU coroutine pool timer
+ *
+ * Copyright (c) 2021 Red Hat, Inc.
+ *
+ * SPDX-License-Identifier: LGPL-2.1-or-later
+ *
+ * This work is licensed under the terms of the GNU LGPL, version 2.1 or later.
+ * See the COPYING.LIB file in the top-level directory.
+ *
+ */
+#ifndef COROUTINE_POOL_TIMER_H
+#define COROUTINE_POOL_TIMER_H
+
+#include "qemu/osdep.h"
+#include "block/aio.h"
+
+/**
+ * A timer that periodically resizes this thread's coroutine pool, freeing
+ * memory if there are too many unused coroutines.
+ *
+ * Threads that make heavy use of coroutines should use this. Failure to resize
+ * the coroutine pool can lead to large amounts of memory sitting idle and
+ * never being used after the first time.
+ */
+typedef struct {
+QEMUTimer *timer;
+} CoroutinePoolTimer;
+
+/* Call this before the thread runs the AioContext */
+void coroutine_pool_timer_init(CoroutinePoolTimer *pt, AioContext *ctx);
+
+/* Call this before the AioContext from t

Re: [PATCH] coroutine: resize pool periodically instead of limiting size

2021-09-02 Thread Philippe Mathieu-Daudé
On 9/1/21 6:09 PM, Stefan Hajnoczi wrote:
> It was reported that enabling SafeStack reduces IOPS significantly
> (>25%) with the following fio benchmark on virtio-blk using a NVMe host
> block device:
> 
>   # fio --rw=randrw --bs=4k --iodepth=64 --runtime=1m --direct=1 \
>   --filename=/dev/vdb --name=job1 --ioengine=libaio --thread \
>   --group_reporting --numjobs=16 --time_based \
> --output=/tmp/fio_result
> 
> Serge Guelton and I found that SafeStack is not really at fault, it just
> increases the cost of coroutine creation. This fio workload exhausts the
> coroutine pool and coroutine creation becomes a bottleneck. Previous
> work by Honghao Wang also pointed to excessive coroutine creation.
> 
> Creating new coroutines is expensive due to allocating new stacks with
> mmap(2) and mprotect(2). Currently there are thread-local and global
> pools that recycle old Coroutine objects and their stacks but the
> hardcoded size limit of 64 for thread-local pools and 128 for the global
> pool is insufficient for the fio benchmark shown above.
> 
> This patch changes the coroutine pool algorithm to a simple thread-local
> pool without a size limit. Threads periodically shrink the pool down to
> a size sufficient for the maximum observed number of coroutines.
> 
> This is a very simple algorithm. Fancier things could be done like
> keeping a minimum number of coroutines around to avoid latency when a
> new coroutine is created after a long period of inactivity. Another
> thought is to stop the timer when the pool size is zero for power saving
> on threads that aren't using coroutines. However, I'd rather not add
> bells and whistles unless they are really necessary.
> 
> The global pool is removed by this patch. It can help to hide the fact
> that local pools are easily exhausted, but it's doesn't fix the root
> cause. I don't think there is a need for a global pool because QEMU's
> threads are long-lived, so let's keep things simple.
> 
> Performance of the above fio benchmark is as follows:
> 
>   Before   After
> IOPS 60k 97k
> 
> Memory usage varies over time as needed by the workload:
> 
> VSZ (KB) RSS (KB)
> Before fio  4705248  843128
> During fio  5747668 (+ ~100 MB)  849280
> After fio   4694996 (- ~100 MB)  845184
> 
> This confirms that coroutines are indeed being freed when no longer
> needed.
> 
> Thanks to Serge Guelton for working on identifying the bottleneck with
> me!
> 
> Reported-by: Tingting Mao 
> Cc: Serge Guelton 
> Cc: Honghao Wang 
> Cc: Paolo Bonzini 
> Cc: Daniele Buono 
> Signed-off-by: Stefan Hajnoczi 
> ---
>  include/qemu/coroutine-pool-timer.h | 36 +
>  include/qemu/coroutine.h|  7 
>  iothread.c  |  6 +++
>  util/coroutine-pool-timer.c | 35 
>  util/main-loop.c|  5 +++
>  util/qemu-coroutine.c   | 62 ++---
>  util/meson.build|  1 +
>  7 files changed, 119 insertions(+), 33 deletions(-)
>  create mode 100644 include/qemu/coroutine-pool-timer.h
>  create mode 100644 util/coroutine-pool-timer.c

Reviewed-by: Philippe Mathieu-Daudé 




Re: [PATCH] coroutine: resize pool periodically instead of limiting size

2021-09-02 Thread Stefan Hajnoczi
On Wed, Sep 01, 2021 at 05:09:23PM +0100, Stefan Hajnoczi wrote:
> It was reported that enabling SafeStack reduces IOPS significantly
> (>25%) with the following fio benchmark on virtio-blk using a NVMe host
> block device:
> 
>   # fio --rw=randrw --bs=4k --iodepth=64 --runtime=1m --direct=1 \
>   --filename=/dev/vdb --name=job1 --ioengine=libaio --thread \
>   --group_reporting --numjobs=16 --time_based \
> --output=/tmp/fio_result
> 
> Serge Guelton and I found that SafeStack is not really at fault, it just
> increases the cost of coroutine creation. This fio workload exhausts the
> coroutine pool and coroutine creation becomes a bottleneck. Previous
> work by Honghao Wang also pointed to excessive coroutine creation.
> 
> Creating new coroutines is expensive due to allocating new stacks with
> mmap(2) and mprotect(2). Currently there are thread-local and global
> pools that recycle old Coroutine objects and their stacks but the
> hardcoded size limit of 64 for thread-local pools and 128 for the global
> pool is insufficient for the fio benchmark shown above.
> 
> This patch changes the coroutine pool algorithm to a simple thread-local
> pool without a size limit. Threads periodically shrink the pool down to
> a size sufficient for the maximum observed number of coroutines.
> 
> This is a very simple algorithm. Fancier things could be done like
> keeping a minimum number of coroutines around to avoid latency when a
> new coroutine is created after a long period of inactivity. Another
> thought is to stop the timer when the pool size is zero for power saving
> on threads that aren't using coroutines. However, I'd rather not add
> bells and whistles unless they are really necessary.
> 
> The global pool is removed by this patch. It can help to hide the fact
> that local pools are easily exhausted, but it's doesn't fix the root
> cause. I don't think there is a need for a global pool because QEMU's
> threads are long-lived, so let's keep things simple.
> 
> Performance of the above fio benchmark is as follows:
> 
>   Before   After
> IOPS 60k 97k
> 
> Memory usage varies over time as needed by the workload:
> 
> VSZ (KB) RSS (KB)
> Before fio  4705248  843128
> During fio  5747668 (+ ~100 MB)  849280
> After fio   4694996 (- ~100 MB)  845184
> 
> This confirms that coroutines are indeed being freed when no longer
> needed.
> 
> Thanks to Serge Guelton for working on identifying the bottleneck with
> me!
> 
> Reported-by: Tingting Mao 
> Cc: Serge Guelton 
> Cc: Honghao Wang 
> Cc: Paolo Bonzini 
> Cc: Daniele Buono 
> Signed-off-by: Stefan Hajnoczi 
> ---
>  include/qemu/coroutine-pool-timer.h | 36 +
>  include/qemu/coroutine.h|  7 
>  iothread.c  |  6 +++
>  util/coroutine-pool-timer.c | 35 
>  util/main-loop.c|  5 +++
>  util/qemu-coroutine.c   | 62 ++---
>  util/meson.build|  1 +
>  7 files changed, 119 insertions(+), 33 deletions(-)
>  create mode 100644 include/qemu/coroutine-pool-timer.h
>  create mode 100644 util/coroutine-pool-timer.c

Adding Andrew and Jenifer in case they have thoughts on improving QEMU's
coroutine pool algorithm.

> 
> diff --git a/include/qemu/coroutine-pool-timer.h 
> b/include/qemu/coroutine-pool-timer.h
> new file mode 100644
> index 00..c0b520ce99
> --- /dev/null
> +++ b/include/qemu/coroutine-pool-timer.h
> @@ -0,0 +1,36 @@
> +/*
> + * QEMU coroutine pool timer
> + *
> + * Copyright (c) 2021 Red Hat, Inc.
> + *
> + * SPDX-License-Identifier: LGPL-2.1-or-later
> + *
> + * This work is licensed under the terms of the GNU LGPL, version 2.1 or 
> later.
> + * See the COPYING.LIB file in the top-level directory.
> + *
> + */
> +#ifndef COROUTINE_POOL_TIMER_H
> +#define COROUTINE_POOL_TIMER_H
> +
> +#include "qemu/osdep.h"
> +#include "block/aio.h"
> +
> +/**
> + * A timer that periodically resizes this thread's coroutine pool, freeing
> + * memory if there are too many unused coroutines.
> + *
> + * Threads that make heavy use of coroutines should use this. Failure to 
> resize
> + * the coroutine pool can lead to large amounts of memory sitting idle and
> + * never being used after the first time.
> + */
> +typedef struct {
> +QEMUTimer *timer;
> +} CoroutinePoolTimer;
> +
> +/* Call this before the thread runs the AioContext */
> +void coroutine_pool_timer_init(CoroutinePoolTimer *pt, AioContext *ctx);
> +
> +/* Call this before the AioContext from the init function is destroyed */
> +void coroutine_pool_timer_cleanup(CoroutinePoolTimer *pt);
> +
> +#endif /* COROUTINE_POOL_TIMER_H */
> diff --git a/include/qemu/coroutine.h b/include/qemu/coroutine.h
> index 4829ff373d..fdb2955ff9 100644
> --- a/include/qemu/coroutine.h
> +++ b/include/qemu/coroutine.h
> @@ -122,6 +122,13 @@ bool qemu_in_coroutine(void);
>   */
>  bool qemu_coroutine_entered(Corout

[PATCH] coroutine: resize pool periodically instead of limiting size

2021-09-01 Thread Stefan Hajnoczi
It was reported that enabling SafeStack reduces IOPS significantly
(>25%) with the following fio benchmark on virtio-blk using a NVMe host
block device:

  # fio --rw=randrw --bs=4k --iodepth=64 --runtime=1m --direct=1 \
--filename=/dev/vdb --name=job1 --ioengine=libaio --thread \
--group_reporting --numjobs=16 --time_based \
--output=/tmp/fio_result

Serge Guelton and I found that SafeStack is not really at fault, it just
increases the cost of coroutine creation. This fio workload exhausts the
coroutine pool and coroutine creation becomes a bottleneck. Previous
work by Honghao Wang also pointed to excessive coroutine creation.

Creating new coroutines is expensive due to allocating new stacks with
mmap(2) and mprotect(2). Currently there are thread-local and global
pools that recycle old Coroutine objects and their stacks but the
hardcoded size limit of 64 for thread-local pools and 128 for the global
pool is insufficient for the fio benchmark shown above.

This patch changes the coroutine pool algorithm to a simple thread-local
pool without a size limit. Threads periodically shrink the pool down to
a size sufficient for the maximum observed number of coroutines.

This is a very simple algorithm. Fancier things could be done like
keeping a minimum number of coroutines around to avoid latency when a
new coroutine is created after a long period of inactivity. Another
thought is to stop the timer when the pool size is zero for power saving
on threads that aren't using coroutines. However, I'd rather not add
bells and whistles unless they are really necessary.

The global pool is removed by this patch. It can help to hide the fact
that local pools are easily exhausted, but it's doesn't fix the root
cause. I don't think there is a need for a global pool because QEMU's
threads are long-lived, so let's keep things simple.

Performance of the above fio benchmark is as follows:

  Before   After
IOPS 60k 97k

Memory usage varies over time as needed by the workload:

VSZ (KB) RSS (KB)
Before fio  4705248  843128
During fio  5747668 (+ ~100 MB)  849280
After fio   4694996 (- ~100 MB)  845184

This confirms that coroutines are indeed being freed when no longer
needed.

Thanks to Serge Guelton for working on identifying the bottleneck with
me!

Reported-by: Tingting Mao 
Cc: Serge Guelton 
Cc: Honghao Wang 
Cc: Paolo Bonzini 
Cc: Daniele Buono 
Signed-off-by: Stefan Hajnoczi 
---
 include/qemu/coroutine-pool-timer.h | 36 +
 include/qemu/coroutine.h|  7 
 iothread.c  |  6 +++
 util/coroutine-pool-timer.c | 35 
 util/main-loop.c|  5 +++
 util/qemu-coroutine.c   | 62 ++---
 util/meson.build|  1 +
 7 files changed, 119 insertions(+), 33 deletions(-)
 create mode 100644 include/qemu/coroutine-pool-timer.h
 create mode 100644 util/coroutine-pool-timer.c

diff --git a/include/qemu/coroutine-pool-timer.h 
b/include/qemu/coroutine-pool-timer.h
new file mode 100644
index 00..c0b520ce99
--- /dev/null
+++ b/include/qemu/coroutine-pool-timer.h
@@ -0,0 +1,36 @@
+/*
+ * QEMU coroutine pool timer
+ *
+ * Copyright (c) 2021 Red Hat, Inc.
+ *
+ * SPDX-License-Identifier: LGPL-2.1-or-later
+ *
+ * This work is licensed under the terms of the GNU LGPL, version 2.1 or later.
+ * See the COPYING.LIB file in the top-level directory.
+ *
+ */
+#ifndef COROUTINE_POOL_TIMER_H
+#define COROUTINE_POOL_TIMER_H
+
+#include "qemu/osdep.h"
+#include "block/aio.h"
+
+/**
+ * A timer that periodically resizes this thread's coroutine pool, freeing
+ * memory if there are too many unused coroutines.
+ *
+ * Threads that make heavy use of coroutines should use this. Failure to resize
+ * the coroutine pool can lead to large amounts of memory sitting idle and
+ * never being used after the first time.
+ */
+typedef struct {
+QEMUTimer *timer;
+} CoroutinePoolTimer;
+
+/* Call this before the thread runs the AioContext */
+void coroutine_pool_timer_init(CoroutinePoolTimer *pt, AioContext *ctx);
+
+/* Call this before the AioContext from the init function is destroyed */
+void coroutine_pool_timer_cleanup(CoroutinePoolTimer *pt);
+
+#endif /* COROUTINE_POOL_TIMER_H */
diff --git a/include/qemu/coroutine.h b/include/qemu/coroutine.h
index 4829ff373d..fdb2955ff9 100644
--- a/include/qemu/coroutine.h
+++ b/include/qemu/coroutine.h
@@ -122,6 +122,13 @@ bool qemu_in_coroutine(void);
  */
 bool qemu_coroutine_entered(Coroutine *co);
 
+/**
+ * Optionally call this function periodically to shrink the thread-local pool
+ * down. Spiky workloads can create many coroutines and then never reach that
+ * level again. Shrinking the pool reclaims memory in this case.
+ */
+void qemu_coroutine_pool_periodic_resize(void);
+
 /**
  * Provides a mutex that can be used to synchronise coroutines
  */
diff --git a/iothread.c b/iothread.c
inde