Re: [PULL 1/1] coroutine: cap per-thread local pool size
On Tue, Mar 19, 2024 at 03:14:07PM +, Daniel P. Berrangé wrote: > Sending this PULL feels little rushed, as I still have > un-answered questions on the inital patch posting just > a few hours ago Sorry, I hadn't seen your email. I'll update this email thread once the discussion has finished. Stefan > > On Tue, Mar 19, 2024 at 11:09:38AM -0400, Stefan Hajnoczi wrote: > > The coroutine pool implementation can hit the Linux vm.max_map_count > > limit, causing QEMU to abort with "failed to allocate memory for stack" > > or "failed to set up stack guard page" during coroutine creation. > > > > This happens because per-thread pools can grow to tens of thousands of > > coroutines. Each coroutine causes 2 virtual memory areas to be created. > > Eventually vm.max_map_count is reached and memory-related syscalls fail. > > The per-thread pool sizes are non-uniform and depend on past coroutine > > usage in each thread, so it's possible for one thread to have a large > > pool while another thread's pool is empty. > > > > Switch to a new coroutine pool implementation with a global pool that > > grows to a maximum number of coroutines and per-thread local pools that > > are capped at hardcoded small number of coroutines. > > > > This approach does not leave large numbers of coroutines pooled in a > > thread that may not use them again. In order to perform well it > > amortizes the cost of global pool accesses by working in batches of > > coroutines instead of individual coroutines. > > > > The global pool is a list. Threads donate batches of coroutines to when > > they have too many and take batches from when they have too few: > > > > .---. > > | Batch 1 | Batch 2 | Batch 3 | ... | global_pool > > `---' > > > > Each thread has up to 2 batches of coroutines: > > > > .---. > > | Batch 1 | Batch 2 | per-thread local_pool (maximum 2 batches) > > `---' > > > > The goal of this change is to reduce the excessive number of pooled > > coroutines that cause QEMU to abort when vm.max_map_count is reached > > without losing the performance of an adequately sized coroutine pool. > > > > Here are virtio-blk disk I/O benchmark results: > > > > RW BLKSIZE IODEPTHOLDNEW CHANGE > > randread 4k 1 113725 117451 +3.3% > > randread 4k 8 192968 198510 +2.9% > > randread 4k 16 207138 209429 +1.1% > > randread 4k 32 212399 215145 +1.3% > > randread 4k 64 218319 221277 +1.4% > > randread128k 1 17587 17535 -0.3% > > randread128k 8 17614 17616 +0.0% > > randread128k 16 17608 17609 +0.0% > > randread128k 32 17552 17553 +0.0% > > randread128k 64 17484 17484 +0.0% > > > > See files/{fio.sh,test.xml.j2} for the benchmark configuration: > > https://gitlab.com/stefanha/virt-playbooks/-/tree/coroutine-pool-fix-sizing > > > > Buglink: https://issues.redhat.com/browse/RHEL-28947 > > Reported-by: Sanjay Rao > > Reported-by: Boaz Ben Shabat > > Reported-by: Joe Mario > > Reviewed-by: Kevin Wolf > > Signed-off-by: Stefan Hajnoczi > > Message-ID: <20240318183429.1039340-1-stefa...@redhat.com> > > --- > > util/qemu-coroutine.c | 282 +- > > 1 file changed, 223 insertions(+), 59 deletions(-) > > > > diff --git a/util/qemu-coroutine.c b/util/qemu-coroutine.c > > index 5fd2dbaf8b..2790959eaf 100644 > > --- a/util/qemu-coroutine.c > > +++ b/util/qemu-coroutine.c > > @@ -18,39 +18,200 @@ > > #include "qemu/atomic.h" > > #include "qemu/coroutine_int.h" > > #include "qemu/coroutine-tls.h" > > +#include "qemu/cutils.h" > > #include "block/aio.h" > > > > -/** > > - * The minimal batch size is always 64, coroutines from the release_pool > > are > > - * reused as soon as there are 64 coroutines in it. The maximum pool size > > starts > > - * with 64 and is increased on demand so that coroutines are not deleted > > even if > > - * they are not immediately reused. > > - */ > > enum { > > -POOL_MIN_BATCH_SIZE = 64, > > -POOL_INITIAL_MAX_SIZE = 64, > > +COROUTINE_POOL_BATCH_MAX_SIZE = 128, > > }; > > > > -/** Free list to speed up creation */ > > -static QSLIST_HEAD(, Coroutine) release_pool = > > QSLIST_HEAD_INITIALIZER(pool); > > -static unsigned int pool_max_size = POOL_INITIAL_MAX_SIZE; > > -static unsigned int release_pool_size; > > +/* > > + * Coroutine creation and deletion is expensive so a pool of unused > > coroutines > > + * is kept as a cache. When the pool has coroutines available, they are > > + * recycled instead of creating new ones from scratch. Coroutines are > > added to > > + * the pool upon termination. > > + * > > + * The pool is global but each thread maintains a small local pool to avoid > > + * global pool contention. Threads fetch and return batches of coroutines > > from > > + * the global pool to maintain their local pool. The local
Re: [PULL 1/1] coroutine: cap per-thread local pool size
Sending this PULL feels little rushed, as I still have un-answered questions on the inital patch posting just a few hours ago On Tue, Mar 19, 2024 at 11:09:38AM -0400, Stefan Hajnoczi wrote: > The coroutine pool implementation can hit the Linux vm.max_map_count > limit, causing QEMU to abort with "failed to allocate memory for stack" > or "failed to set up stack guard page" during coroutine creation. > > This happens because per-thread pools can grow to tens of thousands of > coroutines. Each coroutine causes 2 virtual memory areas to be created. > Eventually vm.max_map_count is reached and memory-related syscalls fail. > The per-thread pool sizes are non-uniform and depend on past coroutine > usage in each thread, so it's possible for one thread to have a large > pool while another thread's pool is empty. > > Switch to a new coroutine pool implementation with a global pool that > grows to a maximum number of coroutines and per-thread local pools that > are capped at hardcoded small number of coroutines. > > This approach does not leave large numbers of coroutines pooled in a > thread that may not use them again. In order to perform well it > amortizes the cost of global pool accesses by working in batches of > coroutines instead of individual coroutines. > > The global pool is a list. Threads donate batches of coroutines to when > they have too many and take batches from when they have too few: > > .---. > | Batch 1 | Batch 2 | Batch 3 | ... | global_pool > `---' > > Each thread has up to 2 batches of coroutines: > > .---. > | Batch 1 | Batch 2 | per-thread local_pool (maximum 2 batches) > `---' > > The goal of this change is to reduce the excessive number of pooled > coroutines that cause QEMU to abort when vm.max_map_count is reached > without losing the performance of an adequately sized coroutine pool. > > Here are virtio-blk disk I/O benchmark results: > > RW BLKSIZE IODEPTHOLDNEW CHANGE > randread 4k 1 113725 117451 +3.3% > randread 4k 8 192968 198510 +2.9% > randread 4k 16 207138 209429 +1.1% > randread 4k 32 212399 215145 +1.3% > randread 4k 64 218319 221277 +1.4% > randread128k 1 17587 17535 -0.3% > randread128k 8 17614 17616 +0.0% > randread128k 16 17608 17609 +0.0% > randread128k 32 17552 17553 +0.0% > randread128k 64 17484 17484 +0.0% > > See files/{fio.sh,test.xml.j2} for the benchmark configuration: > https://gitlab.com/stefanha/virt-playbooks/-/tree/coroutine-pool-fix-sizing > > Buglink: https://issues.redhat.com/browse/RHEL-28947 > Reported-by: Sanjay Rao > Reported-by: Boaz Ben Shabat > Reported-by: Joe Mario > Reviewed-by: Kevin Wolf > Signed-off-by: Stefan Hajnoczi > Message-ID: <20240318183429.1039340-1-stefa...@redhat.com> > --- > util/qemu-coroutine.c | 282 +- > 1 file changed, 223 insertions(+), 59 deletions(-) > > diff --git a/util/qemu-coroutine.c b/util/qemu-coroutine.c > index 5fd2dbaf8b..2790959eaf 100644 > --- a/util/qemu-coroutine.c > +++ b/util/qemu-coroutine.c > @@ -18,39 +18,200 @@ > #include "qemu/atomic.h" > #include "qemu/coroutine_int.h" > #include "qemu/coroutine-tls.h" > +#include "qemu/cutils.h" > #include "block/aio.h" > > -/** > - * The minimal batch size is always 64, coroutines from the release_pool are > - * reused as soon as there are 64 coroutines in it. The maximum pool size > starts > - * with 64 and is increased on demand so that coroutines are not deleted > even if > - * they are not immediately reused. > - */ > enum { > -POOL_MIN_BATCH_SIZE = 64, > -POOL_INITIAL_MAX_SIZE = 64, > +COROUTINE_POOL_BATCH_MAX_SIZE = 128, > }; > > -/** Free list to speed up creation */ > -static QSLIST_HEAD(, Coroutine) release_pool = QSLIST_HEAD_INITIALIZER(pool); > -static unsigned int pool_max_size = POOL_INITIAL_MAX_SIZE; > -static unsigned int release_pool_size; > +/* > + * Coroutine creation and deletion is expensive so a pool of unused > coroutines > + * is kept as a cache. When the pool has coroutines available, they are > + * recycled instead of creating new ones from scratch. Coroutines are added > to > + * the pool upon termination. > + * > + * The pool is global but each thread maintains a small local pool to avoid > + * global pool contention. Threads fetch and return batches of coroutines > from > + * the global pool to maintain their local pool. The local pool holds up to > two > + * batches whereas the maximum size of the global pool is controlled by the > + * qemu_coroutine_inc_pool_size() API. > + * > + * .---. > + * | Batch 1 | Batch 2 | Batch 3 | ... | global_pool > + * `---' > + * > + * .---. > + * | Batch 1 | Batch 2 | per-thread local_pool (maximum 2 batches) > + *