Re: [PULL 1/1] coroutine: cap per-thread local pool size

2024-03-19 Thread Stefan Hajnoczi
On Tue, Mar 19, 2024 at 03:14:07PM +, Daniel P. Berrangé wrote:
> Sending this PULL feels little rushed, as I still have
> un-answered questions on the inital patch posting just
> a few hours ago

Sorry, I hadn't seen your email. I'll update this email thread once the
discussion has finished.

Stefan

> 
> On Tue, Mar 19, 2024 at 11:09:38AM -0400, Stefan Hajnoczi wrote:
> > The coroutine pool implementation can hit the Linux vm.max_map_count
> > limit, causing QEMU to abort with "failed to allocate memory for stack"
> > or "failed to set up stack guard page" during coroutine creation.
> > 
> > This happens because per-thread pools can grow to tens of thousands of
> > coroutines. Each coroutine causes 2 virtual memory areas to be created.
> > Eventually vm.max_map_count is reached and memory-related syscalls fail.
> > The per-thread pool sizes are non-uniform and depend on past coroutine
> > usage in each thread, so it's possible for one thread to have a large
> > pool while another thread's pool is empty.
> > 
> > Switch to a new coroutine pool implementation with a global pool that
> > grows to a maximum number of coroutines and per-thread local pools that
> > are capped at hardcoded small number of coroutines.
> > 
> > This approach does not leave large numbers of coroutines pooled in a
> > thread that may not use them again. In order to perform well it
> > amortizes the cost of global pool accesses by working in batches of
> > coroutines instead of individual coroutines.
> > 
> > The global pool is a list. Threads donate batches of coroutines to when
> > they have too many and take batches from when they have too few:
> > 
> > .---.
> > | Batch 1 | Batch 2 | Batch 3 | ... | global_pool
> > `---'
> > 
> > Each thread has up to 2 batches of coroutines:
> > 
> > .---.
> > | Batch 1 | Batch 2 | per-thread local_pool (maximum 2 batches)
> > `---'
> > 
> > The goal of this change is to reduce the excessive number of pooled
> > coroutines that cause QEMU to abort when vm.max_map_count is reached
> > without losing the performance of an adequately sized coroutine pool.
> > 
> > Here are virtio-blk disk I/O benchmark results:
> > 
> >   RW BLKSIZE IODEPTHOLDNEW CHANGE
> > randread  4k   1 113725 117451 +3.3%
> > randread  4k   8 192968 198510 +2.9%
> > randread  4k  16 207138 209429 +1.1%
> > randread  4k  32 212399 215145 +1.3%
> > randread  4k  64 218319 221277 +1.4%
> > randread128k   1  17587  17535 -0.3%
> > randread128k   8  17614  17616 +0.0%
> > randread128k  16  17608  17609 +0.0%
> > randread128k  32  17552  17553 +0.0%
> > randread128k  64  17484  17484 +0.0%
> > 
> > See files/{fio.sh,test.xml.j2} for the benchmark configuration:
> > https://gitlab.com/stefanha/virt-playbooks/-/tree/coroutine-pool-fix-sizing
> > 
> > Buglink: https://issues.redhat.com/browse/RHEL-28947
> > Reported-by: Sanjay Rao 
> > Reported-by: Boaz Ben Shabat 
> > Reported-by: Joe Mario 
> > Reviewed-by: Kevin Wolf 
> > Signed-off-by: Stefan Hajnoczi 
> > Message-ID: <20240318183429.1039340-1-stefa...@redhat.com>
> > ---
> >  util/qemu-coroutine.c | 282 +-
> >  1 file changed, 223 insertions(+), 59 deletions(-)
> > 
> > diff --git a/util/qemu-coroutine.c b/util/qemu-coroutine.c
> > index 5fd2dbaf8b..2790959eaf 100644
> > --- a/util/qemu-coroutine.c
> > +++ b/util/qemu-coroutine.c
> > @@ -18,39 +18,200 @@
> >  #include "qemu/atomic.h"
> >  #include "qemu/coroutine_int.h"
> >  #include "qemu/coroutine-tls.h"
> > +#include "qemu/cutils.h"
> >  #include "block/aio.h"
> >  
> > -/**
> > - * The minimal batch size is always 64, coroutines from the release_pool 
> > are
> > - * reused as soon as there are 64 coroutines in it. The maximum pool size 
> > starts
> > - * with 64 and is increased on demand so that coroutines are not deleted 
> > even if
> > - * they are not immediately reused.
> > - */
> >  enum {
> > -POOL_MIN_BATCH_SIZE = 64,
> > -POOL_INITIAL_MAX_SIZE = 64,
> > +COROUTINE_POOL_BATCH_MAX_SIZE = 128,
> >  };
> >  
> > -/** Free list to speed up creation */
> > -static QSLIST_HEAD(, Coroutine) release_pool = 
> > QSLIST_HEAD_INITIALIZER(pool);
> > -static unsigned int pool_max_size = POOL_INITIAL_MAX_SIZE;
> > -static unsigned int release_pool_size;
> > +/*
> > + * Coroutine creation and deletion is expensive so a pool of unused 
> > coroutines
> > + * is kept as a cache. When the pool has coroutines available, they are
> > + * recycled instead of creating new ones from scratch. Coroutines are 
> > added to
> > + * the pool upon termination.
> > + *
> > + * The pool is global but each thread maintains a small local pool to avoid
> > + * global pool contention. Threads fetch and return batches of coroutines 
> > from
> > + * the global pool to maintain their local pool. The local 

Re: [PULL 1/1] coroutine: cap per-thread local pool size

2024-03-19 Thread Daniel P . Berrangé
Sending this PULL feels little rushed, as I still have
un-answered questions on the inital patch posting just
a few hours ago

On Tue, Mar 19, 2024 at 11:09:38AM -0400, Stefan Hajnoczi wrote:
> The coroutine pool implementation can hit the Linux vm.max_map_count
> limit, causing QEMU to abort with "failed to allocate memory for stack"
> or "failed to set up stack guard page" during coroutine creation.
> 
> This happens because per-thread pools can grow to tens of thousands of
> coroutines. Each coroutine causes 2 virtual memory areas to be created.
> Eventually vm.max_map_count is reached and memory-related syscalls fail.
> The per-thread pool sizes are non-uniform and depend on past coroutine
> usage in each thread, so it's possible for one thread to have a large
> pool while another thread's pool is empty.
> 
> Switch to a new coroutine pool implementation with a global pool that
> grows to a maximum number of coroutines and per-thread local pools that
> are capped at hardcoded small number of coroutines.
> 
> This approach does not leave large numbers of coroutines pooled in a
> thread that may not use them again. In order to perform well it
> amortizes the cost of global pool accesses by working in batches of
> coroutines instead of individual coroutines.
> 
> The global pool is a list. Threads donate batches of coroutines to when
> they have too many and take batches from when they have too few:
> 
> .---.
> | Batch 1 | Batch 2 | Batch 3 | ... | global_pool
> `---'
> 
> Each thread has up to 2 batches of coroutines:
> 
> .---.
> | Batch 1 | Batch 2 | per-thread local_pool (maximum 2 batches)
> `---'
> 
> The goal of this change is to reduce the excessive number of pooled
> coroutines that cause QEMU to abort when vm.max_map_count is reached
> without losing the performance of an adequately sized coroutine pool.
> 
> Here are virtio-blk disk I/O benchmark results:
> 
>   RW BLKSIZE IODEPTHOLDNEW CHANGE
> randread  4k   1 113725 117451 +3.3%
> randread  4k   8 192968 198510 +2.9%
> randread  4k  16 207138 209429 +1.1%
> randread  4k  32 212399 215145 +1.3%
> randread  4k  64 218319 221277 +1.4%
> randread128k   1  17587  17535 -0.3%
> randread128k   8  17614  17616 +0.0%
> randread128k  16  17608  17609 +0.0%
> randread128k  32  17552  17553 +0.0%
> randread128k  64  17484  17484 +0.0%
> 
> See files/{fio.sh,test.xml.j2} for the benchmark configuration:
> https://gitlab.com/stefanha/virt-playbooks/-/tree/coroutine-pool-fix-sizing
> 
> Buglink: https://issues.redhat.com/browse/RHEL-28947
> Reported-by: Sanjay Rao 
> Reported-by: Boaz Ben Shabat 
> Reported-by: Joe Mario 
> Reviewed-by: Kevin Wolf 
> Signed-off-by: Stefan Hajnoczi 
> Message-ID: <20240318183429.1039340-1-stefa...@redhat.com>
> ---
>  util/qemu-coroutine.c | 282 +-
>  1 file changed, 223 insertions(+), 59 deletions(-)
> 
> diff --git a/util/qemu-coroutine.c b/util/qemu-coroutine.c
> index 5fd2dbaf8b..2790959eaf 100644
> --- a/util/qemu-coroutine.c
> +++ b/util/qemu-coroutine.c
> @@ -18,39 +18,200 @@
>  #include "qemu/atomic.h"
>  #include "qemu/coroutine_int.h"
>  #include "qemu/coroutine-tls.h"
> +#include "qemu/cutils.h"
>  #include "block/aio.h"
>  
> -/**
> - * The minimal batch size is always 64, coroutines from the release_pool are
> - * reused as soon as there are 64 coroutines in it. The maximum pool size 
> starts
> - * with 64 and is increased on demand so that coroutines are not deleted 
> even if
> - * they are not immediately reused.
> - */
>  enum {
> -POOL_MIN_BATCH_SIZE = 64,
> -POOL_INITIAL_MAX_SIZE = 64,
> +COROUTINE_POOL_BATCH_MAX_SIZE = 128,
>  };
>  
> -/** Free list to speed up creation */
> -static QSLIST_HEAD(, Coroutine) release_pool = QSLIST_HEAD_INITIALIZER(pool);
> -static unsigned int pool_max_size = POOL_INITIAL_MAX_SIZE;
> -static unsigned int release_pool_size;
> +/*
> + * Coroutine creation and deletion is expensive so a pool of unused 
> coroutines
> + * is kept as a cache. When the pool has coroutines available, they are
> + * recycled instead of creating new ones from scratch. Coroutines are added 
> to
> + * the pool upon termination.
> + *
> + * The pool is global but each thread maintains a small local pool to avoid
> + * global pool contention. Threads fetch and return batches of coroutines 
> from
> + * the global pool to maintain their local pool. The local pool holds up to 
> two
> + * batches whereas the maximum size of the global pool is controlled by the
> + * qemu_coroutine_inc_pool_size() API.
> + *
> + * .---.
> + * | Batch 1 | Batch 2 | Batch 3 | ... | global_pool
> + * `---'
> + *
> + * .---.
> + * | Batch 1 | Batch 2 | per-thread local_pool (maximum 2 batches)
> + *