On Tue, Mar 19, 2024 at 9:32 AM Kevin Wolf <kw...@redhat.com> wrote:
> Am 18.03.2024 um 19:34 hat Stefan Hajnoczi geschrieben: > > The coroutine pool implementation can hit the Linux vm.max_map_count > > limit, causing QEMU to abort with "failed to allocate memory for stack" > > or "failed to set up stack guard page" during coroutine creation. > > > > This happens because per-thread pools can grow to tens of thousands of > > coroutines. Each coroutine causes 2 virtual memory areas to be created. > > Eventually vm.max_map_count is reached and memory-related syscalls fail. > > The per-thread pool sizes are non-uniform and depend on past coroutine > > usage in each thread, so it's possible for one thread to have a large > > pool while another thread's pool is empty. > > > > Switch to a new coroutine pool implementation with a global pool that > > grows to a maximum number of coroutines and per-thread local pools that > > are capped at hardcoded small number of coroutines. > > > > This approach does not leave large numbers of coroutines pooled in a > > thread that may not use them again. In order to perform well it > > amortizes the cost of global pool accesses by working in batches of > > coroutines instead of individual coroutines. > > > > The global pool is a list. Threads donate batches of coroutines to when > > they have too many and take batches from when they have too few: > > > > .-----------------------------------. > > | Batch 1 | Batch 2 | Batch 3 | ... | global_pool > > `-----------------------------------' > > > > Each thread has up to 2 batches of coroutines: > > > > .-------------------. > > | Batch 1 | Batch 2 | per-thread local_pool (maximum 2 batches) > > `-------------------' > > > > The goal of this change is to reduce the excessive number of pooled > > coroutines that cause QEMU to abort when vm.max_map_count is reached > > without losing the performance of an adequately sized coroutine pool. > > > > Here are virtio-blk disk I/O benchmark results: > > > > RW BLKSIZE IODEPTH OLD NEW CHANGE > > randread 4k 1 113725 117451 +3.3% > > randread 4k 8 192968 198510 +2.9% > > randread 4k 16 207138 209429 +1.1% > > randread 4k 32 212399 215145 +1.3% > > randread 4k 64 218319 221277 +1.4% > > randread 128k 1 17587 17535 -0.3% > > randread 128k 8 17614 17616 +0.0% > > randread 128k 16 17608 17609 +0.0% > > randread 128k 32 17552 17553 +0.0% > > randread 128k 64 17484 17484 +0.0% > > > > See files/{fio.sh,test.xml.j2} for the benchmark configuration: > > > https://gitlab.com/stefanha/virt-playbooks/-/tree/coroutine-pool-fix-sizing > > > > Buglink: https://issues.redhat.com/browse/RHEL-28947 > > Reported-by: Sanjay Rao <s...@redhat.com> > > Reported-by: Boaz Ben Shabat <bbens...@redhat.com> > > Reported-by: Joe Mario <jma...@redhat.com> > > Signed-off-by: Stefan Hajnoczi <stefa...@redhat.com> > > Reviewed-by: Kevin Wolf <kw...@redhat.com> > > Though I do wonder if we can do something about the slight performance > degradation that Sanjay reported. We seem to stay well under the hard > limit, so the reduced global pool size shouldn't be the issue. Maybe > it's the locking? > > We are only seeing a slight fall off from our much improved numbers with the addition of iothreads. I am not very concerned. With database workloads, there's always a run to run variation. Especially when there's a lot of idle cpus on the host. To reduce the run to run variation, we use cpu / numa pinning and other methods like pci passthru. If I get a chance, I will do some runs with cpu pinning to see what the numbers look like. > Either way, even though it could be called a fix, I don't think this is > for 9.0, right? > > Kevin > > -- Sanjay Rao Sr. Principal Performance Engineer Phone: 978-392-2479 Red Hat, Inc. FAX: 978-392-1001 314 Littleton Road Email: s...@redhat.com Westford, MA 01886