Am 28.11.2014 um 13:45 schrieb Paolo Bonzini: > > On 28/11/2014 13:39, Peter Lieven wrote: >> Am 28.11.2014 um 13:26 schrieb Paolo Bonzini: >>> On 28/11/2014 12:46, Peter Lieven wrote: >>>>> I get: >>>>> Run operation 40000000 iterations 9.883958 s, 4046K operations/s, 247ns >>>>> per coroutine >>>> Ok, understood, it "steals" the whole pool, right? Isn't that bad if we >>>> have more >>>> than one thread in need of a lot of coroutines? >>> Overall the algorithm is expected to adapt. The N threads contribute to >>> the global release pool, so the pool will fill up N times faster than if >>> you had only one thread. There can be some variance, which is why the >>> maximum size of the pool is twice the threshold (and probably could be >>> tuned better). >>> >>> Benchmarks are needed on real I/O too, of course, especially with high >>> queue depth. >> Yes, cool. The atomic operations are a bit tricky at the first glance ;-) >> >> Question: >> Why is the pool_size increment atomic and the set to zero not? > Because the set to zero is not a read-modify-write operation, so it is > always atomic. It's just not sequentially-consistent (see > docs/atomics.txt for some info on what that means). > >> Idea: >> If the release_pool is full why not put the coroutine in the thread >> alloc_pool instead of throwing it away? :-) > Because you can only waste 64 coroutines per thread. But numbers cannot > be sneezed at, so it's worth doing it as a separate patch. > >> Run operation 40000000 iterations 9.057805 s, 4416K operations/s, 226ns per >> coroutine >> >> diff --git a/qemu-coroutine.c b/qemu-coroutine.c >> index 6bee354..edea162 100644 >> --- a/qemu-coroutine.c >> +++ b/qemu-coroutine.c >> @@ -25,8 +25,9 @@ enum { >> >> /** Free list to speed up creation */ >> static QSLIST_HEAD(, Coroutine) release_pool = >> QSLIST_HEAD_INITIALIZER(pool); >> -static unsigned int pool_size; >> +static unsigned int release_pool_size; >> static __thread QSLIST_HEAD(, Coroutine) alloc_pool = >> QSLIST_HEAD_INITIALIZER(pool); >> +static __thread unsigned int alloc_pool_size; >> >> /* The GPrivate is only used to invoke coroutine_pool_cleanup. */ >> static void coroutine_pool_cleanup(void *value); >> @@ -39,12 +40,12 @@ Coroutine *qemu_coroutine_create(CoroutineEntry *entry) >> if (CONFIG_COROUTINE_POOL) { >> co = QSLIST_FIRST(&alloc_pool); >> if (!co) { >> - if (pool_size > POOL_BATCH_SIZE) { >> - /* This is not exact; there could be a little skew between >> pool_size >> + if (release_pool_size > POOL_BATCH_SIZE) { >> + /* This is not exact; there could be a little skew between >> release_pool_size >> * and the actual size of alloc_pool. But it is just a >> heuristic, >> * it does not need to be perfect. >> */ >> - pool_size = 0; >> + alloc_pool_size = atomic_fetch_and(&release_pool_size, 0); >> QSLIST_MOVE_ATOMIC(&alloc_pool, &release_pool); >> co = QSLIST_FIRST(&alloc_pool); >> >> @@ -53,6 +54,8 @@ Coroutine *qemu_coroutine_create(CoroutineEntry *entry) >> */ >> g_private_set(&dummy_key, &dummy_key); >> } >> + } else { >> + alloc_pool_size--; >> } >> if (co) { >> QSLIST_REMOVE_HEAD(&alloc_pool, pool_next); >> @@ -71,10 +74,15 @@ Coroutine *qemu_coroutine_create(CoroutineEntry *entry) >> static void coroutine_delete(Coroutine *co) >> { >> if (CONFIG_COROUTINE_POOL) { >> - if (pool_size < POOL_BATCH_SIZE * 2) { >> + if (release_pool_size < POOL_BATCH_SIZE * 2) { >> co->caller = NULL; >> QSLIST_INSERT_HEAD_ATOMIC(&release_pool, co, pool_next); >> - atomic_inc(&pool_size); >> + atomic_inc(&release_pool_size); >> + return; >> + } else if (alloc_pool_size < POOL_BATCH_SIZE) { >> + co->caller = NULL; >> + QSLIST_INSERT_HEAD(&alloc_pool, co, pool_next); >> + alloc_pool_size++; >> return; >> } >> } >> >> >> Bug?: >> The release_pool is not cleanup up on termination I think. > That's not necessary, it is global.
I don't see where you iterate over release_pool and destroy all coroutines? Maybe just add back the old destructor with s/pool/release_pool/g Peter Peter