Re: [Qemu-devel] [PATCH] thread-pool: fix deadlock when callbacks depends on each other
I'll test it tomorrow. I assume you want to avoid calling event_notifier_set() until function is reentered via aio_pool? Yes. But actually, I need to check if it's possible to fix bdrv_drain_all. If you're in coroutine context, you can defer the draining to a safe point using a bottom half. If you're not in coroutine context, perhaps bdrv_drain_all has to be made illegal. Which means a bunch of code auditing... For what it's worth, your solution also works fine, I couldn't recreate hang with it. Updated patch proposal posted earlier today. -- mg
Re: [Qemu-devel] [PATCH] thread-pool: fix deadlock when callbacks depends on each other
Il 01/06/2014 21:02, Marcin Gibuła ha scritto: Good catch! The main problem with the patch is that you need to use atomic_inc/atomic_dec to increment and decrement pool->pending_completions. Ok. Secondarily, event_notifier_set is pretty heavy-weight, does it work if you wrap the loop like this? restart: QLIST_FOREACH_SAFE(elem, &pool->head, all, next) { ... } if (pool->pending_completions) { goto restart; } event_notifier_test_and_clear(notifier); if (pool->pending_completions) { event_notifier_set(notifier); goto restart; } I'll test it tomorrow. I assume you want to avoid calling event_notifier_set() until function is reentered via aio_pool? Yes. But actually, I need to check if it's possible to fix bdrv_drain_all. If you're in coroutine context, you can defer the draining to a safe point using a bottom half. If you're not in coroutine context, perhaps bdrv_drain_all has to be made illegal. Which means a bunch of code auditing... Paolo Finally, the same bug is also in block/linux-aio.c and block/win32-aio.c. I can try with linux-aio, but my knowledge of windows api is zero...
Re: [Qemu-devel] [PATCH] thread-pool: fix deadlock when callbacks depends on each other
Good catch! The main problem with the patch is that you need to use atomic_inc/atomic_dec to increment and decrement pool->pending_completions. Ok. Secondarily, event_notifier_set is pretty heavy-weight, does it work if you wrap the loop like this? restart: QLIST_FOREACH_SAFE(elem, &pool->head, all, next) { ... } if (pool->pending_completions) { goto restart; } event_notifier_test_and_clear(notifier); if (pool->pending_completions) { event_notifier_set(notifier); goto restart; } I'll test it tomorrow. I assume you want to avoid calling event_notifier_set() until function is reentered via aio_pool? > Finally, the same bug is also in block/linux-aio.c and > block/win32-aio.c. I can try with linux-aio, but my knowledge of windows api is zero... -- mg
Re: [Qemu-devel] [PATCH] thread-pool: fix deadlock when callbacks depends on each other
Il 31/05/2014 20:29, Marcin Gibuła ha scritto: @@ -185,6 +191,14 @@ restart: } if (elem->state == THREAD_DONE && elem->common.cb) { QLIST_REMOVE(elem, all); +/* If more completed requests are waiting, notifier needs + * to be rearmed so callback can progress with aio_pool(). + */ +pool->pending_completions--; +if (pool->pending_completions) { +event_notifier_set(notifier); +} + /* Read state before ret. */ smp_rmb(); elem->common.cb(elem->common.opaque, elem->ret); Good catch! The main problem with the patch is that you need to use atomic_inc/atomic_dec to increment and decrement pool->pending_completions. Secondarily, event_notifier_set is pretty heavy-weight, does it work if you wrap the loop like this? restart: QLIST_FOREACH_SAFE(elem, &pool->head, all, next) { ... } if (pool->pending_completions) { goto restart; } event_notifier_test_and_clear(notifier); if (pool->pending_completions) { event_notifier_set(notifier); goto restart; } Finally, the same bug is also in block/linux-aio.c and block/win32-aio.c. Paolo
[Qemu-devel] [PATCH] thread-pool: fix deadlock when callbacks depends on each other
When two coroutines submit I/O and first coroutine depends on second to complete (by calling bdrv_drain_all), deadlock may occur. This is because both requests may have completed before thread pool notifier got called. Then, when notifier gets executed and first coroutine calls aio_pool() to make progress, it will hang forever, as notifier's descriptor has been already marked clear. This patch fixes this, by rearming thread pool notifier if there are more than one completed requests on list. Without this patch, I could reproduce this bug with snapshot-commit with about 1 per 10 tries. With this patch, I couldn't reproduce it any more. Signed-off-by: Marcin Gibula --- --- thread-pool.c 2014-04-17 15:44:45.0 +0200 +++ thread-pool.c 2014-05-31 20:20:26.083011514 +0200 @@ -76,6 +76,8 @@ struct ThreadPool { int new_threads; /* backlog of threads we need to create */ int pending_threads; /* threads created but not running yet */ int pending_cancellations; /* whether we need a cond_broadcast */ +int pending_completions; /* whether we need to rearm notifier when +executing callback */ bool stopping; }; @@ -110,6 +112,10 @@ static void *worker_thread(void *opaque) ret = req->func(req->arg); req->ret = ret; +if (req->common.cb) { +pool->pending_completions++; +} + /* Write ret before state. */ smp_wmb(); req->state = THREAD_DONE; @@ -185,6 +191,14 @@ restart: } if (elem->state == THREAD_DONE && elem->common.cb) { QLIST_REMOVE(elem, all); +/* If more completed requests are waiting, notifier needs + * to be rearmed so callback can progress with aio_pool(). + */ +pool->pending_completions--; +if (pool->pending_completions) { +event_notifier_set(notifier); +} + /* Read state before ret. */ smp_rmb(); elem->common.cb(elem->common.opaque, elem->ret);