Francesco Romani has posted comments on this change. Change subject: threadpool: docs: add design notes ......................................................................
Patch Set 4: (5 comments) Thanks for your comments. Will expand these docs. http://gerrit.ovirt.org/#/c/29190/4/lib/threadpool/README.rst File lib/threadpool/README.rst: Line 49: Line 50: The countermeasures implemented in WatchedThreadPool are the following. Line 51: Line 52: 1. each worker takes a task exclusively. Line 53: each task must be taken exactly by one worker and exactly once for each > Why do you care about the task used only once? the thread pool should care Here I'm trying to address a different problem. It could happen (actually it already happened many times) that a specific libvirt call get stuck due to access to QEMU monitor in D state (hello, virDomainGetBlockInfo). To detect and to react to a blocked QEMU is a different problem (and of course much more serious) which needs to be handled elsewhere, so let's just consider a not-responding call. In this case, there is no point in retrying from a different thread: this will likely lead to another blocked thread, and so on in a cascading effect, because this call is likely to fail again over and over. If the pool ensures the above -only one worker sees a task exactly once, no one else sees the task while it is processed, then the blocked operation cannot be attempted more than once. This is bad, but in my opinion we are just accepting and adapting to a bad situation outside our control, and not making it worse. Another way to go could be to cancel the stuck task (which is surely a nice feature to have anyway) and reattempt it. But if the previous call is still blocked inside libvirt or any other of the lower layers outside of our control, which benefit will give us to reattempt? I fail to see the gain right now. To summarize, my belief is that in presence of blocked calls, the best course of action is to wait for them to unblock -of course without having impact on other well behaving calls, and not to retry, because according to my observations and reports I've seen, retrying will just make things worse. The code reflects this belief. Line 54: request. Each task is consumed by the first worker thread which takes it. Line 55: Thus, at any given time at most one worker thread can get stuck Line 56: on an unresponsive task. Line 57: Periodic task are implemented by letting a worker thread to produce a clone Line 54: request. Each task is consumed by the first worker thread which takes it. Line 55: Thus, at any given time at most one worker thread can get stuck Line 56: on an unresponsive task. Line 57: Periodic task are implemented by letting a worker thread to produce a clone Line 58: of the task that was consumed and reinjecting it into the queue. > We don't need periodic tasks. Lets simplify by removing implementing only w Well I agree we do not need the concept of periodic tasks in the pool, but in sampling we need to gather stats periodically. Probably this is better handled elsewhere (e.g. a separate scheduler thread), I need to think about this. Line 59: Line 60: 2. each worker thread is made capable to answer to a couple of simple yet Line 61: important queries: Line 62: - what are you doing? Line 59: Line 60: 2. each worker thread is made capable to answer to a couple of simple yet Line 61: important queries: Line 62: - what are you doing? Line 63: - how long is this task taking? > Nice, but the time it takes to run a task say nothing about task state. Som I agree that just using clock time is a poor man way to guess a task is stuck. So I thhink it is better if I make the task capable of detect if they are stuck or not. Maybe a per-task timeout is good enough already (or per-task-class e.g. storage, cpu etc. etc.) instead of worker.is_stuck() something like worker.task.is_stuck() Line 64: Line 65: 3. a watcher thread is added to the pool. Line 66: the watcher thread does not do any real work, but instead periodically Line 67: checks the health of each active worker thread, and detects long running Line 66: the watcher thread does not do any real work, but instead periodically Line 67: checks the health of each active worker thread, and detects long running Line 68: tasks. Then the watcher can implement any stuck-detection policy. Line 69: The simplest one is to consider a thread 'stuck' if any task elaboration Line 70: time exceeds a given threshold. > The watcher thread should not be part of the pool, it is unrelated to the p Currently the supervisor/watcher is part of the pool because it also replaces the stuck threads. If instead of replacing threads we cancel tasks, then yes, the supervisor can be simplified and moved outside the pool. Line 71: Line 72: 4. the worker threads detected as 'stuck' are transparently detached by the Line 73: pool and replaced. Line 74: This is transparently done by the watcher thread, which ensures the pool Line 75: has a constant active workforce. The threads detected as stuck are taken out Line 76: the pool and put into a limbo until they eventually unblock. Line 77: The task detected as blocked are *not* automatically retried. This is Line 78: very important because, coupled with point #1, avoids the lemmings effect Line 79: of more and more worker threads frozen attempting to do a blocking task. > I don't think we should remove stuck threads but instead cancel stuck tasks See the answer to the first comment for remove vs cancel. If we want to preserve the capability of replacing sick worker threads, then I feel that having a supervisor inside (or somehow attached to) the pool is cleaner. Line 80: Line 81: With all in place, WatchedThreadPool can effectively minimize the waste Line 82: of resources, and can deal with unresponsive tasks gracefully. Line 83: -- To view, visit http://gerrit.ovirt.org/29190 To unsubscribe, visit http://gerrit.ovirt.org/settings Gerrit-MessageType: comment Gerrit-Change-Id: I400799e300f5d012dae5d158c4379fe57db1bd37 Gerrit-PatchSet: 4 Gerrit-Project: vdsm Gerrit-Branch: master Gerrit-Owner: Francesco Romani <[email protected]> Gerrit-Reviewer: Antoni Segura Puimedon <[email protected]> Gerrit-Reviewer: Dan Kenigsberg <[email protected]> Gerrit-Reviewer: Dima Kuznetsov <[email protected]> Gerrit-Reviewer: Federico Simoncelli <[email protected]> Gerrit-Reviewer: Francesco Romani <[email protected]> Gerrit-Reviewer: Michal Skrivanek <[email protected]> Gerrit-Reviewer: Nir Soffer <[email protected]> Gerrit-Reviewer: Saggi Mizrahi <[email protected]> Gerrit-Reviewer: Vinzenz Feenstra <[email protected]> Gerrit-Reviewer: Yaniv Bronhaim <[email protected]> Gerrit-Reviewer: [email protected] Gerrit-Reviewer: oVirt Jenkins CI Server Gerrit-HasComments: Yes _______________________________________________ vdsm-patches mailing list [email protected] https://lists.fedorahosted.org/mailman/listinfo/vdsm-patches
