Francesco Romani has posted comments on this change.

Change subject: threadpool: docs: add design notes
......................................................................


Patch Set 4:

(5 comments)

Thanks for your comments. Will expand these docs.

http://gerrit.ovirt.org/#/c/29190/4/lib/threadpool/README.rst
File lib/threadpool/README.rst:

Line 49: 
Line 50: The countermeasures implemented in WatchedThreadPool are the following.
Line 51: 
Line 52: 1. each worker takes a task exclusively.
Line 53:    each task must be taken exactly by one worker and exactly once for 
each
> Why do you care about the task used only once? the thread pool should care 
Here I'm trying to address a different problem.
It could happen (actually it already happened many times) that a specific 
libvirt call get stuck due to access to QEMU monitor in D state (hello, 
virDomainGetBlockInfo).
To detect and to react to a blocked QEMU is a different problem (and of course 
much more serious) which needs to be handled elsewhere, so let's just consider 
a not-responding call.

In this case, there is no point in retrying from a different thread: this will 
likely lead to another blocked thread, and so on in a cascading effect, because 
this call is likely to fail again over and over.

If the pool ensures the above -only one worker sees a task exactly once, no one 
else sees the task while it is processed, then the blocked operation cannot be 
attempted more than once.

This is bad, but in my opinion we are just accepting and adapting to a bad 
situation outside our control, and not making it worse.

Another way to go could be to cancel the stuck task (which is surely a nice 
feature to have anyway) and reattempt it.
But if the previous call is still blocked inside libvirt or any other of the 
lower layers outside of our control, which benefit will give us to reattempt? I 
fail to see the gain right now.

To summarize, my belief is that in presence of blocked calls, the best course 
of action is to wait for them to unblock -of course without having impact on 
other well behaving calls, and not to retry, because according to my 
observations and reports I've seen, retrying will just make things worse. The 
code reflects this belief.
Line 54:    request. Each task is consumed by the first worker thread which 
takes it.
Line 55:    Thus, at any given time at most one worker thread can get stuck
Line 56:    on an unresponsive task.
Line 57:    Periodic task are implemented by letting a worker thread to produce 
a clone


Line 54:    request. Each task is consumed by the first worker thread which 
takes it.
Line 55:    Thus, at any given time at most one worker thread can get stuck
Line 56:    on an unresponsive task.
Line 57:    Periodic task are implemented by letting a worker thread to produce 
a clone
Line 58:    of the task that was consumed and reinjecting it into the queue.
> We don't need periodic tasks. Lets simplify by removing implementing only w
Well I agree we do not need the concept of periodic tasks in the pool, but in 
sampling we need to gather stats periodically. Probably this is better handled 
elsewhere (e.g. a separate scheduler thread), I need to think about this.
Line 59: 
Line 60: 2. each worker thread is made capable to answer to a couple of simple 
yet
Line 61:    important queries:
Line 62:    - what are you doing?


Line 59: 
Line 60: 2. each worker thread is made capable to answer to a couple of simple 
yet
Line 61:    important queries:
Line 62:    - what are you doing?
Line 63:    - how long is this task taking?
> Nice, but the time it takes to run a task say nothing about task state. Som
I agree that just using clock time is a poor man way to guess a task is stuck.

So I thhink it is better if I make the task capable of detect if they are stuck 
or not. Maybe a per-task timeout is good enough already (or per-task-class e.g. 
storage, cpu etc. etc.)

instead of

worker.is_stuck()

something like

worker.task.is_stuck()
Line 64: 
Line 65: 3. a watcher thread is added to the pool.
Line 66:    the watcher thread does not do any real work, but instead 
periodically
Line 67:    checks the health of each active worker thread, and detects long 
running


Line 66:    the watcher thread does not do any real work, but instead 
periodically
Line 67:    checks the health of each active worker thread, and detects long 
running
Line 68:    tasks. Then the watcher can implement any stuck-detection policy.
Line 69:    The simplest one is to consider a thread 'stuck' if any task 
elaboration
Line 70:    time exceeds a given threshold.
> The watcher thread should not be part of the pool, it is unrelated to the p
Currently the supervisor/watcher is part of the pool because it also replaces 
the stuck threads. If instead of replacing threads we cancel tasks, then yes, 
the supervisor can be simplified and moved outside the pool.
Line 71: 
Line 72: 4. the worker threads detected as 'stuck' are transparently detached 
by the
Line 73:    pool and replaced.
Line 74:    This is transparently done by the watcher thread, which ensures the 
pool


Line 75:    has a constant active workforce. The threads detected as stuck are 
taken out
Line 76:    the pool and put into a limbo until they eventually unblock.
Line 77:    The task detected as blocked are *not* automatically retried. This 
is
Line 78:    very important because, coupled with point #1, avoids the lemmings 
effect
Line 79:    of more and more worker threads frozen attempting to do a blocking 
task.
> I don't think we should remove stuck threads but instead cancel stuck tasks
See the answer to the first comment for remove vs cancel.

If we want to preserve the capability of replacing sick worker threads, then I 
feel that having a supervisor inside (or somehow attached to) the pool is 
cleaner.
Line 80: 
Line 81: With all in place, WatchedThreadPool can effectively minimize the waste
Line 82: of resources, and can deal with unresponsive tasks gracefully.
Line 83: 


-- 
To view, visit http://gerrit.ovirt.org/29190
To unsubscribe, visit http://gerrit.ovirt.org/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I400799e300f5d012dae5d158c4379fe57db1bd37
Gerrit-PatchSet: 4
Gerrit-Project: vdsm
Gerrit-Branch: master
Gerrit-Owner: Francesco Romani <[email protected]>
Gerrit-Reviewer: Antoni Segura Puimedon <[email protected]>
Gerrit-Reviewer: Dan Kenigsberg <[email protected]>
Gerrit-Reviewer: Dima Kuznetsov <[email protected]>
Gerrit-Reviewer: Federico Simoncelli <[email protected]>
Gerrit-Reviewer: Francesco Romani <[email protected]>
Gerrit-Reviewer: Michal Skrivanek <[email protected]>
Gerrit-Reviewer: Nir Soffer <[email protected]>
Gerrit-Reviewer: Saggi Mizrahi <[email protected]>
Gerrit-Reviewer: Vinzenz Feenstra <[email protected]>
Gerrit-Reviewer: Yaniv Bronhaim <[email protected]>
Gerrit-Reviewer: [email protected]
Gerrit-Reviewer: oVirt Jenkins CI Server
Gerrit-HasComments: Yes
_______________________________________________
vdsm-patches mailing list
[email protected]
https://lists.fedorahosted.org/mailman/listinfo/vdsm-patches

Reply via email to