Nir Soffer has posted comments on this change.

Change subject: threadpool: docs: add design notes
......................................................................


Patch Set 4:

(5 comments)

http://gerrit.ovirt.org/#/c/29190/4/lib/threadpool/README.rst
File lib/threadpool/README.rst:

Line 49: 
Line 50: The countermeasures implemented in WatchedThreadPool are the following.
Line 51: 
Line 52: 1. each worker takes a task exclusively.
Line 53:    each task must be taken exactly by one worker and exactly once for 
each
Why do you care about the task used only once? the thread pool should care 
about task reusing. The task should simply provide the interface that allow the 
worker to run it. If a task can run multiple times, thats the task creator 
concern.
Line 54:    request. Each task is consumed by the first worker thread which 
takes it.
Line 55:    Thus, at any given time at most one worker thread can get stuck
Line 56:    on an unresponsive task.
Line 57:    Periodic task are implemented by letting a worker thread to produce 
a clone


Line 54:    request. Each task is consumed by the first worker thread which 
takes it.
Line 55:    Thus, at any given time at most one worker thread can get stuck
Line 56:    on an unresponsive task.
Line 57:    Periodic task are implemented by letting a worker thread to produce 
a clone
Line 58:    of the task that was consumed and reinjecting it into the queue.
We don't need periodic tasks. Lets simplify by removing implementing only what 
we need.
Line 59: 
Line 60: 2. each worker thread is made capable to answer to a couple of simple 
yet
Line 61:    important queries:
Line 62:    - what are you doing?


Line 59: 
Line 60: 2. each worker thread is made capable to answer to a couple of simple 
yet
Line 61:    important queries:
Line 62:    - what are you doing?
Line 63:    - how long is this task taking?
Nice, but the time it takes to run a task say nothing about task state. Some 
task takes lot of time (moving storage etc.). So the real questions is "are you 
stuck?" and only the task can answer this.
Line 64: 
Line 65: 3. a watcher thread is added to the pool.
Line 66:    the watcher thread does not do any real work, but instead 
periodically
Line 67:    checks the health of each active worker thread, and detects long 
running


Line 66:    the watcher thread does not do any real work, but instead 
periodically
Line 67:    checks the health of each active worker thread, and detects long 
running
Line 68:    tasks. Then the watcher can implement any stuck-detection policy.
Line 69:    The simplest one is to consider a thread 'stuck' if any task 
elaboration
Line 70:    time exceeds a given threshold.
The watcher thread should not be part of the pool, it is unrelated to the pool. 
It should detect sick tasks, not long running tasks. Only the task can detect 
this so the watcher thread does not have to implement a detection policy.
Line 71: 
Line 72: 4. the worker threads detected as 'stuck' are transparently detached 
by the
Line 73:    pool and replaced.
Line 74:    This is transparently done by the watcher thread, which ensures the 
pool


Line 75:    has a constant active workforce. The threads detected as stuck are 
taken out
Line 76:    the pool and put into a limbo until they eventually unblock.
Line 77:    The task detected as blocked are *not* automatically retried. This 
is
Line 78:    very important because, coupled with point #1, avoids the lemmings 
effect
Line 79:    of more and more worker threads frozen attempting to do a blocking 
task.
I don't think we should remove stuck threads but instead cancel stuck tasks.

Any task doing io can be killed by closing the file descriptor it uses. Task 
doing cpu bound stuff should check if it was canceled from time to time.

We need to check if closing a file descriptor does wake up a thread reading 
from it when the thread enter D state.

Removing stuck thread should be last resort if we cannot cancel the task in 
another way.
Line 80: 
Line 81: With all in place, WatchedThreadPool can effectively minimize the waste
Line 82: of resources, and can deal with unresponsive tasks gracefully.
Line 83: 


-- 
To view, visit http://gerrit.ovirt.org/29190
To unsubscribe, visit http://gerrit.ovirt.org/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I400799e300f5d012dae5d158c4379fe57db1bd37
Gerrit-PatchSet: 4
Gerrit-Project: vdsm
Gerrit-Branch: master
Gerrit-Owner: Francesco Romani <[email protected]>
Gerrit-Reviewer: Antoni Segura Puimedon <[email protected]>
Gerrit-Reviewer: Dan Kenigsberg <[email protected]>
Gerrit-Reviewer: Dima Kuznetsov <[email protected]>
Gerrit-Reviewer: Federico Simoncelli <[email protected]>
Gerrit-Reviewer: Francesco Romani <[email protected]>
Gerrit-Reviewer: Michal Skrivanek <[email protected]>
Gerrit-Reviewer: Nir Soffer <[email protected]>
Gerrit-Reviewer: Saggi Mizrahi <[email protected]>
Gerrit-Reviewer: Vinzenz Feenstra <[email protected]>
Gerrit-Reviewer: Yaniv Bronhaim <[email protected]>
Gerrit-Reviewer: [email protected]
Gerrit-Reviewer: oVirt Jenkins CI Server
Gerrit-HasComments: Yes
_______________________________________________
vdsm-patches mailing list
[email protected]
https://lists.fedorahosted.org/mailman/listinfo/vdsm-patches

Reply via email to