On Jul 27, 2006, at 10:16 AM, Phil Carns wrote:
Hmm...I had been thinking about a flow implementation that used
the new concurrent state machine code...it sounds like that's a
bad idea because the testing and restarting would take too long
to switch between bmi and trove? We use the post/test model
through pvfs2 though, so maybe I don't understand the issue.
I don't think that is bad idea. There were really two seperate but
related problems in one of the older flow protocol implementations,
I can try to describe them a little more here if I can remember:
- explicitly tracking and testing each trove and bmi operation: It
basically kept arrays that listed pending trove and bmi ops, and
would call testsome() to service them. This was a problem because
the time it took to keep running up and down those arrays (when
building them at the flow level, or when testing them at the trove/
bmi level). The solution is to just use testcontext() and let
trove/bmi tell you when something finishes without managing extra
state.
- thread switch time: the architecture here was set up at one time
to have one thread pushing the test functions for bmi, another
thread pushing the test functions for trove, while another thread
was processing the flow and posting new operations. The problem
here is that it (at the time) took too long to jump between the
"pushing" threads and the "processing" thread when an operation
finished that should trigger progress on the flow. This led to the
thread-mgr.c code and associated callbacks. The callbacks actually
drive the flow progress and post new operations. That means that
the same thread that pushes testcontext() gets to trigger the next
post, without waiting on the latency of waking up a different
thread to do something (using condition variable etc.). I managed
to reuse the thread-mgr for the job code as well, so that one
testcontext() call triggers callbacks to both the job and flow
interfaces.
I don't think either of the above issues precludes different flow
protocol implementations, and they are really kind of orthogonal to
whether state machines are used or not. The first issue is solved
just by using testcontext() rather than manually tracking operations.
The second issue could be solved in a variety of ways, some of
which may be better than what we have now. The callback approach
is effecient enough, but is hard to debug. Of course it is also
possible that the thread switch (ie. condition signal) latency is
low enough nowadays that you don't even need to worry about it
anymore. I last looked at this problem before NPTL arrived on the
scene.
At any rate I think a state machine based flow protocol could dodge
issue #2 by either:
- lucking out with a faster modern thread implementation
- being smarter about how thread work is divided up
- using callbacks as we do now, and making the state machine
mechanism thread safe so that it can be driven directly from those
callbacks rather than from a testcontext() work loop
On a related note, it is important to remember that trove has its
own internal thread also- so on the trove push side (depending on
your design) you could have to worry about a chain of 2 threads
that have to be woken up to get something done at completion time.
The trove part of that chain can't be avoided without changing the
API.
Sorry about the tangent here, but I figured I may as well share
some warnings about things to look out for here. I think it would
be good to have a cleaner flow protocol implementation.
Thanks for the detailed explanation Phil. I hadn't thought about the
context switches that might slow down flow. I was primarily thinking
of something that would be cleaner, and easier to modify and test for
different scenarios. If at some point I get around to playing with a
flow impl that uses the concurrent state machine framework, I'll open
up the discussion again to avoid any of the pitfalls you described.
-sam
I think I'm lost now. What do you mean by replace? The states
are still isolated, jobs trigger the transitions, only one state
action gets executed at a time, there still may be a time gap
between completion of any given child and when the parent picks
up processing again, and there are still frames. I think both
approaches will look the same when running unless I missed
something. If Walt puts a longjmp() in there we can both hit
him over the head.
Heh. Don't give him ideas! ;-)
I was operating under the constraint that a state machine can
only post a job for itself. If I understand the current plan
correctly, using job_null in the child state machine to post a
job for the parent breaks that constraint, and so in some sense
is a replace (the job_null actually takes the parent smcb
pointer). I think you're probably right that its not a big
difference either way, its just cleaner in my head to only have
state machines posting jobs for themselves.
I see what you are saying. I guess it depends on how you look at
it. I had kind of started thinking of the jobs as a signalling
mechanism since they are the construct that "signals" as state
machine to make its next transition. The job_null() approach just
makes it so that a child state machine is what triggers this
particular signal, rather than a bmi/trove/dev/req_sched/flow
component. I know this is a change in the model and adds a
dependency that wasn't previously there, but at least job_null() is
just a few dozen lines of code. If someone reuses the SM code
elsewhere, I would guess that is one of the more minor worries
considering that they would need a whole new mechanism (other than
the job api) to motivate all of the transitions anyway.
Walt probably got more discussion than he bargained for, but at
the least, lively discussion keeps me awake in the afternoon ;-).
Heh- same here :)
-Phil
_______________________________________________
Pvfs2-developers mailing list
Pvfs2-developers@beowulf-underground.org
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers