On Jul 27, 2006, at 10:16 AM, Phil Carns wrote:


Hmm...I had been thinking about a flow implementation that used the new concurrent state machine code...it sounds like that's a bad idea because the testing and restarting would take too long to switch between bmi and trove? We use the post/test model through pvfs2 though, so maybe I don't understand the issue.

I don't think that is bad idea. There were really two seperate but related problems in one of the older flow protocol implementations, I can try to describe them a little more here if I can remember:

- explicitly tracking and testing each trove and bmi operation: It basically kept arrays that listed pending trove and bmi ops, and would call testsome() to service them. This was a problem because the time it took to keep running up and down those arrays (when building them at the flow level, or when testing them at the trove/ bmi level). The solution is to just use testcontext() and let trove/bmi tell you when something finishes without managing extra state.

- thread switch time: the architecture here was set up at one time to have one thread pushing the test functions for bmi, another thread pushing the test functions for trove, while another thread was processing the flow and posting new operations. The problem here is that it (at the time) took too long to jump between the "pushing" threads and the "processing" thread when an operation finished that should trigger progress on the flow. This led to the thread-mgr.c code and associated callbacks. The callbacks actually drive the flow progress and post new operations. That means that the same thread that pushes testcontext() gets to trigger the next post, without waiting on the latency of waking up a different thread to do something (using condition variable etc.). I managed to reuse the thread-mgr for the job code as well, so that one testcontext() call triggers callbacks to both the job and flow interfaces.

I don't think either of the above issues precludes different flow protocol implementations, and they are really kind of orthogonal to whether state machines are used or not. The first issue is solved just by using testcontext() rather than manually tracking operations.

The second issue could be solved in a variety of ways, some of which may be better than what we have now. The callback approach is effecient enough, but is hard to debug. Of course it is also possible that the thread switch (ie. condition signal) latency is low enough nowadays that you don't even need to worry about it anymore. I last looked at this problem before NPTL arrived on the scene.

At any rate I think a state machine based flow protocol could dodge issue #2 by either:
- lucking out with a faster modern thread implementation
- being smarter about how thread work is divided up
- using callbacks as we do now, and making the state machine mechanism thread safe so that it can be driven directly from those callbacks rather than from a testcontext() work loop

On a related note, it is important to remember that trove has its own internal thread also- so on the trove push side (depending on your design) you could have to worry about a chain of 2 threads that have to be woken up to get something done at completion time. The trove part of that chain can't be avoided without changing the API.

Sorry about the tangent here, but I figured I may as well share some warnings about things to look out for here. I think it would be good to have a cleaner flow protocol implementation.


Thanks for the detailed explanation Phil. I hadn't thought about the context switches that might slow down flow. I was primarily thinking of something that would be cleaner, and easier to modify and test for different scenarios. If at some point I get around to playing with a flow impl that uses the concurrent state machine framework, I'll open up the discussion again to avoid any of the pitfalls you described.

-sam

I think I'm lost now. What do you mean by replace? The states are still isolated, jobs trigger the transitions, only one state action gets executed at a time, there still may be a time gap between completion of any given child and when the parent picks up processing again, and there are still frames. I think both approaches will look the same when running unless I missed something. If Walt puts a longjmp() in there we can both hit him over the head.

Heh.  Don't give him ideas! ;-)
I was operating under the constraint that a state machine can only post a job for itself. If I understand the current plan correctly, using job_null in the child state machine to post a job for the parent breaks that constraint, and so in some sense is a replace (the job_null actually takes the parent smcb pointer). I think you're probably right that its not a big difference either way, its just cleaner in my head to only have state machines posting jobs for themselves.

I see what you are saying. I guess it depends on how you look at it. I had kind of started thinking of the jobs as a signalling mechanism since they are the construct that "signals" as state machine to make its next transition. The job_null() approach just makes it so that a child state machine is what triggers this particular signal, rather than a bmi/trove/dev/req_sched/flow component. I know this is a change in the model and adds a dependency that wasn't previously there, but at least job_null() is just a few dozen lines of code. If someone reuses the SM code elsewhere, I would guess that is one of the more minor worries considering that they would need a whole new mechanism (other than the job api) to motivate all of the transitions anyway.

Walt probably got more discussion than he bargained for, but at the least, lively discussion keeps me awake in the afternoon ;-).

Heh- same here :)

-Phil


_______________________________________________
Pvfs2-developers mailing list
Pvfs2-developers@beowulf-underground.org
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers

Reply via email to