Re: [Pvfs2-developers] fix BMI multiplexing of multiple methods

Sam Lang Thu, 08 Jan 2009 09:35:13 -0800


On Jan 8, 2009, at 10:48 AM, Rob Ross wrote:

Hey,
For the CALLBACK option, you would use that to have the individualmethods filling in things at the "generic BMI" layer (for lack ofthe right terminology), but the overall user API would be the same?

I was thinking that the callback would get passed all the way down tothe method, and it would call the callback on completion of anoperation. We could keep the callback at the generic BMI level, andcall the callback for completed operations on return from a method'stestcontext. That does still avoid the multiplexing issues we see atpresent, and its less of a change to the BMI code overall, so maybethat's the way to go.

The user API would have to change fairly significantly for callbacks,because if a "callback context" were specified, completion would benotified via the callback instead of as a list of completedoperations. For example, in our job code, instead of copyingcompleted BMI operations to the job completion list with each call toBMI_testcontext, we would copy completed BMI operations to the jobcompletion list whenever the callback was called. This still doesn'tfix the issue for our metadata operations though, because completedoperations are just going to sit in the job completion queue whilewe're calling BMI_testcontext (it still takes just as long to iteratethrough all the methods). So we would need to modify the jobinterfaces to take callbacks as well, and define a callback thatstarts up the associated state machine. For our metadata operations,this ends up being fairly invasive.

For I/O, it actually ends up being a win, because flow already usescallbacks to bounce between BMI and trove operations.

A potential drawback to the callback idea, is that synchronizationoccurs on a per-operation basis, instead of for potentially manyoperations. A way around that would be to require a callback thatcould many completed operations instead of just one, although I don'tknow if mutex locks are a real bottleneck for us anymore.

I don't think that the CONTEXT option is appropriate. I don't wantto expose the specifics of the underlying networks any more than wehave already.
There should be relevant research in the MPI space related to thePOLL_PLAN option.
Do we consider this to be a problem for both clients and servers, oris it really a server-specific issue? If this is something we thinkwill solely (or mostly) a server thing, we could consider throwing athread at the issue. One option might be to kick off a thread towait on the TCP side of things, since the kernel is doing most ofthe work for us anyway, and put completed TCP events into thecompletion list asynchronously (for servers only)?

I think the problem has been raised only on clients, but it exists onboth the server and clients.

Maybe I'm just missing some details, but I don't think a tcp threadwill help us, or it at least needs to be combined with the POLL_PLANor CALLBACK option. The tcp testcontext call will sleep (epoll_wait)up to the timeout passed in if there's no completed operations and nowork to be done. With a thread, we would just have tcp testcontextreturn immediately even if nothing was in the completion list. Butthat means that a tcp-only scenario will cause the BMI_testcontextcalls on the client to spin and peg the cpu. We could add in acondition variable, but then we're right back where we started. Ithink an appropriate POLL_PLAN option could adjust timeouts to the tcptestcontext call, but it requires a lot more smarts in the code to getthat right in general, whereas the callback option just allows you toget completion right away.


-sam

Rob

On Jan 7, 2009, at 4:06 PM, Sam Lang wrote:
Hi All,
Right now if multiple methods are enabled in BMI, we tend to getpoor performance from the "fast" network, because BMI_testcontextiterates through all the active methods calling testcontext foreach one. It tries to be smart about which methods getscheduled ;-) to prevent starvation, but it treats all the methodsfairly, which tends to make tcp (the slow one) hog the time spentin testcontext. I have a few ideas for this, so I'll go ahead andpropose them and let you all shoot them down or propose others.
Option CALLBACK: Instead of returning completion as a list intestcontext, we allow a BMI context to be constructed with acallback, and on completion of operations, the callback is called.This allows each method to drive its own operations, and notify theconsumer of completion immediately. There would still need to be atestcontext call for methods that only service operations duringthat call. The changes might not be that significant, theBMI_open_context call could just take an extra parameter that wasthe callback function. If the parameter is null, we just use thecompletion list as before.
Option CONTEXT: Require separate contexts for separate methods.This pushes the problem up to the application, probably not whereit belongs, since active methods are opaque from the BMI api.
Option POLL_PLAN: Modify the construct_poll_plan function in bmithat already tries to be fair, so that its aware of the performancediscrepancy between methods. Maybe it can just skip tcp everyother time for example. This is probably the easiest, since itdoesn't require API changes and the like.
-sam


_______________________________________________
Pvfs2-developers mailing list
Pvfs2-developers@beowulf-underground.org
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers

Re: [Pvfs2-developers] fix BMI multiplexing of multiple methods

Reply via email to