On Jan 8, 2009, at 10:48 AM, Rob Ross wrote:

Hey,

For the CALLBACK option, you would use that to have the individual methods filling in things at the "generic BMI" layer (for lack of the right terminology), but the overall user API would be the same?

I was thinking that the callback would get passed all the way down to the method, and it would call the callback on completion of an operation. We could keep the callback at the generic BMI level, and call the callback for completed operations on return from a method's testcontext. That does still avoid the multiplexing issues we see at present, and its less of a change to the BMI code overall, so maybe that's the way to go.

The user API would have to change fairly significantly for callbacks, because if a "callback context" were specified, completion would be notified via the callback instead of as a list of completed operations. For example, in our job code, instead of copying completed BMI operations to the job completion list with each call to BMI_testcontext, we would copy completed BMI operations to the job completion list whenever the callback was called. This still doesn't fix the issue for our metadata operations though, because completed operations are just going to sit in the job completion queue while we're calling BMI_testcontext (it still takes just as long to iterate through all the methods). So we would need to modify the job interfaces to take callbacks as well, and define a callback that starts up the associated state machine. For our metadata operations, this ends up being fairly invasive.

For I/O, it actually ends up being a win, because flow already uses callbacks to bounce between BMI and trove operations.

A potential drawback to the callback idea, is that synchronization occurs on a per-operation basis, instead of for potentially many operations. A way around that would be to require a callback that could many completed operations instead of just one, although I don't know if mutex locks are a real bottleneck for us anymore.



I don't think that the CONTEXT option is appropriate. I don't want to expose the specifics of the underlying networks any more than we have already.

There should be relevant research in the MPI space related to the POLL_PLAN option.

Do we consider this to be a problem for both clients and servers, or is it really a server-specific issue? If this is something we think will solely (or mostly) a server thing, we could consider throwing a thread at the issue. One option might be to kick off a thread to wait on the TCP side of things, since the kernel is doing most of the work for us anyway, and put completed TCP events into the completion list asynchronously (for servers only)?

I think the problem has been raised only on clients, but it exists on both the server and clients.

Maybe I'm just missing some details, but I don't think a tcp thread will help us, or it at least needs to be combined with the POLL_PLAN or CALLBACK option. The tcp testcontext call will sleep (epoll_wait) up to the timeout passed in if there's no completed operations and no work to be done. With a thread, we would just have tcp testcontext return immediately even if nothing was in the completion list. But that means that a tcp-only scenario will cause the BMI_testcontext calls on the client to spin and peg the cpu. We could add in a condition variable, but then we're right back where we started. I think an appropriate POLL_PLAN option could adjust timeouts to the tcp testcontext call, but it requires a lot more smarts in the code to get that right in general, whereas the callback option just allows you to get completion right away.

-sam



Rob

On Jan 7, 2009, at 4:06 PM, Sam Lang wrote:


Hi All,

Right now if multiple methods are enabled in BMI, we tend to get poor performance from the "fast" network, because BMI_testcontext iterates through all the active methods calling testcontext for each one. It tries to be smart about which methods get scheduled ;-) to prevent starvation, but it treats all the methods fairly, which tends to make tcp (the slow one) hog the time spent in testcontext. I have a few ideas for this, so I'll go ahead and propose them and let you all shoot them down or propose others.

Option CALLBACK: Instead of returning completion as a list in testcontext, we allow a BMI context to be constructed with a callback, and on completion of operations, the callback is called. This allows each method to drive its own operations, and notify the consumer of completion immediately. There would still need to be a testcontext call for methods that only service operations during that call. The changes might not be that significant, the BMI_open_context call could just take an extra parameter that was the callback function. If the parameter is null, we just use the completion list as before.

Option CONTEXT: Require separate contexts for separate methods. This pushes the problem up to the application, probably not where it belongs, since active methods are opaque from the BMI api.

Option POLL_PLAN: Modify the construct_poll_plan function in bmi that already tries to be fair, so that its aware of the performance discrepancy between methods. Maybe it can just skip tcp every other time for example. This is probably the easiest, since it doesn't require API changes and the like.

-sam

_______________________________________________
Pvfs2-developers mailing list
Pvfs2-developers@beowulf-underground.org
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers

Reply via email to