On Jan 8, 2009, at 10:48 AM, Rob Ross wrote:
Hey,
For the CALLBACK option, you would use that to have the individual
methods filling in things at the "generic BMI" layer (for lack of
the right terminology), but the overall user API would be the same?
I was thinking that the callback would get passed all the way down to
the method, and it would call the callback on completion of an
operation. We could keep the callback at the generic BMI level, and
call the callback for completed operations on return from a method's
testcontext. That does still avoid the multiplexing issues we see at
present, and its less of a change to the BMI code overall, so maybe
that's the way to go.
The user API would have to change fairly significantly for callbacks,
because if a "callback context" were specified, completion would be
notified via the callback instead of as a list of completed
operations. For example, in our job code, instead of copying
completed BMI operations to the job completion list with each call to
BMI_testcontext, we would copy completed BMI operations to the job
completion list whenever the callback was called. This still doesn't
fix the issue for our metadata operations though, because completed
operations are just going to sit in the job completion queue while
we're calling BMI_testcontext (it still takes just as long to iterate
through all the methods). So we would need to modify the job
interfaces to take callbacks as well, and define a callback that
starts up the associated state machine. For our metadata operations,
this ends up being fairly invasive.
For I/O, it actually ends up being a win, because flow already uses
callbacks to bounce between BMI and trove operations.
A potential drawback to the callback idea, is that synchronization
occurs on a per-operation basis, instead of for potentially many
operations. A way around that would be to require a callback that
could many completed operations instead of just one, although I don't
know if mutex locks are a real bottleneck for us anymore.
I don't think that the CONTEXT option is appropriate. I don't want
to expose the specifics of the underlying networks any more than we
have already.
There should be relevant research in the MPI space related to the
POLL_PLAN option.
Do we consider this to be a problem for both clients and servers, or
is it really a server-specific issue? If this is something we think
will solely (or mostly) a server thing, we could consider throwing a
thread at the issue. One option might be to kick off a thread to
wait on the TCP side of things, since the kernel is doing most of
the work for us anyway, and put completed TCP events into the
completion list asynchronously (for servers only)?
I think the problem has been raised only on clients, but it exists on
both the server and clients.
Maybe I'm just missing some details, but I don't think a tcp thread
will help us, or it at least needs to be combined with the POLL_PLAN
or CALLBACK option. The tcp testcontext call will sleep (epoll_wait)
up to the timeout passed in if there's no completed operations and no
work to be done. With a thread, we would just have tcp testcontext
return immediately even if nothing was in the completion list. But
that means that a tcp-only scenario will cause the BMI_testcontext
calls on the client to spin and peg the cpu. We could add in a
condition variable, but then we're right back where we started. I
think an appropriate POLL_PLAN option could adjust timeouts to the tcp
testcontext call, but it requires a lot more smarts in the code to get
that right in general, whereas the callback option just allows you to
get completion right away.
-sam
Rob
On Jan 7, 2009, at 4:06 PM, Sam Lang wrote:
Hi All,
Right now if multiple methods are enabled in BMI, we tend to get
poor performance from the "fast" network, because BMI_testcontext
iterates through all the active methods calling testcontext for
each one. It tries to be smart about which methods get
scheduled ;-) to prevent starvation, but it treats all the methods
fairly, which tends to make tcp (the slow one) hog the time spent
in testcontext. I have a few ideas for this, so I'll go ahead and
propose them and let you all shoot them down or propose others.
Option CALLBACK: Instead of returning completion as a list in
testcontext, we allow a BMI context to be constructed with a
callback, and on completion of operations, the callback is called.
This allows each method to drive its own operations, and notify the
consumer of completion immediately. There would still need to be a
testcontext call for methods that only service operations during
that call. The changes might not be that significant, the
BMI_open_context call could just take an extra parameter that was
the callback function. If the parameter is null, we just use the
completion list as before.
Option CONTEXT: Require separate contexts for separate methods.
This pushes the problem up to the application, probably not where
it belongs, since active methods are opaque from the BMI api.
Option POLL_PLAN: Modify the construct_poll_plan function in bmi
that already tries to be fair, so that its aware of the performance
discrepancy between methods. Maybe it can just skip tcp every
other time for example. This is probably the easiest, since it
doesn't require API changes and the like.
-sam
_______________________________________________
Pvfs2-developers mailing list
Pvfs2-developers@beowulf-underground.org
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers