Re: [Pvfs2-developers] BMI send implemtations

Scott Atchley Tue, 22 Aug 2006 11:21:58 -0700

On Aug 22, 2006, at 1:35 PM, Pete Wyckoff wrote:

[EMAIL PROTECTED] wrote on Tue, 22 Aug 2006 11:40 -0400:

What should a BMI method do if the receive is not posted? Wait?
Cancel with failure?


Something polite, just like MPI more or less.  You can complete the
send at the sender but buffer in the sender or receiver.  Or you can
keep the send pending until th receiver shows up.  We do the latter
for large messages in IB.  For small messages (< 8 kB currently),
expected or not, they queue up in the preposted receiver buffers (up
to 20).  Too many small messages and the sender will wait.

Ok. MX does this underneath the covers for me. If my send is small(<32 KB by default), it buffers the data and the send can completeimmediately. Larger sends are held on the sender until a matchingreceive is posted. If the receive is not posted before the sendercancels the send, there is nothing else to clean up.

This brings up an possible case. What if peer A sends what it thinksis an expected small message to B and B never posts a receive for it?Since it is small, A's MX will buffer the message and send to B's MXlib. The sender will get an immediate completion. If the B neverposts a matching receive, it will sit in the MX unexpected queueindefinitely. How long should I let a message sit in the unexpectedqueue before deleting it? I do not want to delete it immediately incase the local BMI is about to post a matching receive for it.

Also, MX does not limit the number of unexpected messages, but itdoes place a limit on memory used for unexpected messages (~2MB bydefault? and is tunable). There are no queue pairs for limiting perpeer messages. If rate limiting (flow control) is a requirement, letme know.

I am still trying to have a clear understanding of how a send varies
from an unexpected send. The receiver will have some number of pre-
posted, "generic" receives (i.e. can receive from any peer) to catch
unexpected sends. If the send is _not_ unexpected (i.e. expected),
then it implies that the receiver will post a receive for the
expected send (either pre-posted or slightly late due to clock
drift), no? If not, what am I missing?


In a BMI implementation, very little difference.  A "test" on the
receive side only looks at existing connections, while a
"testunexpected" also does the moral equivalent of accept() on the
listening socket.

MX is connection-less. I will pre-post a bunch of receives forunexpected messages with a special bit mask. For expected receives, Iwill post using the BMI tag as well as a identifier for the remotepeer. I can then test on a specific, per-peer receive (mx_recv()followed by mx_test()) or on any available (expected or not) receive(mx_recv() followed by mx_test_any()).

To a BMI user, an unexpected message always signals the start of a
new transaction, while an expected message continues an existing
transaction for which you've got outstanding state.  Phil said in
his CAC paper, "This reduces complexity on the server side because
the server does not have to anticipate buffer use in advance."

Can you clarify the "you" in the "while an expected message continuesan existingtransaction for which you've got outstanding state"? Does "you" meanthe BMI/MX method or BMI/PVFS? From what it looks like, I have nostate (unless I need to rate limit sends).

Is the purpose of the RTS/CTS messages then to stall the sending
until the receiver has posted the receive?


Yes.   (I'm just talking about IB again.  GM may be similar.)

This is unnecessary then for MX. If the send is expected, I cansimply call mx_isend() (same semantics as MPI_Isend()). If it issmall, MX will buffer and send it. If large, it will wait until thepeer posts a receive. If the peer fails to post a receive or if it isgone (crashed, etc.), can I assume that BMI or higher will manage thetimeout and call BMI_method_cancel() on the send?


If so, I do not need either RTS or CTS messages.

If so, would the receiver
ever send a CTS to indicate that a match is not forthcoming?


No.  Not sure how such information would help the sender.

This is a possible case in Lustre. If the receiver cannot find amatching buffer (either bad request or lack of resources), it willlet me know to NAK the send request (send a CTS that indicates failure).

Or does
the receiver only send a CTS when the receive is posted?


Yes.  With buffer information to enable RDMA Write.

In MX, there is no registration. I simply pass in a list of user-space buffers. MX handles registration under the covers.

In the
latter case, the sender may time out waiting for a CTS and thus
cancel the send?


Yes.  The sender will get a completion with -PVFS_ETIMEDOUT.  The
receiver, if it is still around, will see the connection get closed.
A good receiver implementation would forget any outstanding RTS
messages for which the user never posted a receive.  (If it happened
to have sent a CTS, it should rewind that state back to "just
posted, never saw RTS" too.)  I confess to never having tested this
situation in IB.  You might be able to crib from MX's MPID_Cancel,
assuming somebody implemented that well.  :)

                -- Pete

Who manages the timeout? The BMI method (i.e. me) or something higherin BMI/PVFS? Can I assume all send and receives are subject to timeout?


Thanks for all the input! This is a huge help. :-)

Scott
_______________________________________________
Pvfs2-developers mailing list
Pvfs2-developers@beowulf-underground.org
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers

Re: [Pvfs2-developers] BMI send implemtations

Reply via email to