On Aug 22, 2006, at 1:35 PM, Pete Wyckoff wrote:

[EMAIL PROTECTED] wrote on Tue, 22 Aug 2006 11:40 -0400:
What should a BMI method do if the receive is not posted? Wait?
Cancel with failure?

Something polite, just like MPI more or less.  You can complete the
send at the sender but buffer in the sender or receiver.  Or you can
keep the send pending until th receiver shows up.  We do the latter
for large messages in IB.  For small messages (< 8 kB currently),
expected or not, they queue up in the preposted receiver buffers (up
to 20).  Too many small messages and the sender will wait.

Ok. MX does this underneath the covers for me. If my send is small (<32 KB by default), it buffers the data and the send can complete immediately. Larger sends are held on the sender until a matching receive is posted. If the receive is not posted before the sender cancels the send, there is nothing else to clean up.

This brings up an possible case. What if peer A sends what it thinks is an expected small message to B and B never posts a receive for it? Since it is small, A's MX will buffer the message and send to B's MX lib. The sender will get an immediate completion. If the B never posts a matching receive, it will sit in the MX unexpected queue indefinitely. How long should I let a message sit in the unexpected queue before deleting it? I do not want to delete it immediately in case the local BMI is about to post a matching receive for it.

Also, MX does not limit the number of unexpected messages, but it does place a limit on memory used for unexpected messages (~2MB by default? and is tunable). There are no queue pairs for limiting per peer messages. If rate limiting (flow control) is a requirement, let me know.

I am still trying to have a clear understanding of how a send varies
from an unexpected send. The receiver will have some number of pre-
posted, "generic" receives (i.e. can receive from any peer) to catch
unexpected sends. If the send is _not_ unexpected (i.e. expected),
then it implies that the receiver will post a receive for the
expected send (either pre-posted or slightly late due to clock
drift), no? If not, what am I missing?

In a BMI implementation, very little difference.  A "test" on the
receive side only looks at existing connections, while a
"testunexpected" also does the moral equivalent of accept() on the
listening socket.

MX is connection-less. I will pre-post a bunch of receives for unexpected messages with a special bit mask. For expected receives, I will post using the BMI tag as well as a identifier for the remote peer. I can then test on a specific, per-peer receive (mx_recv() followed by mx_test()) or on any available (expected or not) receive (mx_recv() followed by mx_test_any()).

To a BMI user, an unexpected message always signals the start of a
new transaction, while an expected message continues an existing
transaction for which you've got outstanding state.  Phil said in
his CAC paper, "This reduces complexity on the server side because
the server does not have to anticipate buffer use in advance."

Can you clarify the "you" in the "while an expected message continues an existing transaction for which you've got outstanding state"? Does "you" mean the BMI/MX method or BMI/PVFS? From what it looks like, I have no state (unless I need to rate limit sends).

Is the purpose of the RTS/CTS messages then to stall the sending
until the receiver has posted the receive?

Yes.   (I'm just talking about IB again.  GM may be similar.)

This is unnecessary then for MX. If the send is expected, I can simply call mx_isend() (same semantics as MPI_Isend()). If it is small, MX will buffer and send it. If large, it will wait until the peer posts a receive. If the peer fails to post a receive or if it is gone (crashed, etc.), can I assume that BMI or higher will manage the timeout and call BMI_method_cancel() on the send?

If so, I do not need either RTS or CTS messages.

If so, would the receiver
ever send a CTS to indicate that a match is not forthcoming?

No.  Not sure how such information would help the sender.

This is a possible case in Lustre. If the receiver cannot find a matching buffer (either bad request or lack of resources), it will let me know to NAK the send request (send a CTS that indicates failure).

Or does
the receiver only send a CTS when the receive is posted?

Yes.  With buffer information to enable RDMA Write.

In MX, there is no registration. I simply pass in a list of user- space buffers. MX handles registration under the covers.

In the
latter case, the sender may time out waiting for a CTS and thus
cancel the send?

Yes.  The sender will get a completion with -PVFS_ETIMEDOUT.  The
receiver, if it is still around, will see the connection get closed.
A good receiver implementation would forget any outstanding RTS
messages for which the user never posted a receive.  (If it happened
to have sent a CTS, it should rewind that state back to "just
posted, never saw RTS" too.)  I confess to never having tested this
situation in IB.  You might be able to crib from MX's MPID_Cancel,
assuming somebody implemented that well.  :)

                -- Pete

Who manages the timeout? The BMI method (i.e. me) or something higher in BMI/PVFS? Can I assume all send and receives are subject to timeout?

Thanks for all the input! This is a huge help. :-)

Scott
_______________________________________________
Pvfs2-developers mailing list
Pvfs2-developers@beowulf-underground.org
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers

Reply via email to