Re: [openib-general] Re: Mellanox HCAs: outstanding RDMAs
Whether iWARP or IB, there is a fixed number of RDMA Requests allowed to be outstanding at any given time. If one posts more RDMA Read requests than the fixed number, the transmit queue is stalled. This is documented in both technology specifications. It is something that all ULP should be aware of and some go so far as to communicate that as part of the Hello / login exchange. This allows the ULP implementation to determine whether it wants to stall or wants to wait until Read Responses complete before sending another request. This isn't something silent; this isn't something new; this is something for the ULP implementation to decide how to deal with the issue. BTW, this is part of the hardware and associated specifications so it is up to software to deal with the limited hardware resources and the associated consequences. Please keep in mind that there are a limited number of RDMA Request / Atomic resource slots at the receiving HCA / RNIC. These are kept in hardware thus one must know the exact limit to avoid creating protocol problems. A ULP transmitter may post to the transmit queue more than the allotted slots but the transmitting (source) HCA / RNIC must not issue them to the remote. These requests do cause the source to stall. This is a well understood problem and if people give the iSCSI / iSER and DA specs good read or SDP they can see that this issue is comprehended. I agree with people that ULP designers / implementers must pay close attention to this constraint as it is in the iWARP / IB specifications for a very good reason and these semantics must be preserved to maintain the ordering requirements that are the used by the overall RDMA protocols themselves. Mike At 05:24 AM 6/6/2006, Talpey, Thomas wrote: At 03:43 AM 6/6/2006, Michael S. Tsirkin wrote: Quoting r. Talpey, Thomas [EMAIL PROTECTED]: Semantically, the provider is not required to provide any such flow control behavior by the way. The Mellanox one apparently does, but it is not a requirement of the verbs, it's a requirement on the upper layer. If more RDMA Reads are posted than the remote peer supports, the connection may break. This does not sound right. Isn't this the meaning of this field: Initiator Depth: Number of RDMA Reads atomic operations outstanding at any time? Shouldn't any provider enforce this limit? The core spec does not require it. An implementation *may* enforce it, but is not *required* to do so. And as pointed out in the other message, there are repercussions of doing so. I believe the silent queue stalling is a bit of a time bomb for upper layers, whose implementers are quite likely unaware of the danger. I greatly prefer an implementation which simply sends the RDMA Read request, resulting in a failed (but unblocked!) connection. Silence is a very dangerous thing, no matter how helpful the intent. Tom. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Re: Mellanox HCAs: outstanding RDMAs
At 10:40 AM 6/6/2006, Roland Dreier wrote: Thomas This is the difference between may and must. The value Thomas is provided, but I don't see anything in the spec that Thomas makes a requirement on its enforcement. Table 107 says the Thomas consumer can query it, that's about as close as it Thomas comes. There's some discussion about CM exchange too. This seems like a very strained interpretation of the spec. For I don't see how strained has anything to do with it. It's not saying anything either way. So, a legal implementation can make either choice. We're talking about the spec! But, it really doesn't matter. The point is, an upper layer should be paying attention to the number of RDMA Reads it posts, or else suffer either the queue-stalling or connection-failing consequences. Bad stuff either way. Tom. example, there's no explicit language in the IB spec that requires an HCA to use the destination LID passed via a modify QP operation, but I don't think anyone would seriously argue that an implementation that sent messages to some other random destination was compliant. In the same way, if I pass a limit for the number of outstanding RDMA/atomic operations in to a modify QP operation, I would expect the HCA to use that limit. - R. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Re: Mellanox HCAs: outstanding RDMAs
Thomas I don't see how strained has anything to do with it. It's Thomas not saying anything either way. So, a legal implementation Thomas can make either choice. We're talking about the spec! I guess the reason I say it is strained is because the spec does have the following compliance statement for the modify QP verb: C11-8 Upon invocation of this Verb, the CI shall modify the attributes for the specified QP... So what should I expect to happen if I modify the number of outstanding RDMA Read/atomic operations? That the HCA will ignore that attribute? To me the only sensible interpretation of the spec is that setting a limit on outstanding operations will limit the number of outstanding operations. If the attribute doesn't do anything, then why would the spec include it? - R. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] Re: Mellanox HCAs: outstanding RDMAs
Talpey, Thomas Sent: Tuesday, June 06, 2006 10:49 AM At 10:40 AM 6/6/2006, Roland Dreier wrote: Thomas This is the difference between may and must. The value Thomas is provided, but I don't see anything in the spec that Thomas makes a requirement on its enforcement. Table 107 says the Thomas consumer can query it, that's about as close as it Thomas comes. There's some discussion about CM exchange too. This seems like a very strained interpretation of the spec. For I don't see how strained has anything to do with it. It's not saying anything either way. So, a legal implementation can make either choice. We're talking about the spec! But, it really doesn't matter. The point is, an upper layer should be paying attention to the number of RDMA Reads it posts, or else suffer either the queue-stalling or connection-failing consequences. Bad stuff either way. Tom. Somewhere beneath this discussion is a bug in the application or IB stack. I'm not sure which may in the spec you are referring to, but the mays I have found all are for cases where the responder might support only 1 outstanding request. In all cases the negotiation protocol must be followed and the requestor is not allowed to exceed the negotiated limit. The mechanism should be: client queries its local HCA and determines responder resources (eg. number of concurrent outstanding RDMA reads on the wire from the remote end where this end will respond with the read data) and initiator depth (eg. number of concurrent outstanding RDMA reads which this end can initiate as the requestor). client puts the above information in the CM REQ. server similarly gets its information from its local CA and negotiates down the values to the MIN of each side (REP.InitiatorDepth = MIN(REQ.ResponderResources, server's local CAs Initiator depth); REP.ResponderResources = MIN(REQ.InitiatorDepth, server's local CAs responder resources). If server does not support RDMA Reads, it can REJ. If client decided the negotiated values are insufficient to meet its goals, it can disconnect. Each side sets its QP parameters via modify QP appropriately. Note they too will be mirror images of eachother: client: QP.Max RDMA Reads as Initiator = REP.ResponderResources QP.Max RDMA reads as responder = REP.InitiatorDepth server: QP.Max RDMA Reads as responder = REP.ResponderResources QP.Max RDMA reads as initiator = REP.InitiatorDepth We have done a lot of high stress RDMA Read traffic with Mellanox HCAs and provided the above negotiation is followed, we have seen no issues. Note however that by default a Mellanox HCA typically reports a large InitiatorDepth (128) and a modest ResponderResources (4-8). Hence when I hear that Responder Resources must be grown to 128 for some application to reliably work, it implies the negotiation I outlined above is not being followed. Note that the ordering rules in table 76 of IBTA 1.2 show how reads and write on a send queue are ordered. There are many cases where an op can pass an outstanding RDMA read, hence it is not always bad to queue extra RDMA reads. If needed, the Fence can be sent to force order. For many apps, its going to be better to get the items onto queue and let the QP handle the outstanding reads cases rather than have the app add a level of queuing for this purpose. Letting the HCA do the queuing will allow for a more rapid initiation of subsequent reads. Todd Rimmer ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Re: Mellanox HCAs: outstanding RDMAs
Quoting r. Talpey, Thomas [EMAIL PROTECTED]: But, it really doesn't matter. The point is, an upper layer should be paying attention to the number of RDMA Reads it posts, or else suffer either the queue-stalling or connection-failing consequences. Bad stuff either way. Queue-stalling is not necessarily bad, for example if the ULP needs to perform multiple RDMA reads anyway. You can use multipe QPs if you do not require ordering between operations. Connection-failing *is* bad stuff, IMO it might be compliant but its clearly broken in the same way that a NIC that drops all packets might be complaint but is broken. -- MST ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] Re: Mellanox HCAs: outstanding RDMAs
Todd, thanks for the set-up. I'm really glad we're having this discussion! Let me give an NFS/RDMA example to illustrate why this upper layer, at least, doesn't want the HCA doing its flow control, or resource management. NFS/RDMA is a credit-based protocol which allows many operations in progress at the server. Let's say the client is currently running with an RPC slot table of 100 requests (a typical value). Of these requests, some workload-specific percentage will be reads, writes, or metadata. All NFS operations consist of one send from client to server, some number of RDMA writes (for NFS reads) or RDMA reads (for NFS writes), then terminated with one send from server to client. The number of RDMA read or write operations per NFS op depends on the amount of data being read or written, and also the memory registration strategy in use on the client. The highest-performing such strategy is an all-physical one, which results in one RDMA-able segment per physical page. NFS r/w requests are, by default, 32KB, or 8 pages typical. So, typically 8 RDMA requests (read or write) are the result. To illustrate, let's say the client is processing a multi-threaded workload, with (say) 50% reads, 20% writes, and 30% metadata such as lookup and getattr. A kernel build, for example. Therefore, of our 100 active operations, 50 are reads for 32KB each, 20 are writes of 32KB, and 30 are metadata (non-RDMA). To the server, this results in 100 requests, 100 replies, 400 RDMA writes, and 160 RDMA Reads. Of course, these overlap heavily due to the widely differing latency of each op and the highly distributed arrival times. But, for the example this is a snapshot of current load. The latency of the metadata operations is quite low, because lookup and getattr are acting on what is effectively cached data. The reads and writes however, are much longer, because they reference the filesystem. When disk queues are deep, they can take many ms. Imagine what happens if the client's IRD is 4 and the server ignores its local ORD. As soon as a write begins execution, the server posts 8 RDMA Reads to fetch the client's write data. The first 4 RDMA Reads are sent, the fifth stalls, and stalls the send queue! Even when three RDMA Reads complete, the queue remains stalled, it doesn't unblock until the fourth is done and all the RDMA Reads have been initiated. But, what just happened to all the other server send traffic? All those metadata replies, and other reads which completed? They're stuck, waiting for that one write request. In my example, these number 99 NFS ops, i.e. 654 WRs! All for one NFS write! The client operation stream effectively became single threaded. What good is the rapid initiation of RDMA Reads you describe in the face of this? Yes, there are many arcane and resource-intensive ways around it. But the simplest by far is to count the RDMA Reads outstanding, and for the *upper layer* to honor ORD, not the HCA. Then, the send queue never blocks, and the operation streams never loses parallelism. This is what our NFS server does. As to the depth of IRD, this is a different calculation, it's a DelayxBandwidth of the RDMA Read stream. 4 is good for local, low latency connections. But over a complicated switch infrastructure, or heaven forbid a dark fiber long link, I guarantee it will cause a bottleneck. This isn't an issue except for operations that care, but it is certainly detectable. I would like to see if a pure RDMA Read stream can fully utilize a typical IB fabric, and how much headroom an IRD of 4 provides. Not much, I predict. Closing the connection if IRD is insufficient to meet goals isn't a good answer, IMO. How does that benefit interoperability? Thanks for the opportunity to spout off again. Comments welcome! Tom. At 12:43 PM 6/6/2006, Rimmer, Todd wrote: Talpey, Thomas Sent: Tuesday, June 06, 2006 10:49 AM At 10:40 AM 6/6/2006, Roland Dreier wrote: Thomas This is the difference between may and must. The value Thomas is provided, but I don't see anything in the spec that Thomas makes a requirement on its enforcement. Table 107 says the Thomas consumer can query it, that's about as close as it Thomas comes. There's some discussion about CM exchange too. This seems like a very strained interpretation of the spec. For I don't see how strained has anything to do with it. It's not saying anything either way. So, a legal implementation can make either choice. We're talking about the spec! But, it really doesn't matter. The point is, an upper layer should be paying attention to the number of RDMA Reads it posts, or else suffer either the queue-stalling or connection-failing consequences. Bad stuff either way. Tom. Somewhere beneath this discussion is a bug in the application or IB stack. I'm not sure which may in the spec you are referring to, but the mays I have found all are for cases where the responder might support only 1 outstanding request.