Mike, I am not arguing to change the standard. I am simply saying I do not want to be a victim of the default. It is my belief that very few upper layer programmers are aware of this, btw.
The Linux NFS/RDMA upper layer implementation already deals with the issue, as I mentioned. It would certainly welcome a higher available IRD on Mellanox hardware however. Thanks for your comments. Tom. At 01:55 PM 6/15/2006, Michael Krause wrote: >As one of the authors of IB and iWARP, I can say that both Roland and Todd's >responses are correct and the intent of the specifications. The number of >outstanding RDMA Reads are bounded and that is communicated during session >establishment. The ULP can choose to be aware of this requirement (certainly >when we wrote iSER and DA we were well aware of the requirement and we >documented as such in the ULP specs) and track from above so that it does not >see a stall or it can stay ignorant and deal with the stall as a result. This >is a ULP choice and has been intentionally done that way so that the hardware >can be kept as simple as possible and as low cost as well while meeting the >breadth of ULP needs that were used to develop these technologies. > >Tom, you raised this issue during iWARP's definition and the debate was >conducted at least several times. The outcome of these debates is reflected >in iWARP and remains aligned with IB. So, unless you really want to have the >IETF and IBTA go and modify their specs, I believe you'll have to deal with >the issue just as other ULP are doing today and be aware of the constraint and >write the software accordingly. The open source community isn't really the >right forum to change iWARP and IB specifications at the end of the day. >Build a case in the IETF and IBTA and let those bodies determine whether it is >appropriate to modify their specs or not. And yes, it is modification of the >specs and therefore the hardware implementations as well address any >interoperability requirements that would result (the change proposed could >fragment the hardware offerings as there are many thousands of devices in the >market that would not necessarily support this change). > >Mike > > > > >At 12:07 PM 6/6/2006, Talpey, Thomas wrote: >>Todd, thanks for the set-up. I'm really glad we're having this discussion! >> >>Let me give an NFS/RDMA example to illustrate why this upper layer, >>at least, doesn't want the HCA doing its flow control, or resource >>management. >> >>NFS/RDMA is a credit-based protocol which allows many operations in >>progress at the server. Let's say the client is currently running with >>an RPC slot table of 100 requests (a typical value). >> >>Of these requests, some workload-specific percentage will be reads, >>writes, or metadata. All NFS operations consist of one send from >>client to server, some number of RDMA writes (for NFS reads) or >>RDMA reads (for NFS writes), then terminated with one send from >>server to client. >> >>The number of RDMA read or write operations per NFS op depends >>on the amount of data being read or written, and also the memory >>registration strategy in use on the client. The highest-performing >>such strategy is an all-physical one, which results in one RDMA-able >>segment per physical page. NFS r/w requests are, by default, 32KB, >>or 8 pages typical. So, typically 8 RDMA requests (read or write) are >>the result. >> >>To illustrate, let's say the client is processing a multi-threaded >>workload, with (say) 50% reads, 20% writes, and 30% metadata >>such as lookup and getattr. A kernel build, for example. Therefore, >>of our 100 active operations, 50 are reads for 32KB each, 20 are >>writes of 32KB, and 30 are metadata (non-RDMA). >> >>To the server, this results in 100 requests, 100 replies, 400 RDMA >>writes, and 160 RDMA Reads. Of course, these overlap heavily due >>to the widely differing latency of each op and the highly distributed >>arrival times. But, for the example this is a snapshot of current load. >> >>The latency of the metadata operations is quite low, because lookup >>and getattr are acting on what is effectively cached data. The reads >>and writes however, are much longer, because they reference the >>filesystem. When disk queues are deep, they can take many ms. >> >>Imagine what happens if the client's IRD is 4 and the server ignores >>its local ORD. As soon as a write begins execution, the server posts >>8 RDMA Reads to fetch the client's write data. The first 4 RDMA Reads >>are sent, the fifth stalls, and stalls the send queue! Even when three >>RDMA Reads complete, the queue remains stalled, it doesn't unblock >>until the fourth is done and all the RDMA Reads have been initiated. >> >>But, what just happened to all the other server send traffic? All those >>metadata replies, and other reads which completed? They're stuck, >>waiting for that one write request. In my example, these number 99 NFS >>ops, i.e. 654 WRs! All for one NFS write! The client operation stream >>effectively became single threaded. What good is the "rapid initiation >>of RDMA Reads" you describe in the face of this? >> >>Yes, there are many arcane and resource-intensive ways around it. >>But the simplest by far is to count the RDMA Reads outstanding, and >>for the *upper layer* to honor ORD, not the HCA. Then, the send queue >>never blocks, and the operation streams never loses parallelism. This >>is what our NFS server does. >> >>As to the depth of IRD, this is a different calculation, it's a >>DelayxBandwidth >>of the RDMA Read stream. 4 is good for local, low latency connections. >>But over a complicated switch infrastructure, or heaven forbid a dark fiber >>long link, I guarantee it will cause a bottleneck. This isn't an issue except >>for operations that care, but it is certainly detectable. I would like to see >>if a pure RDMA Read stream can fully utilize a typical IB fabric, and how >>much headroom an IRD of 4 provides. Not much, I predict. >> >>Closing the connection if IRD is "insufficient to meet goals" isn't a good >>answer, IMO. How does that benefit interoperability? >> >>Thanks for the opportunity to spout off again. Comments welcome! >> >>Tom. >> >>At 12:43 PM 6/6/2006, Rimmer, Todd wrote: >>> >>> >>>> Talpey, Thomas >>>> Sent: Tuesday, June 06, 2006 10:49 AM >>>> >>>> At 10:40 AM 6/6/2006, Roland Dreier wrote: >>>> > Thomas> This is the difference between "may" and "must". The >>>value >>>> > Thomas> is provided, but I don't see anything in the spec that >>>> > Thomas> makes a requirement on its enforcement. Table 107 says >>>the >>>> > Thomas> consumer can query it, that's about as close as it >>>> > Thomas> comes. There's some discussion about CM exchange too. >>>> > >>>> >This seems like a very strained interpretation of the spec. For >>>> >>>> I don't see how strained has anything to do with it. It's not saying >>>> anything >>>> either way. So, a legal implementation can make either choice. We're >>>> talking about the spec! >>>> >>>> But, it really doesn't matter. The point is, an upper layer should be >>>> paying >>>> attention to the number of RDMA Reads it posts, or else suffer either >>>the >>>> queue-stalling or connection-failing consequences. Bad stuff either >>>way. >>>> >>>> Tom. >>> >>>Somewhere beneath this discussion is a bug in the application or IB >>>stack. I'm not sure which "may" in the spec you are referring to, but >>>the "may"s I have found all are for cases where the responder might >>>support only 1 outstanding request. In all cases the negotiation >>>protocol must be followed and the requestor is not allowed to exceed the >>>negotiated limit. >>> >>>The mechanism should be: >>>client queries its local HCA and determines responder resources (eg. >>>number of concurrent outstanding RDMA reads on the wire from the remote >>>end where this end will respond with the read data) and initiator depth >>>(eg. number of concurrent outstanding RDMA reads which this end can >>>initiate as the requestor). >>> >>>client puts the above information in the CM REQ. >>> >>>server similarly gets its information from its local CA and negotiates >>>down the values to the MIN of each side (REP.InitiatorDepth = >>>MIN(REQ.ResponderResources, server's local CAs Initiator depth); >>>REP.ResponderResources = MIN(REQ.InitiatorDepth, server's local CAs >>>responder resources). If server does not support RDMA Reads, it can >>>REJ. >>> >>>If client decided the negotiated values are insufficient to meet its >>>goals, it can disconnect. >>> >>>Each side sets its QP parameters via modify QP appropriately. Note they >>>too will be mirror images of eachother: >>>client: >>>QP.Max RDMA Reads as Initiator = REP.ResponderResources >>>QP.Max RDMA reads as responder = REP.InitiatorDepth >>> >>>server: >>>QP.Max RDMA Reads as responder = REP.ResponderResources >>>QP.Max RDMA reads as initiator = REP.InitiatorDepth >>> >>>We have done a lot of high stress RDMA Read traffic with Mellanox HCAs >>>and provided the above negotiation is followed, we have seen no issues. >>>Note however that by default a Mellanox HCA typically reports a large >>>InitiatorDepth (128) and a modest ResponderResources (4-8). Hence when >>>I hear that Responder Resources must be grown to 128 for some >>>application to reliably work, it implies the negotiation I outlined >>>above is not being followed. >>> >>>Note that the ordering rules in table 76 of IBTA 1.2 show how reads and >>>write on a send queue are ordered. There are many cases where an op can >>>pass an outstanding RDMA read, hence it is not always bad to queue extra >>>RDMA reads. If needed, the Fence can be sent to force order. >>> >>>For many apps, its going to be better to get the items onto queue and >>>let the QP handle the outstanding reads cases rather than have the app >>>add a level of queuing for this purpose. Letting the HCA do the queuing >>>will allow for a more rapid initiation of subsequent reads. >>> >>>Todd Rimmer >> >> >>_______________________________________________ >>openib-general mailing list >>openib-general@openib.org >>http://openib.org/mailman/listinfo/openib-general >> >>To unsubscribe, please visit >>http://openib.org/mailman/listinfo/openib-general _______________________________________________ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general