Mike, I am not arguing to change the standard. I am simply
saying I do not want to be a victim of the default. It is my
belief that very few upper layer programmers are aware of
this, btw.

The Linux NFS/RDMA upper layer implementation already deals
with the issue, as I mentioned. It would certainly welcome a
higher available IRD on Mellanox hardware however.

Thanks for your comments.

Tom.

At 01:55 PM 6/15/2006, Michael Krause wrote:

>As one of the authors of IB and iWARP, I can say that both Roland and Todd's 
>responses are correct and the intent of the specifications.  The number of 
>outstanding RDMA Reads are bounded and that is communicated during session 
>establishment.  The ULP can choose to be aware of this requirement (certainly 
>when we wrote iSER and DA we were well aware of the requirement and we 
>documented as such in the ULP specs) and track from above so that it does not 
>see a stall or it can stay ignorant and deal with the stall as a result.  This 
>is a ULP choice and has been intentionally done that way so that the hardware 
>can be kept as simple as possible and as low cost as well while meeting the 
>breadth of ULP needs that were used to develop these technologies.   
>
>Tom, you raised this issue during iWARP's definition and the debate was 
>conducted at least several times.  The outcome of these debates is reflected 
>in iWARP and remains aligned with IB.  So, unless you really want to have the 
>IETF and IBTA go and modify their specs, I believe you'll have to deal with 
>the issue just as other ULP are doing today and be aware of the constraint and 
>write the software accordingly.  The open source community isn't really the 
>right forum to change iWARP and IB specifications at the end of the day.  
>Build a case in the IETF and IBTA and let those bodies determine whether it is 
>appropriate to modify their specs or not.  And yes, it is modification of the 
>specs and therefore the hardware implementations as well address any 
>interoperability requirements that would result (the change proposed could 
>fragment the hardware offerings as there are many thousands of devices in the 
>market that would not necessarily support this change).
>
>Mike
>
>
>
>
>At 12:07 PM 6/6/2006, Talpey, Thomas wrote:
>>Todd, thanks for the set-up. I'm really glad we're having this discussion!
>>
>>Let me give an NFS/RDMA example to illustrate why this upper layer,
>>at least, doesn't want the HCA doing its flow control, or resource
>>management.
>>
>>NFS/RDMA is a credit-based protocol which allows many operations in
>>progress at the server. Let's say the client is currently running with
>>an RPC slot table of 100 requests (a typical value).
>>
>>Of these requests, some workload-specific percentage will be reads,
>>writes, or metadata. All NFS operations consist of one send from
>>client to server, some number of RDMA writes (for NFS reads) or
>>RDMA reads (for NFS writes), then terminated with one send from
>>server to client.
>>
>>The number of RDMA read or write operations per NFS op depends
>>on the amount of data being read or written, and also the memory
>>registration strategy in use on the client. The highest-performing
>>such strategy is an all-physical one, which results in one RDMA-able
>>segment per physical page. NFS r/w requests are, by default, 32KB,
>>or 8 pages typical. So, typically 8 RDMA requests (read or write) are
>>the result.
>>
>>To illustrate, let's say the client is processing a multi-threaded
>>workload, with (say) 50% reads, 20% writes, and 30% metadata
>>such as lookup and getattr. A kernel build, for example. Therefore,
>>of our 100 active operations, 50 are reads for 32KB each, 20 are
>>writes of 32KB, and 30 are metadata (non-RDMA). 
>>
>>To the server, this results in 100 requests, 100 replies, 400 RDMA
>>writes, and 160 RDMA Reads. Of course, these overlap heavily due
>>to the widely differing latency of each op and the highly distributed
>>arrival times. But, for the example this is a snapshot of current load.
>>
>>The latency of the metadata operations is quite low, because lookup
>>and getattr are acting on what is effectively cached data. The reads
>>and writes however, are much longer, because they reference the
>>filesystem. When disk queues are deep, they can take many ms.
>>
>>Imagine what happens if the client's IRD is 4 and the server ignores
>>its local ORD. As soon as a write begins execution, the server posts
>>8 RDMA Reads to fetch the client's write data. The first 4 RDMA Reads
>>are sent, the fifth stalls, and stalls the send queue! Even when three
>>RDMA Reads complete, the queue remains stalled, it doesn't unblock
>>until the fourth is done and all the RDMA Reads have been initiated.
>>
>>But, what just happened to all the other server send traffic? All those
>>metadata replies, and other reads which completed? They're stuck,
>>waiting for that one write request. In my example, these number 99 NFS
>>ops, i.e. 654 WRs! All for one NFS write! The client operation stream
>>effectively became single threaded. What good is the "rapid initiation
>>of RDMA Reads" you describe in the face of this?
>>
>>Yes, there are many arcane and resource-intensive ways around it.
>>But the simplest by far is to count the RDMA Reads outstanding, and
>>for the *upper layer* to honor ORD, not the HCA. Then, the send queue
>>never blocks, and the operation streams never loses parallelism. This
>>is what our NFS server does.
>>
>>As to the depth of IRD, this is a different calculation, it's a 
>>DelayxBandwidth
>>of the RDMA Read stream. 4 is good for local, low latency connections.
>>But over a complicated switch infrastructure, or heaven forbid a dark fiber
>>long link, I guarantee it will cause a bottleneck. This isn't an issue except
>>for operations that care, but it is certainly detectable. I would like to see
>>if a pure RDMA Read stream can fully utilize a typical IB fabric, and how
>>much headroom an IRD of 4 provides. Not much, I predict.
>>
>>Closing the connection if IRD is "insufficient to meet goals" isn't a good
>>answer, IMO. How does that benefit interoperability? 
>>
>>Thanks for the opportunity to spout off again. Comments welcome!
>>
>>Tom.
>>
>>At 12:43 PM 6/6/2006, Rimmer, Todd wrote:
>>>
>>>
>>>> Talpey, Thomas
>>>> Sent: Tuesday, June 06, 2006 10:49 AM
>>>> 
>>>> At 10:40 AM 6/6/2006, Roland Dreier wrote:
>>>> >    Thomas> This is the difference between "may" and "must". The
>>>value
>>>> >    Thomas> is provided, but I don't see anything in the spec that
>>>> >    Thomas> makes a requirement on its enforcement. Table 107 says
>>>the
>>>> >    Thomas> consumer can query it, that's about as close as it
>>>> >    Thomas> comes. There's some discussion about CM exchange too.
>>>> >
>>>> >This seems like a very strained interpretation of the spec.  For
>>>> 
>>>> I don't see how strained has anything to do with it. It's not saying
>>>> anything
>>>> either way. So, a legal implementation can make either choice. We're
>>>> talking about the spec!
>>>> 
>>>> But, it really doesn't matter. The point is, an upper layer should be
>>>> paying
>>>> attention to the number of RDMA Reads it posts, or else suffer either
>>>the
>>>> queue-stalling or connection-failing consequences. Bad stuff either
>>>way.
>>>> 
>>>> Tom.
>>>
>>>Somewhere beneath this discussion is a bug in the application or IB
>>>stack.  I'm not sure which "may" in the spec you are referring to, but
>>>the "may"s I have found all are for cases where the responder might
>>>support only 1 outstanding request.  In all cases the negotiation
>>>protocol must be followed and the requestor is not allowed to exceed the
>>>negotiated limit.
>>>
>>>The mechanism should be:
>>>client queries its local HCA and determines responder resources (eg.
>>>number of concurrent outstanding RDMA reads on the wire from the remote
>>>end where this end will respond with the read data) and initiator depth
>>>(eg. number of concurrent outstanding RDMA reads which this end can
>>>initiate as the requestor).
>>>
>>>client puts the above information in the CM REQ.
>>>
>>>server similarly gets its information from its local CA and negotiates
>>>down the values to the MIN of each side (REP.InitiatorDepth =
>>>MIN(REQ.ResponderResources, server's local CAs Initiator depth);
>>>REP.ResponderResources = MIN(REQ.InitiatorDepth, server's local CAs
>>>responder resources).  If server does not support RDMA Reads, it can
>>>REJ.
>>>
>>>If client decided the negotiated values are insufficient to meet its
>>>goals, it can disconnect.
>>>
>>>Each side sets its QP parameters via modify QP appropriately.  Note they
>>>too will be mirror images of eachother:
>>>client:
>>>QP.Max RDMA Reads as Initiator = REP.ResponderResources
>>>QP.Max RDMA reads as responder = REP.InitiatorDepth
>>>
>>>server:
>>>QP.Max RDMA Reads as responder = REP.ResponderResources
>>>QP.Max RDMA reads as initiator = REP.InitiatorDepth
>>>
>>>We have done a lot of high stress RDMA Read traffic with Mellanox HCAs
>>>and provided the above negotiation is followed, we have seen no issues.
>>>Note however that by default a Mellanox HCA typically reports a large
>>>InitiatorDepth (128) and a modest ResponderResources (4-8).  Hence when
>>>I hear that Responder Resources must be grown to 128 for some
>>>application to reliably work, it implies the negotiation I outlined
>>>above is not being followed.
>>>
>>>Note that the ordering rules in table 76 of IBTA 1.2 show how reads and
>>>write on a send queue are ordered.  There are many cases where an op can
>>>pass an outstanding RDMA read, hence it is not always bad to queue extra
>>>RDMA reads.  If needed, the Fence can be sent to force order.
>>>
>>>For many apps, its going to be better to get the items onto queue and
>>>let the QP handle the outstanding reads cases rather than have the app
>>>add a level of queuing for this purpose.  Letting the HCA do the queuing
>>>will allow for a more rapid initiation of subsequent reads.
>>>
>>>Todd Rimmer
>>
>>
>>_______________________________________________
>>openib-general mailing list
>>openib-general@openib.org
>>http://openib.org/mailman/listinfo/openib-general
>>
>>To unsubscribe, please visit 
>>http://openib.org/mailman/listinfo/openib-general 


_______________________________________________
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Reply via email to