Re: [openib-general] IB routing discussion summary
At 11:49 AM 2/21/2007, Sean Hefty wrote: I sent a message on this topic to the IBTA several days ago, but I am still awaiting details (likely early next week). Unclear if that will occur. I just responded to some e-mail in the IBTA on the router subject as well.Given that discussion, I suspect it will be some time coming to fully answer the router dilemma. It should not be carried in the CM REQ. The SLID / DLID of the router ports should be derived through local subnet SA / SM query. When a CM REQ traverses one or more subnets there will be potentially many SLID / DLID involved in the communication. Each router should be populating its routing tables in order to build the new LRH attached to the GRH / CM REQ that it is forwarding to the next hop. I'm referring to configuration of the QP, not the operation of the routers. To establish a connection, the passive side QP needs to transition from Init to RTR. As part of that transition, the modify QP verb needs as input the Destination LID of its local router. It sounds like you expect the passive side to perform an SA query to obtain its own local routing information, which would essentially invalidate the data carried in the primary and alternate path fields in the CM REQ. The source always queries to obtain a subnet-local router Port. A sink can simply reflect back the LRH with source / destination LID reversed assuming it had such information or it can query to find the optimal / preferred subnet-local router Port. From reading 12.7.11, 13.5.1, and 17.4, I do not believe that such a requirement was expected to be placed on the passive side of a connection. The initial response I received agreed with this. I'd need to go back but the architecture is predicated that the SM and SA are strictly local and for security purposes their communication should remain local. Higher level management entities built to communicate with SM and SA are responsible for cross subnet communications without exposing the SA or SM to direct interaction. P_Key and Q_Key management across subnets is an example of such communication across subnets that would not be exposed to the SA and SM. My initial thoughts are that this sounds like a good idea. It's not eliminating the need for interacting with a remote SA, so much as it abstracts it to another entity. My hope is that we can reach an agreement on the CM REQ. Depending on that, it still needs to determine if the existing SA attributes are sufficient to allow forming inter-subnet connections, and if they are, can such attributes be obtained. A lot of discussion will be required within the IBTA to nail anything down. As I noted above, I just provided answers to a number of questions posed as well as opened up perhaps a few more. I am not aware of a TTM to complete this work but clearly some amount of standardization is required and it will take a bit to define the scope so that the specification does not become so large that it will take significant amount of time to develop and more importantly, significant resources and time to validate that the routing protocol is solid. Routing protocols are not as simple as some may think - they vary as a function of the functional robustness and scalability provided. For now, I'll assume this discussion is on hold until the IBTA gets its act together. Mike ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] IB routing discussion summary
At 02:05 PM 2/15/2007, Sean Hefty wrote: Is this first an IBTA problem to solve if you believe there is a problem? Based on my interpretation, I do not believe that there's an error in the architecture. It seems consistent. Additional clarification of what PathRecord fields mean when the GIDs are on different subnets may be needed, and a change to the architecture may make things easier to implement, but that's a separate matter. I contend CM does not require anything that is subnet local other than to target a given router port which should be derived from local SM/SA only Then please state how the passive side obtains the information (e.g. SLID/DLID) it needs in order to configure its QP. I claim that information is carried in the CM REQ. It should not be carried in the CM REQ. The SLID / DLID of the router ports should be derived through local subnet SA / SM query. When a CM REQ traverses one or more subnets there will be potentially many SLID / DLID involved in the communication. Each router should be populating its routing tables in order to build the new LRH attached to the GRH / CM REQ that it is forwarding to the next hop. The alternatives that I see are: 1. The passive side extracts the data from the LRH that carries the CM REQ. 2. The passive side issues its own local path record query. Will you please clarify where this information comes from? The router protocol determines path to the next hop. As noted in prior e-mails, the router works in conjunction with the SM/SA to populate its database so that any CM or other query for a path record to get to / from the router can be derived and optimized based on local policy, e.g. QoS, within each subnet. I will further state that SA-SA communication sans perhaps a P_Key / Q_Key service lookup should be avoided wherever possible. I agree - which is why my proposal avoided SA-SA communication. I see nothing in the architecture that prohibits a node from querying an SA that is not on its local subnet. I'd need to go back but the architecture is predicated that the SM and SA are strictly local and for security purposes their communication should remain local. Higher level management entities built to communicate with SM and SA are responsible for cross subnet communications without exposing the SA or SM to direct interaction. P_Key and Q_Key management across subnets is an example of such communication across subnets that would not be exposed to the SA and SM. Mike ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Immediate data question
At 09:37 PM 2/14/2007, Devesh Sharma wrote: On 2/14/07, Michael Krause [EMAIL PROTECTED] wrote: At 05:37 AM 2/13/2007, Devesh Sharma wrote: On 2/12/07, Devesh Sharma [EMAIL PROTECTED] wrote: On 2/10/07, Tang, Changqing [EMAIL PROTECTED] wrote: Not for the receiver, but the sender will be severely slowed down by having to wait for the RNR timeouts. RNR = Receiver Not Ready so by definition, the data flow isn't going to progress until the receiver is ready to receive data. If a receive QP enters RNR for a RC, then it is likely not progressing as desired. RNR was initially put in place to enable a receiver to create back pressure to the sender without causing a fatal error condition. It should rarely be entered and therefore should have negligible impact on overall performance however when a RNR occurs, no forward progress will occur so performance is essentially zero. Mike: I still do not quite understand this issue. I have two situations that have RNR triggered. 1. process A and process B is connected with QP. A first post a send to B, B does not post receive. Then A and B are doing a long time RDMA_WRITE each other, A and B just check memory for the RDMA_WRITE message. Finally B will post a receive. Does the first pending send in A block all the later RDMA_WRITE ? According to IBTA spec HCA will process WR entries in strict order in which they are posted so the send will block all WR posted after this send, Until-unless HCA has multiple processing elements, I think even then processing order will be maintained by HCA If not, since RNR is triggered periodically till B post receive, does it affect the RDMA_WRITE performance between A and B ? 2. extend above to three processes, A connect to B, B connect to C, so B has two QPs, but one CQ.A posts a send to B, B does not post receive, post ordering accross QP is not guaranteed hence presence of same CQ or different CQ will not affect any thing. rather B and C are doing a long time RDMA_WRITE,or send/recv. But B If RDMA WRITE _on_ B, no effect on performance. If RDMA WRITE _on_ C, I am sorry I have missed that in both cases same DMA channel is in use. _may_ affect the performance, since load is on same HCA. In case of Send/Recv again _may_ affect the performance, with the same reason. Seems orthogonal. Any time h/w is shared, multiple flows will have an impact on one another. That is why we have the different arbitration mechanisms to enable one to control that impact. Please, can you explain it more clearly? Most I/O devices are shared by multiple applications / kernel subsystems. Hence, the device acts as a serialization point for what goes on the wire / link. Sharing = resource contention and in order to add any structure to that contention, a number of technologies provide arbitration options. In the case of IB, the arbitration is confined to VL arbitration where a given data flow is assigned to a VL and that VL is services at some particular rate. A number of years ago I wrote up how one might also provide QP arbitration (not part of the IBTA specifications) and I understand some implementations have incorporated that or a variation of the mechanisms into their products. In addition to IB link contention, there is also PCI link / bus contention. For PCIe, given most designs did not want to waste resources on multiple VC, there really isn't any standard arbitration mechanism. However, many devices, especially a device like a HCA or a RNIC, already have the concept of separate resource domains, e.g. QP, and they provide a mechanism to associate how the QP's DMA requests or interrupts requests are scheduled to the PCIe link. must sends RNR periodically to A, right?. So does the pending message from A affects B's overall performance between B and C ? But RNR NAK is not for very long time.possibly this performance hit you will not be able to observe even. The moment rnr_counter expires connection will be broken! Keep in mind the timeout can be infinite. RNR NAK are not expected to be frequent so their performance impact was considered reasonable. Thanks I missed that. It is a subtlety within the specification that is easy to miss. Mike ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] IB routing discussion summary
At 11:39 AM 2/15/2007, Sean Hefty wrote: Ideas were presented around trying to construct an 'inter-subnet path record' that contained the following: - Side A GRH.SGID = active side's Port GID - Side A GRH.DGID = passive side's Port GID - Side A LRH.SLID = any active side's port LID - Side A LRH.DLID = A subnet router - Side A LRH.SL = SL to A subnet router - Side B GRH.SGID = Side A GRH.DGID - Side B GRH.DGID = Side A GRH.SGID - Side B LRH.SLID = any passive side's port LID - Side B LRH.DLID = B subnet router - Side B LRH.SL = SL to B subnet router Until I can become convinced that the above isn't needed, I've been trying to brainstorm of ways to obtain this information. Is this first an IBTA problem to solve if you believe there is a problem? I believe the track you are on is incorrect and any attempt to surface subnet local information across subnets will create unnecessary complexity and therefore makes such solutions less practical to execute within the industry. I've tried to illustrate the role of the router, how the flows work, etc. I believe these to be correct and are reflected not only in the existing specifications but also the prior router specification work and thinking. They also parallel the IP world quite nicely which should also lend credence that subnet-local information does not need to be exchanged between subnets. I contend CM does not require anything that is subnet local other than to target a given router port which should be derived from local SM/SA only information. I will further state that SA-SA communication sans perhaps a P_Key / Q_Key service lookup should be avoided wherever possible. I strongly urge you to take this problem to the IBTA where any issues regarding specification interpretation can be sorted out and an official position taken. This will yield a faster and more successful investigation into whether there is a problem and if so, how best to solve it. Mike 0. Have the SA return pairs of PathRecords for inter-subnet queries. But, since this simply punts the problem to the SA, my other thought is to define the following: 1. Inter-subnet PathRecord/MultiPathRecord Get/GetTable requests require both an SGID and DGID, one of which must be subnet local to the processing SA. 2. PathRecord/MultiPathRecord Get/GetTable request fields are relative to the subnet specified by the SGID. 3. PathRecord GetResp/GetTableResp response fields are relative to the subnet local to the processing SA. 4. SAs are addressable by a well-known GID suffix. I think this may allow establishing inter-subnet connections. As an example of its usage: a. Active side issues a PathRecord query to the local SA with SGID=local, DGID=remote. b. SA responds with PathRecord(s). c. Active side selects local PathRecord P1. d. Active side issues a PathRecord query to the remote SA using PathRecord P1 to format the request: SGID, DGID, SLID, DLID, TC, FL, SL, etc. e. The remote SA responds with PathRecord(s). The SA must ensure that packets injected into the internetwork using P1 will route to the returned records. f. Active side selects remote PathRecord P2. g. Active side validates that remote packets injected using P2 route to P1. At this point, the active side should have path information that can be used to configure the QPs for a connection. Assuming that this will work, what I don't like about it is the validation at step g. This adds a third query that I don't see a way to eliminate. If the check fails, the client restarts at step c. - Sean ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] IB routing discussion summary
I do not see the need for any of this. The router protocol should be designed to work with each subnet's SM / SA to provide information on what GID prefix is on each router Port. This is used to look up the subnet local LRH fields. The only cross-subnet challenges are global based, e.g. what is the P_Key to use and how to manage those across subnets or how should TClass be interpreted to achieve a consistent behavior independent of how the TClass is subnet local mapped to a SL.These were the types of challenges remaining when we stopped development of the router specification. If the IBTA decides to develop a router specification then it might be best to join that effort and work it out in detail before attempting to develop the management infrastructure. May be able to slightly lag in order to validate the technical directions that the spec will take without having to wait until 1.0 to say, yep, this looks good or here is where you need to change the spec. Not clear what can be developed until there is a router specification to execute to in the industry. Mike At 01:17 PM 2/13/2007, Sean Hefty wrote: Here's a first take at summarizing the IB routing discussion. The following spec references are noted: 9.6.1.5 C9-54. The SLID shall be validated (for connected QPs). 12.7.11. CM REQ Local Port LID - is LID of remote router. 13.5.4: Defines reversible paths. The main discussion point centered on trying to meet 9.6.1.5 C9-54. This requires that the forward and reverse data flows between two QPs traverse the same router LID on both subnets. The idea was made to try to eliminate this compliance statement for packets carrying a GRH, but this is viewed as going against the spirit of IBA. Ideas were presented around trying to construct an 'inter-subnet path record' that contained the following: - Side A GRH.SGID = active side's Port GID - Side A GRH.DGID = passive side's Port GID - Side A LRH.SLID = any active side's port LID - Side A LRH.DLID = A subnet router - Side A LRH.SL = SL to A subnet router - Side B GRH.SGID = Side A GRH.DGID - Side B GRH.DGID = Side A GRH.SGID - Side B LRH.SLID = any passive side's port LID - Side B LRH.DLID = B subnet router - Side B LRH.SL = SL to B subnet router It is still unclear how such a record can be constructed. But communication with remote SAs might be achieved by using a well-known GID suffix. It's also unclear whether the fields in a path record are relative to the SA's subnet or the SGID. It's anticipated that SAs will need to interact with routers, but in an unspecified manner. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Problem is routing CM REQ
At 01:36 PM 2/14/2007, Sean Hefty wrote: Assume that the active and passive sides of a connection request are on different subnets and: Active side - LID 1 Active side router - LID 2 Passive side - LID 93 Passive side router - LID 94 What values are you suggesting are used for: Active side QP - DLID Passive side QP - DLID CM REQ Primary Local Port LID Subnet A is: QP Port LID 1 Router A Port LID 2 Subnet B is: QP Port LID 93 Router B Port LID 94 Process steps: - Router A populates SM / SA A with the GID prefix it can route. SM / SA A will have configured the router Port with the appropriate local route information and hence have assigned it LID 2. - CM associated with Port LID 1 queries the SM / SA to identify a path to a GID Prefix. SM / SA returns a path record indicating a global route, i.e. one that requires a GRH, is available and provides the CM with the information targeting router Port LID 2. - CM creates a REQ and populates the global information to identify the remote endnode. The LRH generated targets Port LID 2. The GRH is generated to target the remote subnet so the router will comprehend how to process the packet. - Router A receives the packet and examines the GRH. Via its router protocol, it has previously identified what router Port will lead to the next hop on the path to the destination endnode. - If the endnode is subnet local, say subnet B, then the router generates a LRH with QP LID 93 and emits that on router Port LID 94. - QP in subnet B receives the CM REQ and validates the LRH. Given these messages are via UD service and not RC / UC, the validation rules for the LRH are different. The CM agent processes the request and returns an appropriate response by filling in a GRH that replaces the SGID with the DGID and so forth so the addresses are basically reflected back. The response uses QP port LID 93 and targets router Port 94. - Router B Port 94 receives the response. It parses the GRH and determines the next hop port. In this example, the response goes out router A Port 2 and targets QP Port LID 1. The LRH is generated using these fields. Again, since CM is targeting a UD QP, the LRH validation rules are different. - Once the connection is established, the QP on subnet A will send packets to QP on subnet B using a GRH that is processed by the router with each QP using a LRH that targets the router port locally attached to its subnet. The router is responsible for generating a LRH to forward to the next hop. These packets are now in a RC / UC data flow so the LRH validation is per the sections cited in this e-mail string. In all cases, the router protocol is responsible for generation of a LRH that will work within each subnet. There is no exchange of subnet local information between the subnets. Each subnet's SM/SA only tracks what is local to it as well as what GID prefix can be routed via a given LID. If multiple LID can route to a given GID prefix, multiple path records are returned. Which to choose is not specified by the specifications so it can be any policy one desires. If the router protocol communicates a cost to a given path in order to give an indication of appropriateness for a given workload, then this should be communicated to the CM agent. Mike ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] IB routing discussion summary
At 02:02 PM 2/14/2007, Sean Hefty wrote: Mike, are you expecting that routers will modify CM messages as they flow between subnets? The router parses the GRH, strips the LRH, attaches a new LRH to the next hop with the contents of the LRH filled in per its internal policies. Nothing more for the main packet processing. The router interacts with each subnet's SM/SA to insure the path records can be provided to the CM to fill in the right information. Mike ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Problem is routing CM REQ
At 03:48 PM 2/12/2007, Sean Hefty wrote: An endnode look up should be to find the address vector to the remote. A look up may return multiple vectors. The SLID would correspond to each local subnet router port that acts as a first-hop destination to the remote subnet.I don't see why the router protocol would not simply enable all paths on the local subnet to a given remote subnet be acquired. All of the work is kept local to the SA / SM in the source subnet when determining a remote path to take. Why is there any need to define more than just this? For an RC QP, we need at least two sets of LIDs. In the simplest case, we need the SLID/router DLID for the local subnet, and the router SLID/DLID for the remote subnet. The problem is in obtaining the SLID/DLID for the remote subnet. Not quite. The router protocol should determine the next hop LID to be used to either reach the destination endnode if in its local subnet or for the next router on the path to the remote. CM only needs to be concerned with what is in a local subnet for finding the router or the endnode. It does not need to comprehend the remote subnet(s) LID. That is the router protocol to determine. CM also must understand the GIDs involved which the router will process to figure out its LID mapping to the next hop. Mike ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Immediate data question
At 05:37 AM 2/13/2007, Devesh Sharma wrote: On 2/12/07, Devesh Sharma [EMAIL PROTECTED] wrote: On 2/10/07, Tang, Changqing [EMAIL PROTECTED] wrote: Not for the receiver, but the sender will be severely slowed down by having to wait for the RNR timeouts. RNR = Receiver Not Ready so by definition, the data flow isn't going to progress until the receiver is ready to receive data. If a receive QP enters RNR for a RC, then it is likely not progressing as desired. RNR was initially put in place to enable a receiver to create back pressure to the sender without causing a fatal error condition. It should rarely be entered and therefore should have negligible impact on overall performance however when a RNR occurs, no forward progress will occur so performance is essentially zero. Mike: I still do not quite understand this issue. I have two situations that have RNR triggered. 1. process A and process B is connected with QP. A first post a send to B, B does not post receive. Then A and B are doing a long time RDMA_WRITE each other, A and B just check memory for the RDMA_WRITE message. Finally B will post a receive. Does the first pending send in A block all the later RDMA_WRITE ? According to IBTA spec HCA will process WR entries in strict order in which they are posted so the send will block all WR posted after this send, Until-unless HCA has multiple processing elements, I think even then processing order will be maintained by HCA If not, since RNR is triggered periodically till B post receive, does it affect the RDMA_WRITE performance between A and B ? 2. extend above to three processes, A connect to B, B connect to C, so B has two QPs, but one CQ.A posts a send to B, B does not post receive, post ordering accross QP is not guaranteed hence presence of same CQ or different CQ will not affect any thing. rather B and C are doing a long time RDMA_WRITE,or send/recv. But B If RDMA WRITE _on_ B, no effect on performance. If RDMA WRITE _on_ C, _may_ affect the performance, since load is on same HCA. In case of Send/Recv again _may_ affect the performance, with the same reason. Seems orthogonal. Any time h/w is shared, multiple flows will have an impact on one another. That is why we have the different arbitration mechanisms to enable one to control that impact. must sends RNR periodically to A, right?. So does the pending message from A affects B's overall performance between B and C ? But RNR NAK is not for very long time.possibly this performance hit you will not be able to observe even. The moment rnr_counter expires connection will be broken! Keep in mind the timeout can be infinite. RNR NAK are not expected to be frequent so their performance impact was considered reasonable. Mike Thank you. --CQ Mike ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Problem is routing CM REQ
At 04:10 PM 2/12/2007, Jason Gunthorpe wrote: On Mon, Feb 12, 2007 at 03:31:15PM -0800, Michael Krause wrote: TClass is intended to communicate the end-to-end QoS desired. TClass is then mapped to a SL that is local to each subnet. A flow label is intended to much the same as in the IP world and is left, in essence, to routers to manage.An endnode look up should be to find the address vector to the remote. A look up may return multiple vectors. The SLID would correspond to each local subnet router port that acts as a first-hop destination to the remote subnet.I don't see why the router protocol would not simply enable all paths on the local subnet to a given remote subnet be acquired. All of the work is kept local to the SA / SM in the source subnet when determining a remote path to take. Why is there any need to define more than just this? Define a router protocol to communicate the each subnet's prefix, TClass, etc. and apply KISS. A management entity that wanted to manage out each subnet provides router management in terms of route selection, etc. can be constructed by using the existing protocols / tools combined with a new router protocol which only does DGID to next hop SLID mapping. All of this complexity is due to the RC QP requirement that the SLID of an incoming LRH match the DLID programmed into the QP. Translated into a network with routers this means that for a RC flow to successfully work both the *forward* and *reverse* direction must traverse the same router *LID* not just *port* on both subnets. That is a given since the LID = path and same path must be used to insure strong ordering is maintained. Please see the little ascii diagram I drew in a prior email to understand my concern. There is no such restriction in a real IP network. It would be akin to having a host match the source MAC address in the ethernet frame to double check that it came from the router port it is sending outgoing packets to. Which means simple one-sided solutions from IP land don't work here. Things work exactly the way you outline today for UD. They don't work at all for the general case of RC. Get rid of the QP requirement and things work the way you outline for RC too. Keep it in and you must use the FlowLabel to force the flows onto the right router LID. The same path must always be used to maintain strong ordering. This is immutable part of IB technology. That is why I said previously that the QP matching rules are a mistake. The best way to solve this is to change C9-54 to only be in effect if the GRH is not present. I disagree. We were very explicit in how and why we constructed those rules. CM also introduces the much smaller problem of getting the LIDs to the passive side - but that cannot be solved without a broad solution to the RC QP SLID matching problem. Mike ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Problem is routing CM REQ
At 01:14 PM 2/13/2007, Sean Hefty wrote: It does not need to comprehend the remote subnet(s) LID. That is the router protocol to determine. CM also must understand the GIDs involved which the router will process to figure out its LID mapping to the next hop. The CM REQ carries the remote router LID (primary local port lid - 12.7.11) and remote endpoint LID (primary remote port lid - 12.7.21). Let me clarify what the specification is saying which is what I'm saying. A LID is subnet local on that we can all agree. The CM Req contains either the LID of a local subnet CA or the LID a local router which will move the packet to the next hop to the destination. 12.7.11 is basically saying that the remote LID is the router's LID of the local subnet's router Port. 12.7.21 also refers to the remote LID but in each subnet that is either the router Port's LID or the destination CA. From an operational flow perspective, CM would: Query to see if the destination CA is on the local subnet If yes, then obtain the associated records to find the local LID If no, then obtain the set of records that contain the local addressing to a router Port that will progress connection establishment to the next hop on the way to the destination. While there isn't a router specification any longer, the basic operation is very much like that of an IP subnet. The router protocol establishes a set of routes for given subnet prefix and then communicates that to each SM/SA so that queries will resolve the optimal router Port. Chapter 8 provides clear guidance in this regard. Chapter 12 is basically stating what to plug into various fields with all LIDs being only local to the subnet where they are managed. The primary global knowledge that one must have across subnets are to establish a connection or communication flow. - SGID - DGID - P_Key - Q_Key There really isn't much more than this to comprehend. The TClass and Flow Labels were expected to be provided via the router protocol so the management requirements are really query look up. Mike ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Problem is routing CM REQ
At 02:02 PM 2/13/2007, Jason Gunthorpe wrote: On Tue, Feb 13, 2007 at 12:49:57PM -0800, Michael Krause wrote: Translated into a network with routers this means that for a RC flow to successfully work both the *forward* and *reverse* direction must traverse the same router *LID* not just *port* on both subnets. That is a given since the LID = path and same path must be used to insure strong ordering is maintained. I think you are missing what I'm saying. IB within a subnet has the path selected by the DLID only. The actual path selection is a policy decision outside the scope of the specification - it appears this is your main concern in that the specification does not state take these N parameters and apply the following algorithm to identify a path. The address vector can be comprised of many fields including a LID range. The actual DLID selected is done above as there can be a variety of policies or constraints imposed for a given data flow. I agree that packet switching within is via a DLID. So the construction process for a QP is to choose two enport LIDs, reverse them on one side and then query the SA for the forward and reverse SL. That gives you a pair of workable QPs. SL, LID, etc. are all uploaded into the management database for the SM / SA to access and there can be much more robust information loaded as well that goes well beyond what the IBTA specified in order to provide additional interpretations / information to guide path selection. A query can return multiple records if multi-path has been configured. Policy above is used to construct the CM messages which communicate the preferred path.The CM messages for establishment across subnets should be sufficient in their existing content to work independent of how the actual routing is accomplished. This same procedure doesn't work for routers. Consider a case where a router port has LID 1 and an end port has LIDs 3,4. The end port establishes two RC QPs: #1: SLID=3, DLID=1 #2: SLID=4, DLID=1 Both have the same DGID - how is the router expected to know that QP #1 requires one set of LIDs and QP #2 requires a different set? For all intents and purposes, within a local subnet, a router Port is treated the same as CA. If there are multiple paths between a router Port and a given CA Port, i.e. multiple LIDs are configured, then the router is supposed to query the SM / SA database and obtain the appropriate records and make a decision that remains valid for the lifetime of the data flow. The purpose of the TClass is to enable a local mapping to SL which can also be used as input into LID selection. The flow label is left open in its value and was expected to be used much like it is in IP. People considered encoding it or at a minimum, using it as an input parameter to identify the associated LID for the flow but that was not agreed to since the router vendors at the time wanted it left largely opaque. Section 19.2.4.1 seems to make it explicit to me that this is a valid situation. Yes, 19.2.4.1 supports multi-path within a given subnet. To have this work the router must use the flow label to identify the correct DLID. SA/CM must be enhanced in some way to let the two sides exchange flow labels. That is a policy decision or something for a TBD router protocol specification. It is not required to use the Flow Label. This problem is worse if you have multiple independent redundent routers on your subnet, or LMC != 0. Then you now have the problem of SLID matching as well as DLID matching. It is no worse due to the existence of multi-path. There are many variables involved in creating a viable router protocol specification which is in part, why the IBTA chose to not complete that work. Strong ordering is maintained in all cases because the routers always make consistent choices for the LRH.DLID on a session by session basis. Agreed, The router is responsible for insuring a consistent path is used for a given flow. That does not preclude multi-path nor does it make multi-path more complicated as a result. That is why I said previously that the QP matching rules are a mistake. The best way to solve this is to change C9-54 to only be in effect if the GRH is not present. I disagree. We were very explicit in how and why we constructed those rules. Do you know of a solution then? If C9-54 is a very deliberate design then it must be that the CM specification in Chapter 12 is not designed to handle the ramifications of C9-54. I just can't see how to fit both CM and C9-54 together into a workable solution. You are arguing about a router protocol problem that does not exist or perhaps I just don't get it. We did progress the router specification or at least the operating models behind it sufficiently to validate that both Chapter 9 and Chapter 12 worked as specified (as well as chapters 8 and 19). Yes, there are implementation issues within
Re: [openib-general] dapl broken for iWARP
At 07:29 AM 2/9/2007, Kanevsky, Arkady wrote: Mike, this is not a DAPL issue. There are 2 ways to deal with it. One is for all ULPs to use private data to exchange CM info. yes, some ULPs, like SDP do that in hello world message. Another is to let CM handle it. This way ULP does not have to deal with it. This is analogous to the IBTA CM IP addressing Annex. It ensure backwards compatibility and does not break any existing apps which use MPA as specified by IETF. No need to bother IETF until we have it working. Given what it took to get MPA specified, I don't see changing the specification for this as likely welcomed by many. The ULP used within the IETF are largely able to solve this problem at their login exchange so unless there is some ground swell of IETF ULP that can't solve it as these do, I think this may be a challenge to gain any traction. Mike Thanks, Arkady Kanevsky email: [EMAIL PROTECTED] Network Appliance Inc. phone: 781-768-5395 1601 Trapelo Rd. - Suite 16.Fax: 781-895-1195 Waltham, MA 02451 central phone: 781-768-5300 -Original Message- From: Michael Krause [mailto:[EMAIL PROTECTED] Sent: Thursday, February 08, 2007 4:27 PM To: Kanevsky, Arkady; Steve Wise; Arlin Davis Cc: openib-general Subject: Re: [openib-general] dapl broken for iWARP At 07:43 AM 2/8/2007, Kanevsky, Arkady wrote: That is correct. I am working with Krishna on it. Expect patches soon. By the way the problem is not DAPL specific and so is a proposed solution. There are 3 aspects of the solution. One is APIs. We suggest that we do not augment these. That is a connection requestor sets its QP RDMA ORD and IRD. When connection is established user can check the QP RDMA ORD and IRD to see what he has now to use over the connection. We may consider to extend QP attributes to support transport specific parameters passing in the future. For example, iWARP MPA CRC request. Second is the semantic that CM provides. The proposal is to match IBCM semantic. That is CM guarantee that local IRD is = remote ORD. This guarantees that incoming RDMA Read requests will not overwhelm the QP RDMA Read capabilities. Again there is not changes to IBCM only to IWCM. Notice that as part of this IWCM will pass down to driver and extract from driver needed info. The final part is iWARP CM extension to exchange RDMA ORD, IRD. This is similar to IBTA Annex for IP Addressing. The harder part that this will eventually require IETF MPA spec extension, and the fact that MPA protocol is implemented in RNIC HW by many vendors, and hence can not be done by IWCM itself. We looked at this quite a bit during the creation of the specification. All of the targeted usage models exchange this information as part of their hello or login exchanges.As such, the hum was to not change MPA to communicate such information and leave it to software to exchange these values through existing mechanisms. I seriously doubt there will be much support for modifying the MPA specification at this stage since the implementations are largely complete and a modification would have to deal with the legacy interoperability issue which likely would be solved in software any way. It would be simpler to simply modify the underlying DAPL implementation to exchange the information and keep this hidden from both the application and the RNIC providers. Mike Thanks, Arkady Kanevsky email: [EMAIL PROTECTED] Network Appliance Inc. phone: 781-768-5395 1601 Trapelo Rd. - Suite 16.Fax: 781-895-1195 Waltham, MA 02451 central phone: 781-768-5300 -Original Message- From: Steve Wise [mailto:[EMAIL PROTECTED] Sent: Wednesday, February 07, 2007 6:12 PM To: Arlin Davis Cc: openib-general Subject: Re: [openib-general] dapl broken for iWARP On Wed, 2007-02-07 at 15:05 -0800, Arlin Davis wrote: Steve Wise wrote: On Wed, 2007-02-07 at 14:02 -0600, Steve Wise wrote: Arlin, The OFED dapl code is assuming the responder_resources and initiator_depth passed up on a connection request event are from the remote peer. This doesn't happen for iWARP. In the current iWARP specifications, its up to the application to exchange this information somehow. So these are defaulting to 0 on the server side of any dapl connection over iWARP. This is a fairly recent change, I think. We need to come up with some way to deal with this for OFED 1.2 IMO. Yes, this was changed recently to sync up with the rdma_cm changes that exposed the values. The IWCM could set these to the device max values for instance. That would work fine
Re: [openib-general] Immediate data question
At 09:10 PM 2/11/2007, Devesh Sharma wrote: On 2/10/07, Tang, Changqing [EMAIL PROTECTED] wrote: Not for the receiver, but the sender will be severely slowed down by having to wait for the RNR timeouts. RNR = Receiver Not Ready so by definition, the data flow isn't going to progress until the receiver is ready to receive data. If a receive QP enters RNR for a RC, then it is likely not progressing as desired. RNR was initially put in place to enable a receiver to create back pressure to the sender without causing a fatal error condition. It should rarely be entered and therefore should have negligible impact on overall performance however when a RNR occurs, no forward progress will occur so performance is essentially zero. Mike: I still do not quite understand this issue. I have two situations that have RNR triggered. 1. process A and process B is connected with QP. A first post a send to B, B does not post receive. Then A and B are doing a long time RDMA_WRITE each other, A and B just check memory for the RDMA_WRITE message. Finally B will post a receive. Does the first pending send in A block all the later RDMA_WRITE ? According to IBTA spec HCA will process WR entries in strict order in which they are posted so the send will block all WR posted after this send, Until-unless HCA has multiple processing elements, I think even then processing order will be maintained by HCA If not, since RNR is triggered The source HCA is responsible for processing work requests in the order they are posted. If the SEND cannot proceed and receives a RNR, then the subsequent RDMA Write should not proceed, i.e. the sequence numbers that define the valid window will not progress and given IB requires strong ordering within the fabric, nothing sent subsequently should be made visible at the sink HCA. In your example, if A is sending a SEND followed by a RDMA Write, the first check should have been that B had provided an ACK with a credit indicating that a SEND is allowed. If B subsequently removed access to the buffer that had to be posted to provide that credit, then it should trigger a RNR NAK and the subsequent RDMA Writes should not be visible at B since there is no an effective hole in the transmission stream. periodically till B post receive, does it affect the RDMA_WRITE performance between A and B ? 2. extend above to three processes, A connect to B, B connect to C, so B has two QPs, but one CQ. A posts a send to B, B does not post receive, rather B and C are doing a long time RDMA_WRITE, or send/recv. But B must sends RNR periodically to A, right?. So does the pending message from A affects B's overall performance between B and C ? Neither IB nor iWARP provide any ordering guarantees between different data flows. This is strictly under application control. Hence, if a RNR NAK or whatever occurs on a RC between A and B, then it has no impact on what occurs between A and C or B and C. It is simply outside the scope of either technology to address. Mike ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Problem is routing CM REQ
At 12:56 PM 2/12/2007, Jason Gunthorpe wrote: On Mon, Feb 12, 2007 at 09:23:06AM -0800, Sean Hefty wrote: Ah, I think I missed the key step in your scheme.. You plan to query the local SM for SGID=remote DGID=local? (ie reversed from 'normal'. I was thinking only about the SGID=local DGID=remote query direction) I'm not sure that the query needs the GIDs reversed, as long as the path is reversible. So, the local query would be: SGID=local, DGID=remote, reversible=1 (to SA) And the remote query would be: SGID=local, DGID=remote, reversible=1, (to SA') TClass FlowLabel=from previous query response 1) What does the TClass and FlowLabel returned from SGID=local DGID=remote mean? Do you use it in the Node1 - Node2 direction or the Node2 - Node1 direction or both? 1a) If it is Node1 - Node2 then the local SA has to query SA' to figure what FlowLabel to return. 1b) If it is for both directions then somehow SA, SA' and all four router ports need to agree on global flowlabels. 2) In the 2nd query, passing SGID=local, DGID=remote is 'reversed' since SGID=local is the wrong subnet for SA'. I think defining this to mean something is risky. 2b) A PR query with TClass and FlowLabel present in the query is currently expected to return an answer with those fields matching. That implies #1b.. TClass is intended to communicate the end-to-end QoS desired. TClass is then mapped to a SL that is local to each subnet. A flow label is intended to much the same as in the IP world and is left, in essence, to routers to manage.An endnode look up should be to find the address vector to the remote. A look up may return multiple vectors. The SLID would correspond to each local subnet router port that acts as a first-hop destination to the remote subnet.I don't see why the router protocol would not simply enable all paths on the local subnet to a given remote subnet be acquired. All of the work is kept local to the SA / SM in the source subnet when determining a remote path to take. Why is there any need to define more than just this? Define a router protocol to communicate the each subnet's prefix, TClass, etc. and apply KISS. A management entity that wanted to manage out each subnet provides router management in terms of route selection, etc. can be constructed by using the existing protocols / tools combined with a new router protocol which only does DGID to next hop SLID mapping. Mike So, here is how I see this working.. - There is a single well known 'reversible' flowlabel. When a router processes a GRH with that flowlabel it produces a packet that has a SLID that is always the same, no matter what router port is used (A' or B' in my example). The LRH is also reversible according to the rules in IBA. A well known value side-steps the global information problem and allows the GRH to be reversible. - Whenever a PR has reversible=1 the result returns the well known flowlabel. The router LID is always the single shared SLID. - To get a more optimal path the following sequence of queries are used: to SA: SGID=Node1 DGID=Node2 [In the background SA asks SA' what flow label to use] to SA': SGID=Node1 DGID=Node2 FlowLabel=(from above) to SA': SGID=Node2 DGID=Node1 SLID=(dlid from above) [In the background SA' asks SA what flow label to use] to SA: SGID=Node2 DGID=Node1 FlowLabel=(from above) It is almost guarenteed that the FlowLabel will be asymetric. This is to keep the flowlabel space local to each subnet. In the background quries SA and SA' also examine the global route topology to select an optimal no-spoof needed router LID. The background exchange is how the disambiguation problem with multiple-router path is solved. Implicit in this are five IBA affecting things: - that PRs with SGID=non-local mean something specific - PRs with DGID=non-local cause the SA to communicate with the remote SA to learn the GRH's FlowLabel (except in the case where reversible=1) - clients can communicate with remote SA's - Routers do the SLID spoofing you outlined. - SA's and routers collaborate quite closely on how the router produces a LRH. In particular the SA controls the SLID spoofing A new query type or maybe some kind of modified multi-path-record query could be defined by IBA to reduce the 6 exchanges required to something more efficient. Does this match what you are thinking? SA SA' Node1 -- (LID 1) Router A --- Router A' (LID A) --- Node2 |- (LID 2) Router A | |- (LID 3) Router B --- Router B' (LID B) --| Router A and Router B are independent redundant devices, not a route cloud of some sort. B - A' is not a possible path. Since A' and B' connect to the same subnet, B - A' should be a valid path. Please don't
Re: [openib-general] Problem is routing CM REQ
At 02:47 PM 2/12/2007, Sean Hefty wrote: 1) What does the TClass and FlowLabel returned from SGID=local DGID=remote mean? Do you use it in the Node1 - Node2 direction or the Node2 - Node1 direction or both? Maybe it would help if we can agree on a set of expectations. These are what I am thinking: 1. An SA should be able to respond to a valid PR query if at least one of the GIDs in the path record is local. 2. The LIDs in a PR are relative to the SA's subnet that returned the record. 3. An IB router should not failover transparently to QPs sending traffic through that router. There is no reason for such a restriction. APM can work with routers and the IB protocol will recover from any out of order packet processing just fine. 4. A PR from the local SA with reversible=1 indicates that data sent from the remote GID to the local GID using the PR TC and FL will route locally using the specified LID pair. This holds whether the PR SGID is local or remote. 5. A PR from a remote SA with reversible=1 indicates that data sent from the local GID to the remote GID using the PR TC and FL will route remotely using the specified LID pair. This holds whether the PR SGID is local or remote. 6. A PR with reversible=0 is relative to SA's subnet. The SGID-DGID data flow over the PR TC and FL indicates the SLID-DLID mapping for that subnet. Do your expectations differ from these? The use of reversible between subnets is what's concerning me. It may be that an SA could not return any paths as reversible between two subnets without using some trick like what you mentioned. These add a requirement on the SA that they must be aware of the routes packets take between two GIDs using a given TC and FL, but I don't believe that this necessarily forces SA to SA communication. The SA may only need to exchange information with a router...? It should not force SA to SA communication. Such communication is overly complex and will be a major issue to control and manage in the end. Further, security concerns, partition management, etc. start to complex enough as it is without adding more fuel to the fire. Implicit in this are five IBA affecting things: - that PRs with SGID=non-local mean something specific I don't think that we're changing any of the meanings of the fields though. - Routers do the SLID spoofing you outlined. I'm not sure this is something that we do want now. APM should really handle path failover. There is alot of complex work in the router and SA side to make this kind of topology work, but it is critical that the clients use path queries that can provide enough data to the SA and return enough data to the client to support this. I'm still deciding if the existing path record attribute is sufficient. Our original IB router work I believe drove some of what is in the current records so I suspect they are fine as is. Mike ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Immediate data question
At 03:41 PM 2/7/2007, Roland Dreier wrote: Changqing What I mean is that, is there any performance penalty Changqing for receiver's overall performance if RNR happens Changqing continuously on one of the QP ? Not for the receiver, but the sender will be severely slowed down by having to wait for the RNR timeouts. RNR = Receiver Not Ready so by definition, the data flow isn't going to progress until the receiver is ready to receive data. If a receive QP enters RNR for a RC, then it is likely not progressing as desired. RNR was initially put in place to enable a receiver to create back pressure to the sender without causing a fatal error condition. It should rarely be entered and therefore should have negligible impact on overall performance however when a RNR occurs, no forward progress will occur so performance is essentially zero. Mike ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] dapl broken for iWARP
At 07:43 AM 2/8/2007, Kanevsky, Arkady wrote: That is correct. I am working with Krishna on it. Expect patches soon. By the way the problem is not DAPL specific and so is a proposed solution. There are 3 aspects of the solution. One is APIs. We suggest that we do not augment these. That is a connection requestor sets its QP RDMA ORD and IRD. When connection is established user can check the QP RDMA ORD and IRD to see what he has now to use over the connection. We may consider to extend QP attributes to support transport specific parameters passing in the future. For example, iWARP MPA CRC request. Second is the semantic that CM provides. The proposal is to match IBCM semantic. That is CM guarantee that local IRD is = remote ORD. This guarantees that incoming RDMA Read requests will not overwhelm the QP RDMA Read capabilities. Again there is not changes to IBCM only to IWCM. Notice that as part of this IWCM will pass down to driver and extract from driver needed info. The final part is iWARP CM extension to exchange RDMA ORD, IRD. This is similar to IBTA Annex for IP Addressing. The harder part that this will eventually require IETF MPA spec extension, and the fact that MPA protocol is implemented in RNIC HW by many vendors, and hence can not be done by IWCM itself. We looked at this quite a bit during the creation of the specification. All of the targeted usage models exchange this information as part of their hello or login exchanges.As such, the hum was to not change MPA to communicate such information and leave it to software to exchange these values through existing mechanisms. I seriously doubt there will be much support for modifying the MPA specification at this stage since the implementations are largely complete and a modification would have to deal with the legacy interoperability issue which likely would be solved in software any way. It would be simpler to simply modify the underlying DAPL implementation to exchange the information and keep this hidden from both the application and the RNIC providers. Mike Thanks, Arkady Kanevsky email: [EMAIL PROTECTED] Network Appliance Inc. phone: 781-768-5395 1601 Trapelo Rd. - Suite 16.Fax: 781-895-1195 Waltham, MA 02451 central phone: 781-768-5300 -Original Message- From: Steve Wise [mailto:[EMAIL PROTECTED] Sent: Wednesday, February 07, 2007 6:12 PM To: Arlin Davis Cc: openib-general Subject: Re: [openib-general] dapl broken for iWARP On Wed, 2007-02-07 at 15:05 -0800, Arlin Davis wrote: Steve Wise wrote: On Wed, 2007-02-07 at 14:02 -0600, Steve Wise wrote: Arlin, The OFED dapl code is assuming the responder_resources and initiator_depth passed up on a connection request event are from the remote peer. This doesn't happen for iWARP. In the current iWARP specifications, its up to the application to exchange this information somehow. So these are defaulting to 0 on the server side of any dapl connection over iWARP. This is a fairly recent change, I think. We need to come up with some way to deal with this for OFED 1.2 IMO. Yes, this was changed recently to sync up with the rdma_cm changes that exposed the values. The IWCM could set these to the device max values for instance. That would work fine as long as you know the remote settings will be equal or better. The provider just sets the min of local device max values and the remote values provided with the request. I know Krishna Kumar is working on a solution for exchanging this info in private data so the IWCM can do the right thing. Stay tuned for a patch series to review for this. But this functionality is definitely post OFED-1.2. So for the OFED-1.2, I will set these to the device max in the IWCM. Assuming the other side is OFED 1.2 DAPL, then it will work fine. Steve. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Problem is routing CM REQ was: Use a GRH when appropriate for unicast packets
At 12:39 PM 2/8/2007, Hal Rosenstock wrote: On Thu, 2007-02-08 at 14:54, Sean Hefty wrote: Hum, you mean to meet the LID validation rules of 9.6.1.5? That is a huge PITA.. [IMHO, 9.6.1.5 C9-54 is a mistake, if there is a GRH then the LRH.SLID should not be validated against the QP context since it makes it extra hard for multipath routing and QoS to work...] If you examine the prior diagram, the packet validation is quite precise and intent on catching any misrouted packets as early in the validation process as possible. This particular compliance statement makes it clear as to the type of connection and how to pattern match. The protocol was designed to work witin a single subnet as well as across subnets. Hence, the GRH must be validated in conjunction with the LRH and the QP context in order to insure an intermediate component did not misroute the packet.As described, a RC QP must flow through at most a single path at any given time in order to insure packet ordering is maintained (IB requires strong ordering so multi-path within a single RC is not allowed). As for QoS, one can arbitrate a packet for a RC QP relative to other flows without any additional complexity. If one wants to segregate a set of RC QP onto different paths as well as arbitration slots that is allowed and supported by the architecture even if going between the same set of ports - simply use multiple LID and SL during connection establishment. Mike Yes - this gets messy. Here is one thought on how to do this: To meet this rule each side of the CM must take the SLID from the incoming LRH as the DLID for the connection. This SLID will be one of the SLIDs for the local router. The other side doesn't need to know what it is. The passive side will get the router SLID from the REQ and the active side gets it from the ACK. The passive side is easy, it just path record queries the DGID and requests the DLID == the incoming LRH.SLID. This requires that the passive side be able to issue path record queries, but I think that it could work for static routes. A point was made to me that the remote side could be a TCA without query capabilities. Are you referring to SA query capabilities ? Would such a device just be expected to work without change in an IB routed environment anyway ? -- Hal There's still the issue of what value is carried in the remote port LID in the CM REQ (12.7.21), and I haven't even gotten to APM yet... The nasty problem is with the active side - CMA will select a router lid it uses as the DLID and the router may select a different LID for it to use as the SLID when it processes the ACK. By C9-54 they have to be the same : So the active side might have to do another path record query to move its DLID and SL to match the routers choosen SLID. Double suck :P As long as the SA and local routers are in sync, we may be okay here without a second path record query. - Sean ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] BandWidth doubt
At 02:02 AM 11/10/2006, john t wrote: Hi, I got following readings in one of my experiments: Single 64-bit xeon machine (2 dual-core 3.2 GHz Intel CPUs, linux FC4, OFED 1.0) with two Mellanox DDR (4x) HCAs (each having two ports and each connected to a PCI x8 interface) is connected to a switch (all the 4 DDR (4x) ports are connected to the switch). If I send data from mthca0-1 to mthca0-1 meaning from same port to the same port i.e. same port doing send/recv (also same cable doing send/recv) I get a BW of around 10 Gb/sec. Similarly, from mthca1-1 to mthca1-1 I get same i.e. around 10 Gb/sec. So, individual port-to-port gives 10 Gb/sec. But when I use them together i.e when I send the data from mthca0-1 to mthca0-1 AND from mthca1-1 to mthca1-1 at the same time (simultaneously) I get a BW of 6.7 Gb/sec on each port. This is less than 10 Gb/sec that is expected. Note that mthca0 and mthca1 are connected to two different PCI-x8 interfaces, so there is no question of bandwidth splitting. What could be causing such a behaviour ?? Just to add if the same thing is done between two different hosts i.e. If I send data from mthca0-0 and mthca1-1 of one host to mthca0-0 and mthca1-1 of other host, I get expected BW i.e. 10 Gb/sec on each port/link. You have two links pounding on a shared PCIe Root Complex / memory controller. This sounds like a chipset issue not an IB / software issue when it is placed under load. Mike ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] ibstatus support for speed
At 02:43 PM 10/30/2006, Hal Rosenstock wrote: On Mon, 2006-10-30 at 17:29, Michael Krause wrote: At 02:05 PM 10/30/2006, Roland Dreier wrote: Hal So rate = speed * width ? Yes, you should see the right think on DDR systems etc. Strange. Bandwidth = signaling rate * width. This of course is raw bandwidth prior to encoding, protocol, etc. overheads which will derate the effective application bandwidth minimally be 20-25%. Yes of course. It's just a simple diagnostic to display the width and speed simply. If the goal is provide a true indication of the maximum peak bandwidth that an application might see, That's not the goal of this simplistic tool. then stating 10 Gbps for an IB x4 SDR is clearly a misrepresentation and out of alignment with other networking links such as Ethernet which customers understand its bandwidth to be minimally after the encoding, etc. is removed from the equation. The perpetual trend by marketing to use 10 Gbps IB as equivalent to 10 Gbps of application data is actually detrimental not beneficial when it comes to customers. It inevitably leads to the question of why the application is not achieving the stated bandwidth, i.e. why it is say 700-800MB/s theoretical peak for a x4 while a 10 GbE is 1 GB/s peak. So much marketing hype has gone forward already. I realize I'm tilting at windmills but if you are to provide a tool that is supposed to project the maximum bandwidth possible and given the goal of OFA is to provide as much conceptual commonality with existing network stacks / links, then it would be beneficial to have this move towards a much more apple-to-apple communication of information. I know it would certainly help with having to repeatedly explain why IB 10 Gbps is not the same as 10 GbE to customers and analysts. Agreed but this is a different issue from what the tool is for. Understood. IMO this issue largely started when IB decided to use the signalling rate rather than the data rate like most other networks. Blame it on marketroids who were more concerned about their naive attempt to look better than other technology and not about customers or the people who have to continually explain how their drivel is simply wrong. Unfortunately, these same marketroids continual to perpetuate this message even now with their apple-to-orange comparisons. Annoys customers who when educated end up with a slightly less favorable opinion of the technology. Mike ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] ibstatus support for speed
At 02:05 PM 10/30/2006, Roland Dreier wrote: Hal So rate = speed * width ? Yes, you should see the right think on DDR systems etc. Strange. Bandwidth = signaling rate * width. This of course is raw bandwidth prior to encoding, protocol, etc. overheads which will derate the effective application bandwidth minimally be 20-25%. If the goal is provide a true indication of the maximum peak bandwidth that an application might see, then stating 10 Gbps for an IB x4 SDR is clearly a misrepresentation and out of alignment with other networking links such as Ethernet which customers understand its bandwidth to be minimally after the encoding, etc. is removed from the equation. The perpetual trend by marketing to use 10 Gbps IB as equivalent to 10 Gbps of application data is actually detrimental not beneficial when it comes to customers. It inevitably leads to the question of why the application is not achieving the stated bandwidth, i.e. why it is say 700-800MB/s theoretical peak for a x4 while a 10 GbE is 1 GB/s peak. So much marketing hype has gone forward already. I realize I'm tilting at windmills but if you are to provide a tool that is supposed to project the maximum bandwidth possible and given the goal of OFA is to provide as much conceptual commonality with existing network stacks / links, then it would be beneficial to have this move towards a much more apple-to-apple communication of information. I know it would certainly help with having to repeatedly explain why IB 10 Gbps is not the same as 10 GbE to customers and analysts. Mike ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] IPoIB Question
At 10:00 PM 10/23/2006, Greg Lindahl wrote: On Mon, Oct 23, 2006 at 07:53:06AM -0500, Hubbell, Sean C Contractor/Decibel wrote: I currently have several applications that uses a legacy IPv4 protocol and I use IPoIB to utilize my infiniband network which works great. I have completed some timing and throughput analysis and noticed that I do not get very much more if I use an infiniband network interface than using my GigE network interface. You might want to note that different InfinBand implementations have quite different performance of IPoIB, especially for UDP. Another issue is that IPoIB has quite different performance with different Linux kernels. This is especially evident for TCP, although you can use SDP to accelerate TCP sockets and avoid this issue. My question is, am I using IPoIB correctly or are these the typical numbers that everyone is seeing? It is certainly the case that there are some message patterns and situations for which InfiniBand is not much of an improvement over gigE. Unfortunately, the comparison of IB to GbE are often apple-to-orange comparisons even for IP over IB. Until a HCA supplies the same level of functional off-load enabled by the IP network stack that is used with Ethernet, it really isn't a fair comparison. The same is also true for many of the marketroids and their comparisons of IB to Ethernet based solutions. Fortunately, most customers are getting a bit smarter and not falling for the marketing drivel these days - certainly the OEM don't fall for it thought the marketroids continue to come in and try to convince people it isn't an apple-to-orange comparison.The fact is both technologies have their pros / cons and it is really the workload or production environment that determines which is the best fit instead of the force fit. In any case, not really a development issue so will drop further discussion. Mike ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] IPoIB Question
At 10:19 AM 10/23/2006, Michael S. Tsirkin wrote: Quoting r. Sean Hubbell [EMAIL PROTECTED]: I am looking at libsdp for the TCP funcationality and wanted to know if libsdp supports UDP as well AFAIK, SDP can only emulate TCP sockets. SDP is defined to work with AF_INET applications. If using a shared library approach / pre-load, one can transparently enable any AF_INET application to utilize SDP without a recompile, etc. The SDP Port Mapper specification for iWARP / service id for IB enable the connection management or whatever service it is implemented within to application-transparent discover the real target listen port and establish a SDP session nominally during connection establishment.Implementations may vary in the robustness or policies used to determine what to off-load, number of off-load sessions, etc. - in other words, a lot of opportunity and flexibility is provided to use SDP. Note: WinSocks Direct on Windows provides an equivalent service though uses a proprietary protocol. Vista will have SDP as defined in the specifications. There are currently no plans to develop an equivalent for datagram applications. Any datagram application (user or kernel) can already access the hardware directly and given RDMA is not defined for datagram, it was felt such a specification would provide minimal value. Mike ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] APM support in openib stack
At 11:24 AM 10/13/2006, Sean Hefty wrote: 3. in req_handler() we follow the same steps as we have done without APM.. i.e. create qpairs, change qp state to RTR and then send REP. however, when trying to change state to RTR usinb ib_modify_qp() I get an error (-22). two info: same code will work if I pass alt_path as NULL or change the alt_path as primary path. I must be missing something here, I assume this basic APM feature works in RHEL4 update 4 distribtion of openib stack. I added code to the ib_cm to handle APM, but haven't ever tested it myself. I believe others have used it successfully though. What differences are there between the primary and alternate paths? I.e. are just the LIDs different, or are other values also different? The spec allows a full address vector to be specified not just LID. Mike ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Dropping NETIF_F_SG since no checksum feature.
At 02:46 AM 10/11/2006, Michael S. Tsirkin wrote: Quoting r. David Miller [EMAIL PROTECTED]: Subject: Re: Dropping NETIF_F_SG since no checksum feature. From: Michael S. Tsirkin [EMAIL PROTECTED] Date: Wed, 11 Oct 2006 11:05:04 +0200 So, it seems that if I set NETIF_F_SG but clear NETIF_F_ALL_CSUM, data will be copied over rather than sent directly. So why does dev.c have to force set NETIF_F_SG to off then? Because it's more efficient to copy into a linear destination buffer of an SKB than page sub-chunks when doing checksum+copy. Thanks for the explanation. Obviously its true as long as you can allocate the skb that big. I think you won't realistically be able to get 64K in a linear SKB on a busy system, though, is not that right? OTOH, having large MTU (e.g. 64K) helps performance a lot since it reduces receive side processing overhead. One thing to keep in mind is while it may help performance in a micro-benchmark, the system performance or the QoS impacts to other flows can be negatively impacted depending upon implementation. For example, consider multiple messages interleaving (heaven help implementations that are not able to interleave multiple messages) on either the transmit or receive HCA / RNIC and how the time-to-completion of any message is extended out in time as a result of the interleave. The effective throughput in terms of useful units of work can be lower as a result. The same effect can be observed when there are a significant number connections in a device being simultaneously processed. Also, if the copy-checksum is not performed on the processor where the application resides, then the performance can also be negatively impacted (want to have the right cache hot when initiated or concluded). While the aggregate computational performance of systems may be increasing at a significant rate (set aside the per core vs. aggregate core debate), the memory performance gains are much less. If you examine the longer term trends, there may be a flattening out of memory performance improvements by 2009/10 without some major changes in the way controllers and subsystems are designed. Mike ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] ipoib: ignores dma mapping errors on TX?
At 10:24 AM 10/10/2006, Tom Tucker wrote: Does anyone know what might happen if a device tries to bus master bad_dma_address. Does it get a pci-abort, an NMI, a bus err interrupt, all of the above? It depends upon the platform. Some will enter a containment mode and, for example, shutdown the PCI Bus or the PCIe Root Port. Others may trigger a system error and shutdown the system. These responses are in part, a policy of the implementation and how the system is implemented. In future chipsets that contain IOMMU / Address Translation Protection Tables (ATPT) / pick your favorite name, the error can be contained to a single device and the appropriate error recovery triggered without requiring the system to go down. Again, all policy at the end of the day as to what action is triggered. For most, the potential for silent data corruption is too high to risk that bus or Root Port from continuing to operate without a reset / flush so containment is used at a minimum. Mike On 10/9/06 1:01 PM, Roland Dreier [EMAIL PROTECTED] wrote: Michael It seems that IPoIB ignores the possibility that Michael dma_map_single with DMA_TO_DEVICE direction might return Michael dma_mapping_error. Michael Is there some reason that such mappings can't fail? No, it's just an oversight. Most network device drivers don't check for DMA mapping errors but it's probably better to do so anyway. I added this to my queue: commit 8edaf479946022d67350d6c344952fb65064e51b Author: Roland Dreier [EMAIL PROTECTED] Date: Mon Oct 9 10:54:20 2006 -0700 IPoIB: Check for DMA mapping error for TX packets Signed-off-by: Roland Dreier [EMAIL PROTECTED] diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c index f426a69..8bf5e9e 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c @@ -355,6 +355,11 @@ void ipoib_send(struct net_device *dev, tx_req-skb = skb; addr = dma_map_single(priv-ca-dma_device, skb-data, skb-len, DMA_TO_DEVICE); + if (unlikely(dma_mapping_error(addr))) { + ++priv-stats.tx_errors; + dev_kfree_skb_any(skb); + return; + } pci_unmap_addr_set(tx_req, mapping, addr); if (unlikely(post_send(priv, priv-tx_head (ipoib_sendq_size - 1), ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Multi-port HCA
Off-line someone asked me to clarify my earlier e-mail. Given this discussion continues, perhaps this might help explain the performance a bit more. The Max Payload Size quoted here is what is typically implemented on x86 chipsets though other chipsets may use a larger value. From a pure bandwidth perspective (which is not typical of many applications), this should be reasonable accurate. In any case, this is just a fyi. A x4 IB 5 GT/s is 20 Gbps raw (customers do comprehend the marketing hype does not translate into that bandwidth being available for applications - I have had to explain this to the press in the past about how raw does equal application available bandwidth). Take off 8b/10b, protocol overheads, etc. and assuming a 2KB PMTU, then one can expect to hit perhaps 14-15 Gbps per direction depending upon the workload. Let's assume an aggregate of 30 Gbps of potential application bandwidth for simplicity. The PCIe x8 2.5 GT/s is 20 Gbps raw so take off the 8b/10b, protocol overheads, control / application overheads, etc. and given it uses at most a 256B Max Payload Size on DMA Writes and cache line sized DMA Read Completions (64B) though many people use PIO Writes to avoid DMA Reads when it comes to micro-benchmarks, the actual performance is unlikely to hit what IB might drive depending upon the direction and mix of control and application data transactions. Add in the impacts on memory controller which in real-world applications is servicing the processors quite a bit more than illustrated by micro-benchmarks and the ability of a system to drive an IB x4 DDR device at link rate is very questionable. The question is whether this really matters. If you examine most workloads on various platforms, they simply cannot generate enough bandwidth to consume the external I/O bandwidth capacity. In many cases, they are constrained by the processor or the combination of the processor / memory components. This isn't a bad thing when you think about it. For many customers, it means that the attached I/O fabrics will be sufficiently provisioned to eliminate or largely mitigate the impacts of external fabric events, e.g. congestion, and deliver a reasonable solution using the existing hardware (issues of topology, use of multi-path, etc. all come into bearing as a function of fabric diameter). In the end, customers care about whether the application performs as expected and where the real bottlenecks lie. For most applications, it will come down to the processor / memory subsystems and not the I/O or external fabric. While I haven't seen all of the latest DDR micro-benchmark results, I believe the x4 IB SDR numbers largely align with what I've outlined here. Mike At 02:09 AM 10/6/2006, john t wrote: Hi Shannon, The bandwidth figures that you quoted below match with my readings for single port Mellanox DDR HCA (both for unidirection and bidirection). So it seems dual port SDR HCA performs as good as single port DDR HCA. It would help if you can also tell the bandwidth that you got using one port of your dual-port SDR HCA card. Was it half the bandwidth that you stated below, which means having two SDR ports per HCA helps. In my case it seems having two ports (DDR) per HCA does not increase BW, since PCI-e x8 limit is 16 Gb/sec per direction and each of the two HCA ports (DDR) though capable of transferring 16 Gb/sec in each direction, when used together can not go above 16 Gb/sec. Regards, John T. On 10/5/06, Shannon V. Davidson [EMAIL PROTECTED] wrote: John, In our testing with dual port Mellanox SDR HCAs, we found that not all PCI-express implementations are equal. Depending on the PCIe chipset, we measured unidirectional SDR dual-rail bandwidth ranging from 1100-1500 MB/sec and bidirectional SDR dual-rail bandwidth ranging from 1570-2600 MB/sec. YMMV, but had good luck with Intel and Nvidia chipsets, and less success with the Broadcom Serverworks HT-1000 and HT-2000 chipsets. My last report (in June 2006) was that Broadcom was working to improve their PCI-express performance. Regards, Shannon john t wrote: Hi Bernard, I had a configuration issue. I fixed it and now I get same BW (i.e. around 10 Gb/sec) on each port provided I use ports on different HCA cards. If I use two ports of the same HCA card then BW gets divided between these two ports. I am using Mellanox HCA cards and doing simple send/recv using uverbs. Do you think it could be an issue with Mallanox driver or could it be due to system/PCI-E limitation. Regards, John T. On 10/3/06, Bernard King-Smith [EMAIL PROTECTED] wrote: John, Who's adapter (manufacturer) are you using? It is usually an adapter implementation or driver issue that occures when you cannot scale across multiple links. The fact that you don't scale up from one link, but it appears they share a fixed bandwidth across N links means that there is a driver or stack issue. At one time I think that IPoIB and maybe other IB drivers used only
Re: [openib-general] Drop in performance on Mellanox MT25204 single port DDR HCA
At 02:43 PM 10/2/2006, Roland Dreier wrote: Robert Yes. 1250Mbytes/sec is what we expect. You say the 128 Robert value comes from the BIOS ? If so, we need to discuss this Robert with our BIOS team to find out why they limit it to 128, Robert perhaps it is a BIOS bug. Yes, I believe that the BIOS is the only place that would set that value. We know that resetting the device makes it go back to a different default value, and nothing in the kernel that I know of is going to set it down to 128. 128B is the default minimum from PCIe so likely some BIOS engineer took a conservative view and chose the defaults (go figure). Setting Max Read Request Size = 4096 is preferred on any implementation as it is basically free from a chipset perspective. The chipset will likely return in cache line quantities but there is some obvious optimizations to be achieved by issuing a single DMA Read Request. Mike ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH 0/10] [RFC] Support for SilverStorm Virtual Ethernet I/O controller (VEx)
Silverstorm is executing a usage model that the IBTA used to develop the IB protocols. What is the problem with that? If it works and integrates into the stack, then this seems like an appropriate bit of functionality to support. The fact that one can use a standard ULP to communicate to a TCA as an alternative which is supported by the existing stack is a customer product decision at the end of the day. If Silverstorm or any IHV can show value and that it works in the stack, then it seems appropriate to support. Isn't that a fundamental principle of being an open source effort? Mike At 12:31 PM 10/3/2006, Fabian Tillier wrote: Hi Yaron, On 10/3/06, Yaron Haviv [EMAIL PROTECTED] wrote: I'm trying to figure out why this protocol makes sense As far as I understand, IPoIB can provide a Virtual NIC functionality just as well (maybe even better), with two restrictions: 1. Lack of support for Jumbo Frames 2. Doesn't support protocols other than IP (e.g. IPX, ..) Whether to use a router or virtual NIC approach for connectivity to Ethernet subnets is a design decision. We could argue until we are blue in the face about which architecture is better, but that's really not relevant. I believe we should first see if such a driver is needed and if IPoIB UD/RC cannot be leveraged for that, maybe the Ethernet emulation can just be an extension to IPoIB RC, hitting 3 birds in one stone (same infrastructure, jumbo frames for IPoIB, and Ethernet emulation for all nodes not just Gateways) You're joking right? Are you really arguing that SilverStorm should not develop a driver to support its existing devices? This really isn't complicated: 1). SilverStorm has a virtual NIC hardware device. 2). SilverStorm is committed to support OpenFabrics. The above two statements lead to the following conclusion: SilverStorm needs a driver for its devices that works with the OpenFabrics stack. This is totally orthogonal to and independent of working on IPoIB RC or any IETF efforts to define something new. - Fab ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] A critique of RDMA PUT/GET in HPC
At 08:56 AM 8/25/2006, Greg Lindahl wrote: On Fri, Aug 25, 2006 at 10:13:01AM -0500, Tom Tucker wrote: He does say this, but his analysis does not support this conclusion. His analysis revolves around MPI send/recv, not the MPI 2.0 get/put services. Nobody uses MPI put/get anyway, so leaving out analyzing that doesn't change reality much. Is this due to legacy or other reasons? One reason cited from Winsocks Direct for using the bcopy vs. the RDMA zcopy operations was the cost to register memory if done on a per operation basis, i.e. single use. The bcopy threshold was ~9KB. With the new verbs developed for iWARP and then added to IB v1.2, the bcopy threshold was reduced to ~1KB. Now, if I recall correctly, many MPI implementations split their buffer usage between what are often 1KB envelopes and what are large regions. One can persistently register the envelopes so their size does not really matter and thus could use send / receive or RDMA semantics for their update depending upon how the completions are managed. The larger data movements can be RDMA semantics if desired as these are typically large in size. A valid conclusion IMO is that MPI send/recv can be most efficiently implemented over an unconnected reliable datagram protocol that supports 64bit tag matching at the data sink. And not coincidentally, Myricom has this ;-) As do all of the non-VIA-family interconnects he mentions. Since we all landed on the same conclusion, you might think we're on to something. Or not. We've had this argument multiple times and examined all of the known and relatively volume usage models which includes the suite of MPI benchmarks used to evaluate and drive implementations. Any interconnect architecture is one of compromise if it is to be used in a volume environment - the goal for the architects is to insure the compromises do not result in a brain-dead or too diminished technology that will not meet customer requirements. With respect to reliable datagram, unless one does software multiplexing over what amounts to a reliable connection which comes with a performance penalty as well as complexity in terms of error recover, etc. logic it really does not buy one anything better than a RC model used today. Given the application mix and the customer usage model, IB provided four transport types to meet different application needs and allow people to make choices. iWARP reduced this to one since the target applications really were met with RC and reliable datagram as defined in IB simply was not being picked up or demanded by the targeted ISV. While some of us had argued for the software multiplex model, others wanted everything to be implemented in hardware so IB is what it is today. In any case, it is one of a set of reasonable compromises and for the most part, I contend it is difficult to argue that these interconnect technologies are so compromised that they are brain dead or broken. However, that's only part of the argument. Another part is that the buffer space needed to use RDMA put/get for all data links is huge. And there are some other interesting points. The buffer and context differences to track RDMA vs. Send are not significant in terms of hardware. In terms of software, memory needs to be registered in some capacity to perform DMA to it and hence, there is a cost from the OS / application perspective. Our goals were to be able to use application buffers to provide zero copy data movements as well as OS bypass. RDMA vs. Send does not incrementally differ in terms of resource costs in the end. I DO agree that it is interesting reading. :-), it's definitely got people fired up. Heh. Glad you found it interesting. The article is somewhat interesting but does not really present anything novel in this on-going debate on how interconnects should be designed. There will always be someone pointing out a particular issue here and there and in the end, many of these amount to mouse nuts when placed into the larger context. When they don't, a new interconnect is defined or extensions are made to compensate as nothing is ever permanent or perfect. Mike ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] basic IB doubt
At 10:14 AM 8/23/2006, Ralph Campbell wrote: On Wed, 2006-08-23 at 09:47 -0700, Caitlin Bestler wrote: [EMAIL PROTECTED] wrote: Quoting r. john t [EMAIL PROTECTED]: Subject: basic IB doubt Hi I have a very basic doubt. Suppose Host A is doing RDMA write (say 8 MB) to Host B. When data is copied into Host B's local buffer, is it guaranteed that data will be copied starting from the first location (first buffer address) to the last location (last buffer address)? or it could be in any order? Once B gets a completion (e.g. of a subsequent send), data in its buffer matches that of A, byte for byte. An excellent and concise answer. That is exactly what the application should rely upon, and nothing else. With iWARP this is very explicit, because portions of the message not only MAY be placed out of order, they SHOULD be when packets have been re-ordered by the network. But for *any* RDMA adapter there is no guarantee on what order the adapter flushes things to host memory or particularly when old contents that may be cached are invalidated or updated. The role of the completion is to limit the frequency with which the RDMA adapter MUST guarantee coherency with application visible buffers. The completion not only indicates that the entire message was received, but that it has been entirely delivered to host memory. Actually, A knows the data is in B's memory when A gets the completion notice. This is incorrect for both iWARP and IB. A completion by A only means that the receiving HCA / RNIC has the data and has generated an acknowledgement. It does not indicate that B has flushed the data to host memory. Hence, the fault zone remains the HCA / RNIC and while A may free the associated buffer for other usage, it should not rely upon the data being delivered to host memory on B. This is one of the fault scenarios I raised during the initial RDS transparent recovery assertions. If A were to issue a RDMA Read to the B targeting the associated RDMA Write memory location, then it can know the data has been placed in B's memory. B can't rely on anything unless A uses the RDMA write with immediate which puts a completion event in B's CQ. Most applications on B ignore this requirement and test for the last memory location being modified which usually works but doesn't guarantee that all the data is in memory. B cannot rely on anything until a completion is seen either through an immediate or a subsequent Send. It is not wise to rely upon IHV-specific behaviors when designing an application as even an IHV can change things over time or due to interoperability requirements, things may not work as desired which is definitely a customer complaint that many would like to avoid. BTW, the reason immediate data is 4 bytes in length is that was what was defined in VIA. Many within the IBTA wanted to get rid of immediate data but due to the requirement to support legacy VIA applications, the immediate value was left in place. The need to support a larger value was not apparent. One needs to keep in mind where the immediate resides within the wire protocol and its usage model. The past usage was to signal a PID or some other unique identifier that could be used to comprehend which thread of execution should be informed of a particular completion event. Four bytes is sufficient to communicate such information without significantly complicating or making the wire protocol too inefficient. Mike ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] basic IB doubt
At 02:58 PM 8/24/2006, Sean Hefty wrote: We're trying to create *inter-operable* hardware and software in this community. So we follow the IB standard. Atomic operations and RDD are optional, yet still part of the IB standard. An application that makes use of either of these isn't guaranteed to operate with all IB hardware. I'm not even sure that CAs are required to implement RDMA reads. A TCA is not required to support RDMA Read. A HCA is required. It is correct that atomic and reliable datagram are optional. However, that does not mean they can be used or will not work in an interoperable manner. The movement to a software multiplexing over a RC (a technique HP delivered to some ISV years ago) may make RD obsolete from an execution perspective but that does mean it is not interoperable. As for atomics, well, they are part of IB and many within MPI would like to see their support. Their usage should also be interoperable. It's up to the application to verify that the hardware that they're using provides the required features, or adjust accordingly, and publish those requirements to the end users. If that was being done (and it isn't), it would still be bad for the ecosystem as a whole. Applications should drive the requirements. Some poll on memory today. A lot of existing hardware provides support for this by guaranteeing that the last byte will always be written last. This doesn't mean that data cannot be placed out of order, only that the last byte is deferred. Seems much of this debate is really about how software chose to implement polling of a CQ versus polling of memory. Changing IB or iWARP semantics to compensate for what some might view as a sub-optimal implementation does not seem logical as others have been able to poll CQ without such overheads in other environments. In fact, during the definition of IB and iWARP, it was with this knowledge that we felt the need to change the semantics was not required. Again, if a vendor wants to work with applications written this way, then this is a feature that should be provided. If a vendor doesn't care about working with those applications, or wants to require that the apps be rewritten, then this feature isn't important. But I do not see an issue with a vendor adding value beyond what's defined in the spec. It all comes down to how much of the solution needs to be fully interoperable and how much needs to be communicated as optional semantics. You could always define API for applications to communicate their capabilities that go beyond a specification. This is in part the logic behind an iSCSI login or SDP Hello exchange where the capabilities are communicated in a standard way so software does the right thing based on the components involved. Changing fundamentals of IB and iWARP seems a bit much when it is much easier to have the ULP provide such an exchange of capabilities if people feel they are truly required. Mike ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] basic IB doubt
At 11:55 AM 8/25/2006, Greg Lindahl wrote: On Fri, Aug 25, 2006 at 10:00:50AM -0400, Thomas Bachman wrote: Not that I have any stance on this issue, but is this is the text in the spec that is being debated? (page 269, section 9.5, Transaction Ordering): An application shall not depend upon the order of data writes to memory within a message. For example, if an application sets up data buffers that overlap, for separate data segments within a message, it is not guaranteed that the last sent data will always overwrite the earlier. No. The case we're talking about is different from the example. There's text elsewhere which says, basically, that you can't access the data buffer until seeing the completion. I'm assuming that the spec authors had reason for putting this in there, so maybe they could provide guidance here? We put that text there to accommodate differing memory controller architectures / coherency protocol capabilities / etc. Basically, there is no way to guarantee that the memory is in a usable and correct state until the completion is seen. This was intended to guide software to not peek at memory but to examine a completion queue entry so that if memory is updated out of order, silent data corruption would not occur. I can't speak for the authors, but as an implementor, this has a huge impact on implementation. For example, on an architecture where you need to do work such as flushing the cache before accessing DMAed data, that's done in the completion. x86 in general is not such an architecture, but they exist. IB is intended to be portable to any CPU architecture. Invalidation protocol is one concern. The other is the a completion notification also often acts as a flush of the local I/O fabric as well. In the case of a RDMA Write, the only way to safely determine complete delivery was to have a RDMA Write / Send with completion combination or a RDMA Write / RDMA Read depending upon which side required such completion knowledge. For iWarp, the issue is that packets are frequently reordered. Neither IP or Ethernet re-order packets that often in practice. Same is true for packet drop rates (the real issue for packet drop is the impact on performance and recovery times which is why IB was not designed to work over long or diverse topologies where intermediate elements may see what might be termed a high packet loss rate). Mike -- greg ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] basic IB doubt
At 12:53 PM 8/25/2006, Talpey, Thomas wrote: At 03:23 PM 8/25/2006, Greg Lindahl wrote: On Fri, Aug 25, 2006 at 03:21:20PM -0400, [EMAIL PROTECTED] wrote: I presume you meant invalidate the cache, not flush it, before accessing DMA'ed data. Yes, this is what I meant. Sorry! Flush (sync for_device) before posting. Invalidate (sync for_cpu) before processing. On some architectures, these operations flush and/or invalidate i/o pipeline caches as well. As they should. Many platforms have coherent I/O components so the explicit requirements on software to participate are often eliminated. Mike ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] basic IB doubt
At 10:45 AM 8/25/2006, Tom Tucker wrote: On Fri, 2006-08-25 at 12:51 -0400, Talpey, Thomas wrote: At 12:40 PM 8/25/2006, Sean Hefty wrote: Thomas How does an adapter guarantee that no bridges or other Thomas intervening devices reorder their writes, or for that Thomas matter flush them to memory at all!? That's a good point. The HCA would have to do a read to flush the posted writes, and I'm sure it's not doing that (since it would add horrible latency for no good reason). I guess it's not safe to rely on ordering of RDMA writes after all. Couldn't the same point then be made that a CQ entry may come before the data has been posted? When the CQ entry arrives, the context that polls it off the queue must use the dma_sync_*() api to finalize any associated data transactions (known by the uper layer). This is basic, and it's the reason that a completion is so important. The completion, in and of itself, isn't what drives the synchronization. It's the transfer of control to the processor. This is a giant rat hole. On a coherent cache architecture, the CQE write posted to the bus following the write of the last byte of data will NOT be seen by the processor prior to the last byte of data. That is, write ordering is preserved in bridges. The dma_sync_* API has to do with processor cache, not transaction ordering. In fact, per this argument at the time you called dma_sync_*, the processor may not have seen the reordered transaction yet, so what would it be syncing? Write ordering and read ordering/fence is preserved in intervening bridges. What you DON'T know is whether or not a write (which was posted and may be sitting in a bridge FIFO) has been flushed and/or propagated to memory at the time you submit the next write and/or interrupt the host. If you submit a READ following the write, however, per the PCI bus ordering rules you know that the data is in the target. Unless, of course, I'm wrong ... :-) A PCI read following a write to the same address will result validate that all prior write transactions are flushed to host memory. This is one way that people have used (albeit with a performance penalty) to verify that a transaction it out of the HCA / RNIC fault zone and therefore an acknowledgement to the source means the data is safe and one can survive the HCA / RNIC failing without falling into a non-deterministic state. PCI writes are strongly ordered on any PCI technology offering. Relaxed ordering needs to be taken into account w.r.t. writes vs. reads as well as read completions being weakly ordered as well. Mike ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] basic IB doubt
At 09:50 AM 8/25/2006, Caitlin Bestler wrote: [EMAIL PROTECTED] wrote: Thomas How does an adapter guarantee that no bridges or other Thomas intervening devices reorder their writes, or for that Thomas matter flush them to memory at all!? That's a good point. The HCA would have to do a read to flush the posted writes, and I'm sure it's not doing that (since it would add horrible latency for no good reason). I guess it's not safe to rely on ordering of RDMA writes after all. Couldn't the same point then be made that a CQ entry may come before the data has been posted? That's why both specs (IBTA and RDMAC) are very explicit that all prior messages are complete before the CQE is given to the user. It is up to the RDMA Device and/or its driver to guarantee this by whatever means are appropriate. An implementation that allows a CQE post to pass the data placement that it is reporting on the PCI bus is in error. The critical concept of the Work Completion is that it consolidates guarantees and notificatins. The implementation can do all sorts of strange things that it thinks optimize *before* the work completion, but at the time the work completion is delivered to the user everything is supposed to be as expected. Caitlin's logic is correct and the basis for why these two specifications call out this issue. And yes, Roland, one cannot rely upon RDMA Write ordering whether for IB or iWARP. iWARP specifically allows out of order delivery. IB while providing in-order delivery due to its strong ordering protocol still has no guarantees when it comes to the memory controller and I/O technology being used. Given not everything was expected to operate over PCI, we made sure that the specifications pointed out these issues so that software would be designed to accommodate all interconnect attach types and usage models. We wanted to maximize the underlying implementation options while providing software with a consistent operating model to enable it to be simplified as well. Mike ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Multicast traffic performace of OFED 1.0 ipoib
Is the performance being measured on an identical topology and hardware set as before? Multicast by its very nature is sensitive to topology, hardware components used (buffer depth, latency, etc.) and workload occurring within the fabric. Loss occurs as a function of congestion or lack of forward progress resulting in a timeout and thus a toss of a packet. If the hardware is different or the settings chosen are changed, then the results would be expected to change. It is not clear what you hope to achieve with such tests as there will be other workloads flowing over the fabric which will create random HOL blocking which can result in packet loss. Multicast workloads should be tolerant of such loss. Mike At 04:30 AM 8/2/2006, Moni Levy wrote: Hi, we are doing some performance testing of multicast traffic over ipoib. The tests are performed by using iperf on dual 1.6G AMD PCI-X servers with PCI-X Tavor cards with 3.4.FW. Below are the command the may be used to run the test. Iperf server: route add -net 224.0.0.0 netmask 240.0.0.0 dev ib0 /home/qa/testing-tools/iperf-2.0.2/iperf -us -B 224.4.4.4 -i 1 Iperf client: route add -net 224.0.0.0 netmask 240.0.0.0 dev ib0 /home/qa/testing-tools/iperf-2.0.2/iperf -uc 224.4.4.4 -i 1 -b 100M -t 400 -l 100 We are looking for the max PPT rate (100 byte packets size) without losses, by changing the BW parameter and looking at the point where we get no losses reported. The best results we received were around 50k PPS. I remember that we got some 120k-140k packets of the same size running without losses. We are going to look into it and try to see where is the time spent, but any ideas are welcome. Best regards, Moni ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [openfabrics-ewg] OFED 1.1 release - schedule and features
At 03:49 PM 7/12/2006, Fabian Tillier wrote: Hi Mike, On 7/12/06, Michael Krause [EMAIL PROTECTED] wrote: At 09:48 AM 7/12/2006, Jeff Broughton wrote: Modifying the sockets API is just defining yet another RDMA API, and we have so many already I disagree. This effort has distilled the API to basically one for RDMA developers. Applications are supported over this via either MPI or Sockets. There's been a lot of effort to make the RDMA verbs easy to use. With the RDMA CM, socket-like connection semantics can be used to establish the connection between QPs. The connection establishment is the hard part - doing I/O is trivial in comparisson. This verbs and RDMA CM have nothing to do with MPI. If an application is going to be RDMA aware, I don't see any reason it shouldn't just use the verbs directly and use the RDMA CM to establish the connections. What's your point? It seems you are in agreement that there is a single RDMA API that people can use. It seems rather self limiting to think the traditional BSD synchronous Sockets API is all the world should be able to use when it comes to Sockets. Sockets developers could easily incorporate the extensions into their applications providing them with improved designs and flexibility without having to learn about RDMA itself. Wait, you want applications to be able to register memory and issue RDMA operations, but not have to learn about RDMA? How does that make sense? The Sockets API extensions allow developers to register memory. That has been a desire by many when it comes to SDP or copy avoidance technology as it optimizes the performance path by eliminating the need to do per op registration. For many applications which already known working sets, they can use this to enable the OS and underlying infrastructure to take advantage of this fact to improve performance and quality of the solution. The extensions provide the async communications and event collection mechanisms to also improve performance over the rather limiting select / poll supported by Sockets today. It currently does not support explicit RDMA but it is rather trivial to add such calls and remove the need to interject SDP if desired. The benefits of such new API extensions are there for those that want to eliminate one more ULP with its unfortunate IP cloud over head. If the couple of calls necessary to extend this API to support direct RDMA would allow them to eliminate SDP entirely, well, that has benefits that go beyond just its all Sockets; For a socket implementation to support RDMA, the socket must have an underlying RDMA QP. This means that if you want the application to not have to be verbs-aware, you can't really get rid of SDP - you're just extending SDP to let the application have a part in memory registration and RDMA, while still supporting the traditional BSD operations. This is IMO more complex than just letting applications interface directly with verbs, especially since the SDP implementation will size the QP for its own use, without a means for negotiating with the user so that you don't cause buffer overruns. Please take a look at the API extensions. I never stated that one gets rid of SDP unless one adds the RDMA-explicit calls. As for complexity, well, the goal is to extend to Sockets developers the optimal communication paradigm already available on OS such as Windows without having to leave with the same unfortunate constraints imposed by the OS. The same logic applies to extending the benefits derived from MPI which supports async communications as well as put / get semantics which would be analogous to the additional RDMA interfaces I referenced. I find it strange that people would argue against improving the Sockets developer's tool suite when the benefits are already proven elsewhere within the industry and even within this open source effort. Giving the millions of Sockets developers the choice of a set of extensions that work over both RDMA and traditional network stacks seems like a no brainer. Trying to force them to use a native RDMA API even if semantically similar to Sockets seems like a poor path to pursue. Leave the RDMA API to the middleware providers and those that need to be close the metal. it also eliminates the IP cloud that hovers over SDP licensing. Something that many developers and customers would appreciate. I believe that Microsoft's IP claims only apply to SDP over IB -- I don't believe SDP over iWarp is affected. I don't know how the RDMA verbs moving towards a hardware independent (wrt IB vs. iWarp) affects the IP claims, but it should certainly make things interesting if a single SDP code base can work over both IB and iWarp. SDP is SDP and it isn't just restricted to IB. I'll leave it to the lawyers to sort it out but having a single SDP with minor code execution path deltas for the IB-specifics isn't that hard to construct. It has been done on other OS already. Mike ___ openib
Re: [openib-general] OFED 1.1 release - schedule and features
At 12:59 AM 7/12/2006, Tziporet Koren wrote: Scott Weitzenkamp (sweitzen) wrote: For SDP, I would like to see improved stability (maybe you have this in mind under beta quality), also how about AIO support? The rest of the list looks good. Yes - beta quality means improved stability. AIO is not planed for 1.1 (schedule issue). If needed we can add it to 1.2 Would be nice if people thought about implementing the Sockets API Extensions from the OpenGroup. They provide explicit memory management and async communications which will allow SDP performance to be fully exploited. The benefits go beyond what is found in AIO or on other OS such as Windows. If one were to extend slightly to have explicit RDMA Read and Write from the Sockets API, then it would be quite possible to eliminate SDP entirely for new applications leaving SDP strictly for legacy Sockets environments. Mike Tziporet ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [openfabrics-ewg] OFED 1.1 release - schedule and features
At 09:48 AM 7/12/2006, Jeff Broughton wrote: Mike, The whole purpose of SDP is to make sockets go faster without having to have the applications modified. This is what the customers want. I've heard this time and time again, across a wide spectrum of customers. I am well aware of this. However, Linux / Unix do not support async communications which severely limits the potential performance benefits of SDP. When we wrote the SDP specification it was fully understood that optimal performance is achieved through async communications. We spent considerable time constructing SDP to support both synchronous and asynchronous communication paradigms which there are many applications that would benefit. Customers want to be able to use RDMA interconnects without recompilation and through the use of SDP and shared libraries this is certainly practical to execute. Developers however are not the same as customers and it is developers who would benefit from the Sockets extensions and this would in turn benefit customers. Modifying the sockets API is just defining yet another RDMA API, and we have so many already I disagree. This effort has distilled the API to basically one for RDMA developers. Applications are supported over this via either MPI or Sockets. It seems rather self limiting to think the traditional BSD synchronous Sockets API is all the world should be able to use when it comes to Sockets. Sockets developers could easily incorporate the extensions into their applications providing them with improved designs and flexibility without having to learn about RDMA itself. If the couple of calls necessary to extend this API to support direct RDMA would allow them to eliminate SDP entirely, well, that has benefits that go beyond just its all Sockets; it also eliminates the IP cloud that hovers over SDP licensing. Something that many developers and customers would appreciate. In the end, this effort could choose to progress Sockets technology and extend the number of developers and applications that can achieve optimal performance with only minor knowledge growth or they can live with the limitations of the BSD Sockets API and either accept performance loss or be forced to jump through the hoops of using other rather niche or obscure API to accomplish what is possible with a small number of Sockets extensions which were defined by people with years of experience implementing Sockets and working with application developers. Mike -Jeff From: [EMAIL PROTECTED] [ mailto:[EMAIL PROTECTED]] On Behalf Of Michael Krause Sent: Wednesday, July 12, 2006 9:23 AM To: Tziporet Koren; Scott Weitzenkamp (sweitzen) Cc: OpenFabricsEWG; openib Subject: Re: [openfabrics-ewg] [openib-general] OFED 1.1 release - schedule and features At 12:59 AM 7/12/2006, Tziporet Koren wrote: Scott Weitzenkamp (sweitzen) wrote: For SDP, I would like to see improved stability (maybe you have this in mind under beta quality), also how about AIO support? The rest of the list looks good. Yes - beta quality means improved stability. AIO is not planed for 1.1 (schedule issue). If needed we can add it to 1.2 Would be nice if people thought about implementing the Sockets API Extensions from the OpenGroup. They provide explicit memory management and async communications which will allow SDP performance to be fully exploited. The benefits go beyond what is found in AIO or on other OS such as Windows. If one were to extend slightly to have explicit RDMA Read and Write from the Sockets API, then it would be quite possible to eliminate SDP entirely for new applications leaving SDP strictly for legacy Sockets environments. Mike Tziporet ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH v3 1/7] AMSO1100 Low Level Driver.
At 10:14 AM 6/23/2006, Grant Grundler wrote: On Fri, Jun 23, 2006 at 04:04:31PM +0200, Arjan van de Ven wrote: I thought the posted write WILL eventually get to adapter memory. Not stall forever cached in a bridge. I'm wrong? I'm not sure there is a theoretical upper bound I'm not aware of one either since MMIO writes can travel across many other chips that are not constrained by PCI ordering rules (I'm thinking of SGI Altix...) It is processor / coherency backplane technology specific as to the number of outstanding writes. There is also no guarantee that such writes will hit the top of the PCI hierarchy in the order they were posted in a multi-core / processor system. Hence, it is up to software to guarantee that ordering is preserved and to not assume anything about ordering from a hardware perspective. Once a transaction hits the PCI hierarchy, then the PCI ordering rules apply and depending upon the transaction type and other rules, what is guaranteed is deterministic in nature. (and if it's several msec per bridge, then you have a lot of latency anyway) That's what my original concern was when I saw you point this out. But MMIO reads here would be expensive and many drivers tolerate this latency in exchange for avoiding the MMIO read in the performance path. As the saying goes, MMIO Reads are pure evil and should be avoided at all costs if performance is the goal. Even in a relatively flat I/O hierarchy, the additional latency is non-trivial and can lead to a significant loss in performance for the system. Mike ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Mellanox HCAs: outstanding RDMAs
As one of the authors of IB and iWARP, I can say that both Roland and Todd's responses are correct and the intent of the specifications. The number of outstanding RDMA Reads are bounded and that is communicated during session establishment. The ULP can choose to be aware of this requirement (certainly when we wrote iSER and DA we were well aware of the requirement and we documented as such in the ULP specs) and track from above so that it does not see a stall or it can stay ignorant and deal with the stall as a result. This is a ULP choice and has been intentionally done that way so that the hardware can be kept as simple as possible and as low cost as well while meeting the breadth of ULP needs that were used to develop these technologies. Tom, you raised this issue during iWARP's definition and the debate was conducted at least several times. The outcome of these debates is reflected in iWARP and remains aligned with IB. So, unless you really want to have the IETF and IBTA go and modify their specs, I believe you'll have to deal with the issue just as other ULP are doing today and be aware of the constraint and write the software accordingly. The open source community isn't really the right forum to change iWARP and IB specifications at the end of the day. Build a case in the IETF and IBTA and let those bodies determine whether it is appropriate to modify their specs or not. And yes, it is modification of the specs and therefore the hardware implementations as well address any interoperability requirements that would result (the change proposed could fragment the hardware offerings as there are many thousands of devices in the market that would not necessarily support this change). Mike At 12:07 PM 6/6/2006, Talpey, Thomas wrote: Todd, thanks for the set-up. I'm really glad we're having this discussion! Let me give an NFS/RDMA example to illustrate why this upper layer, at least, doesn't want the HCA doing its flow control, or resource management. NFS/RDMA is a credit-based protocol which allows many operations in progress at the server. Let's say the client is currently running with an RPC slot table of 100 requests (a typical value). Of these requests, some workload-specific percentage will be reads, writes, or metadata. All NFS operations consist of one send from client to server, some number of RDMA writes (for NFS reads) or RDMA reads (for NFS writes), then terminated with one send from server to client. The number of RDMA read or write operations per NFS op depends on the amount of data being read or written, and also the memory registration strategy in use on the client. The highest-performing such strategy is an all-physical one, which results in one RDMA-able segment per physical page. NFS r/w requests are, by default, 32KB, or 8 pages typical. So, typically 8 RDMA requests (read or write) are the result. To illustrate, let's say the client is processing a multi-threaded workload, with (say) 50% reads, 20% writes, and 30% metadata such as lookup and getattr. A kernel build, for example. Therefore, of our 100 active operations, 50 are reads for 32KB each, 20 are writes of 32KB, and 30 are metadata (non-RDMA). To the server, this results in 100 requests, 100 replies, 400 RDMA writes, and 160 RDMA Reads. Of course, these overlap heavily due to the widely differing latency of each op and the highly distributed arrival times. But, for the example this is a snapshot of current load. The latency of the metadata operations is quite low, because lookup and getattr are acting on what is effectively cached data. The reads and writes however, are much longer, because they reference the filesystem. When disk queues are deep, they can take many ms. Imagine what happens if the client's IRD is 4 and the server ignores its local ORD. As soon as a write begins execution, the server posts 8 RDMA Reads to fetch the client's write data. The first 4 RDMA Reads are sent, the fifth stalls, and stalls the send queue! Even when three RDMA Reads complete, the queue remains stalled, it doesn't unblock until the fourth is done and all the RDMA Reads have been initiated. But, what just happened to all the other server send traffic? All those metadata replies, and other reads which completed? They're stuck, waiting for that one write request. In my example, these number 99 NFS ops, i.e. 654 WRs! All for one NFS write! The client operation stream effectively became single threaded. What good is the rapid initiation of RDMA Reads you describe in the face of this? Yes, there are many arcane and resource-intensive ways around it. But the simplest by far is to count the RDMA Reads outstanding, and for the *upper layer* to honor ORD, not the HCA. Then, the send queue never blocks, and the operation streams never loses parallelism. This is what our NFS server does. As to the depth of IRD, this is a different calculation, it's a DelayxBandwidth of the RDMA Read stream. 4 is good for local, low latency connections. But over a
Re: [openib-general] IB MTU tunable for uDAPL and/or Intel MPI?
At 10:44 AM 6/9/2006, Scott Weitzenkamp (sweitzen) wrote: Content-class: urn:content-classes:message Content-Type: multipart/alternative; boundary=_=_NextPart_001_01C68BEC.6C768F57 Content-Transfer-Encoding: 7bit While we're talking about MTUs, is the IB MTU tunable in uDAPL and/or Intel MPI via env var or config file? Looks like Intel MPI 2.0.1 uses 2K for IB MTU like MVAPICH does in OFED 1.0 rc4 and rc6, I'd like to try 1K with Intel MPI. IB MTU should be set on a per path basis by the SM. An application should examine the PMTU for a given path and take appropriate action - really only applies to UD as connected mode should automatically SAR requests. Communicating PMTU to an application should not occur unless it is datagram based. The same is true for iWARP where TCP / IP takes care of the PMTU on behalf of the ULP / application. If you want to control PMTU, then do so via the SM directly which was the intention of the architecture and specification. Mike Scott From: [EMAIL PROTECTED] [ mailto:[EMAIL PROTECTED]] On Behalf Of Scott Weitzenkamp (sweitzen) Sent: Thursday, June 08, 2006 4:38 PM To: Tziporet Koren; [EMAIL PROTECTED] Cc: openib-general Subject: RE: [openib-general] OFED-1.0-rc6 is available The MTU change undos the changes for bug 81, so I have reopened bug 81 ( http://openib.org/bugzilla/show_bug.cgi?id=81). With rc6, PCI-X osu_bw and osu_bibw performance is bad, and PCI-E osu_bibw performance is bad. I've enclosed some performance data, look at rc4 vs rc5 vs rc6 for Cougar/Cheetah/LionMini. Are there other benchmarks driving the changes in rc6 (and rc4)? Scott Weitzenkamp SQA and Release Manager Server Virtualization Business Unit Cisco Systems OSU MPI: · Added mpi_alltoall fine tuning parameters · Added default configuration/documentation file $MPIHOME/etc/mvapich.conf · Added shell configuration files $MPIHOME/etc/mvapich.csh , $MPIHOME/etc/mvapich.csh · Default MTU was changed back to 2K for InfiniHost III Ex and InfiniHost III Lx HCAs. For InfiniHost card recommended value is: VIADEV_DEFAULT_MTU=MTU1024 ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Re: Mellanox HCAs: outstanding RDMAs
Whether iWARP or IB, there is a fixed number of RDMA Requests allowed to be outstanding at any given time. If one posts more RDMA Read requests than the fixed number, the transmit queue is stalled. This is documented in both technology specifications. It is something that all ULP should be aware of and some go so far as to communicate that as part of the Hello / login exchange. This allows the ULP implementation to determine whether it wants to stall or wants to wait until Read Responses complete before sending another request. This isn't something silent; this isn't something new; this is something for the ULP implementation to decide how to deal with the issue. BTW, this is part of the hardware and associated specifications so it is up to software to deal with the limited hardware resources and the associated consequences. Please keep in mind that there are a limited number of RDMA Request / Atomic resource slots at the receiving HCA / RNIC. These are kept in hardware thus one must know the exact limit to avoid creating protocol problems. A ULP transmitter may post to the transmit queue more than the allotted slots but the transmitting (source) HCA / RNIC must not issue them to the remote. These requests do cause the source to stall. This is a well understood problem and if people give the iSCSI / iSER and DA specs good read or SDP they can see that this issue is comprehended. I agree with people that ULP designers / implementers must pay close attention to this constraint as it is in the iWARP / IB specifications for a very good reason and these semantics must be preserved to maintain the ordering requirements that are the used by the overall RDMA protocols themselves. Mike At 05:24 AM 6/6/2006, Talpey, Thomas wrote: At 03:43 AM 6/6/2006, Michael S. Tsirkin wrote: Quoting r. Talpey, Thomas [EMAIL PROTECTED]: Semantically, the provider is not required to provide any such flow control behavior by the way. The Mellanox one apparently does, but it is not a requirement of the verbs, it's a requirement on the upper layer. If more RDMA Reads are posted than the remote peer supports, the connection may break. This does not sound right. Isn't this the meaning of this field: Initiator Depth: Number of RDMA Reads atomic operations outstanding at any time? Shouldn't any provider enforce this limit? The core spec does not require it. An implementation *may* enforce it, but is not *required* to do so. And as pointed out in the other message, there are repercussions of doing so. I believe the silent queue stalling is a bit of a time bomb for upper layers, whose implementers are quite likely unaware of the danger. I greatly prefer an implementation which simply sends the RDMA Read request, resulting in a failed (but unblocked!) connection. Silence is a very dangerous thing, no matter how helpful the intent. Tom. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] QoS RFC - Resend using a friendly mailer
High-level feedback: - An IB fabric could be used for a single ULP and still require QoS. The issue is how to differentiate flows on a given shared element within the fabric. - QoS controls must be dynamic. The document references initialization as the time when decisions are made but obviously that is just a first pass on use of the fabric and not what it will become in potentially a short period of time. - QoS also involves multi-path support (not really touched upon in terms of specifics in this document). Distributing or segregating work even if for the same ULP should be done across multiple or distinct paths. In one sense this may complicate the work but in another it is simpler in that arbitration controls for shared links become easier to manage if the number of flows is reduced. - IP over IB defines a multicast group which is ultimately a spanning tree. That should not constrain what paths are used to communicate between endnode pairs. That only defines the multicast paths which are not strongly ordered relative to the unicast traffic. Further IP over IB may operate using the RC mode between endnodes. It is very simple to replicate RC and then segregate these into QoS domains (one could just align priority with the 802.1p for simplicity and practical execution) which can in turn flow over shared or distinct paths. - IB is a centrally managed fabric. Adding in SID into records and such really isn't going to help solve the problem unless there is also a centralized management entity well above IB that can prioritize communication service rates for different ULP and endnode pairs. Given most of these centralized management entities are rather ignorant of IB at the moment, this presents a chicken-egg dilemma which is further complicated by developing SOA technology. It might be more valuable in one sense to examine SOA technology and how it is translating itself to say Ethernet and then see how this can be leveraged to IB. - QoS needs to examine the sums of the consumers of a given path and their service rate requirements. It isn't just about setting a priority level but also about the packet injection rate to the fabric on that priority. This needs to be taken into account as well. Overall, it is not clear to me what the end value of this document. The challenge for any network admin is to translate SOA driven requirements into fabric control knob setting. Without such translation algorithms / understanding, it is not clear that there is anything truly missing in the IBTA spec suite or that this RFC will really advance the integration of IB into the data center in a truly meaningful manner. Mike At 07:53 AM 5/30/2006, Eitan Zahavi wrote: To: OPENIB openib-general@openib.org Subject: QoS RFC - Resend using a friendly mailer --text follows this line-- Hi All Please find the attached RFC describing how QoS policy support could be implemented in the OpenFabrics stack. Your comments are welcome. Eitan RFC: OpenFabrics Enhancements for QoS Support === Authors: . Eitan Zahavi [EMAIL PROTECTED] Date: May 2006. Revision: 0.1 Table of contents: 1. Overview 2. Architecture 3. Supported Policy 4. CMA functionality 5. IPoIB functionality 6. SDP functionality 7. SRP functionality 8. iSER functionality 9. OpenSM functionality 1. Overview Quality of Service requirements stem from the realization of I/O consolidation over IB network: As multiple applications and ULPs share the same fabric, means to control their use of the network resources are becoming a must. The basic need is to differentiate the service levels provided to different traffic flows. Such that a policy could be enforced and control each flow utilization of the fabric resources. IBTA specification defined several hardware features and management interfaces to support QoS: * Up to 15 Virtual Lanes (VL) could carry traffic in a non-blocking manner * Arbitration between traffic of different VL is performed by a 2 priority levels weighted round robin arbiter. The arbiter is programmable with a sequence of (VL, weight) pairs and maximal number of high priority credits to be processed before low priority is served * Packets carry class of service marking in the range 0 to 15 in their header SL field * Each switch can map the incoming packet by its SL to a particular output VL based on programmable table VL=SL-to-VL-MAP(in-port, out-port, SL) * The Subnet Administrator controls each communication flow parameters by providing them as a response to Path Record query The IB QoS features provide the means to implement a DiffServ like architecture. DiffServ architecture (IETF RFC2474 2475) is widely used today in highly dynamic fabrics. This proposal provides the detailed functional definition for the various software elements that are required to enable a DiffServ like architecture over the OpenFabrics software stack. 2. Architecture This proposal split
Re: [openib-general] re RDS missing features
At 10:42 AM 5/1/2006, Ranjit Pandit wrote: On 5/1/06, Or Gerlitz [EMAIL PROTECTED] wrote: Can you elaborate on each of the features, specifically the following points are of interest to us: +1 so you running Oracle Loopback traffic over RDS sockets? if yes, what the issue here? the openib CMA supports listen/connect on loopback addresses (eg 127.0.0.1 or IPoIB local address) Yes. There is no issue. It's just next in line for me to implement. +2 by failover, are you referring to APM? that is failover between IB pathes to/from the same HCA over which the original connection/QP was established or you are talking on failover between HCAs Failover within and across HCAs. APM does not work for failover across HCAs. That is because it is two different types of fail over being discussed. APM is completely transparent to the IB RC connections thus there is no disruption or loss of data. Fail over across HCA is in effect replaying ULP transactions across a new RC connection. Without an application / ULP level acknowledgement, there is still a hole in the RDS proposal that has been raised and acknowledged in the past as existing as recently as the Sonoma get together. I still have not seen a response to my inquiry about the this ULP and API changes being at least comprehended beyond the Oracle usage model and perhaps being reviewed within the IETF given it represents changes in API and communication semantics. If the goal is to have RDS be a generic service then it should be reviewed and validated by other potential consumers as well as those subsystems that may be impacted. Mike +3 is the no support for /proc like for RDS an issue to run crload or demo Oracle (that is specific tuning and usage of non defaults is needed for any/optimal operation) No, this does not affect core functionality. You should be able to run Oracle or crload without this feature. That was a list of things that still need to be implemented for GA and not just demo Or. [openfabrics-ewg] Before we can start testing - we needto ensure that RDS is fully ported. Pandit, Ranjit rpandit at silverstorm.com Following features are yet to be implemented in OpenFabric Rds: 1. Failover 2. Loopback connections 3. support for /proc fs like Rds config, stats and info. Ranjit ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] re RDS missing features
Given this is an extension to Sockets, should it not also be reviewed by the Sockets owners? What about the API itself? Any plans to make this portable to other OS / endnodes or have a spec and associated wire protocol that is reviewed perhaps in the IETF so it is applicable to more than just Oracle? It seems this really should be standardized within the IETF to gain broad adoption and insure it will be interoperable across all implementations not just OpenFabric's. At 10:42 AM 5/1/2006, Ranjit Pandit wrote: On 5/1/06, Or Gerlitz [EMAIL PROTECTED] wrote: Can you elaborate on each of the features, specifically the following points are of interest to us: +1 so you running Oracle Loopback traffic over RDS sockets? if yes, what the issue here? the openib CMA supports listen/connect on loopback addresses (eg 127.0.0.1 or IPoIB local address) Yes. There is no issue. It's just next in line for me to implement. +2 by failover, are you referring to APM? that is failover between IB pathes to/from the same HCA over which the original connection/QP was established or you are talking on failover between HCAs Failover within and across HCAs. APM does not work for failover across HCAs. For OpenFabric, one would need to have this work across RNIC as well. APM is not part of iWARP so can't be relied upon. +3 is the no support for /proc like for RDS an issue to run crload or demo Oracle (that is specific tuning and usage of non defaults is needed for any/optimal operation) No, this does not affect core functionality. You should be able to run Oracle or crload without this feature. That was a list of things that still need to be implemented for GA and not just demo Or. [openfabrics-ewg] Before we can start testing - we needto ensure that RDS is fully ported. Pandit, Ranjit rpandit at silverstorm.com Following features are yet to be implemented in OpenFabric Rds: 1. Failover 2. Loopback connections 3. support for /proc fs like Rds config, stats and info. Ranjit ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] mthca FMR correctness (and memory windows)
At 05:10 PM 3/20/2006, Fabian Tillier wrote: On 3/20/06, Talpey, Thomas [EMAIL PROTECTED] wrote: Ok, this is a longer answer. At 06:08 PM 3/20/2006, Fabian Tillier wrote: As to using FMRs to create virtually contiguous regions, the last data I saw about this related to SRP (not on OpenIB), and resulted in a gain of ~25% in throughput when using FMRs vs the full frontal DMA MR. So there is definitely something to be gained by creating virutally contiguous regions, especially if you're doing a lot of RDMA reads for which there's a fairly low limit to how many can be in flight (4 comes to mind). 25% throughput over what workload? And I assume, this was with the lazy deregistration method implemented with the current fmr pool? What was your analysis of the reason for the improvement - if it was merely reducing the op count on the wire, I think your issue lies elsewhere. This was a large block read workload (since HDDs typically give better read performance than write). It was with lazy deregistration, and the analysis was that the reduction of the op count on the wire was the reason. It may well have to do with how the target chose to respond, though, and I have no idea how that side of things was implemented. It could well be that performance could be improved without going with FMRs. Quite often performance is governed by the target more than the initiator as it is in turn governed by its local cache and disc mech performance / capacity. Large data movements typically are a low op count from the initiator perspective therefore it seems a bit odd to state that performance can be dramatically impacted by the op count on the wire. Also, see previous paragraph - if your SRP is fast but not safe, then only fast but not safe applications will want to use it. Fibre channel adapters do not introduce this vulnerability, but they go fast. I can show you NFS running this fast too, by the way. Why can't Fibre Channel adapters, or any locally attached hardware for that matter, DMA anywhere in memory? Unless the chipset somehow protect against it, doesn't locally attached hardware have free reign over DMA? As a general practice, future volume I/O chipsets across multiple market segments will implement an IOMMU to restrict where DMA is allowed. Both AMD and Intel have recently announced specifications to this effect which reflect what has been implemented in many non-x86 chipset offerings. Whether a given OS always requires this protection to be enabled is implementation-specific but it is something that many within the industry and customer base require. Mike Also, please don't take my anectdotal benchmark results as an endorsement of the Mellanox FMR design - the data was presented to me by Mellanox as a reason to add FMR support to the Windows stack (which currently uses the full frontal approach due to limitations of the verbs API and how it needs to be used for storage). I never had a chance to look into why the gains where so large, and it could be either the SRP target implementation, a hardware limitation, or a number of other issues, especially since a read workload results in RDMA Writes from the target to the host which can be pipelined much deeper than RDMA Reads. - Fab ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Re: RFC: e2e credits
At 03:31 AM 3/24/2006, Hal Rosenstock wrote: On Thu, 2006-03-23 at 12:34, Michael S. Tsirkin wrote: Quoting r. Hal Rosenstock [EMAIL PROTECTED]: Sean, just to wrap it up, the API at the verbs layer will look like the below, and then ULPs just put the value they want in the CM and CM will pass it in to low level. So this is our question, right? CM REQ and REP messages include the following field: --- 12.7.26 END-TO-END FLOW CONTROL Signifies whether the local CA actually implements End-to-End Flow Control (1), or instead always advertises .invalid credits.(0). See section 9.7.7.2 End-to-End (Message Level) Flow Control for more detail. --- Consider and implementation that advertises valid credits for connections, and always advertises invalid credits for other connections. This is compliant since the IB spec says (end-to-end (message level) flow control, Requester Behaviour): Even a responder which does generate end-to-end credits may choose to send the 'invalid' code in the AETH I did some spec reading to find this and found the following which I think makes the current requirement clear: p.347 line 37 states HCA receive queues must generate end-to-end credits (except for QPs associated with a SRQ), but TCA receive queues are not required to do so. This appears to be informative text. I first found the following: p. 348 has the compliances for this for both HCAs and TCAs: C9-150.2.1: For QPs that are not associated with an SRQ, each HCA re- ceive queue shall generate end-to-end flow control credits. If a QP is associated with an SRQ, the HCA receive queue shall not generate end-to- end flow control credits. o9-95.2.1: Each TCA receive queue may generate end-to-end credits ex- cept for QPs that are associated with an SRQ. If a TCA supports SRQ, the TCA must not generate End-to-End Flow Control Credits for QPs associ- ated with an SRQ. C9-151: If a TCA's given receive queue generates End-to-End credits, then the corresponding send queue shall receive and respond to those credits. This is a requirement on each send queue of a CA. The above informative text also references the CA requirements in chapter 17 and on p. 1026 line 25 there is a row in the table for end to end flow control for RC consistent with the above. p.1028 has the compliances for this. Is it compliant for CM implementations to set/clear the End-to-End Flow Control field accordingly, taking it to mean whether the local CA actually implements End-to-End Flow Control (1), or instead always advertises 'invalid credits'(0) *for the specific connection* So IMO the intent of what was written is clear (on a per CA basis) and this is a spec change which is OK to propose but needs a different writeup. The spec was written to minimize the impact to TCA in terms of e2e credits. HCA are expected to use e2e credits all the time sans SRQ which is a special multiplexing case where e2e isn't that beneficial / logical to support. Each connection must negotiate this per CA pair as no single policy was deemed practical across all usage models. I don't think there would be much support for changing the spec. BTW, iWARP does not support e2e credits as it relies upon the ULP to advertise and track its buffer usage. It was therefore deemed unnecessary. Mike ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] mthca FMR correctness (and memory windows)
At 04:30 PM 3/20/2006, Talpey, Thomas wrote: At 06:00 PM 3/20/2006, Sean Hefty wrote: Can you provide more details on this statement? When are you fencing the send queue when using memory windows? Infiniband 101, and VI before it. Memory windows fence later operations on the send queue until the bind completes. It's a misguided attempt to make upper layers' job easier because they can post a bind and then immediately post a send carrying the rkey. In reality, it introduces bubbles in the send pipeline and reduces op rates dramatically. The requirement / semantics were derived from the ULP being used to construct the technology. The combination of a bind-n-send operation was to reduce the software interactions with the device by consolidating this into a combo operation. I do not follow your logic that this creates a bubble in the send pipeline as there were also ordering and correctness issues w.r.t. subsequent operations to the send. The bind-n-send is a single operation and its fence semantics were required to allow the bind to complete before informing the remote of the subsequent information in order to avoid race conditions. I argued against them in iWARP verbs, and lost. If Linux could introduce a way to make the fencing behavior optional, I would lead the parade. I fear most hardware is implemented otherwise. Hardware generally implements operations in the order they are posted to a given QP, i.e. it is a serial execution flow that allows pipelined operations to be posted and executed by the hardware. Scaling is achieved by executing across a set of QP and thus a set of resources. The ordering domain requirements are kept simple to allow low-cost hardware implementations. This does not preclude software from executing across a set of QP in any order that it desires. Yes, I know about binding on a separate queue. That doesn't work, because windows are semantically not fungible (for security reasons). You could always simply allow a region to be accessible across multiple operations but then again storage argued that it must only be accessible for a single op thus things like FMR, bind-n-send, etc. were all created. To say that storage was not listened to or their needs were not met or balanced against what is practical to implement in either the creation of IB or iWARP is simply incorrect. Mike Tom. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] IPoIB and lid change
At 09:43 AM 2/10/2006, Grant Grundler wrote: On Fri, Feb 10, 2006 at 11:05:34AM -0500, Hal Rosenstock wrote: Hi, Roland! One issue we have with IPoIB is that IPoIB may cache a remote node path for a long time. Remote LID may get changed e.g. if the SM is changed, and IPoIB might lose connectivity. I wonder if this is why when I reload the IB drivers on one node I sometimes have to reload them on other nodes too. Otherwise ping over IPoIB doesn't work. If endnodes are not periodically refreshing their caches or are not subscribing to event management to be informed a refresh is in order, then endnodes will fall out of sync and would need to be restarted to establish communication. This is a classic problem that was illustrated in various early router protocols and is why today's protocols rely implement a two-prong approach in many cases - limited cache lifetime and proactive cache event updates. The remote LID may get changed for other reasons too without an SM change (SM merge of 2 separate subnets). How can this be handled ? Isn't this just another case of the SM changing for one of the subnets? A SM merge that involves updating LIDs is a non-trivial event. It requires connections to be effectively restarted as one cannot ascertain whether all packets are flushed from the fabric otherwise - that can cause silent data corruption. For a subsystem such as IPoverIB, a LID update should result in an unsolicited ARP / ND exchange which will cause all remote endnodes to receive the new information. Mike ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [dat-discussions] [openib-general] [RFC] DAT2.0immediatedataproposal
At 03:36 PM 2/8/2006, Arlin Davis wrote: Roland Dreier wrote: Michael So, here we have a long discussion on attempting to Michael perpetuate a concept that is not universal across Michael transports and was deemed to have minimal value that most Michael wanted to see removed from the architecture. But this discussion is being driven by an application developer who does see value in immediate data. Arlin, can you quantify the benefit you see from RDMA write with immediate vs. RDMA write followed by a send? We need speed and simplicity. A very latency sensitive application that requires immediate notification of RDMA write completion on the remote node without ANY latency penalties associated with combining operations, HCA priority rules across QPs, wire congestion, etc. An application that has no requirement for messaging outside of remote rdma write completion notifications. The application would not have to register and manage additional message buffers on either side, we can just size the queues accordingly and post zero byte messages. We need something that would be equivelent to setting there polling on the last byte of inbound data. But, since data ordering within an operation is not guaranteed that is not an option. So, rdma with immediate data is the most optimal and simplistic method for indication of RDMA-write completion that we have available today. In fact, I would like to see it increased in size to make it even more useful. RDMA Write with Immediate is part of the IB Extended Transport Header. It is a fixed-sized quantity and not one subject to change, i.e. increasing its size. Your argument above reinforces that the particular application need is IB-specific and thus should not be part of a general API but a transport-specific API. If the application will only operate optimally using immediate data, then it is only suitable for an IB fabric. This reinforces the need for a transport-specific API. Those applications that simply want to enable completion notification when a RDMA Write has occurred can use a general purpose API that is interconnect independent and whose code is predicated upon a RDMA Write - Send set of operations. This will enable application portability across all interconnect types. Mike ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Re: [PATCH] [RFC] - example user moderdmaping/pongprogram using CMA
At 11:04 AM 2/8/2006, Michael S. Tsirkin wrote: Quoting r. Steve Wise [EMAIL PROTECTED]: Subject: Re: [openib-general] Re: [PATCH] [RFC] - example user moderdmaping/pongprogram using CMA On Wed, 2006-02-08 at 19:10 +0200, Michael S. Tsirkin wrote: Quoting r. Sean Hefty [EMAIL PROTECTED]: Subject: RE: [openib-general] Re: [PATCH] [RFC] - example user mode rdmaping/pongprogram using CMA Steve, looks like you have at most a single receive work request posted at the receive workqueue at all times. If true, this is *really* not a good idea, performance-wise, even if you actually have at most 1 packet in flight. Can you provide some more details on this? See 9.7.7.2 end-to-end (message level) flow control I just read this section in the 1.2 version of the spec, and I still don't understand what the issue really is? 9.7.7.2 talks about IBA doing flow control based on the RECV WQEs posted. rping always ensures that there is a RECV posted before the peer can send. This is ensured by the rping protocol itself (see the comment at the front of rping.c describing the ping loop). I'm only ever sending one outstanding message via SEND/RECV. I would rather post exactly what is needed, than post some number of RECVs just to be safe. Sorry if I'm being dense. What am I missing here? Steve. As far as I know, the credits are only updated by the ACK messages. If there is a single work request outstanding on the RQ, the ACK of the SEND message will have the credit field value 0 (since exactly one receive WR was outstanding, and that is now consumed). As a result the remote side withh think that there are no receive WQEs and will slow down (what spec refers to as limited WQE). Correct. The ACK / NAK protocol used by IB is used to return credits. In order to pipeline to improve performance, then you must post multiple receive work requests in order to account for the expected round trip time of the fabric and the associated CA processing. Mike ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Re: [PATCH] [RFC] - example user moderdmaping/pongprogram using CMA
At 11:35 AM 2/8/2006, Steve Wise wrote: I just read this section in the 1.2 version of the spec, and I still don't understand what the issue really is? 9.7.7.2 talks about IBA doing flow control based on the RECV WQEs posted. rping always ensures that there is a RECV posted before the peer can send. This is ensured by the rping protocol itself (see the comment at the front of rping.c describing the ping loop). I'm only ever sending one outstanding message via SEND/RECV. I would rather post exactly what is needed, than post some number of RECVs just to be safe. Sorry if I'm being dense. What am I missing here? Steve. As far as I know, the credits are only updated by the ACK messages. If there is a single work request outstanding on the RQ, the ACK of the SEND message will have the credit field value 0 (since exactly one receive WR was outstanding, and that is now consumed). As a result the remote side withh think that there are no receive WQEs and will slow down (what spec refers to as limited WQE). Oh. I understand now. This is an issue with only 1 RQ WQE posted and how IB tries to inform the peer transport of the WQE count. For iWARP, none of this transport-level flow control happens (and I'm more familiar with iWARP than IB). For iWARP, we decided to not implement application receiver based flow control due to two items:TCP provides transport-level flow control (IB does not provide the equivalent per se) and upon examination of the majority of the ULP, they exchange and track the number of receive buffers allowed to be processed thus there is no need to replicate this in iWARP. There are some subtleties as well between a message-based transport and a byte stream such as TCP that go into the equation but these are not that important for most application writers to deal with. Mike ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [dat-discussions] [openib-general] [RFC] DAT2.0immediatedataproposal
At 09:16 PM 2/6/2006, Sean Hefty wrote: The requirement is to provide an API that supports RDMA writes with immediate data. A send that follows an RDMA write is not immediate data, and the API should not be constructed around trying to make it so. To be clear, I believe that write with immediate should be part of the normal APIs, rather than an extension, but should be designed around those devices that provide it natively. One thing to keep in mind is that the IBTA workgroup responsible for the transport wanted to eliminate immediate data support entirely but it was retained solely to enable VIA application migration (even though the application base was quite small). If that requirement could have been eliminated, then it would have been gone in a heart beat. Given a RDMA-WRITE followed by a SEND provides the same application semantics based on the use models, iWARP chose not to support immediate data. So, here we have a long discussion on attempting to perpetuate a concept that is not universal across transports and was deemed to have minimal value that most wanted to see removed from the architecture. One has to question the value of trying to develop any API / software to support immediate data instead of just enabling the preferred method which is RDMA WRITE - SEND. I agree with those who have contended that this is difficult to do in a general purpose fashion. When all of this is taken into account, it seems the only good engineering answer is to eliminate immediate data support by the software and focused on the method that works across all interconnects. Mike ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [ANNOUNCE] Contribute RDS(ReliableDatagramSockets) to OpenIB
At 12:49 PM 11/14/2005, Nitin Hande wrote: Michael Krause wrote: At 01:01 PM 11/11/2005, Nitin Hande wrote: Michael Krause wrote: At 10:28 AM 11/9/2005, Rick Frank wrote: Yes, the application is responsible for detecting lost msgs at the application level - the transport can not do this. RDS does not guarantee that a message has been delivered to the application - just that once the transport has accepted a msg it will deliver the msg to the remote node in order without duplication - dealing with retransmissions, etc due to sporadic / intermittent msg loss over the interconnect. If after accepting the send - the current path fails - then RDS will transparently fail over to another path - and if required will resend / send any already queued msgs to the remote node - again insuring that no msg is duplicated and they are in order. This is no different than APM - with the exception that RDS can do this across HCAs. The application - Oracle in this case - will deal with detecting a catastrophic path failure - either due to a send that does not arrive and or a timedout response or send failure returned from the transport. If there is no network path to a remote node - it is required that we remove the remote node from the operating cluster to avoid what is commonly termed as a split brain condition - otherwise known as a partition in time. BTW - in our case - the application failure domain logic is the same whether we are using UDP / uDAPL / iTAPI / TCP / SCTP / etc. Basically, if we can not talk to a remote node - after some defined period of time - we will remove the remote node from the cluster. In this case the database will recover all the interesting state that may have been maintained on the removed node - allowing the remaining nodes to continue. If later on, communication to the remote node is restored - it will be allowed to rejoin the cluster and take on application load. Please clarify the following which was in the document provided by Oracle. On page 3 of the RDS document, under the section RDP Interface, the 2nd and 3rd paragraphs are state: * RDP does not guarantee that a datagram is delivered to the remote application. * It is up to the RDP client to deal with datagrams lost due to transport failure or remote application failure. The HCA is still a fault domain with RDS - it does not address flushing data out of the HCA fault domain, nor does it sound like it ensures that CQE loss is recoverable. I do believe RDS will replay all of the sendmsg's that it believes are pending, but it has no way to determine if already sent sendmsgs were actually successfully delivered to the remote application unless it provides some level of resync of the outstanding sends not completed from an application's perspective as well as any state updated via RDMA operations which may occur without an explicit send operation to flush to a known state. If RDS could define a mechanism that the application could use to inform the sender to resync and replay on catastrophic failure, is that a correct understanding of your suggestion ? I'm not suggesting anything at this point. I'm trying to reconcile the documentation with the e-mail statements made by its proponents. I'm still trying to ascertain whether RDS completely recovers from HCA failure (assuming there is another HCA / path available) between the two endnodes Reading at the doc and the thread, it looks like we need src/dst port for multiplexing connections, we need seq/ack# for resyncing, we need some kind of window availability for flow control. Are'nt we very close to tcp header ? .. TCP does not provide end-to-end to the application as implemented by most OS. Unless one ties TCP ACK to the application's consumption of the receive data, there is no method to ascertain that the application really received the data. The application would be required to send its own application-level acknowledgement. I believe the intent is for applications to remain responsible for the end-to-end receipt of data and that RDS and the interconnect are simply responsible for the exchange at the lower levels.Yes, a TCP ack only implies that it has received the data, and means nothing to the application. It is the application which has send a application level ack to its peer. TCP ACK was intended to be an end-to-end ACK but implementations took it to a lower level ACK only. A TCP stack linked into an application as demonstrated by multiple IHV and research does provide an end-to-end ACK and considerable performance improvements over the traditional network stack implementations. Some claim it is more than good enough to eliminate the need for protocol off-load / RDMA which is true for many applications (certainly for most Sockets, etc.) but not true when one takes advantage of the RDMA comms paradigm which has benefit for a number of applications. Mike ___ openib-general mailing list openib-general@openib.org http://openib.org
Re: [openib-general] [ANNOUNCE] Contribute RDS(ReliableDatagramSockets) to OpenIB
At 10:28 AM 11/9/2005, Rick Frank wrote: Yes, the application is responsible for detecting lost msgs at the application level - the transport can not do this. RDS does not guarantee that a message has been delivered to the application - just that once the transport has accepted a msg it will deliver the msg to the remote node in order without duplication - dealing with retransmissions, etc due to sporadic / intermittent msg loss over the interconnect. If after accepting the send - the current path fails - then RDS will transparently fail over to another path - and if required will resend / send any already queued msgs to the remote node - again insuring that no msg is duplicated and they are in order. This is no different than APM - with the exception that RDS can do this across HCAs. The application - Oracle in this case - will deal with detecting a catastrophic path failure - either due to a send that does not arrive and or a timedout response or send failure returned from the transport. If there is no network path to a remote node - it is required that we remove the remote node from the operating cluster to avoid what is commonly termed as a split brain condition - otherwise known as a partition in time. BTW - in our case - the application failure domain logic is the same whether we are using UDP / uDAPL / iTAPI / TCP / SCTP / etc. Basically, if we can not talk to a remote node - after some defined period of time - we will remove the remote node from the cluster. In this case the database will recover all the interesting state that may have been maintained on the removed node - allowing the remaining nodes to continue. If later on, communication to the remote node is restored - it will be allowed to rejoin the cluster and take on application load. Please clarify the following which was in the document provided by Oracle. On page 3 of the RDS document, under the section RDP Interface, the 2nd and 3rd paragraphs are state: * RDP does not guarantee that a datagram is delivered to the remote application. * It is up to the RDP client to deal with datagrams lost due to transport failure or remote application failure. The HCA is still a fault domain with RDS - it does not address flushing data out of the HCA fault domain, nor does it sound like it ensures that CQE loss is recoverable. I do believe RDS will replay all of the sendmsg's that it believes are pending, but it has no way to determine if already sent sendmsgs were actually successfully delivered to the remote application unless it provides some level of resync of the outstanding sends not completed from an application's perspective as well as any state updated via RDMA operations which may occur without an explicit send operation to flush to a known state. I'm still trying to ascertain whether RDS completely recovers from HCA failure (assuming there is another HCA / path available) between the two endnodes. Mike ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [ANNOUNCE] Contribute RDS (ReliableDatagramSockets) to OpenIB
At 02:09 PM 11/9/2005, Greg Lindahl wrote: On Wed, Nov 09, 2005 at 01:57:06PM -0800, Michael Krause wrote: What you indicate above is that RDS will implement a resync of the two sides of the association to determine what has been successfully sent. More accurate to say that it could implement that. I'm just kibbutzing on someone else's proposal. This then implies that the reliability of the underlying interconnect isn't as critical per se as the end-to-end RDS protocol will assure that data is delivered to the RDS components in the face of hardware failures. Correct? Yes. That's the intent that I see in the proposal. The implementation required to actually support this may not be what the proposers had in mind. If it is to be reasonably robust, then RDS should be required to support the resync between the two sides of the communication. This aligns with the stated objective of implementing reliability in one location in software and one location in hardware. Without such resync being required in the ULP, then one ends up with a ULP that falls shorts of its stated objectives and pushes complexity back up to the application which is where the advocates have stated it is too complex or expensive to get it correct. This sort of message service, by the way, has a long history in distributed computing. Yep. Mike ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] [ANNOUNCE] Contribute RDS ( ReliableDatagramSockets) to OpenIB
At 10:48 AM 11/10/2005, Caitlin Bestler wrote: Mike Krause wrote in response to Greg Lindahl: If it is to be reasonably robust, then RDS should be required to support the resync between the two sides of the communication. This aligns with the stated objective of implementing reliability in one location in software and one location in hardware. Without such resync being required in the ULP, then one ends up with a ULP that falls shorts of its stated objectives and pushes complexity back up to the application which is where the advocates have stated it is too complex or expensive to get it correct. I haven't reread all of RDS fine print to double-check this, but my impression is that RDS semantics exactly match the subset of MPI point-to-point communications where the receiving rank is required to have pre-posted buffers before the send is allowed. My concern is the requirement that RDS resync the structures in the face of failure and know whether to re-transmit or will deal with duplicates. Having pre-posted buffers will help enable the resync to be accomplished but should not be equated to pre-post equals one can deal with duplicates or will verify to prevent duplicates from occurring. Mike ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [ANNOUNCE] Contribute RDS (ReliableDatagramSockets) to OpenIB
At 12:37 PM 11/8/2005, Hal Rosenstock wrote: On Tue, 2005-11-08 at 15:33, Ranjit Pandit wrote: Using APM is not useful because it doesn't provide failover across HCA's. Can't APM be made to work across HCAs ? No. It requires state that is only within the HCA and there are other aspects that prevent this, e.g. no single unified QP space across all HCA, etc. Mike ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [ANNOUNCE] Contribute RDS (ReliableDatagramSockets) to OpenIB
At 12:33 PM 11/8/2005, Ranjit Pandit wrote: Mike wrote: - RDS does not solve a set of failure models. For example, if a RNIC / HCA were to fail, then one cannot simply replay the operations on another RNIC / HCA without extracting state, etc. and providing some end-to-end sync of what was really sent / received by the application. Yes, one can recover from cable or switch port failure by using APM style recovery but that is only one class of faults. The harder faults either result in the end node being cast out of the cluster or see silent data corruption unless additional steps are taken to transparently recover - again app writers don't want to solve the hard problems; they want that done for them. The current reference implementation of RDS solves the HCA failure case as well. Since applications don't need to keep connection states, it's easier to handle cases like HCA and intermediate path failures. As far as application is concerned, every sendmsg 'could' result in a new connection setup in the driver. If the current path fails, RDS reestablishes a connection, if available, on a different port or a different HCA , and replays the failed messages. Using APM is not useful because it doesn't provide failover across HCA's. I think others may disagree about whether RDS solves the problem. You have no way of knowing whether something was received or not into the other node's coherency domain without some intermediary or application's involvement to see the data arrived. As such, you might see many hardware level acks occur and not know there is a real failure. If an application takes any action assuming that send complete means it is delivered, then it is subject to silent data corruption. Hence, RDS can replay to its heart content but until there is an application or middleware level of acknowledgement, you have not solve the fault domain issues. Some may be happy with this as they just cast out the endnode from the cluster / database but others see the loss of a server as a big deal so may not be happy to see this occur. It really comes down to whether you believe loosing a server is worth while just for a local failure event which is not fatal to the rest of the server. APM's value is the ability to recover from link failure. It has the same value for any other ULP in that it recovers transparently to the ULP. Mike ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [ANNOUNCE] Contribute RDS (ReliableDatagramSockets) to OpenIB
At 11:42 AM 11/9/2005, Greg Lindahl wrote: On Tue, Nov 08, 2005 at 01:08:13PM -0800, Michael Krause wrote: If an application takes any action assuming that send complete means it is delivered, then it is subject to silent data corruption. Right. That's the same as pretty much all other *transport* layers. I don't think anyone's asserting RDS is any different: you can't assume the other side's application received and acted on your message until the other side's application tells you that it did. So, things like HCA failure are not transparent and one cannot simply replay the operations since you don't know what was really seen by the other side unless the application performs the resync itself. Hence, while RDS can attempt to retransmit, the application must deal with duplicates, etc. or note the error, resync, and retransmit to avoid duplicates. BTW, host-based transport implementations can transparently recover from device failure on behalf of applications since their state is in the host and not in the failed device - this is true for networking, storage, etc. HCA / RNIC / TOE / FC / etc. all loose state or cannot be trusted thus must rely upon upper level software to perform the recovery, resync, retransmission, etc. Unless RDS has implemented its own state checkpoint between endnodes, this class of failures must be solved by the application since it cannot be solved in the hardware. Hence, RDS may push some of its reliability requirements to the interconnect but it does not eliminate all reliability requirements from the application or RDS itself. Mike ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [ANNOUNCE] Contribute RDS(ReliableDatagramSockets) to OpenIB
At 10:28 AM 11/9/2005, Rick Frank wrote: Yes, the application is responsible for detecting lost msgs at the application level - the transport can not do this. RDS does not guarantee that a message has been delivered to the application - just that once the transport has accepted a msg it will deliver the msg to the remote node in order without duplication - dealing with retransmissions, etc due to sporadic / intermittent msg loss over the interconnect. If after accepting the send - the current path fails - then RDS will transparently fail over to another path - and if required will resend / send any already queued msgs to the remote node - again insuring that no msg is duplicated and they are in order. This is no different than APM - with the exception that RDS can do this across HCAs. The application - Oracle in this case - will deal with detecting a catastrophic path failure - either due to a send that does not arrive and or a timedout response or send failure returned from the transport. If there is no network path to a remote node - it is required that we remove the remote node from the operating cluster to avoid what is commonly termed as a split brain condition - otherwise known as a partition in time. BTW - in our case - the application failure domain logic is the same whether we are using UDP / uDAPL / iTAPI / TCP / SCTP / etc. Basically, if we can not talk to a remote node - after some defined period of time - we will remove the remote node from the cluster. In this case the database will recover all the interesting state that may have been maintained on the removed node - allowing the remaining nodes to continue. If later on, communication to the remote node is restored - it will be allowed to rejoin the cluster and take on application load. One could be able to talk to the remote node across other HCA but that does not mean one has an understanding of the state at the remote node unless the failure is noted and a resync of state occurs or the remote is able to deal with duplicates, etc. This has nothing to do with API or the transport involved but, as Caitlin noted, the difference between knowing a send buffer is free vs. knowing that the application received the data requested. Therefore, one has only reduced the reliability / robustness problem space to some extent but has not solved it by the use of RDS. Mike - Original Message - From: Michael Krause To: Ranjit Pandit Cc: openib-general@openib.org Sent: Tuesday, November 08, 2005 4:08 PM Subject: Re: [openib-general] [ANNOUNCE] Contribute RDS(ReliableDatagramSockets) to OpenIB At 12:33 PM 11/8/2005, Ranjit Pandit wrote: Mike wrote: - RDS does not solve a set of failure models. For example, if a RNIC / HCA were to fail, then one cannot simply replay the operations on another RNIC / HCA without extracting state, etc. and providing some end-to-end sync of what was really sent / received by the application. Yes, one can recover from cable or switch port failure by using APM style recovery but that is only one class of faults. The harder faults either result in the end node being cast out of the cluster or see silent data corruption unless additional steps are taken to transparently recover - again app writers don't want to solve the hard problems; they want that done for them. The current reference implementation of RDS solves the HCA failure case as well. Since applications don't need to keep connection states, it's easier to handle cases like HCA and intermediate path failures. As far as application is concerned, every sendmsg 'could' result in a new connection setup in the driver. If the current path fails, RDS reestablishes a connection, if available, on a different port or a different HCA , and replays the failed messages. Using APM is not useful because it doesn't provide failover across HCA's. I think others may disagree about whether RDS solves the problem. You have no way of knowing whether something was received or not into the other node's coherency domain without some intermediary or application's involvement to see the data arrived. As such, you might see many hardware level acks occur and not know there is a real failure. If an application takes any action assuming that send complete means it is delivered, then it is subject to silent data corruption. Hence, RDS can replay to its heart content but until there is an application or middleware level of acknowledgement, you have not solve the fault domain issues. Some may be happy with this as they just cast out the endnode from the cluster / database but others see the loss of a server as a big deal so may not be happy to see this occur. It really comes down to whether you believe loosing a server is worth while just for a local failure event which is not fatal to the rest of the server. APM's value is the ability to recover from link failure. It has the same value for any other ULP in that it recovers transparently to the ULP
Re: [openib-general] [ANNOUNCE] Contribute RDS (ReliableDatagramSockets) to OpenIB
At 01:24 PM 11/9/2005, Greg Lindahl wrote: On Wed, Nov 09, 2005 at 12:18:28PM -0800, Michael Krause wrote: So, things like HCA failure are not transparent and one cannot simply replay the operations since you don't know what was really seen by the other side unless the application performs the resync itself. I think you are over-stating the case. On the remote end, the kernel piece of RDS knows what it presented to the remote application, ditto on the local end. If only an HCA fails, and not the sending and receiving kernels or applications, that knowledge is not lost. Perhaps you were assuming that RDS would be implemented only in firmware on the HCA, and there is no kernel piece that knows what's going on. I hadn't seen that stated by anyone, and of course there are several existing and contemplated OpenIB devices that are considerably different from the usual offload engine. You could also choose to implement RDS using an offload engine and still keep enough state in the kernel to recover. I hadn't assumed anything. I'm simply trying to understand the assertions concerning availability and recovery. What you indicate above is that RDS will implement a resync of the two sides of the association to determine what has been successfully sent. It will then retransmit what has not transparent to the application. This then implies that the reliability of the underlying interconnect isn't as critical per se as the end-to-end RDS protocol will assure that data is delivered to the RDS components in the face of hardware failures. Correct? Mike ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [swg] Re: [openib-general] round 2 - proposal for socket based connectionmodel
Just to correct one comment: A ULP written to TCP/IP can use RDMA transport without change. An example is SDP not that the ULP must use what SDP uses. Also, please keep in mind that SDP on iWARP uses the port mapper protocol to obtain the IP address and port to target for the connection request. So, the TCP connection establishment is to the RDMA listen endpoint from the start and the SDP hello exchange then fills in the rest of the parameters required to determine whether the connection should proceed and what resources should be configured when the response is generated. I will also re-iterate what another person stated and that is to separate out the interface from the wire protocol. IBTA defines wire protocols / semantics while OpenIB is defining its API to communicate the wire protocol and associated semantics. I agree with that person on this point and their other point on the need for the IBTA to construct a solid spec for the wire protocol and associated semantics. OpenIB will then determine how best to implement but these are separate efforts and it would be more productive for all to table the discussion for now. The original request was whether something would break if the private data size was changed. It was noted that one cannot know what will or will not break thus the requirement is to provide a method for software to note the difference in the layout. How is for the IBTA to specify. Just a thought.. Mike At 03:43 PM 10/25/2005, Sean Hefty wrote: Kanevsky, Arkady wrote: What are you trying to achieve? I'm trying to define a connection *service* for Infiniband that uses TCP/IP addresses as its user interface. That service will have its own protocol, in much the same way that SDP, SRP, etc. do today. I am trying to define an IB REQ protocol extension that support IP connection 5-tuple exchange between connection requestor and responder. Why? What need is there for a protocol extension to the IB CM? To me, this is similar to setting a bit in the CM REQ to indicate that the private data format looks like SDP's private data. The format of the _private_ data shouldn't be known to the CM; that's why it's private data. And define mapping between IP 5-tuple and IB entities. No mapping between IP - IB addresses was defined in the proposal. Defining this mapping is required to make this work. Right now, the mapping is the responsibility of every user. That way ULP which was written to TCP/IP, UDP/IP, CSTP/IP (and so on) can use RDMA transport without change. A ULP written to TCP/IP can use an RDMA transport without change. They use SDP. However, an application that wants to take advantage of QP semantics must change. (And if they want to take full advantage of RDMA, they'll likely need to be re-architected as well.) The goal in that case becomes to permit them to establish connections using TCP/IP addresses. To meet this goal, we need to define how to map IP address to and from IB addresses. That mapping is part of the protocol, and is missing from the proposal. And if the application isn't going to know that they're running on Infiniband, then the mapping must also include mapping to a destination service ID. To modify ULP to know that it runs on top of IB vs. iWARP vs. (any other RDMA transport) is bad idea. It is one thing to choose proper port to connect. Completely different to ask ULP to parse private data in transport specific way. The same protocol must support both user level ULPs and kernel level ULPs. Defining an interface that allows a ULP to use either iWarp, IB, or some other random RDMA transport is an implementation issue. However, it requires something that maps IP to IB addresses (including service IDs). To be more concrete, you've gone from having source and destination TCP/IP addresses to including them in a CM REQ. What translated the source and destination IP addresses into GIDs and a PKey? Who converted those into IB routing information? How was the destination of the CM REQ determined? What service ID was selected? - Sean ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] TCP/IP connection service over IB
At 12:50 PM 10/21/2005, Fab Tillier wrote: From: James Lentini [ mailto:[EMAIL PROTECTED]] Sent: Friday, October 21, 2005 12:38 PM On Fri, 21 Oct 2005, Sean Hefty wrote: sean version(8) | reserved(8) | src port (16) version(1) | reserved(1) | src port (2) sean src ip (16) sean dst ip (16) sean user private data (56) /* for version 1 */ Are the numbers in parens in bytes or bits? It looks like a mixture to me. Uhm.. they were a mix. Changed above to bytes. Ok. I assume that your 1 byte of version information is broken into 2 4-bit pieces, one for the protocol version and one for the IP version. Doesn't leading-zero-padding the IPv4 addresses to be 16 bytes eliminates the need for an IP version field? Not really. The same logic was used in the SDP port mapper for iWARP where there was still an IP version provided so that the space remained constant while the end node would know how to parse the message. Mike ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] Re: [swg] Re: private data...
This is really an IBTA issue to resolve and to insure that backward compatibility with existing applications is maintained. Hence, this exercise of who is broken or not is inherently flawed in that one cannot comprehend all implementations that may exist. Therefore, the spec should use either a new version number or a reserved bit to indicate that there is a defined format to the private data portion or not. This is no different than what is done in other technologies such as PCIe. Those applications that require the existing semantics will be confined to the existing associated infrastructure. Those that want the new IP semantics set the bit / version and operate within the restricted private data space available. It is that simple. Mike At 07:31 AM 10/20/2005, Jimmy Hill wrote: A Linux uDAPL-based system infrastructure application I am working on at IBM currently depends on 64-bytes of Private Data for Connect and Accept as well. -- jimmy Oracle currently depends on 64 bytes of private data for connect and accept. - Original Message - From: Kanevsky, Arkady To: Davis, Arlin R ; [EMAIL PROTECTED] ; Grant Grundler Cc: [EMAIL PROTECTED] ; openib-general@openib.org Sent: Wednesday, October 19, 2005 11:31 AM Subject: RE: [dat-discussions] RE: [openib-general] Re: iWARP emulationprotocol Arlin, just to clarify, Intel MPI will not have problems with useing less than 64 bytes of private data. If a solution will provide you with 48 bytes of private data will it be sufficient? Arkady Arkady Kanevsky email: [EMAIL PROTECTED] Network Appliance phone: 781-768-5395 375 Totten Pond Rd. Fax: 781-895-1195 Waltham, MA 02451-2010 central phone: 781-768-5300 -Original Message- From: Davis, Arlin R [ mailto:[EMAIL PROTECTED]] Sent: Wednesday, October 19, 2005 11:30 AM To: [EMAIL PROTECTED] ; Grant Grundler Cc: [EMAIL PROTECTED] ; openib-general@openib.org Subject: RE: [dat-discussions] RE: [openib-general] Re: iWARP emulationprotocol Arkady, Intel MPI (real consumer of uDAPL) has no problem with this change. -arlin From: [EMAIL PROTECTED] [ mailto:[EMAIL PROTECTED]] On Behalf Of Kanevsky, Arkady Sent: Wednesday, October 19, 2005 6:40 AM To: Grant Grundler; Caitlin Bestler Cc: Roland Dreier; [EMAIL PROTECTED]; [EMAIL PROTECTED]; openib-general@openib.org Subject: [dat-discussions] RE: [openib-general] Re: iWARP emulation protocol Grant, The developers of the application(s) in questions are aware of the discussion. I will leave it to them to respond. I bring the discussion point at the weekly DAT Collaborative meeting which we have every Wednesday. I appologize that the DAT Collaborative charter does not allow to submit contribution without joining DAT Collaborative. But this is no different from Linux not accepting any contrubutions without proper license. Byt be rest assure that as a Chair I bring the concerns and suggestions stated in email discussion at the DAT meetings. Arkady Arkady Kanevsky email: [EMAIL PROTECTED] Network Appliance phone: 781-768-5395 375 Totten Pond Rd. Fax: 781-895-1195 Waltham, MA 02451-2010 central phone: 781-768-5300 -Original Message- From: Grant Grundler [mailto:[EMAIL PROTECTED] ] Sent: Tuesday, October 18, 2005 8:02 PM To: Caitlin Bestler Cc: Grant Grundler; Roland Dreier; Kanevsky, Arkady; [EMAIL PROTECTED]; [EMAIL PROTECTED]; openib-general@openib.org Subject: Re: [openib-general] Re: iWARP emulation protocol On Tue, Oct 18, 2005 at 04:40:54PM -0700, Caitlin Bestler wrote: Roland (and the rest of us) would like to see someone name a real consumer of the proposed interface. ie who depends on this change? Then the dependency for that use/user can be discussed and appropriate tradeoffs made. Make sense? Unfortunately not every application that is under development, or even deployed, can be discussed in a google-searchable public forum. That especially applies to user-mode development. Well, this is open source. While I don't want to preclude closed source developement, it's usually necessary to have an open source consumer that any open source developer can test with. So I could have actually tested such applications and still not be free to cite them here. Understood. I'm not asking *you* to cite one unless you happen to own one of the consumers. With any luck some of them are following the discussion and will jump in on their own. Unfortunately, since they are developing to uDAPL they are unlikely to be following this discussion. It doesn't help that the DAT yahoo-groups.com mailing list is rejecting my replies. It would be helpful if someone following this forum could share Roland's question with DAT mailing list if it didn't make it there already and possibly explain why naming a consumer is necessary. hth, grant SPONSORED LINKS Protocol Communication and networking Wireless communication and
Re: [openib-general] I/O controllers
At 10:41 PM 10/18/2005, Mohit Katiyar, Noida wrote: Content-class: urn:content-classes:message Content-Type: multipart/alternative; boundary=_=_NextPart_001_01C5D46F.B45AF930 Hi all, Can anyone tell me are there any specific I/O controller for the connection between the TCA and SCSI devices or any I/O controller will work between the TCA and SCSI devices See various IB vendors for their offerings which include attachment to various I/O device types. Their web pages contain plenty of appropriate information. Mike ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] IB and FC
These types of discussions should be taken up with IB technology / OEM vendors directly as they have nothing to do with development. Mike At 06:28 AM 10/15/2005, Mohit Katiyar, Noida wrote: Hi all, Sorry previous mail got scrapped due to HTML pictures so now with text pictures I just cant clear a doubt about IB. In the first figure given below the max speed that can be obtained between the client and the IO storage is 2Gb/s Client | Client | |--- FC Switch---| . | | | . |---FC Cables--| |-I/O storage . |Each client| | Client | connected|--- FC Switch---| To both switch Figure 1 While in the figure given below the client to IB FC gateway speed is 10 GB/s and from Gateway to I/O storage is 2GB/s and if port aggregation is applied at gateway then 4GB/s. So the total effective speed from client to I/O storage can max be reached at 4GB/s IB cables Client | Client | |- FC Switch---| . |IB cables | | . |IB FC -- | |---FC Cables--I/O storage . | Gateway/Router | | Client | |- FC Switch---| Figure 2 So can anyone explain me am I correct in my approach? Are there any other advantages in shifting from figure 1 architecture to figure 2 architecture? It does not seem any advantageous in shifting from FC SAN to IB FC SAN through such a pattern? Can anyone help me in deciding about this?? Thanks in advance Mohit Katiyar ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] [RFC] IB address translation using ARP
At 03:14 PM 10/12/2005, Caitlin Bestler wrote: -Original Message- From: [EMAIL PROTECTED] [ mailto:[EMAIL PROTECTED]] On Behalf Of Sean Hefty Sent: Wednesday, October 12, 2005 2:36 PM To: Michael Krause Cc: openib-general@openib.org Subject: Re: [openib-general] [RFC] IB address translation using ARP Michael Krause wrote: 1. Applications want to use existing API to identify remote endnodes / services. To clarify, the applications want to use IP based addressing to identify remote endnotes. The connection API is under development. No, I think Mike's comment was dead on. Applications want to use the existing API. They want to use the existing API even when the API is clearly defective. Note that there are several generations of host-resolution APIs for the IP world, with the earlier ones clearly being heavily inferior (not thread safe, not IPv4/IPv6 neutral, etc). But they have not been eliminated. Why, because applications want to use the existing API. If application developers were rationale and totally open to adopt new ideas instantly then the active side would ask to make a connection to a *service*, not to a host with a service qualifier. A new API may be under development to meet new needs. But keep in mind that the application developers expect it to be as close to what they are used to as possible, and will grumble that it is not 100% compatible. This all comes down to economics which is why some ULP such as SDP are created. Let's examine SDP for a moment. The purpose of SDP to enable synchronous and asynchronous Sockets applications to transparently run unmodified over a RDMA capable interconnect. Unmodified means no source code changes and no recompile required (this is possible if the Sockets library is a shared library and dynamically linked). The first part of unmodified means that the existing address / service resolution API calls work (further, no change to the address family, etc. is required to make this work either). Hence, pick any of the get* API calls that are in use today and they should just work. How does this work? The SDP implementation takes on the burden for the application developer. For iWARP, there really isn't anything special that has to be done as these calls all should provide the necessary information. The port mapper protocol would be invoked which would map to the actual RDMA listen QP and target RNIC. For IB, there is some additional work both in using SID as well as resolving the IP address to the IB address vector but the work isn't that hard to implement (we know this because this has all been implemented on various OS within the industry). The same will be true for NFS/RDMA and iSER - again all use the existing interfaces to identify the address / service and map to an address vector (and again, all of this has been implemented on various OS within the industry). The above makes ISV and customers very happy as they can take advantage of RDMA technologies without having to go through the lengthy and expensive qualification process that comes when any application is modified / recompiled. This keeps costs low and improves TTM. As for the RDMA connection API, that is simply attempting to abstract to a common interface that any ULP implementation can use to access either iWARP or IB. The RDMA connection API should not be viewed as something end application developers will use but towards middleware developers. This allows everyone to use IP addresses, port spaces, etc. through the existing application API while allowing RDMA to transparently add some intelligence to the process and eventually enable new capabilities like policy management (e.g. how best to map ULP QoS needs to a given path, service rate,etc.) without permuting everything above. Keeping things transparent is best for all. Attempting to require end application developers to modify their code will result in slower adoption and reduced utilization of RDMA technologies within the industry. It really is all about economics and re-using the existing ecosystem / infrastructure. Mike ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] [RFC] IB address translation using ARP
At 09:59 AM 10/12/2005, Caitlin Bestler wrote: From: [EMAIL PROTECTED] [ mailto:[EMAIL PROTECTED]] On Behalf Of Michael Krause Sent: Wednesday, October 12, 2005 8:24 AM To: Hal Rosenstock; Sean Hefty Cc: Openib Subject: RE: [openib-general] [RFC] IB address translation using ARP At 07:45 AM 10/10/2005, Hal Rosenstock wrote: On Sun, 2005-10-09 at 10:19, Sean Hefty wrote: I think iWARP can be on top of TCP or SCTP. But why wouldn't it care ? I'm referring to the case that iWarp is running over TCP. I know that it can run over SCTP, but I'm not familiar with the details of that protocol. With TCP, this is an end-to-end connection, so layering iWarp over it, only the endpoints need to deal with it. I believe the same is true for SCTP. Yes, SCTP is similar in those regards. SCTP creates a connection and then multiplexes a set of sessions over it. You can conceptually think of it as akin to IB RD but where all QP are bound to the same EEC. SCTP preserves all QP to QP semantics, including buffers posted to specific buffers and credits. So SCTP will allows multiple in-flight messages for each RDMA stream in the association. Yep. This is where iWARP differs from IB RD in that IB restricts this to a single in-flight message per EEC at a time while iWARP allows multiple in-flight over either transport type supported. The logic behind why IB RD was constructed the way it was is somewhat complex but one of the core requirements was to enable a QP to communicate across multiple EEC while preserving an ordering domain within an EEC. Given all of this needed to be implemented in hardware, i.e. without host software intervention, for both main data path and error management, the restriction to a single message was required. I and several others had created a proprietary RDMA RC followed by a RD implementation 10+ years ago so we had a reasonable understanding of the error / complexity trade-offs. Given the distances were within a usec or each other and one could support multiple EEC per endnode pair, the performance / scaling impacts were not seen as overly restrictive and met the software application usage models quite nicely. Anyway, there are differences between iWARP / SCTP and IB RD so people cannot equate them beyond some base conceptual level aspects. Mike ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] IRQ sharing on PCIe bus
At 02:05 PM 10/10/2005, Roland Dreier wrote: Roland BTW, for INTx emulation on PCI Express, there are no Roland physical interrupt lines -- interrupts are asserted and Roland deasserted with messages. So PCI Express interrupts are Roland unshared. Michael They are messages upstream that any device. ^ sent Sorry. Insert sent above. That doesn't parse for me. Was what I said wrong? No. Just clarifying that they are not unique per device. INTx being a message does not change the fundamental semantics of a wire being asserted. Hence, if the wire was shared before, then there is no reason why this would not be the same with PCIe sans. It really is an OS issue as to how INTx interrupts are assigned to different processors and to what extent then end up being shared. The host bridge can play some tricks as well as you noted. Again, the goal within the PCI-SIG is to move people to MSI-X and to eliminate INTx long-term. In fact, one area under development is asking the SIG's members whether INTx can be eliminated entirely which would go a long ways to simplifying designs both in hardware and software. Mike ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Re: [PATCH] [CMA] RDMA CM abstraction module
At 01:09 PM 10/10/2005, Christoph Hellwig wrote: On Mon, Oct 10, 2005 at 12:53:29PM -0700, Michael Krause wrote: standards. There are also the new standard Sockets extension API available today that might be extended sometime in the future to include explicit which is never going to get into linux. one more of these braindead standards people masturbating in a dark room and coming up with a frankenstein bastard cases. Everyone is free to have an opinion. Sockets extensions are not braindead nor created using whatever methods you envision. The extensions were created by Sockets engineers with 20+ years experience. But, hey, why put any faith into people who develop and implement Sockets for a living? One day perhaps you'll learn a bit of professionalism and perhaps open your mind that there are people out in the world besides yourself you don't take a NIH approach to the world and are actually qualified engineers who have a clue. All you get with these constant unprofessional diatribes is a continual loss in credibility. But, hey, that is just an opinion. BTW, do you feel the same way about the people who created IB? How about iWARP? How about PCIe? Are all of the engineers who work on trying to accelerate technology, its performance, etc. who take into account and try to find a balanced approach to problem solving simply all in dark little rooms? All of these specs are created by companies. Those same companies who fund open source efforts and many of the people working here. One last thing, I'm not the only person who feels this way about your unprofessional behavior. There are many others who have simply don't want to bother writing or have simply written you off as whatever. Sad state to be in and I suspect you don't care since you view them all as in dark little rooms anyway. Just something you might want to keep in mind. There is a much larger world out there where people value other people's professional opinions and ideas. They don't simply discount what they produce because it was not done in whatever form you prefer. It is called reality. Get used to it. Mike ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] Re: [PATCH] [CMA] RDMA CM abstraction module
At 12:13 PM 10/10/2005, Fab Tillier wrote: From: Sean Hefty [ mailto:[EMAIL PROTECTED]] Sent: Monday, October 10, 2005 11:16 AM Michael S. Tsirkin wrote: Maybe rdma_connection (these things encapsulate connectin state)? Or, rdma_sock or rdma_socket, since people are used to the fact that connections are sockets? Any objection to rdma_socket? I don't like rdma_socket, since you can't actually perform any I/O operations on the rdma_socket, unlike normal sockets. We're dealing only with the connection part of the problem, and the name should reflect that. So rdma_connection, rdma_conn, or rdma_cid seem more appropriate. Naming should not involve sockets as that is part of existing standards. There are also the new standard Sockets extension API available today that might be extended sometime in the future to include explicit RDMA support should people decide to bypass SDP and go straight to a more robust API definition. The Sockets Extensions already comprehend explicit memory management, async comms, etc. making a significant improvement over the existing sync Sockets as well as going further in solving areas like memory management beyond what was done in Winsocks. Mike ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [RFC] IB address translation using ARP
At 10:40 AM 10/10/2005, Sean Hefty wrote: Hal Rosenstock wrote: What about the case of iWARP - IB ? Crossing IB shouldn't matter. iWarp should simply cross the IB subnet using IPoIB. You could build a gateway to make the transfer across IB more efficient, but it's not required. I don't understand this statement. iWARP is RDMA based and if someone wanted to build a gateway with IB in between, it should be mapped to an IB RC connection 1:1. Going through IPoIB is a waste and would result in a very poor performing solution (not that such a solution would deliver stellar performance to start with. Prior similar solutions used ULP over IB and the gateway then provided ULP over TOE and would then be easily extended to do iWARP. In general, you would want to have defined domains for each interconnect and not try to add poor ROI superset functionality of one over the other - waste of time and money. Mike ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] IRQ sharing on PCIe bus
At 09:22 AM 10/10/2005, Roland Dreier wrote: yipee Hi, My setup is a 3GHz Xeon (x86_64) with a 2.6.13.2 yipee kernel. A Mellanox memfree PCIe ddr HCA is connected. Why yipee do I see IRQ sharing although I'm using msi_x and PCIe? yipee Doesn't IRQ sharing only happen on older non PCIe busses? I think the messages you see are coming from the ACPI interrupt routing that is done when the driver calls pci_enable_device(). However, if you use MSI-X then that interrupt won't actually be used. If you check /proc/interrupts you should see ib_mthca using 3 non-shared interrupts. BTW, for INTx emulation on PCI Express, there are no physical interrupt lines -- interrupts are asserted and deasserted with messages. So PCI Express interrupts are unshared. They are messages upstream that any device. However, the PCI Express host bridge turns those interrupts into real interrupts to the system's interrupt controller, and for that part of the story, it's entirely possible for two different PCI Express devices to end up sharing the same interrupt line. Correct, the host bridge may map them to a monarch processor and thus any or all devices can share the same interrupt. This is why within the PCI-SIG we recommend using MSI-X and long-term, many of us would simply like to drop INTx and make MSI-X mandatory. Mike ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [RFC] IB address translation using ARP
At 01:59 PM 10/10/2005, Sean Hefty wrote: Michael Krause wrote: What about the case of iWARP - IB ? Crossing IB shouldn't matter. iWarp should simply cross the IB subnet using IPoIB. You could build a gateway to make the transfer across IB more efficient, but it's not required.I don't understand this statement. iWARP is RDMA based and if someone I was referring to the case where both endpoints are running over iWarp, with IB being one of the subnets being crossed. I believe that you're referring to one side running over iWarp, and the other running over IB, with an application level gateway in between. For the latter case, I would think that the gateway needs to establish iWarp connections for any IP addresses that reside on the IB subnet behind it, with a separate IB connection on the back-end. It seems to me that this would occur transparently to the application using iWarp. iWARP with IB in between seems like a waste of time to do (very small if any market for such a beast). IB HCA on a host with an iWARP edge device may be reasonable but again seems like a waste to construct. These types of corner usage models while of interest to comprehend to see if there is any architectural issues to insure they are not precluded really are just that, corner cases, and little time or effort should be spent on their support. Mike ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] [RFC] IB address translation using ARP
At 06:38 AM 9/30/2005, Caitlin Bestler wrote: -Original Message- From: [EMAIL PROTECTED] [ mailto:[EMAIL PROTECTED]] On Behalf Of Roland Dreier Sent: Thursday, September 29, 2005 6:50 PM To: Sean Hefty Cc: Openib Subject: Re: [openib-general] [RFC] IB address translation using ARP Sean Can you explain how RDMA works in this case? This is simply Sean performing IP routing, and not IB routing, correct? Are you Sean referring to a protocol running on top of IP or IB directly? Sean Is the router establishing a second reliable connection on Sean the backend? Does it simply translate headers as packets Sean pass through in this case? I think the usage model is the following: you have some magic device that has an IB port on one side and something else on the other side. Think of something like a gateway that talks SDP on the IB side and TCP/IP on the other side. You configure your IPoIB routing so that this magic device is the next hop for talking to hosts on the IP network on the other side. Now someone tries to make an SDP connection to an IP address on the other side of the magic device. Routing tables + ARP give it the GID of the IB port of this magic device. It connects to the magic device and run SDP to talk to the magic device, and the magic device magically splices this into a TCP connection to the real destination. Or the same idea for an NFS/RDMA - NFS/UDP gateway, etc. Those examples are all basically application level gateways. As such they would have no transport or connection setup implications. The application level gateway simply offers a service on network X that it fulfills on network Y. But as far as network X is concerned the gateway IS the server. It must be viewed as such. The cross over point between the two domains represents independent management domains, trust domains, reliable delivery domains, etc. I do not believe it is possible to construct a transport layer gateway that bridges RDMA between IB and iWARP while appearing to be a normal RDMA endpoint on both networks. Higher level gateways will be possible for many applications, but I don't see how that relates to connection establishment. That would require having an end-to-end reliable connection, complete with flow control semantics, that bridged the two networks by some method other than encapsulation or tunneling. We took steps to insure that both IB and iWARP could transmit packets in the main data path very efficiently between the two interconnects but it was never envisioned that a connection was truly end-to-end transparent across the gateway component. I think most of the architects would not support such an effort to define such a beast. There are many issues in attempting such an offering. Just examine all of the problems with the existing iSCSI to FC solutions; they ignore a number of customer issues and hence have been relegated in many customer minds as TTM, play toys not ready for prime time. This is one of the many reasons why iSCSI has not taken off as the hype portrayed. It would be best to define a CM architecture that enabled communication between like endpoints and avoid the gateway dilemma. Let the gateway provider work out such issues as there are many requirements already on each side of these interconnects. Mike ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] [RFC] IB address translation using ARP
At 06:24 AM 9/30/2005, Yaron Haviv wrote: -Original Message- From: Roland Dreier [ mailto:[EMAIL PROTECTED]] Sent: Thursday, September 29, 2005 9:50 PM To: Sean Hefty Cc: Yaron Haviv; Openib Subject: Re: [openib-general] [RFC] IB address translation using ARP I think the usage model is the following: you have some magic device that has an IB port on one side and something else on the other side. Think of something like a gateway that talks SDP on the IB side and TCP/IP on the other side. Also applicable to two IB ports, e.g. forwarding SDP traffic from one IB partition to SDP on another partition (may even be the same port with two P_Keys), and doing some load-balancing or traffic management in between, overall there are many use cases for that. While I can envision how an endpoint could communicate with another in separate partitions, doing so really violates the spirit of the partitioning where endpoints must be in the same partition in order to see one another and communicate. Attempting to create an intermediary who has insights into both and then somehow is able to communicate how to find one another using some proprietary (can't be through standards that I can think of) method, seems like way too much complexity to be worth it. Mike ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] [RFC] libibverbs completion event handling
At 03:33 PM 9/21/2005, Caitlin Bestler wrote: I'm not sure I follow what a completion channel is. My understanding is that work completions are stored in user-accessible memory (typically a ring buffer). This enables fast-path reaping of work completions. The OS has no involvement unless notifications are enabled. The completion vector is used to report completion notifications. So is the completion vector a *single* resource used by the driver/verbs to report completions, where said notifications are then split into user context dependent completion channels? The RDMAC verbs did not define callbacks to userspace at all. Instead it is assumed that the proxy for user mode services will receive the callbacks, and how it relays those notifications to userspace is outside the scope of the verbs. Correct. Both uDAPL and ITAPI define relays of notifications to AEVDS/CNOs and/or file descriptors. Forwarding a completion notification to userspace in order to make a callback in userspace so that it can kick an fd to wake up another thread doesn't make much sense. The uDAPL/ITAPI/whatever proxy can perform all of these functions without any device dependencies and in a way that is fully optimal for the usermode API that is being used. Exactly. This was the intention. Does not really matter what the API is but that there by an API that does this work on behalf of the consumer. For kernel clients, I don't see any need for anything beyond the already defined callbacks direct from the device-dependent code. This was the intention when we designed the verbs. Even in the typical case where the usermode application does an evd_wait() on the DAT or ITAPI endpoint, the DAT/ITAPI proxy will be able to determine which thread should be woken and could even do so optimally. It also allows the proxy to implemenet Access Layer features such as EVD thresholding without device-specific support. Correct. -Original Message- From: [EMAIL PROTECTED] [ mailto:[EMAIL PROTECTED]] On Behalf Of Roland Dreier Sent: Wednesday, September 21, 2005 12:22 PM To: openib-general@openib.org Subject: [openib-general] [RFC] libibverbs completion event handling While thinking about how to handle some of the issues raised by Al Viro in http://lkml.org/lkml/2005/9/16/146, I realized that our verbs interface could be improved to make delivery of completion events more flexible. For example, Arlin's request for using one FD for each CQ can be accomodated quite nicely. The basic idea is to create new objects that I call completion vectors and completion channels. Completion vectors refer to the interrupt generated when a completion event occurs. With the current drivers, there will always be a single completion vector, but once we have full MSI-X support, multiple completion vectors will be possible. When I proposed the use of multiple completion handlers, it was based on the operating assumption that either MSI or MSI-X be used by the underlying hardware. Either is possible - MSI limits it to a single address with 32 data values which allows different handlers to be bound to each value though targeting a single processor. MSI-X builds upon technology we've been shipping for nearly 20 years now and allows up to 2048 different addresses which may target or multiple processors. Any API should be able to deal with both approaches thus should not assume anything about whether one or more handlers are bound to a given processor. Orthogonal to this is the notion of a completion channel. This is a FD used for delivering completion events to userspace. Completion vectors are handled by the kernel, and userspace cannot change the number of vectors that available. On the other hand, completion channels are created at the request of a userspace process, and userspace can create as many channels as it wants. Every userspace CQ has a completion vector and a completion channel. Multiple CQs can share the same completion vector and/or the same completion channel. CQs with different completion vectors can still share a completion channel, and vice versa. The exact API would be something like the below. Thoughts? Why wouldn't it just be akin to the verbs interface - here are the event handler and callback routines to associate with a given CQ. The handler might be nothing more than an index into a set of functions that are stored within the kernel - these functions are either device-specific (i.e. supplied by the IHV) or a OS-specific such as dealing with error events (might also have a device-specific component as well). When the routine is invoked, it has basically has three parameters: CQ to target, number of CQE to reap, address to store CQE. I do not see what more is required. Mike Thanks, Roland struct ibv_comp_channel { int fd; }; /** * ibv_create_comp_channel - Create a completion event channel */ extern struct ibv_comp_channel *ibv_create_comp_channel(struct ibv_context
Re: [openib-general][PATCH][RFC]: CMA IB implementation
At 05:30 PM 9/21/2005, Caitlin Bestler wrote: On 9/21/05, Sean Hefty [EMAIL PROTECTED] wrote: Caitlin Bestler wrote: That's certainly an acceptably low overhead for iWARP IHVs, provided there are applications that want this control and *not* also need even more IB-specific CM control. I still have the same skepticism I had for the IT-API's exposing of paths via a transport neutral API. Namely, is there really any basis to select amongst multiple paths from transport neutral code? The same applies to caching of address translations on a transport neutral basis. Is it really possible to do in any way that makes sense? Wouldn't caching at a lower layer, with transport/device specific knowledge, make more sense? I guess I view this API slightly differently than being just a transport neutral connection interface. I also see it as a way to connect over IB using IP addresses, which today is only possible if using ib_at. That is, the API could do both. Given that purpose I can envision an IB-aware application that needed to use IP addresses and wanted to take charge of caching the translation. But viewing this in a wider scope raises a second question. Shouldn't iSER be using the same routines to establish connections? While many applications do use IP addresses, unless one goes the route of defining an IP address per path (something that iSCSI does comprehend today), IB multi-path (and I suspect eventually Ethernet's multi-path support) will require interconnect specific interfaces. Ideally, applications / ULP define the destination and QoS requirements - what we used to call an address vector. Middleware maps those to a interconnect-specific path on behalf of the application / ULP. This is done underneath the API as part of the OS / RDMA infrastructure. Such an approach works quite well for many applications / ULP however it should not be the only one supported as it assumes that the OS / RDMA infrastructure is sufficiently robust to apply policy management decisions in conjunction with the fabric management being deployed. Given IB SM will vary in robustness, there must also exist API that allow applications / ULP to comprehend the set of paths and select accordingly. I can envision how to construct such a knowledge that is interconnect independent but it requires more standardization about what defines the QoS requirements - latency, bandwidth, service rate, no single point of failure, etc. What I see so far does not address these issues. Mike ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Re: RDMA Generic Connection Management
At 07:46 AM 8/31/2005, James Lentini wrote: On Tue, 30 Aug 2005, Roland Dreier wrote: I just committed this SRP fix, which should make sure we don't use a device after it's gone. And it actually simplifies the code a teeny bit... The device could still be used after it's gone. For example: - the user is configuring SRP via sysfs. The thread in srp_create_target() has just called ib_sa_path_rec_get() [srp.c line 1209] and is waiting for the path record query to complete in wait_for_completion() - the SA callback, srp_path_rec_completion(), is called. This callback thread will make several verb calls (ib_create_cq, ib_req_notify_cq, ib_create_qp, ...) without any coordination with the hotplug device removal callback, srp_remove_one Notice that if the SA client's hotplug removal function, ib_sa_remove_one(), ensured that all callbacks had completed before returning the problem would be fixed. This would protect all ULPs from having to deal with hotplug races in their SA callback function. The fix belongs in the SA client (the core stack), not in SRP. All the ULPs are deficient with respect to their hotplug synchronization. Given that there is a common problem, doesn't it make sense to try and solve it in a generic way instead of in each ULP? There are two approaches to device removal to consider - both are required to have a credible solution: (1) Inform all entities that a planned device removal is to occur and allow them to close gracefully or migrate to alternatives. Ideally, the OS comprehends whether the removal will result in the loss of any critical resources and not inform or take action unless it knows the removal is something that the system can survive. Doing this requires the ULP to register interest with the OS in a particular hardware resource. This also allows the OS to construct a resource analysis tool to determine whether the removal of a device will be a good idea or not. This is really outside the scope of an RDMA infrastructure and should be done by the OS through an OS defined API which is applicable to all types of hardware resources and sub-systems. (2) Design all ULP to handle surprise removal, e.g. device failure, from the start and allow them to close gracefully or migrate to alternatives. The OS would inform the device driver of the failure if the device driver has not already discovered the problem. The OS would also inform interested parties of the device failure. The device driver would simply error out all users of the device instance - there are already error codes defined for IB and iWARP for this purpose. The associated verbs resources should be released as the ULP closes out its resources through the verbs API (we did define the verbs to clean up resources that the infrastructure may allocate on behalf of the ULP). Activities such as listen entries would be released just like what is done for Sockets, etc. today. Device addition is simply a matter of informing policy or whatever service management within the OS that determines what services should be available on a given device. The device driver really does not need to do anything special. One area to consider is whether a planned migration of a service needs to be supported. This is generally best handled by the ULP with only a small set of services required of the infrastructure, e.g. get / set of QP / LLP context and then coordinating any other aspects with the appropriate SM or network services such updating address vectors or fabric management / configuration. In general, the ULP should already be designed to handle the error condition and whether they support a managed / planned removal or migration is perhaps the only potential area of deficiency. Mike ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Re: uverbs comp events
At 11:11 AM 8/19/2005, Roland Dreier wrote: Arlin Yes, this is certainly another option; albeit one that Arlin requires more system resources. Why not take full advantage Arlin of the FD resource we already have? It's your call, but Arlin uDAPL and other multi-thread applications could make good Arlin use of a wakeup feature with these event interfaces. An Arlin event model that allows users to create events and get Arlin events but requires them to use side band mechanisms to Arlin trigger the event seems incomplete to me. I disagree. Right now the CQ FD is a pretty clean concept: you read CQ events out of it. If you want to trigger a CQ event, then you could post a work request to a QP that generates a completion event. Adding a new system call for queuing synthetic events seems like growing an ugly wart to me. If we look at the analogous design of a multi-threaded network server, where a thread might block waiting for input on a socket, we see that there's no system call to inject synthetic data into a network socket. I'd rather fix the uDAPL design instead of adding ugliness to the kernel to work around it. Please take a look at the Sockets API Extensions standard that was published quite awhile back to insure that the infrastructure can support this API as well. The API was developed by a set of Sockets developers and addresses a number of concerns for async communications, event management, explicit memory management, etc. It is also well suited to have SDP transparently implemented underneath it. Mike ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] [ANNOUNCE] Initial trunk checkin of ISERinitiator
At 08:04 AM 8/19/2005, Yaron Haviv wrote: -Original Message- From: Christoph Hellwig [mailto:[EMAIL PROTECTED]] Sent: Friday, August 19, 2005 10:22 AM To: Roland Dreier Cc: Yaron Haviv; Christoph Hellwig; Grant Grundler; open- [EMAIL PROTECTED]; openib-general@openib.org Subject: Re: [openib-general] [ANNOUNCE] Initial trunk checkin of ISERinitiator On Thu, Aug 18, 2005 at 09:24:24PM -0700, Roland Dreier wrote: Yaron Not every one wants to keep on doing target discovery with Yaron Python scripts, Come on, this is just a stupid statement. The whole point of putting device management in userspace is so that everybody has the flexibility to use whatever discovery mechanism they want. And just FYI. If you ever want an iSER implementation merged it will have to work the same way. Look at how the open-iscsi TCP initator does it. Good point, the high-level functionality in iSER is all done in Open-iSCSI and its userspace extensions iSER just deals with the data transfer and is layered under Open-iSCSI by the way can you point me to the iSCSI HBA that delivers better performance, latency, and memory consumption and what about the price of that HBA and the attached 10GbE switch Is any of this really relevant? The focus here is open source and creating a RDMA infrastructure for ULP to use. The market will decide whether a given technology survives or not. It isn't up to the open source community. Please take personal opinions on whether a technology will succeed elsewhere. Mike ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Re:[ULP] how to choose appropriate ULPs for application
At 06:39 AM 7/13/2005, James Lentini wrote: On Tue, 12 Jul 2005, xg wang wrote: Frankly speaking, I can not distinguish the function of SDP and DAPL. Since Lustre is a file system, it runs on kernel. So I think maybe kDAPL is better. SDP stands for the Sockets Direct Protocol. The protocol is designed to support the Berkley Sockets API. This allows code already using the Sockets API to easily use InfiniBand by simply changing the socket type. One clarification: SDP supports both synchronous and asynchronous sockets. The OpenGroup ICSC released the sockets extensions API a number of months back that enables the full performance provided by SDP to be tapped. The SDP specifications can be found at the IBTA for IB and at the RDMAC for RNIC web sites. kDAPL is the kernel Direct Access Provider Library. It is an API that supports RDMA networks (InfiniBand, iWARP, etc.). But for ULP application, what is the advantage and disadvantage of SDP and DAP ? While you implementation an application, will you use SDP or DAPL, and why? I just wonder the difference between them from the application view. First off, SDP is a protocol and kDAPL is an API. Since SDP is a protocol, you will only be able to communicate with other nodes that implement SDP. Another thing to consider is the differences in the APIs. SDP accessed with traditional Sockets API. This makes porting applications to it easy, but doesn't give you much fine grained control over how the RDMA network is used. Sockets by definition is interconnect and topology independent. Network controls are managed separately. The best that an application should do is signal its requirements, e.g. using diffserv or similar standard. kDAPL was designed specifically for RDMA networks with lots of features that allow you to control how the network is used. This is good if you are writing new code, but means that old code needs substantial porting. Ideally, applications stay out of such decisions. Middleware's job is to handle application optimization, etc. so that the end consumer stays as ignorant as possible thus focused on their application's needs not the networks. The middleware API - whether DAPL, IT API, RNIC PI, whatever - can provide the hooks needed to manage the usage from a given endnode's perspective. But even here, the real network management, what routes are actually used, the arbitration for QoS, etc. should also be outside of the middleware's control. It simply manages a set of local resources and allows the fabric management to do the rest. There is more to this than that but that is how IB was constructed which is no different in many respects from how IP works as well. Mike ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Re:[ULP] how to choose appropriate ULPs for application
At 11:18 AM 7/13/2005, James Lentini wrote: On Wed, 13 Jul 2005, Michael Krause wrote: At 06:39 AM 7/13/2005, James Lentini wrote: kDAPL was designed specifically for RDMA networks with lots of features that allow you to control how the network is used. This is good if you are writing new code, but means that old code needs substantial porting. Ideally, applications stay out of such decisions. Middleware's job is to handle application optimization, etc. so that the end consumer stays as ignorant as possible thus focused on their application's needs not the networks. The middleware API - whether DAPL, IT API, RNIC PI, whatever - can provide the hooks needed to manage the usage from a given endnode's perspective. But even here, the real network management, what routes are actually used, the arbitration for QoS, etc. should also be outside of the middleware's control. It simply manages a set of local resources and allows the fabric management to do the rest. There is more to this than that but that is how IB was constructed which is no different in many respects from how IP works as well. Let me clarify: kDAPL users can specify exactly how data is transfered (SEND, RDMA write, RDMA read), completion events are processed, memory is registered, etc. This is the network control I was referring to. In retrospect, it would be more correct refer to this as adapter control. Just to nit-pick this a bit. The ULP determines what type of operation to use - SEND, RDMA Write, RDMA Read, or Atomic (where supported by the interconnect). The middleware API or the verbs API provide an interface to abstract the IHV hardware-specifics from the ULP allowing the ULP to be implemented across a variety of technologies. Thus, the API is just an abstraction of the underlying semantics and does not make decisions or provide much in the way of controls in any regard unless additional value-add beyond the underlying hardware semantics are transparently implemented within the API implementation itself For example, SDP defines specifically when to use a SEND for control operations, when to use a RDMA for a zcopy operation. SDP itself does not care what the underlying API is used to access the associated hardware resources but it does define what resources and associated services are used. SDP can be implemented directly on the verbs API (just like MPI) and operate quite nicely without the additional middleware API in the execution path. Apologies of the nit pick but the API does not provide any type of control other than to act as an abstraction funnel between the ULP / application and the underlying hardware. There are opportunities to provide transparent value-add controls but I don't believe this open source effort is focused on these this time. Mike ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Re: IP addressing on InfiniBand networks (Caitlin Bestler)
At 06:37 AM 7/11/2005, James Lentini wrote: On Tue, 5 Jul 2005, Michael Krause wrote: The intention was to allow one to manage the fabric by having mapping functions from traditional IP management applications to IB GID to minimize the amount of work to enable IB within a solution. I was unaware of this. What happened to the mapping functions? What did the API look like and how was it going to be implemented? The IBTA specs are not API specifications. They define semantics and wire protocols. As for what was envisioned which guided the spec, most data center management applications understand a variety of management objects that represent various IP based attributes. A GID is close enough to an IPv6 address that many of these objects could be easily modified via a plug-in such that the IB would have been easily slid into the associated management applications. Investigations occurred, e.g. in to providing the necessary plug ins for OpenView, Tivoli, etc. This is one area where IB is insufficient in terms of a viable ecosystem since there really isn't any way to manage IB in the enterprise without requiring significant amounts of training and a new tool chain to be deployed. Mike ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] vendor_id, vendor_part_id, hw_ver
At 04:20 PM 7/8/2005, Kevin Reilly wrote: Mike, Ideally your right the ULP shouldn't care what HCA it's running on. There are some practical reasons why an ULP might want to know the vendor and part number it was using like for debug or taking advantage a perform nuance of a particular HCA. Create a private interface since it is unique per HCA. This type of information was rejected during the IB and iWARP verbs creation as something best handled out of band / out of spec. Please do not push for this to be in a ULP itself or any standard RDMA API since it isn't required for all - most likely is very rare given it was rejected by all companies involved in creating these technologies to date. Mike ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] vendor_id, vendor_part_id, hw_ver
At 02:00 PM 7/8/2005, Roland Dreier wrote: Kevin Is openIB going to do anything to enumerate the Kevin vendor_id,vendor_part_id and hw_ver in a common header Kevin fille or is it the responsiblity of ULP running ontop of Kevin the lib to understand these values? I don't have anything planned to enumerate those values. I would guess that only a very few ULPs should even look at them. Why would this ever have to be examined by a ULP? This seems like a low-level driver issue. Given the verbs semantics abstract the hardware, aside from the HCA-specific driver, the rest of the stack should be unaware of any thing below. Mike ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] [iser]about the target
At 06:14 AM 7/6/2005, Rimmer, Todd wrote: I would like to clarify the comment on SRP. There are companies presently shipping and demonstrating SRP native IB storage. For example: Engenio (formerly LSI) Raytheon Data Direct Mellanox SRP was designed for highly optimized storage access across an RDMA capable transport, and hence is capable of very high performance. Longer term the storage vendors anticipate that iSCSI will be the focus for long haul (remote backup, etc) type solutions, while IB Native Storage and FC will be the focus for data center high performance storage solutions. iSCSI within the data center is quite real for many applications especially blades where there is already an Ethernet interface. The reason for iSER was to take advantage of the iSCSI ecosystem while providing a RDMA focused data mover. SRP is primarily a data mover and does not define the rest of the management, etc. interfaces. When used to move data to a FC, the FC infrastructure is leveraged but SRP by itself does nothing really in this regard. No one stated that SRP could not deliver performance only that the rest of the infrastructure is not defined and must rely upon other standards / plug-ins / etc. I do not want to get into a vision / marketing debate - was just explaining why we created iSER instead of just enhancing SRP. Mike Todd R. -Original Message- From: Michael Krause [ mailto:[EMAIL PROTECTED]] Sent: Tuesday, July 05, 2005 12:28 PM To: Ian Jiang; openib-general@openib.org Subject: Re: [openib-general] [iser]about the target At 06:07 PM 7/4/2005, Ian Jiang wrote: Hi! I am new to the iSER. On https://openib.org/tiki/tiki-index.php?page=iSER, it is said that iSER currently contains initiator only (no target). Will the target come out later? How did they test the iSER initiator without a iSER target? Could you give some explaination? From a practical perspective, there are very few iSCSI targets shipping today. Most people had envisioned iSER over IB to a gateway Ethernet device since native IB storage is also quite rare in terms of real product. For many of us, our push for iSER over IB was to replace SRP which has a deficient ecosystem thus not really used beyond some basic Fibre Channel gateway cards. Mike Thanks! Ian Jiang [EMAIL PROTECTED] Computer Architecture Laboratory Institute of Computing Technology Chinese Academy of Sciences Beijing,P.R.China Zip code: 100080 Tel: +86-10-62564394(office) _ ÓëÁª»úµÄÅóÓѽøÐн»Á÷£¬ÇëʹÓà MSN Messenger: http://messenger.msn.com/cn ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Re: IP addressing on InfiniBand networks (Caitlin Bestler)
At 10:49 AM 6/30/2005, Roland Dreier wrote: Michael Being the person who led the addressing definition for Michael IB, I can state quite clearly that GID are NOT IPv6 Michael addresses. They were intentionally defined to have a Michael similar look-n-feel since they were derived in large part Michael from Future I/O which had them as real IPv6 addresses. Michael But again, they are NOT IPv6 addresses. The IBA spec seems to have a different idea. In fact chapter 4 says: A GID is a valid 128-bit IPv6 address (per RFC 2373) I wrote the original spec here. The text was supposed to be updated to clarify that the rest of the sentence, i.e. with additional rules, etc. thus making it not a real IPv6 address from the IETF's perspective but something quite close. The intention was to allow one to manage the fabric by having mapping functions from traditional IP management applications to IB GID to minimize the amount of work to enable IB within a solution. Mike ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [iser]about the target
At 06:07 PM 7/4/2005, Ian Jiang wrote: Hi! I am new to the iSER. On https://openib.org/tiki/tiki-index.php?page=iSER, it is said that iSER currently contains initiator only (no target). Will the target come out later? How did they test the iSER initiator without a iSER target? Could you give some explaination? From a practical perspective, there are very few iSCSI targets shipping today. Most people had envisioned iSER over IB to a gateway Ethernet device since native IB storage is also quite rare in terms of real product. For many of us, our push for iSER over IB was to replace SRP which has a deficient ecosystem thus not really used beyond some basic Fibre Channel gateway cards. Mike Thanks! Ian Jiang [EMAIL PROTECTED] Computer Architecture Laboratory Institute of Computing Technology Chinese Academy of Sciences Beijing,P.R.China Zip code: 100080 Tel: +86-10-62564394(office) _ ÓëÁª»úµÄÅóÓѽøÐн»Á÷£¬ÇëʹÓà MSN Messenger: http://messenger.msn.com/cn ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Re: IP addressing on InfiniBand networks (Caitlin Bestler)
At 10:39 PM 6/29/2005, Bill Strahm wrote: -- Message: 2 Date: Wed, 29 Jun 2005 09:00:37 -0700 From: Roland Dreier [EMAIL PROTECTED] Subject: Re: [openib-general] IP addressing on InfiniBand networks To: Caitlin Bestler [EMAIL PROTECTED] Cc: Lentini, James [EMAIL PROTECTED], Christoph Hellwig [EMAIL PROTECTED],openib-general openib-general@openib.org Message-ID: [EMAIL PROTECTED] Content-Type: text/plain; charset=us-ascii Caitlin An assigned GID meets all of the requirements for an IA Caitlin Address. I think taking advantage of that existing Caitlin capability is just one of many options that can be done Caitlin by the IB CM rather than forcing IB specific changes up Caitlin to the application layer. Just to be clear, the IBA spec is very clear that a GID _is_ an IPv6 address. - R. Just to be REALLY clear - IANA has not allocated IPv6 address space to any Infiniband entities - so they are not Internet IPv6 addresses. GIDs are formatted like IPv6 addresses but in no sense should EVER be used at an IP layer 3 address. From an IPoIB stance - IB is just a very over engineered Layer 2 that has a singularly large MAC address. From an IB ULP point of view - how you get to the layer 2 address that is needed to perform communications that do not include IP, it isn't a problem - but let me tell you The leadership of the IETF is scared of IB because of saying things IB GIDs _ARE_ IPv6 addresses. Being the person who led the addressing definition for IB, I can state quite clearly that GID are NOT IPv6 addresses. They were intentionally defined to have a similar look-n-feel since they were derived in large part from Future I/O which had them as real IPv6 addresses. But again, they are NOT IPv6 addresses. For IP over IB, it is unfortunate that we could not have simply used a raw datagram service as that would have made life very simple but we are in the state we are so that means there is a UD transport providing a layer 2 Ethernet equivalent of functionality. Mike ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] mapping between IP address and device name
At 10:30 AM 6/24/2005, Roland Dreier wrote: Thomas As I said - I am not attached to ATS. I would welcome an Thomas alternative. Sure, understood. I'm suggesting a slight tweak to the IB wire protocol. I don't think there's a difference in the security provided, and carrying the peer address in the CM private data avoids a lot of the conceptual and implementation difficulties of ATS. Thomas But in the absence of one, I like what we have. Also, I do Thomas not want to saddle the NFS/RDMA transport with carrying an Thomas IP address purely for the benefit of a missing transport Thomas facility. After all NFS/RDMA works on iWARP too. I'm not sure I understand this objection. We wouldn't be saddling the transport with anything -- simply specifying in the binding of NFS/RDMA to IB that certain information is carried in the private data fields of the CM messages used to establish a connection. Clearly iWARP would use its own mechanism for providing the peer address. This would be exactly analogous to the situation for SDP -- obviously SDP running on iWARP does not use the IB CM to exchange IP address information in the same way the SDP over IB does. Actually, SDP on iWARP uses the SDP port mapper protocol to comprehend the IP address / port tuples used on both sides of the communication before the connection is established (this protocol could be used by any mapping service since it is implemented on top of UDP so could be re-used by other subsystems like NFS. The TCP transport then connects normally and one can ask it for the IP address / port tuple that is really being used. Port mapper may be viewed as akin to the SID protocol defined for IB. The SDP hello is then exchange in byte stream as opposed to IB CM. The port mapper supports both centrally managed and distributed usage models, supports the ability to return diff IP address than requested, support multiple IP addresses per port, etc. One can construct a very flexible infrastructure that supports nearly any type of mapping one desires to same or different hardware or endnodes. It is fairly light weight and can support caching of data for a period of time or even a one-shot connection attempt. Mike ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] How about ib_send_page() ?
At 12:13 PM 6/3/2005, Sean Hefty wrote: Fab Tillier wrote: Ok, so this question is from a noob, but here goes anyway. Why can't IPoIB advertise a larger MTU than the UD MTU, and then just fragment large IP packets up if they need to go over the IB UD transport? Is there any reason this couldn't work? If it does, it allows IPoIB to expose a single MTU to the OS, and take care of the rest under the covers. Just a thought. I don't remember seeing a response to this. Something like this could work. I guess one disadvantage is that it can be less efficient if you lose a lot of packets. TCP would resend an entire MTU (as seen by TCP) of data if only a single IB packet were lost. Why not just use the IETF draft for RC / UC based IP over IB and not worry about creating something new? Mike ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] How about ib_send_page() ?
At 09:28 AM 6/7/2005, Fab Tillier wrote: From: Roland Dreier [ mailto:[EMAIL PROTECTED]] Sent: Tuesday, June 07, 2005 8:38 AM Michael Why not just use the IETF draft for RC / UC based IP over Michael IB and not worry about creating something new? I think we've come full circle. The original post was a suggestion on how to handle the fact the the connected-mode IPoIB draft requires a network stack to deal with different MTUs for different destinations on the same logical link. That's right - by implementing IP segmentation in the IPoIB driver when going over UD, the driver could expose a single MTU to the network stack, thereby removing all the issues related to having per-endpoint MTUs. Keeping a 2K MTU for RC mode doesn't really take advantage of IB's RC capabilities. I'd probably target 64K as the MTU. The draft should state a minimum for all RC / UC which should be the TCP MSS. Whether one does SAR over a UD endpoint independent of the underlying physical MTU can be done but it should not require end-to-end understanding of the operation, i.e. the send side tells its local that the TCP MSS is X while the receive side only posts 2-4 KB buffers. This has been done over Ethernet for years. Mike ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [Rdma-developers] Re: [openib-general] OpenIB and OpenRDMA: Convergence on common RDMAAPIs and ULPs for Linux
At 06:47 AM 5/28/2005, Christoph Hellwig wrote: On Sat, May 28, 2005 at 05:17:54AM -0700, Sukanta ganguly wrote: That's a pretty bold statement. Linux grew up to be popular via mass acceptance. Seems like that charter has changed and a few have control over Linux and its future. The My way or the highway philosophy has gotten embedded in the Linux way of life. Life is getting tough. You're totally missing the point. Linux is successfull exactly because it's lookinf for the right solution, not something the business people need short-term. Hence why some of us contend that the end-game, i.e. the right solution, is not necessarily the short-term implementation that is present today that just evolves creating that legacy inertia that I wrote about earlier. I think there is validity to having an implementation to critique - accept, reject, modify. I think there is validity to examining industry standards as the basis for new work / implementation. If people are unwilling to discuss these standards and only stay focused on their business people's short-term needs, then some might contend as above that Linux is evolving to be much like the dreaded Pacific NW company in the end. Not intending to offend anyone but if there can be no debate without implementation on what is the right solution, then people might as well just go off and implement and propose their solution for incorporation into the Linux kernel. It may be that OpenIB wins in the end or it may be that it does not. Just having OpenIB subsume control of anything iWARP or impose only DAPL for all RDMA infrastructure because it just happens to be there today seems rather stifling. Just stating that some OpenIB steering group is somehow empowered to decide this for Linux is also rather strange. Open source is about being open and not under the control of any one entity in the end. Perhaps that is no longer the case. Mike ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [Rdma-developers] Re: [openib-general] OpenIB and OpenRDMA: Convergence on common RDMA APIs and ULPs for Linux
At 06:40 AM 5/27/2005, Sukanta ganguly wrote: Venkata, How will that work? If the RNIC offloads RDMA and TCP completely from the Operating System and does not share any state information then the application running on the host will never be in the position to utilize the socket interface to use the communication logic to send and receive data between the remote node and itself. Some information needs to be shared. How much of it and what exactly needs to be shared is the question. Ok. It all depends upon what level of integration / interaction a TOE and thus a RNIC will have with the host network stack. For example, if a customer wants to have TCP and IP stats kept for the off-loaded stack even if it is just being using for RDMA, then there needs to be a method defined to consolidate these stats back into the host network stack tool chain. Similarly, if one wants to maintain a single routing table to manage, etc. on the host, then the RNIC needs to access / update that information accordingly. One can progress through other aspects of integration, e.g. connection management, security interactions (e.g. DOS protection), and so forth. What is exposed again depends upon the level of integration and how customers want to manage their services. This problem also exists for IB but most people have not thought about this from a customer perspective and how to integrate the IB semantics into the way customers manage their infrastructures, do billing, etc. For some environments, they simply do not care but if IB is to be used in the enterprise space, then some thought will be required here since most IT don't see anything as being free or self-managed. Again, Sockets is an application API and not how one communicates to a TOE or RDMA component. The RNIC PI has been proposed as an interface to the RDMA functionality. The PI supports all of the iWARP and IB v 1.2 verbs. Mike Thanks SG --- Venkata Jagana [EMAIL PROTECTED] wrote: [EMAIL PROTECTED] wrote on 05/25/2005 09:47:00 PM: Venkata, Interesting coincidence: I was talking with someone (at HP) today who knows substantially more than I do about RNICs. They indicated RNICs need to manage TCP state on the card from userspace. I suspect that's only possible through a private interface (e.g. ioctl() or /proc) or the non-existant (in kernel.org) TOE implementation. Is this correct? Not correct. Since RNICs are offloaded adapters with RDMA protocols layered on top of TCP stack, they do maintain the TCP state internally but it does not expose to the host. RNIC expose only RNIC Verbs interface to the host bot not TOE interface. Thanks Venkat hth, grant --- SF.Net email is sponsored by: GoToMeeting - the easiest way to collaborate online with coworkers and clients while avoiding the high cost of travel and communications. There is no equipment to buy and you can meet as often as you want. Try it free. http://ads.osdn.com/?ad_id=7402alloc_id=16135op=click ___ Rdma-developers mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/rdma-developers __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com --- This SF.Net email is sponsored by Yahoo. Introducing Yahoo! Search Developer Network - Create apps using Yahoo! Search APIs Find out how you can build Yahoo! directly into your own Applications - visit http://developer.yahoo.net/?fr=offad-ysdn-ostg-q22005 ___ Rdma-developers mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/rdma-developers ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [Rdma-developers] Re: [openib-general] OpenIB and OpenRDMA: Convergence on common RDMA APIs and ULPs for Linux
At 09:29 AM 5/27/2005, Grant Grundler wrote: On Fri, May 27, 2005 at 07:24:44AM -0700, Michael Krause wrote: ... Again, Sockets is an application API and not how one communicates to a TOE or RDMA component. Mike, What address family is used to open a socket over iWARP? AF_INET? Or something else? TCP = AF_INET. Address family != Sockets. Sockets is an API that can operate over multiple address families. An application can be coded to Sockets, IT API, DAPL, or a verbs interface like the RNIC PI. It is a matter of choice as well as what is trying to be accomplished. The RNIC PI is an acceptable interface for any RDMA-focused ULP. There are pros / cons to using such a verbs interface directly but I do not believe any one can deny that a general-purpose verbs API is a good thing at the end of the day as it works for the volume verbs definition. Whether one applies further hardware semantics abstraction such as IT API / DAPL should be a choice for the individual subsystem as there is no single right answer across all subsystems. Attempting to force fit isn't practical. I understand most of what you wrote but am still missing one bit: How is the RNIC told what the peer IP is it should communicate with? The destination address (IB GID or IP) is derived from the CM services. This is where the two interconnects differ in what is required to physical inject a packet on the wire. This is why I call it out as separate from the verbs interface and something that could be abstracted to some extent but at the end of the day, really requires the subsystem to understand the underlying fabric type to make some intelligent choices. Given this effort is still nascent, most of the issues beyond basic bootstrap have not really been discussed as yet. The RNIC PI has been proposed as an interface to the RDMA functionality. The PI supports all of the iWARP and IB v 1.2 verbs. That's good. Folks from RDMA consortium will have to look at openib implementations and see whats missing/wrong. Then submit proposals to fill in the gaps. I'm obviously not the first one to say this. There are two open source efforts. The question is whether to move to a single effort (I tried to get this to occur before OpenIB was formally launched but it seem to fall on deaf ears for TTM marketing purposes) or whether to just coordinate on some of the basics. My preference remains that the efforts remained strictly focused on the RDMA infrastructure and interconnect-specific components and leave the ULP / services as separate efforts who will make their own decisions on how best to interface with the RDMA infrastructure. I expect most of the principals involved with openib.org do NOT have time to browse through RNIC PI at this point. They are struggling to get openib.org filled in sufficiently so it can go into a commercial distro (RH/SuSE primarily). Hence, why OpenRDMA needs to get source being developed to enable the RNIC community. If people find value in the work, then people can look at finding the right solution for both IB and iWARP when it makes sense. Revenue for them comes from selling IB equipment. Having openib.org code in kernel.org is a key enabler for getting into commercial distros. I expect the same is true for RNIC vendors as well. RNIC Vendors (and related switch Vendors) will have to decide which path is the right one for them to get the support into kernel.org. Several openib.org people have suggested one (like I have). RNIC folks need to listen and decide if the advice is good or not. If RNIC folks think they know better, then please take another look at where openib.org is today and where rdmaconsortium is. I'm certain openib.org would be dead now if policies and direction changes had not made last year as demanded by several key linux developers and users (Gov Labs). Understood. Mike ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general