Re: [openib-general] question - mapping QPIDs back to ptrs
> The Chelsio driver is hogging lots of memory right now for mapping > PDIDs, QPIDs, CQIDs, and STAG IDs back to their respective kernel > structures. This is done via an array of pointers, indexed by the ID. > The critical performance mapping is finding a QP struct from the QPID in > the poll path. mthca rolls its own two-level sparse arrays (the mthca_array_xxx) stuff, but it would probably be smarter to use the kernel's radix tree stuff. I've been meaning to benchmark mthca after converting to radix trees for those tables, to see if it makes a difference. - R. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Question about the query QP mask
> What should be the expected behavior? > Should this description should be changed or should the low level drivers > of mthca and ipath need to be changed? The mask is used as a hint to the low-level driver about which attributes the consumer cares about. The driver may fill in more fields, but it can use the mask to optimize some calls, if filling in a particular field is expensive and that field is not requested by the consumer. I guess we should update the documentation to reflect this. - R. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Question about multicast GIDs
Robert Walsh wrote: > Roland Dreier wrote: >> > Is there are registration authority for multicast GIDs? Or at >> least a > safe way of assigning a range of GIDs to a vendor? >> >> I don't think so. Perhaps RFC 3307 would be of some use... > > Ah - looks exactly like what I was looking for. Thanks. Hmm - spoke too soon. This seems to be related to IPv6 multicast GIDs, but not IB. The idea is similar, but the allocation mechanism is entirely arbitrary (but consistent) and I don't think it would map from IPv6 to IB in any meaningful way. I'll talk to the folks here who are on the various IB committees and see if they have any thoughts on this. Regards, Robert. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Question about multicast GIDs
Roland Dreier wrote: > > Is there are registration authority for multicast GIDs? Or at least a > > safe way of assigning a range of GIDs to a vendor? > > I don't think so. Perhaps RFC 3307 would be of some use... Ah - looks exactly like what I was looking for. Thanks. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Question about multicast GIDs
> Is there are registration authority for multicast GIDs? Or at least a > safe way of assigning a range of GIDs to a vendor? I don't think so. Perhaps RFC 3307 would be of some use... - R. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] question on QoS support
On Mon, 2006-11-06 at 13:13, Wang, Feiyi wrote: > Hal - > > Please see the output for active port 1 (although there are two ports on > this HCA, the second one is disabled now). > > #smpquery portinfo 8 1 > # Port info: Lid 8 port 1 > Mkey:0x > GidPrefix:...0xfe80 > Lid:.0x0008 > SMLid:...0x0001 > CapMask:.0x2510a68 > IsTrapSupported > IsAutomaticMigrationSupported > IsSLMappingSupported > IsLedInfoSupported > IsSystemImageGUIDsupported > IsCommunicatonManagementSupported > IsVendorClassSupported > IsCapabilityMaskNoticeSupported > IsClientRegistrationSupported > DiagCode:0x > MkeyLeasePeriod:.0 > LocalPort:...1 > LinkWidthEnabled:1X or 4X > LinkWidthSupported:..1X or 4X > LinkWidthActive:.4X > LinkSpeedSupported:..2.5 or 5.0 Gbps > LinkState:...Active > PhysLinkState:...LinkUp > LinkDownDefState:Polling > ProtectBits:.0 > LMC:.0 > LinkSpeedActive:.2.5 Gbps > LinkSpeedEnabled:2.5 or 5.0 Gbps > NeighborMTU:.2048 > SMSL:0 > VLCap:...VL0-7 > InitType:0x00 > VLHighLimit:.255 OK; this is pretty conclusive. > VLArbHighCap:8 > VLArbLowCap:.8 > InitReply:...0x00 > MtuCap:..2048 > VLStallCount:7 > HoqLife:.31 > OperVLs:.VL0-7 > PartEnforceInb:..0 > PartEnforceOutb:.0 > FilterRawInb:0 > FilterRawOutb:...0 > MkeyViolations:..0 > PkeyViolations:..0 > QkeyViolations:..0 > GuidCap:.32 > ClientReregister:0 > SubnetTimeout:...18 > RespTimeVal:.16 > LocalPhysErr:8 > OverrunErr:..8 > MaxCreditHint:...0 > RoundTrip:...0 Do you have an IB analyzer ? -- Hal > Feiyi > > -Original Message- > From: Hal Rosenstock [mailto:[EMAIL PROTECTED] > Sent: Friday, November 03, 2006 3:58 PM > To: Wang, Feiyi > Cc: openib-general@openib.org > Subject: RE: [openib-general] question on QoS support > > On Fri, 2006-11-03 at 15:56, Wang, Feiyi wrote: > > 255 > > > > I think I tested with default 0 before, that is send at most one > packet > > before give low priority table the chance according to IBA. It doesn't > > seem to make a difference though. > > I was hoping you would say 0 as that means 1 packet before looking at > low priority. > > 255 means unbounded packets on high priority. Can you send me the > results of smpquery portinfo on that port to ensure that it is being set > properly ? > > -- Hal > > > Feiyi > > > > > > -Original Message- > > From: Hal Rosenstock [mailto:[EMAIL PROTECTED] > > Sent: Friday, November 03, 2006 3:51 PM > > To: Wang, Feiyi > > Cc: openib-general@openib.org > > Subject: RE: [openib-general] question on QoS support > > > > On Fri, 2006-11-03 at 15:43, Wang, Feiyi wrote: > > > The test is done on two hosts, say A and B. A has 4x SDR (run > > ib_rdam_bw > > > as server), B has 4x DDR (run more than one thread of ib_rdma_bw as > > > clients). The sl2vl table read as: > > > > > > smpquery sl2vl 7 > > > # SL2VL table: Lid 7 > > > # SL: | 0| 1| 2| 3| 4| 5| 6| 7| 8| > > 9|10|11|12|13|14|15| > > > ports: in 0, out 0: | 0| 1| 2| 3| 4| 5| 6| 7| 0| 1| 2| 3| 4| 5| 6| > > 7| > > > > > > smpquery vlarb 7 > > > # VLArbitration tables: Lid 7 port 0 LowCap 8 HighCap 8 > > > # Low priority VL Arbitration Table: > > > VL: |0x0 |0x1 |0x2 |0x3 |0x4 |0x5 |0x6 |0x7 | > > > WEIGHT: |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 | > > > # High priority VL Arbitration Table: &g
Re: [openib-general] question on QoS support
Hal - Please see the output for active port 1 (although there are two ports on this HCA, the second one is disabled now). #smpquery portinfo 8 1 # Port info: Lid 8 port 1 Mkey:0x GidPrefix:...0xfe80 Lid:.0x0008 SMLid:...0x0001 CapMask:.0x2510a68 IsTrapSupported IsAutomaticMigrationSupported IsSLMappingSupported IsLedInfoSupported IsSystemImageGUIDsupported IsCommunicatonManagementSupported IsVendorClassSupported IsCapabilityMaskNoticeSupported IsClientRegistrationSupported DiagCode:0x MkeyLeasePeriod:.0 LocalPort:...1 LinkWidthEnabled:1X or 4X LinkWidthSupported:..1X or 4X LinkWidthActive:.4X LinkSpeedSupported:..2.5 or 5.0 Gbps LinkState:...Active PhysLinkState:...LinkUp LinkDownDefState:Polling ProtectBits:.0 LMC:.0 LinkSpeedActive:.2.5 Gbps LinkSpeedEnabled:2.5 or 5.0 Gbps NeighborMTU:.2048 SMSL:0 VLCap:...VL0-7 InitType:0x00 VLHighLimit:.255 VLArbHighCap:8 VLArbLowCap:.8 InitReply:...0x00 MtuCap:..2048 VLStallCount:7 HoqLife:.31 OperVLs:.VL0-7 PartEnforceInb:..0 PartEnforceOutb:.0 FilterRawInb:0 FilterRawOutb:...0 MkeyViolations:..0 PkeyViolations:..0 QkeyViolations:..0 GuidCap:.32 ClientReregister:0 SubnetTimeout:...18 RespTimeVal:.16 LocalPhysErr:8 OverrunErr:..8 MaxCreditHint:...0 RoundTrip:...0 Feiyi -Original Message- From: Hal Rosenstock [mailto:[EMAIL PROTECTED] Sent: Friday, November 03, 2006 3:58 PM To: Wang, Feiyi Cc: openib-general@openib.org Subject: RE: [openib-general] question on QoS support On Fri, 2006-11-03 at 15:56, Wang, Feiyi wrote: > 255 > > I think I tested with default 0 before, that is send at most one packet > before give low priority table the chance according to IBA. It doesn't > seem to make a difference though. I was hoping you would say 0 as that means 1 packet before looking at low priority. 255 means unbounded packets on high priority. Can you send me the results of smpquery portinfo on that port to ensure that it is being set properly ? -- Hal > Feiyi > > > -Original Message- > From: Hal Rosenstock [mailto:[EMAIL PROTECTED] > Sent: Friday, November 03, 2006 3:51 PM > To: Wang, Feiyi > Cc: openib-general@openib.org > Subject: RE: [openib-general] question on QoS support > > On Fri, 2006-11-03 at 15:43, Wang, Feiyi wrote: > > The test is done on two hosts, say A and B. A has 4x SDR (run > ib_rdam_bw > > as server), B has 4x DDR (run more than one thread of ib_rdma_bw as > > clients). The sl2vl table read as: > > > > smpquery sl2vl 7 > > # SL2VL table: Lid 7 > > # SL: | 0| 1| 2| 3| 4| 5| 6| 7| 8| > 9|10|11|12|13|14|15| > > ports: in 0, out 0: | 0| 1| 2| 3| 4| 5| 6| 7| 0| 1| 2| 3| 4| 5| 6| > 7| > > > > smpquery vlarb 7 > > # VLArbitration tables: Lid 7 port 0 LowCap 8 HighCap 8 > > # Low priority VL Arbitration Table: > > VL: |0x0 |0x1 |0x2 |0x3 |0x4 |0x5 |0x6 |0x7 | > > WEIGHT: |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 | > > # High priority VL Arbitration Table: > > VL: |0x0 |0x1 |0x2 |0x3 |0x4 |0x5 |0x6 |0x7 | > > WEIGHT: |0x1 |0x0 |0x8 |0x0 |0x0 |0x0 |0x0 |0x0 | > > > > Low priority table entries are all zero to skip. > > High priority table give VL 0 and VL 2 different weight. > > > > The SL is specified on command line, one thread with SL 0, the other > > thread with SL 2. > > > > Thanks for looking into this, and let me know if more info is needed. > > What's the limit of high priority ? > > -- Hal > > > Feiyi > > > > > > > > -Original Message- > > From: Hal Rosenstock [mailto:[EMAIL PROTECTED] > > Sent: Friday
Re: [openib-general] question on QoS support
On Fri, 2006-11-03 at 15:56, Wang, Feiyi wrote: > 255 > > I think I tested with default 0 before, that is send at most one packet > before give low priority table the chance according to IBA. It doesn't > seem to make a difference though. I was hoping you would say 0 as that means 1 packet before looking at low priority. 255 means unbounded packets on high priority. Can you send me the results of smpquery portinfo on that port to ensure that it is being set properly ? -- Hal > Feiyi > > > -Original Message- > From: Hal Rosenstock [mailto:[EMAIL PROTECTED] > Sent: Friday, November 03, 2006 3:51 PM > To: Wang, Feiyi > Cc: openib-general@openib.org > Subject: RE: [openib-general] question on QoS support > > On Fri, 2006-11-03 at 15:43, Wang, Feiyi wrote: > > The test is done on two hosts, say A and B. A has 4x SDR (run > ib_rdam_bw > > as server), B has 4x DDR (run more than one thread of ib_rdma_bw as > > clients). The sl2vl table read as: > > > > smpquery sl2vl 7 > > # SL2VL table: Lid 7 > > # SL: | 0| 1| 2| 3| 4| 5| 6| 7| 8| > 9|10|11|12|13|14|15| > > ports: in 0, out 0: | 0| 1| 2| 3| 4| 5| 6| 7| 0| 1| 2| 3| 4| 5| 6| > 7| > > > > smpquery vlarb 7 > > # VLArbitration tables: Lid 7 port 0 LowCap 8 HighCap 8 > > # Low priority VL Arbitration Table: > > VL: |0x0 |0x1 |0x2 |0x3 |0x4 |0x5 |0x6 |0x7 | > > WEIGHT: |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 | > > # High priority VL Arbitration Table: > > VL: |0x0 |0x1 |0x2 |0x3 |0x4 |0x5 |0x6 |0x7 | > > WEIGHT: |0x1 |0x0 |0x8 |0x0 |0x0 |0x0 |0x0 |0x0 | > > > > Low priority table entries are all zero to skip. > > High priority table give VL 0 and VL 2 different weight. > > > > The SL is specified on command line, one thread with SL 0, the other > > thread with SL 2. > > > > Thanks for looking into this, and let me know if more info is needed. > > What's the limit of high priority ? > > -- Hal > > > Feiyi > > > > > > > > -Original Message- > > From: Hal Rosenstock [mailto:[EMAIL PROTECTED] > > Sent: Friday, November 03, 2006 3:27 PM > > To: Wang, Feiyi > > Cc: openib-general@openib.org > > Subject: Re: [openib-general] question on QoS support > > > > On Fri, 2006-11-03 at 15:12, Feiyi Wang wrote: > > > In our test at the ORNL - it appears you can "turn off" the traffic > by > > > giving every VL weight 0. > > > > A weight of 0 indicates to skip that entry. > > > > > As soon as you assign non-zero VL weight, > > > the traffic starts to flow, however, VL with more weight doesn't > have > > > expected preference treatment. In other words, traffic shaping > didn't > > > take place. smpquery vlarb verified the mapping table was there. > > > > correctly ? > > > > Is it high or low priority or both ? > > > > What about SL2VLMapping table ? Is it setup correctly ? > > > > What's your topology for this ? > > > > Can you send your SL2VLMapping and VLarbitration configuration ? > > > > > I believe the scenario described below 'should' be able to generate > > > congestion point ... but it would be helpful if someone can > elaborate > > > a way to "look into" how/if scheduling/arbitration take place. > > > > The only ways I know would be to look at either the packets on the > wire > > or what you are doing with multiple streams which seems valid to me. > > > > Have you read section 7.6.9.2 (p. 189-190) in IBA 1.2 volume 1 to > > understand how to configure this ? > > > > -- Hal > > > > > Best, > > > > > > Feiyi > > > > > > > > > On 02 Nov 2006 10:49:04 -0500, Hal Rosenstock <[EMAIL PROTECTED]> > > wrote: > > > > Hi Oliver, > > > > > > > > On Thu, 2006-11-02 at 10:20, Oliver wrote: > > > > > Hi, Hal - > > > > > > > > > > > How is this being observed/measured ? > > > > > > > > > > Host A, B, with 4x DDR both connected to Flextronic switch. > > > > > A single process of ibv_read_bw gives about 1415MB /s average > > > > > bandwidth. Two concurrent process report 714.45 MB/s each, dead > > even. > > > > > Now if I bump up one process with a different SL, then I expect > to > > see > > > > > shaping to take place. Please let me if the scenario makes > sense. > > > &
Re: [openib-general] question on QoS support
255 I think I tested with default 0 before, that is send at most one packet before give low priority table the chance according to IBA. It doesn't seem to make a difference though. Feiyi -Original Message- From: Hal Rosenstock [mailto:[EMAIL PROTECTED] Sent: Friday, November 03, 2006 3:51 PM To: Wang, Feiyi Cc: openib-general@openib.org Subject: RE: [openib-general] question on QoS support On Fri, 2006-11-03 at 15:43, Wang, Feiyi wrote: > The test is done on two hosts, say A and B. A has 4x SDR (run ib_rdam_bw > as server), B has 4x DDR (run more than one thread of ib_rdma_bw as > clients). The sl2vl table read as: > > smpquery sl2vl 7 > # SL2VL table: Lid 7 > # SL: | 0| 1| 2| 3| 4| 5| 6| 7| 8| 9|10|11|12|13|14|15| > ports: in 0, out 0: | 0| 1| 2| 3| 4| 5| 6| 7| 0| 1| 2| 3| 4| 5| 6| 7| > > smpquery vlarb 7 > # VLArbitration tables: Lid 7 port 0 LowCap 8 HighCap 8 > # Low priority VL Arbitration Table: > VL: |0x0 |0x1 |0x2 |0x3 |0x4 |0x5 |0x6 |0x7 | > WEIGHT: |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 | > # High priority VL Arbitration Table: > VL: |0x0 |0x1 |0x2 |0x3 |0x4 |0x5 |0x6 |0x7 | > WEIGHT: |0x1 |0x0 |0x8 |0x0 |0x0 |0x0 |0x0 |0x0 | > > Low priority table entries are all zero to skip. > High priority table give VL 0 and VL 2 different weight. > > The SL is specified on command line, one thread with SL 0, the other > thread with SL 2. > > Thanks for looking into this, and let me know if more info is needed. What's the limit of high priority ? -- Hal > Feiyi > > > > -Original Message- > From: Hal Rosenstock [mailto:[EMAIL PROTECTED] > Sent: Friday, November 03, 2006 3:27 PM > To: Wang, Feiyi > Cc: openib-general@openib.org > Subject: Re: [openib-general] question on QoS support > > On Fri, 2006-11-03 at 15:12, Feiyi Wang wrote: > > In our test at the ORNL - it appears you can "turn off" the traffic by > > giving every VL weight 0. > > A weight of 0 indicates to skip that entry. > > > As soon as you assign non-zero VL weight, > > the traffic starts to flow, however, VL with more weight doesn't have > > expected preference treatment. In other words, traffic shaping didn't > > take place. smpquery vlarb verified the mapping table was there. > > correctly ? > > Is it high or low priority or both ? > > What about SL2VLMapping table ? Is it setup correctly ? > > What's your topology for this ? > > Can you send your SL2VLMapping and VLarbitration configuration ? > > > I believe the scenario described below 'should' be able to generate > > congestion point ... but it would be helpful if someone can elaborate > > a way to "look into" how/if scheduling/arbitration take place. > > The only ways I know would be to look at either the packets on the wire > or what you are doing with multiple streams which seems valid to me. > > Have you read section 7.6.9.2 (p. 189-190) in IBA 1.2 volume 1 to > understand how to configure this ? > > -- Hal > > > Best, > > > > Feiyi > > > > > > On 02 Nov 2006 10:49:04 -0500, Hal Rosenstock <[EMAIL PROTECTED]> > wrote: > > > Hi Oliver, > > > > > > On Thu, 2006-11-02 at 10:20, Oliver wrote: > > > > Hi, Hal - > > > > > > > > > How is this being observed/measured ? > > > > > > > > Host A, B, with 4x DDR both connected to Flextronic switch. > > > > A single process of ibv_read_bw gives about 1415MB /s average > > > > bandwidth. Two concurrent process report 714.45 MB/s each, dead > even. > > > > Now if I bump up one process with a different SL, then I expect to > see > > > > shaping to take place. Please let me if the scenario makes sense. > > > > > > It makes sense. However, if the higher priority traffic does not > fill > > > the scheduling, the low priority can take up the slack so I'm not > sure > > > if this is what you are seeing or something else. > > > > > > It might be interesting to try the same thing at SDR speeds. > > > > > > -- Hal > > > > > > > > Yes, 8 VLs should be supported in your subnet. You can verify > this with > > > > > smpquery portinfo on the HCA port and examine OperVLs assuming > the port > > > > > is ACTIVE. > > > > > > > > yes, I verified the data VL support, it is 8. I will poke for more > > > > info with suggested commands by Sasha. > > > > > > > > > > A related question is, if I modify qos setting in SM,
Re: [openib-general] question on QoS support
On Fri, 2006-11-03 at 15:12, Feiyi Wang wrote: > In our test at the ORNL - it appears you can "turn off" the traffic by > giving every VL weight 0. A weight of 0 indicates to skip that entry. > As soon as you assign non-zero VL weight, > the traffic starts to flow, however, VL with more weight doesn't have > expected preference treatment. In other words, traffic shaping didn't > take place. smpquery vlarb verified the mapping table was there. correctly ? Is it high or low priority or both ? What about SL2VLMapping table ? Is it setup correctly ? What's your topology for this ? Can you send your SL2VLMapping and VLarbitration configuration ? > I believe the scenario described below 'should' be able to generate > congestion point ... but it would be helpful if someone can elaborate > a way to "look into" how/if scheduling/arbitration take place. The only ways I know would be to look at either the packets on the wire or what you are doing with multiple streams which seems valid to me. Have you read section 7.6.9.2 (p. 189-190) in IBA 1.2 volume 1 to understand how to configure this ? -- Hal > Best, > > Feiyi > > > On 02 Nov 2006 10:49:04 -0500, Hal Rosenstock <[EMAIL PROTECTED]> wrote: > > Hi Oliver, > > > > On Thu, 2006-11-02 at 10:20, Oliver wrote: > > > Hi, Hal - > > > > > > > How is this being observed/measured ? > > > > > > Host A, B, with 4x DDR both connected to Flextronic switch. > > > A single process of ibv_read_bw gives about 1415MB /s average > > > bandwidth. Two concurrent process report 714.45 MB/s each, dead even. > > > Now if I bump up one process with a different SL, then I expect to see > > > shaping to take place. Please let me if the scenario makes sense. > > > > It makes sense. However, if the higher priority traffic does not fill > > the scheduling, the low priority can take up the slack so I'm not sure > > if this is what you are seeing or something else. > > > > It might be interesting to try the same thing at SDR speeds. > > > > -- Hal > > > > > > Yes, 8 VLs should be supported in your subnet. You can verify this with > > > > smpquery portinfo on the HCA port and examine OperVLs assuming the port > > > > is ACTIVE. > > > > > > yes, I verified the data VL support, it is 8. I will poke for more > > > info with suggested commands by Sasha. > > > > > > > > A related question is, if I modify qos setting in SM, do I need to > > > > > restart SA on each hosts for it to see the changes? (I am hoping not, > > > > > as I tried in the test, it doesn't seem to make a difference) > > > > > > > > Not sure what you mean. SA is tightly coupled with the OpenSM. Do you > > > > mean SA client ? The client hosts don't need restarting but did you > > > > restart OpenSM with your QoS configuration ? > > > > > > I mean client SA. yes, I understand OpenSM needs to be restarted. > > > > > > > BTW, which OpenSM are you running ? > > > > > > OFED 1.1 based. > > > > > > thanks > > > > > > - Oliver > > > > ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] question on QoS support
In our test at the ORNL - it appears you can "turn off" the traffic by giving every VL weight 0. As soon as you assign non-zero VL weight, the traffic starts to flow, however, VL with more weight doesn't have expected preference treatment. In other words, traffic shaping didn't take place. smpquery vlarb verified the mapping table was there. I believe the scenario described below 'should' be able to generate congestion point ... but it would be helpful if someone can elaborate a way to "look into" how/if scheduling/arbitration take place. Best, Feiyi On 02 Nov 2006 10:49:04 -0500, Hal Rosenstock <[EMAIL PROTECTED]> wrote: > Hi Oliver, > > On Thu, 2006-11-02 at 10:20, Oliver wrote: > > Hi, Hal - > > > > > How is this being observed/measured ? > > > > Host A, B, with 4x DDR both connected to Flextronic switch. > > A single process of ibv_read_bw gives about 1415MB /s average > > bandwidth. Two concurrent process report 714.45 MB/s each, dead even. > > Now if I bump up one process with a different SL, then I expect to see > > shaping to take place. Please let me if the scenario makes sense. > > It makes sense. However, if the higher priority traffic does not fill > the scheduling, the low priority can take up the slack so I'm not sure > if this is what you are seeing or something else. > > It might be interesting to try the same thing at SDR speeds. > > -- Hal > > > > Yes, 8 VLs should be supported in your subnet. You can verify this with > > > smpquery portinfo on the HCA port and examine OperVLs assuming the port > > > is ACTIVE. > > > > yes, I verified the data VL support, it is 8. I will poke for more > > info with suggested commands by Sasha. > > > > > > A related question is, if I modify qos setting in SM, do I need to > > > > restart SA on each hosts for it to see the changes? (I am hoping not, > > > > as I tried in the test, it doesn't seem to make a difference) > > > > > > Not sure what you mean. SA is tightly coupled with the OpenSM. Do you > > > mean SA client ? The client hosts don't need restarting but did you > > > restart OpenSM with your QoS configuration ? > > > > I mean client SA. yes, I understand OpenSM needs to be restarted. > > > > > BTW, which OpenSM are you running ? > > > > OFED 1.1 based. > > > > thanks > > > > - Oliver > > ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Question on ucma
Sean posted 7 patches that include the ucma support. You'll need those + the one librdmacm patch he posted. Steve. On Fri, 2006-11-03 at 13:59 +0530, Krishna Kumar2 wrote: > Hi, > > I installed the 2.6.19-rc3 bits, and when I try to run > perftest/rdma_bw (with '-c' option), I get the error : > "librdmacm: Couldnt open rdma_cm ABI version". > > I found that this is due to ucma not being present in > mainline kernel bits (which creates /sys/class/misc/rdma_cm). > So how can I resolve this and run these tests ? > > Thanks, > > - KK > > > ___ > openib-general mailing list > openib-general@openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] question on QoS support
Hi Oliver, On Thu, 2006-11-02 at 10:20, Oliver wrote: > Hi, Hal - > > > How is this being observed/measured ? > > Host A, B, with 4x DDR both connected to Flextronic switch. > A single process of ibv_read_bw gives about 1415MB /s average > bandwidth. Two concurrent process report 714.45 MB/s each, dead even. > Now if I bump up one process with a different SL, then I expect to see > shaping to take place. Please let me if the scenario makes sense. It makes sense. However, if the higher priority traffic does not fill the scheduling, the low priority can take up the slack so I'm not sure if this is what you are seeing or something else. It might be interesting to try the same thing at SDR speeds. -- Hal > > Yes, 8 VLs should be supported in your subnet. You can verify this with > > smpquery portinfo on the HCA port and examine OperVLs assuming the port > > is ACTIVE. > > yes, I verified the data VL support, it is 8. I will poke for more > info with suggested commands by Sasha. > > > > A related question is, if I modify qos setting in SM, do I need to > > > restart SA on each hosts for it to see the changes? (I am hoping not, > > > as I tried in the test, it doesn't seem to make a difference) > > > > Not sure what you mean. SA is tightly coupled with the OpenSM. Do you > > mean SA client ? The client hosts don't need restarting but did you > > restart OpenSM with your QoS configuration ? > > I mean client SA. yes, I understand OpenSM needs to be restarted. > > > BTW, which OpenSM are you running ? > > OFED 1.1 based. > > thanks > > - Oliver ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] question on QoS support
Hi, Hal - > How is this being observed/measured ? Host A, B, with 4x DDR both connected to Flextronic switch. A single process of ibv_read_bw gives about 1415MB /s average bandwidth. Two concurrent process report 714.45 MB/s each, dead even. Now if I bump up one process with a different SL, then I expect to see shaping to take place. Please let me if the scenario makes sense. > Yes, 8 VLs should be supported in your subnet. You can verify this with > smpquery portinfo on the HCA port and examine OperVLs assuming the port > is ACTIVE. yes, I verified the data VL support, it is 8. I will poke for more info with suggested commands by Sasha. > > A related question is, if I modify qos setting in SM, do I need to > > restart SA on each hosts for it to see the changes? (I am hoping not, > > as I tried in the test, it doesn't seem to make a difference) > > Not sure what you mean. SA is tightly coupled with the OpenSM. Do you > mean SA client ? The client hosts don't need restarting but did you > restart OpenSM with your QoS configuration ? I mean client SA. yes, I understand OpenSM needs to be restarted. > BTW, which OpenSM are you running ? OFED 1.1 based. thanks - Oliver ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] question on QoS support
On Thu, 2006-11-02 at 09:15, Makia Minich wrote: > Hal Rosenstock wrote: > > Makia, > > > > On Wed, 2006-11-01 at 17:42, Makia Minich wrote: > >> It just so happens that we've started looking at this here at ORNL as > >> well. I had a question about the options. The manpage makes it seem > >> that you can set these qos options (e.g. qos_high_limit) from the > >> command line, but I haven't been overly successful. > > > > What are you referring to in the man page ? > > OK, re-reading the man page section on qos, I now realize that I didn't > understand the statement "cached options file" on my initial read > through. So, now I've got it. > > > Which OpenSM are you using (trunk or 1.1 based) ? > > 1.1 based > > >> Is there an example of this being done? > > > > Yes in both the man page under QOS CONFIGURATION or under > > osm/doc/qos-config.txt in the repository. > > I see that that file doesn't install in the doc directory with OFED, > perhaps that should be added (so that I can find it in the ${OFED}/doc > directory). I used that doc and put it pretty much verbatim into the man page so IMO this is somewhat redundant but it could be added to the next release if you think this adds value (having the separate docs). -- Hal > >> Or is changing the /var/cache/osm/opensm.opts file > >> the preferred method of changing the options? > > > > I think it's the only way but it is imperative QoS is enabled for this > > to have any effect. > > > > -- Hal > > That part I've got set in the opensm.opts file: > > no_qos FALSE > > >> Sasha Khapyorsky wrote: > >>> On 16:52 Wed 01 Nov , Oliver wrote: > Hi, folks - > > I am trying to verify and evaluate IB QoS support, running openSM as > subnet manager. The perftest program is extended to set SL as command > line options instead of default 0, and by modifying VL arbitration > tables, I am expecting to see the traffic shaping can actually take > place, but it did not. More details on configuration: > > in opensm.opts: > # QoS default options > qos_high_limit 255 # disable low priority table > qos_vlarb_high: 0:4,1:4,2:8,3:0, 4:0 # this is to give VL 2 > (corresponding to SL 2) a higher weight 8 > qos_sl2vl 0,1,2,3,4, ... # no changes here > > I think (though not verified) the Voltaire HCA we are using can > support 8 data VLs. I don't have much more information to go on why > qos shaping is not taking place, any suggestions? > >>> You can verify actual port's parameters with smpquery (from diags), you > >>> will need to run to get QoS related parameters: > >>> > >>> smpquery portinfo ... > >>> smpquery vlarb ... > >>> smpquery sl2vl ... > >>> > >>> Sasha > >>> > A related question is, if I modify qos setting in SM, do I need to > restart SA on each hosts for it to see the changes? (I am hoping not, > as I tried in the test, it doesn't seem to make a difference) > > Thanks for help. > -- > Oliver > > ___ > openib-general mailing list > openib-general@openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > > >>> ___ > >>> openib-general mailing list > >>> openib-general@openib.org > >>> http://openib.org/mailman/listinfo/openib-general > >>> > >>> To unsubscribe, please visit > >>> http://openib.org/mailman/listinfo/openib-general > >>> > >>> > > > > ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] question on QoS support
Hal Rosenstock wrote: > Makia, > > On Wed, 2006-11-01 at 17:42, Makia Minich wrote: >> It just so happens that we've started looking at this here at ORNL as >> well. I had a question about the options. The manpage makes it seem >> that you can set these qos options (e.g. qos_high_limit) from the >> command line, but I haven't been overly successful. > > What are you referring to in the man page ? OK, re-reading the man page section on qos, I now realize that I didn't understand the statement "cached options file" on my initial read through. So, now I've got it. > Which OpenSM are you using (trunk or 1.1 based) ? 1.1 based >> Is there an example of this being done? > > Yes in both the man page under QOS CONFIGURATION or under > osm/doc/qos-config.txt in the repository. I see that that file doesn't install in the doc directory with OFED, perhaps that should be added (so that I can find it in the ${OFED}/doc directory). >> Or is changing the /var/cache/osm/opensm.opts file >> the preferred method of changing the options? > > I think it's the only way but it is imperative QoS is enabled for this > to have any effect. > > -- Hal That part I've got set in the opensm.opts file: no_qos FALSE >> Sasha Khapyorsky wrote: >>> On 16:52 Wed 01 Nov , Oliver wrote: Hi, folks - I am trying to verify and evaluate IB QoS support, running openSM as subnet manager. The perftest program is extended to set SL as command line options instead of default 0, and by modifying VL arbitration tables, I am expecting to see the traffic shaping can actually take place, but it did not. More details on configuration: in opensm.opts: # QoS default options qos_high_limit 255 # disable low priority table qos_vlarb_high: 0:4,1:4,2:8,3:0, 4:0 # this is to give VL 2 (corresponding to SL 2) a higher weight 8 qos_sl2vl 0,1,2,3,4, ... # no changes here I think (though not verified) the Voltaire HCA we are using can support 8 data VLs. I don't have much more information to go on why qos shaping is not taking place, any suggestions? >>> You can verify actual port's parameters with smpquery (from diags), you >>> will need to run to get QoS related parameters: >>> >>> smpquery portinfo ... >>> smpquery vlarb ... >>> smpquery sl2vl ... >>> >>> Sasha >>> A related question is, if I modify qos setting in SM, do I need to restart SA on each hosts for it to see the changes? (I am hoping not, as I tried in the test, it doesn't seem to make a difference) Thanks for help. -- Oliver ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general >>> ___ >>> openib-general mailing list >>> openib-general@openib.org >>> http://openib.org/mailman/listinfo/openib-general >>> >>> To unsubscribe, please visit >>> http://openib.org/mailman/listinfo/openib-general >>> >>> > > -- Makia Minich <[EMAIL PROTECTED]> National Center for Computation Science Oak Ridge National Laboratory Phone: 865.574.7460 ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] question on QoS support
Makia, On Wed, 2006-11-01 at 17:42, Makia Minich wrote: > It just so happens that we've started looking at this here at ORNL as > well. I had a question about the options. The manpage makes it seem > that you can set these qos options (e.g. qos_high_limit) from the > command line, but I haven't been overly successful. What are you referring to in the man page ? Which OpenSM are you using (trunk or 1.1 based) ? > Is there an example of this being done? Yes in both the man page under QOS CONFIGURATION or under osm/doc/qos-config.txt in the repository. > Or is changing the /var/cache/osm/opensm.opts file > the preferred method of changing the options? I think it's the only way but it is imperative QoS is enabled for this to have any effect. -- Hal > Sasha Khapyorsky wrote: > > On 16:52 Wed 01 Nov , Oliver wrote: > >> Hi, folks - > >> > >> I am trying to verify and evaluate IB QoS support, running openSM as > >> subnet manager. The perftest program is extended to set SL as command > >> line options instead of default 0, and by modifying VL arbitration > >> tables, I am expecting to see the traffic shaping can actually take > >> place, but it did not. More details on configuration: > >> > >> in opensm.opts: > >> # QoS default options > >> qos_high_limit 255 # disable low priority table > >> qos_vlarb_high: 0:4,1:4,2:8,3:0, 4:0 # this is to give VL 2 > >> (corresponding to SL 2) a higher weight 8 > >> qos_sl2vl 0,1,2,3,4, ... # no changes here > >> > >> I think (though not verified) the Voltaire HCA we are using can > >> support 8 data VLs. I don't have much more information to go on why > >> qos shaping is not taking place, any suggestions? > > > > You can verify actual port's parameters with smpquery (from diags), you > > will need to run to get QoS related parameters: > > > > smpquery portinfo ... > > smpquery vlarb ... > > smpquery sl2vl ... > > > > Sasha > > > >> A related question is, if I modify qos setting in SM, do I need to > >> restart SA on each hosts for it to see the changes? (I am hoping not, > >> as I tried in the test, it doesn't seem to make a difference) > >> > >> Thanks for help. > >> -- > >> Oliver > >> > >> ___ > >> openib-general mailing list > >> openib-general@openib.org > >> http://openib.org/mailman/listinfo/openib-general > >> > >> To unsubscribe, please visit > >> http://openib.org/mailman/listinfo/openib-general > >> > > > > ___ > > openib-general mailing list > > openib-general@openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit > > http://openib.org/mailman/listinfo/openib-general > > > > ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] question on QoS support
Hi Oliver, On Wed, 2006-11-01 at 16:52, Oliver wrote: > Hi, folks - > > I am trying to verify and evaluate IB QoS support, running openSM as > subnet manager. The perftest program is extended to set SL as command > line options instead of default 0, and by modifying VL arbitration > tables, I am expecting to see the traffic shaping can actually take > place, How is this being observed/measured ? > but it did not. More details on configuration: > > in opensm.opts: > # QoS default options > qos_high_limit 255 # disable low priority table This doesn't disable it but it won't be scheduled unless there are no high priority packets to send. > qos_vlarb_high: 0:4,1:4,2:8,3:0, 4:0 # this is to give VL 2 > (corresponding to SL 2) a higher weight 8 > qos_sl2vl 0,1,2,3,4, ... # no changes here > > I think (though not verified) the Voltaire HCA we are using can > support 8 data VLs. Yes, 8 VLs should be supported in your subnet. You can verify this with smpquery portinfo on the HCA port and examine OperVLs assuming the port is ACTIVE. > I don't have much more information to go on why > qos shaping is not taking place, any suggestions? Sasha's email is a good start. We can go from there. > A related question is, if I modify qos setting in SM, do I need to > restart SA on each hosts for it to see the changes? (I am hoping not, > as I tried in the test, it doesn't seem to make a difference) Not sure what you mean. SA is tightly coupled with the OpenSM. Do you mean SA client ? The client hosts don't need restarting but did you restart OpenSM with your QoS configuration ? BTW, which OpenSM are you running ? -- Hal > Thanks for help. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] question on QoS support
On 17:42 Wed 01 Nov , Makia Minich wrote: > It just so happens that we've started looking at this here at ORNL as > well. I had a question about the options. The manpage makes it seem > that you can set these qos options (e.g. qos_high_limit) from the > command line, AFAIK there is option -Q which enables/disables QoS configuration, it does nothing with particular qos_high_limit parameter. Configuration parameters (qos_max_vls, qos_high_limit, qos_vlarb_high, qos_vlarb_low and qos_sl2vl templates) should be specified in opensm.opts file (or other OpenSM configuration file which does not exist yet). > but I haven't been overly successful. Is there an example > of this being done? Or is changing the /var/cache/osm/opensm.opts file > the preferred method of changing the options? Yes, you need to specify QoS parameters in opensm.opts file. There is some readme file osm/doc/qos-config.txt which describes details (I think man page have similar section too). Ah, important note with OFED QoS is disabled by default in OpenSM, so -Q option should be used, which for OFED means --qos. OpenSM from trunk supports QoS configuration by default and -Q option disables this (and means --no-qos), this can be confused, I know. Sasha > > Sasha Khapyorsky wrote: > > On 16:52 Wed 01 Nov , Oliver wrote: > >> Hi, folks - > >> > >> I am trying to verify and evaluate IB QoS support, running openSM as > >> subnet manager. The perftest program is extended to set SL as command > >> line options instead of default 0, and by modifying VL arbitration > >> tables, I am expecting to see the traffic shaping can actually take > >> place, but it did not. More details on configuration: > >> > >> in opensm.opts: > >> # QoS default options > >> qos_high_limit 255 # disable low priority table > >> qos_vlarb_high: 0:4,1:4,2:8,3:0, 4:0 # this is to give VL 2 > >> (corresponding to SL 2) a higher weight 8 > >> qos_sl2vl 0,1,2,3,4, ... # no changes here > >> > >> I think (though not verified) the Voltaire HCA we are using can > >> support 8 data VLs. I don't have much more information to go on why > >> qos shaping is not taking place, any suggestions? > > > > You can verify actual port's parameters with smpquery (from diags), you > > will need to run to get QoS related parameters: > > > > smpquery portinfo ... > > smpquery vlarb ... > > smpquery sl2vl ... > > > > Sasha > > > >> A related question is, if I modify qos setting in SM, do I need to > >> restart SA on each hosts for it to see the changes? (I am hoping not, > >> as I tried in the test, it doesn't seem to make a difference) > >> > >> Thanks for help. > >> -- > >> Oliver > >> > >> ___ > >> openib-general mailing list > >> openib-general@openib.org > >> http://openib.org/mailman/listinfo/openib-general > >> > >> To unsubscribe, please visit > >> http://openib.org/mailman/listinfo/openib-general > >> > > > > ___ > > openib-general mailing list > > openib-general@openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit > > http://openib.org/mailman/listinfo/openib-general > > > > > > -- > Makia Minich <[EMAIL PROTECTED]> > National Center for Computation Science > Oak Ridge National Laboratory > Phone: 865.574.7460 ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] question on QoS support
It just so happens that we've started looking at this here at ORNL as well. I had a question about the options. The manpage makes it seem that you can set these qos options (e.g. qos_high_limit) from the command line, but I haven't been overly successful. Is there an example of this being done? Or is changing the /var/cache/osm/opensm.opts file the preferred method of changing the options? Sasha Khapyorsky wrote: > On 16:52 Wed 01 Nov , Oliver wrote: >> Hi, folks - >> >> I am trying to verify and evaluate IB QoS support, running openSM as >> subnet manager. The perftest program is extended to set SL as command >> line options instead of default 0, and by modifying VL arbitration >> tables, I am expecting to see the traffic shaping can actually take >> place, but it did not. More details on configuration: >> >> in opensm.opts: >> # QoS default options >> qos_high_limit 255 # disable low priority table >> qos_vlarb_high: 0:4,1:4,2:8,3:0, 4:0 # this is to give VL 2 >> (corresponding to SL 2) a higher weight 8 >> qos_sl2vl 0,1,2,3,4, ... # no changes here >> >> I think (though not verified) the Voltaire HCA we are using can >> support 8 data VLs. I don't have much more information to go on why >> qos shaping is not taking place, any suggestions? > > You can verify actual port's parameters with smpquery (from diags), you > will need to run to get QoS related parameters: > > smpquery portinfo ... > smpquery vlarb ... > smpquery sl2vl ... > > Sasha > >> A related question is, if I modify qos setting in SM, do I need to >> restart SA on each hosts for it to see the changes? (I am hoping not, >> as I tried in the test, it doesn't seem to make a difference) >> >> Thanks for help. >> -- >> Oliver >> >> ___ >> openib-general mailing list >> openib-general@openib.org >> http://openib.org/mailman/listinfo/openib-general >> >> To unsubscribe, please visit >> http://openib.org/mailman/listinfo/openib-general >> > > ___ > openib-general mailing list > openib-general@openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > -- Makia Minich <[EMAIL PROTECTED]> National Center for Computation Science Oak Ridge National Laboratory Phone: 865.574.7460 ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] question on QoS support
On 16:52 Wed 01 Nov , Oliver wrote: > Hi, folks - > > I am trying to verify and evaluate IB QoS support, running openSM as > subnet manager. The perftest program is extended to set SL as command > line options instead of default 0, and by modifying VL arbitration > tables, I am expecting to see the traffic shaping can actually take > place, but it did not. More details on configuration: > > in opensm.opts: > # QoS default options > qos_high_limit 255 # disable low priority table > qos_vlarb_high: 0:4,1:4,2:8,3:0, 4:0 # this is to give VL 2 > (corresponding to SL 2) a higher weight 8 > qos_sl2vl 0,1,2,3,4, ... # no changes here > > I think (though not verified) the Voltaire HCA we are using can > support 8 data VLs. I don't have much more information to go on why > qos shaping is not taking place, any suggestions? You can verify actual port's parameters with smpquery (from diags), you will need to run to get QoS related parameters: smpquery portinfo ... smpquery vlarb ... smpquery sl2vl ... Sasha > A related question is, if I modify qos setting in SM, do I need to > restart SA on each hosts for it to see the changes? (I am hoping not, > as I tried in the test, it doesn't seem to make a difference) > > Thanks for help. > -- > Oliver > > ___ > openib-general mailing list > openib-general@openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Question about ehca CQ handling
> While looking over the ehca driver from the perspective of adding a > "peek CQ" operation, I noticed some code that looked funny. > > In hipz_set_cqx_n0() and hipz_set_cqx_n1(), what is the point of the > calls to hipz_galpa_load_cq()? The return value is discarded. I see > that hipz_galpa_load_cq() dereferences a volatile pointer internally, > so I'm guessing this is some sort of ordering constraint. But would > it be just as good to do "barrier()" there? > > - R. No, barrier won't help, the I/O bus connection is theoretically allowed to reorder and aggregate writes in a defined pattern. The recommended way to ensure that the ehca chip actually has seen the write is doing a read on the same address. Gruss / Regards . . . Christoph R ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Question about interrupt generation
Hi,One more question. What kind of event mask helps mask the interrupts?thanksharishOn 9/5/06, harish < [EMAIL PROTECTED]> wrote:Hi All,I tried the following simple experiment and am not able to understand the results: Calcualted the number of interrupts generated by the infiniband [with little or no traffic to the NIC] over a period of 10seconds and saw around 10-20 interrupts/sec. Then ran a netperf test and saw around 100+ K interrupts/sec. This screwed up my host machine. To reduce the impact of the interrupts, I add a call back that is scheduled to be periodically called every few microseconds that masks the irq line used by the NIC and a little later unmasks the same. Noticed that with no traffic, I see anywhere between 30-50K interrupts/sec. With the netperf traffic, I see around 120K+ interrupts/sec. Am a newbie to infiniband technology and so do not understand why so many interrupts are getting generated when I have my call back periodically called. Could it be that the Infiniband supports MSI? Or is what I am seeing IPIs? Or does Infiniband generate interrupts based on types of events and what I am doing by masking/unmasking the interrupt line is one such event? Any information/suggestions would be useful.Thanks in advance,harish ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] question: ib_umem page_size
> > It gives the page size for the user memory described by the struct. > > The idea was that if/when someone tries to optimize for huge pages, > > then the low-level driver can know that a region is using huge pages > > without having to walk through the page list and search for the > > minimum physically contiguous size. > > Hmm, mthca_reg_user_mr seems to do: > > len = sg_dma_len(&chunk->page_list[j]) >> shift > > which means that dma_len must be a multiple of page size. > > Is this intentional? Yes, it's intentional I think. I'm probably missing something, but the upper layer has just told mthca_reg_user_mr() that the page size for this region is (1
Re: [openib-general] question: ib_umem page_size
Quoting r. Roland Dreier <[EMAIL PROTECTED]>: > Subject: Re: question: ib_umem page_size > > Michael> Roland, could you please clarify what does the page_size > Michael> field in struct ib_mem do? > > It gives the page size for the user memory described by the struct. > The idea was that if/when someone tries to optimize for huge pages, > then the low-level driver can know that a region is using huge pages > without having to walk through the page list and search for the > minimum physically contiguous size. Hmm, mthca_reg_user_mr seems to do: len = sg_dma_len(&chunk->page_list[j]) >> shift which means that dma_len must be a multiple of page size. Is this intentional? -- MST ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Question about QP's in timewait state and CM stale conn rejects
>Cool, I would go for XOR-ing a random value with the **local id** . > >Sean, my understanding it can be narrowed for doing so in: > >1) cm_alloc_id() after calling idr_get_new_above() >2) cm_free_id() before calling idr_remove() >3) cm_get_id() before calling idr_find() > >and initializing the random value we XOR in ib_cm_init() > >What do you think? I like this approach as well. I need to see what else I have in my queue first, but will work on a patch, since it seems straightforward. - Sean ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Question about QP's in timewait state and CM stale conn rejects
Sean Hefty wrote: > When a new REQ is received, we enter its timewait structure into two trees: one > sorted by remote ID, one sorted by remote QPN. If the REQ is new, both would > succeed, and timewait_info would be NULL. Since timewait_info is not NULL, we > are dealing with a REQ that re-uses the same remote ID or same remote QPN. If > the new REQ has the same remote ID (get_cm_id() returns non-NULL), we treat it > as a duplicate, otherwise it's marked as stale. OK, thanks for clarifying this. Or. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Question about QP's in timewait state and CM stale conn rejects
Roland Dreier wrote: > Sean> If we record a base offset, we can start at any random > Sean> number. We just need to always add/subtract the base when > Sean> getting a value from the IDR. > > Good point -- or better still, we could XOR in a random bit pattern. > That way we don't have to keep straight when to add and when to subtract. Cool, I would go for XOR-ing a random value with the **local id** . Sean, my understanding it can be narrowed for doing so in: 1) cm_alloc_id() after calling idr_get_new_above() 2) cm_free_id() before calling idr_remove() 3) cm_get_id() before calling idr_find() and initializing the random value we XOR in ib_cm_init() What do you think? Or. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Question about QP's in timewait state and CM stale conn rejects
Sean> If we record a base offset, we can start at any random Sean> number. We just need to always add/subtract the base when Sean> getting a value from the IDR. Good point -- or better still, we could XOR in a random bit pattern. That way we don't have to keep straight when to add and when to subtract. - R. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Question about QP's in timewait state and CM stale conn rejects
>> If we get here, this means that the REQ was a new REQ and not a >> duplicate, but the remote_id or remote_qpn is already in use. We need >> to reject the new REQ as containing stale data. > >I don't follow, if we get to the else case its as of cm_get_id() >returning NULL. This holds when idr_find() returns NULL or when the >entry returned is associated with a different remote_id, so what makes >you to conclude that "the remote_id or remote_qpn is already in use"??? When a new REQ is received, we enter its timewait structure into two trees: one sorted by remote ID, one sorted by remote QPN. If the REQ is new, both would succeed, and timewait_info would be NULL. Since timewait_info is not NULL, we are dealing with a REQ that re-uses the same remote ID or same remote QPN. If the new REQ has the same remote ID (get_cm_id() returns non-NULL), we treat it as a duplicate, otherwise it's marked as stale. - Sean ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Question about QP's in timewait state and CM stale conn rejects
>> Just to emphasize what Sean has pointed out, you are asking how can a CM >> consumer know that a **local** QPN is not in the timewait state >> according to the **remote** CM. Since the issue is with the remote CM, >> it seems to me that pushing down timewait into verbs is not the correct >> direction to look at. We should still ensure that we don't give a user a local QPN that we know is in timewait. For example, a user 1 connects over a QP, transfers some data, then destroys the QP. User 2 allocates a new QP. Can user 2 get the same QP as the user 1? If so, user 2 is likely to see a stale connection. An option at this point is for user 2 to destroy the QP and allocate a new one. If they do this, will they get the same QP again? Now imagine if user 1 had created 1000 connections. I believe that we should make things as easy on user 2 as possible, including reducing the chance of giving them a QP that the remote side is likely to have in timewait. - Sean ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Question about QP's in timewait state and CM stale conn rejects
>How about (for the meantime, till this rework is designed && done) going >to projecting the initial random local id into the range of (say) >[0-1022] (i think 1023 is prime, if not choose a prime near it) this way >with very good probability and with very little overhead on memory >consumption a client connect/reboot/"reconnect" would work. If we record a base offset, we can start at any random number. We just need to always add/subtract the base when getting a value from the IDR. - Sean ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Question about QP's in timewait state and CM stale conn rejects
Or> How about (for the meantime, till this rework is designed && Or> done) going to projecting the initial random local id into the Or> range of (say) [0-1022] (i think 1023 is prime, if not choose Or> a prime near it) this way with very good probability and with Or> very little overhead on memory consumption a client Or> connect/reboot/"reconnect" would work. Of course 1023 is not prime -- since (a^2 - b^2) = (a - b) * (a + b), it follows 2 ^ 10 - 1 = (2^5 - 1) * (2^5 + 1) = 31 * 33. I don't see why you care about the range being prime, but the closest primes to 1024 are 1021 and 1031. - R. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Question about QP's in timewait state and CM stale conn rejects
This email appear in the archive, but seems not to be distributed to the subscribers so i am reposting it. Or Gerlitz wrote: > Sean Hefty wrote: >> Even if we pushed timewait handling under verbs, a user could always >> get a QP that the remote side thinks is connected. The original >> connection could fail to disconnect because of lost DREQs. So, >> locally, the QP could have exited timewait, while the remote side >> still thinks that it's connected. > > Sean, > > If you don't mind (also related to the patch you have sent Eric of > randomizing the initial local cm id) to get into this deeper, can we do > here a quick code review of the REQ matching logic? I wrote what i > understand below. > >> static struct cm_id_private * cm_match_req(struct cm_work *work, >> + struct cm_id_private >> *cm_id_priv) >> +{ >> + struct cm_id_private *listen_cm_id_priv, *cur_cm_id_priv; >> + struct cm_timewait_info *timewait_info; >> + struct cm_req_msg *req_msg; >> + unsigned long flags; >> + >> + req_msg = (struct cm_req_msg *)work->mad_recv_wc->recv_buf.mad; >> + >> + /* Check for duplicate REQ and stale connections. */ >> + spin_lock_irqsave(&cm.lock, flags); >> + timewait_info = cm_insert_remote_id(cm_id_priv->timewait_info); >> + if (!timewait_info) >> + timewait_info = >> cm_insert_remote_qpn(cm_id_priv->timewait_info); > > This if() holds when entry is present in > remote_id_table OR entry is present in > remote_qpn_table > >> + if (timewait_info) { >> + cur_cm_id_priv = cm_get_id(timewait_info->work.local_id, >> + >> timewait_info->work.remote_id); > > + spin_unlock_irqrestore(&cm.lock, flags); >> + if (cur_cm_id_priv) { >> + cm_dup_req_handler(work, cur_cm_id_priv); >> + cm_deref_id(cur_cm_id_priv); > > entry exists in local_id_table, looking on > dup_req_handler() i see it sends REP when the id is in "MRA sent" and > sends a STALE_CONN REJ when the id is in timewait state, else it does > nothing. > >> + } else >> + cm_issue_rej(work->port, work->mad_recv_wc, >> +IB_CM_REJ_STALE_CONN, >> CM_MSG_RESPONSE_REQ, >> +NULL, 0); > > what is this case? there is no entry but there is > remote or entries??? > >> + goto error; >> + } > > Or. > ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Question about QP's in timewait state and CM stale conn rejects
This email appear in the archive, but seems not to be distributed to the subscribers so i am reposting it. Or Gerlitz wrote: > Arlin Davis wrote: >> We are running into connection reject issues (IB_CM_REJ_STALE_CONN) >> with our application under heavy load and lots of connections. >> >> We occassionally get a reject based on the QP being in timewait state >> leftover from a prior connection. It appears that the CM keeps track >> of the QP's in timewait state on both sides of the connection, > > How did you verify that? the CM generated REJ with IB_CM_REJ_STALE_CONN > in two flows for the passive side (ie rejecting a REQ) and one flow for > the active side (ie rejecting a REP). > >> How can a consumer know for sure that the new QP will not be in a >> timewait state according to the CM? Does it make sense to push the >> timewait functionality down into verbs? If not, is there a way for the >> CM to hold a reference to the QP until the timewait expires? > > Just to emphasize what Sean has pointed out, you are asking how can a CM > consumer know that a **local** QPN is not in the timewait state > according to the **remote** CM. Since the issue is with the remote CM, > it seems to me that pushing down timewait into verbs is not the correct > direction to look at. > > Or. > ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Question about QP's in timewait state and CM stale conn rejects
>>> + } else >>> + cm_issue_rej(work->port, work->mad_recv_wc, >>> +IB_CM_REJ_STALE_CONN, >>> CM_MSG_RESPONSE_REQ, >>> +NULL, 0); >> >> >> what is this case? there is no entry but there is >> remote or entries??? > If we get here, this means that the REQ was a new REQ and not a > duplicate, but the remote_id or remote_qpn is already in use. We need > to reject the new REQ as containing stale data. I don't follow, if we get to the else case its as of cm_get_id() returning NULL. This holds when idr_find() returns NULL or when the entry returned is associated with a different remote_id, so what makes you to conclude that "the remote_id or remote_qpn is already in use"??? > +static struct cm_id_private * cm_get_id(__be32 local_id, __be32 remote_id) > +{ > + struct cm_id_private *cm_id_priv; > + > + cm_id_priv = idr_find(&cm.local_id_table, (__force int) local_id); > + if (cm_id_priv) { > + if (cm_id_priv->id.remote_id == remote_id) > + atomic_inc(&cm_id_priv->refcount); > + else > + cm_id_priv = NULL; > + } > + > + return cm_id_priv; > +} Or. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Question about QP's in timewait state and CM stale conn rejects
Sean Hefty wrote: > Or Gerlitz wrote: >> If you don't mind (also related to the patch you have sent Eric of >> randomizing the initial local cm id) to get into this deeper, can we do > There's an issue trying to randomize the initial local CM ID. The way > the IDR works, if you start at a high value, then the IDR size grows up > to the size of the first value, which can result in memory allocation > failures. In my tests, using a random value would frequently result in > connection failures because of low memory. > My conclusion is that the local ID assignment in the IB CM needs to be > reworked, or we will run into a condition that after X number of > connections have been established, we will be unable to create any new > connections, even if the previous connections have all been destroyed. How about (for the meantime, till this rework is designed && done) going to projecting the initial random local id into the range of (say) [0-1022] (i think 1023 is prime, if not choose a prime near it) this way with very good probability and with very little overhead on memory consumption a client connect/reboot/"reconnect" would work. Or. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Question about QP's in timewait state and CM stale conn rejects
Or Gerlitz wrote: > If you don't mind (also related to the patch you have sent Eric of > randomizing the initial local cm id) to get into this deeper, can we do There's an issue trying to randomize the initial local CM ID. The way the IDR works, if you start at a high value, then the IDR size grows up to the size of the first value, which can result in memory allocation failures. In my tests, using a random value would frequently result in connection failures because of low memory. My conclusion is that the local ID assignment in the IB CM needs to be reworked, or we will run into a condition that after X number of connections have been established, we will be unable to create any new connections, even if the previous connections have all been destroyed. >> static struct cm_id_private * cm_match_req(struct cm_work *work, >> + struct cm_id_private >> *cm_id_priv) >> +{ >> + struct cm_id_private *listen_cm_id_priv, *cur_cm_id_priv; >> + struct cm_timewait_info *timewait_info; >> + struct cm_req_msg *req_msg; >> + unsigned long flags; >> + >> + req_msg = (struct cm_req_msg *)work->mad_recv_wc->recv_buf.mad; >> + >> + /* Check for duplicate REQ and stale connections. */ >> + spin_lock_irqsave(&cm.lock, flags); >> + timewait_info = cm_insert_remote_id(cm_id_priv->timewait_info); >> + if (!timewait_info) >> + timewait_info = >> cm_insert_remote_qpn(cm_id_priv->timewait_info); > > > This if() holds when entry is present in > remote_id_table OR entry is present in > remote_qpn_table correct > >> + if (timewait_info) { >> + cur_cm_id_priv = cm_get_id(timewait_info->work.local_id, >> + >> timewait_info->work.remote_id); > > > + spin_unlock_irqrestore(&cm.lock, flags); > >> + if (cur_cm_id_priv) { >> + cm_dup_req_handler(work, cur_cm_id_priv); >> + cm_deref_id(cur_cm_id_priv); > > > entry exists in local_id_table, looking on > dup_req_handler() i see it sends REP when the id is in "MRA sent" and > sends a STALE_CONN REJ when the id is in timewait state, else it does > nothing. It sends an MRA if in the MRA sent state, or a reject as indicated. >> + } else >> + cm_issue_rej(work->port, work->mad_recv_wc, >> +IB_CM_REJ_STALE_CONN, >> CM_MSG_RESPONSE_REQ, >> +NULL, 0); > > > what is this case? there is no entry but there is > remote or entries??? If we get here, this means that the REQ was a new REQ and not a duplicate, but the remote_id or remote_qpn is already in use. We need to reject the new REQ as containing stale data. - Sean ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Question about QP's in timewait state and CM stale conn rejects
Arlin Davis wrote: > How can a consumer know for sure that the new QP will not be in a > timewait state according to the CM? Given that the QP may have been in use by another process, I don't think that there's any way for the new owner to know. > Does it make sense to push the timewait functionality down into verbs? This may be a clean way of handling the issue, but... see below. > If not, is there a way for the > CM to hold a reference to the QP until the timewait expires? For userspace QPs, the CM doesn't have access to the QP, so some sort of special call into verbs would be needed. Even if we pushed timewait handling under verbs, a user could always get a QP that the remote side thinks is connected. The original connection could fail to disconnect because of lost DREQs. So, locally, the QP could have exited timewait, while the remote side still thinks that it's connected. - Sean ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] question: ib_umem page_size
Quoting r. Roland Dreier <[EMAIL PROTECTED]>: > Subject: Re: question: ib_umem page_size > > Michael> Roland, could you please clarify what does the page_size > Michael> field in struct ib_mem do? > > It gives the page size for the user memory described by the struct. > The idea was that if/when someone tries to optimize for huge pages, > then the low-level driver can know that a region is using huge pages > without having to walk through the page list and search for the > minimum physically contiguous size. Thoguth though. Cool, that's exactly what I'm trying to do. -- MST ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] question: ib_umem page_size
Michael> Roland, could you please clarify what does the page_size Michael> field in struct ib_mem do? It gives the page size for the user memory described by the struct. The idea was that if/when someone tries to optimize for huge pages, then the low-level driver can know that a region is using huge pages without having to walk through the page list and search for the minimum physically contiguous size. - R. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Question about the IPoIB bandwidth performance ?
At 12:36 PM 6/5/2006, Talpey, Thomas wrote: Thanks Parks, this is a very interesting perspective. I will avoid going into my rant about edge devices for now, however. :-) Cool, you can send it direct if you want. I am not sure what you mean about using SDP "end to end". I assume you would perhaps use SDP to these edge nodes, but this would require terminating the SDP connection and re-issuing the stream over TCP to the Panasas box, wouldn't it? yes It would probably have to work that way. Another problem would be SDP is not routeable. Would this bridging be done in-kernel, like your IPoIB/Ethernet solution today, or would you implement a daemon? It will be a difficult challenge, I predict. We are just starting to think about things like this, and trying to keep an open mind to all possibilities. We have no solutions to do this yet. There might be better ways. So you are correct and haven't thought it all the way through and have no alterative plan other than IPoIB at the moment. My next step will be testing 4x-ddr IPoIB before doing anything else. parks * Correspondence * This email contains no programmatic content that requires independent ADC review ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Question about the IPoIB bandwidth performance ?
Thanks Parks, this is a very interesting perspective. I will avoid going into my rant about edge devices for now, however. :-) I am not sure what you mean about using SDP "end to end". I assume you would perhaps use SDP to these edge nodes, but this would require terminating the SDP connection and re-issuing the stream over TCP to the Panasas box, wouldn't it? Would this bridging be done in-kernel, like your IPoIB/Ethernet solution today, or would you implement a daemon? It will be a difficult challenge, I predict. Tom. At 02:16 PM 6/5/2006, Parks Fields wrote: > >> >>I consider IPoIB to be Ethernet emulation. >> >>As for apples and oranges, my point exactly. > > >It is not really about comparisons. Here at LANL we have an >environment where all our new Clusters have to mount our global >parallel file system Panasas. It is ethernet and will be for a while. > >Cluster interconnect is IB and the compute nodes do NOT have >ethernet, so we created i-o nodes to "bridge " IB to ethernet. > >Compute nodeIB---i/o node---10gig---ethernet switch panasas > >We like to match / balance the network to bandwidth to storage >bandwidth plus try to achieve 1GB/sec per TF of the machine. EX: >50TF machine = 50 GB/sec of storage bandwidth needed. > >So if IPoIB would give us ~700 MB/sec and came out the other side >with 10gigE at ~800 that would be nice. >Hope this helps. We are now trying to find out is SDP will work end-to-end. > >thanks >parks > > > >* Correspondence * > >This email contains no programmatic content that requires independent >ADC review > > > ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Question about the IPoIB bandwidth performance ?
I consider IPoIB to be Ethernet emulation. As for apples and oranges, my point exactly. It is not really about comparisons. Here at LANL we have an environment where all our new Clusters have to mount our global parallel file system Panasas. It is ethernet and will be for a while. Cluster interconnect is IB and the compute nodes do NOT have ethernet, so we created i-o nodes to "bridge " IB to ethernet. Compute nodeIB---i/o node---10gig---ethernet switch panasas We like to match / balance the network to bandwidth to storage bandwidth plus try to achieve 1GB/sec per TF of the machine. EX: 50TF machine = 50 GB/sec of storage bandwidth needed. So if IPoIB would give us ~700 MB/sec and came out the other side with 10gigE at ~800 that would be nice. Hope this helps. We are now trying to find out is SDP will work end-to-end. thanks parks * Correspondence * This email contains no programmatic content that requires independent ADC review ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Question about the IPoIB bandwidth performance ?
[PATCHv2 1/2] resend: mthca support for >max_map_per_fmr > device attribute (Roland Dreier) > 7. Re: Question about the IPoIB bandwidth performance ? > (Talpey, Thomas) > 8. Re: Question about the IPoIB bandwidth performance ? (hbchen) > >----- Message from "hbchen" <[EMAIL PROTECTED]> on Mon, 05 Jun 2006 09:38:24 >-0600 - > > To: "Hal Rosenstock" <[EMAIL PROTECTED]> > > cc: "OPENIB" > > Subject: Re: [openib-general] Question about the IPoIB bandwidth > performance ? > > >Hal Rosenstock wrote: > On Mon, 2006-06-05 at 11:12, hbchen wrote: > >Hi, >I have a question about the IPoIB bandwidth performance. >I did netperf testing using Single GiGE, Myrinet D card, >Myrinet 10G >ethernet card, >and Voltaire Infiniband 4X HCA400Ex (PCI-Express interface). > > >NIC (Jumbo enabled) Line bandwidth(LB) IPoverNIC bandwidth >utilization >(IPoNIC/LB) >- -- >-- >Single Gigabit NIC : 1Gb/sec=125MB/sec 120MB/sec 96% (PIC-X >interface) >Myrinet D card : 250MB/sec 240~-245MB/sec 96% ~ 98% (PCI-X >interface) >Myrinet 10G Ethernet: 10Gb/sec=1280MB/sec 980MB/sec 76.6% (My >testing >using Linux 2.6.14.6) >(PCI-Express) 1225MB/sec 95.7% (Data from Myrinet website) >IB HCA4X(PCI-Express): 10Gb/sec=1280MB/sec 420MB/sec 32.8% (My >testing >using Linux 2.6.14.6) >474MB/sec 37% (the best from OpenIB mailing list) >(2.6.12-rc5 patch 1) > >Why the bandwidth utilization of IPoIB is so low compared to >the others >NICs? > > > One thing to note is that the max utilization of 10G IB (4x) is 8G > due > to the signalling being included in this rate (unlike ethernet whose > rate represents the data rate and does not include the signalling > overhead). > >Hal, >Even with this IB-4X = 8Gb/sec = 1024 MB/sec the IPoIB bandwidth >utilization is still very low. >>> IPoIB=420MB/sec >>> bandwidth utilization= 420/1024 = 41.01% > > >HB > > > > > -- Hal > > >There must be a lot of room to improve the IPoIB software to >reach 75%+ >bandwidth utilization. > > >HB Chen >Los Alamos National Lab >[EMAIL PROTECTED] > >___ >openib-general mailing list >openib-general@openib.org >http://openib.org/mailman/listinfo/openib-general > >To unsubscribe, please visit >http://openib.org/mailman/listinfo/openib-general > > > > > >- Message from "Hal Rosenstock" <[EMAIL PROTECTED]> on 05 Jun 2006 >11:34:50 -0400 - > > To: "Eitan Zahavi" <[EMAIL PROTECTED]> > > cc: "OPENIB" > > Subject: [openib-general] Re: [PATCH] osm: trivial missing header files > fix > > >On Mon, 2006-06-05 at 08:51, Eitan Zahavi wrote: >> Hi Hal >> >> Cleaning up compilation warnings I found there missing includes in >> various sources. >> >> Eitan >> >> Signed-off-by: Eitan Zahavi <[EMAIL PROTECTED]> > >Thanks. Applied to trunk only. > >-- Hal > > > >- Message from "Hal Rosenstock" <[EMAIL PROTECTED]> on 05 Jun 2006 >11:45:28 -0400 - > > To: "Eitan Zahavi" <[EMAIL P
Re: [openib-general] Question about the IPoIB bandwidth performance ?
Tom, We are in the process of measuring the CPU utilization on our NFS/RDMA experiments in contrast with regular the NFS, we also intend to include netperf numbers and will keep you posted with our results as soon as possible. Helen - original Message - >From [EMAIL PROTECTED] Mon Jun 5 09:03:56 2006 Helen, have you measured the CPU utilizations during these runs? Perhaps you are out of CPU. Outrageous opinion follows. Frankly, an IB HCA running Ethernet emulation is approximately the world's worst 10GbE adapter (not to put too fine of a point on it :-) ) There is no hardware checksumming, nor large-send offloading, both of which force overhead onto software. And, as you just discovered it isn't even 10Gb! In general, network emulation layers are always going to perform more poorly than native implementations. But this is only a generality learned from years of experience with them. Tom. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Question about the IPoIB bandwidth performance ?
> Thomas Talpey said: > At 11:38 AM 6/5/2006, hbchen wrote: > >Even with this IB-4X = 8Gb/sec = 1024 MB/sec the IPoIB bandwidth utilization is still very > low. > >>> IPoIB=420MB/sec > >>> bandwidth utilization= 420/1024 = 41.01% > > > Helen, have you measured the CPU utilizations during these runs? > Perhaps you are out of CPU. > > Outrageous opinion follows. > > Frankly, an IB HCA running Ethernet emulation is approximately the > world's worst 10GbE adapter (not to put too fine of a point on it :-) ) > There is no hardware checksumming, nor large-send offloading, both > of which force overhead onto software. And, as you just discovered > it isn't even 10Gb! > > In general, network emulation layers are always going to perform more > poorly than native implementations. But this is only a generality learned > from years of experience with them > > Tom. Hold on here Who said anything about Ethernnet emulation. Hal said he is running straight Netperf over IB not ethernet emulation. I don't think that any IB HCAs today support offloaded checksum and large send. You are comparing apples and oranges. The only appropriate comparison is to use the IBM HCA compared to the mthca adapters. I think Hal's point is actually comparing "any" IB adapter against GigE and Myrinet. Both the mthca and IBM HCA's should get similar IPoIB performance using identical OpenIB stacks. Bernie King-Smith IBM Corporation Server Group Cluster System Performance [EMAIL PROTECTED](845)433-8483 Tie. 293-8483 or wombat2 on NOTES "We are not responsible for the world we are born into, only for the world we leave when we die. So we have to accept what has gone before us and work to change the only thing we can, -- The Future." William Shatner openib-general-re [EMAIL PROTECTED] Sent by: To openib-general-bo openib-general@openib.org [EMAIL PROTECTED] cc Subject 06/05/2006 12:11 openib-general Digest, Vol 24, PMIssue 22 Please respond to [EMAIL PROTECTED] enib.org Send openib-general mailing list submissions to openib-general@openib.org To subscribe or unsubscribe via the World Wide Web, visit http://openib.org/mailman/listinfo/openib-general or, via email, send a message with subject or body 'help' to [EMAIL PROTECTED] You can reach the person managing the list at [EMAIL PROTECTED] When replying, please edit your Subject line so it is more specific than "Re: Contents of openib-general digest..." Today's Topics: 1. Re: Question about the IPoIB bandwidth performance ? (hbchen) 2. Re: [PATCH] osm: trivial missing header files fix (Hal Rosenstock) 3. Re: [PATCH] osm: trivial missing cast in osmt_service call for memcmp (Hal Rosenstock) 4. Re: Question about the IPoIB bandwidth performance ? (Bernard King-Smith) 5. Re: Re: [PATCH]Repost: IPoIB skb panic (Shirley Ma) 6. Re: [PATCHv2 1/2] resend: mthca support for max_map_per_fmr device attribute (Roland Dreier) 7. Re: Question about the IPoIB bandwidth performance ? (Talpey, Thomas) 8. Re: Question about the IPoIB bandwidth performance ? (hbchen) - Message from "hbchen" <[EMAIL PROTECTED]> on Mon, 05 Jun 2006 09:38:24 -0600 - To: "Hal Rosenstock" <[EMAIL PROTECTED]> cc: "OPENIB" Subject: Re: [openib-general] Question about the IPoIB bandwidth performance ?
RE: [openib-general] Question about the IPoIB bandwidth performance ?
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of hbchen Sent: Monday, June 05, 2006 9:12 AM To: Talpey, Thomas Cc: openib-general@openib.org Subject: Re: [openib-general] Question about the IPoIB bandwidth performance ? Talpey, Thomas wrote: At 11:38 AM 6/5/2006, hbchen wrote: Even with this IB-4X = 8Gb/sec = 1024 MB/sec the IPoIB bandwidth utilization is still very low. IPoIB=420MB/sec bandwidth utilization= 420/1024 = 41.01% Helen, have you measured the CPU utilizations during these runs?Perhaps you are out of CPU. Tom, I am HB Chen from LANL not the Helen Chen from SNL. I didn't run out of CPU. It is about 70-80 % of CPU utilization. Outrageous opinion follows. Frankly, an IB HCA running Ethernet emulation is approximately theworld's worst 10GbE adapter (not to put too fine of a point on it :-) ) The IP over Myrinet ( Ethernet emulation) can reach upto 96%-98% bandwidth utilization why not the IPoIB ? [Felix:] As pointed out earlier: it is the message rate. If you change the mtu to 1500B (instead of the non-standard 9000B Jumbo frames) performance will drop into the same range as what you see with IPoIB (limited by the receiver). HB Chen [EMAIL PROTECTED] There is no hardware checksumming, nor large-send offloading, bothof which force overhead onto software. And, as you just discoveredit isn't even 10Gb! In general, network emulation layers are always going to perform morepoorly than native implementations. But this is only a generality learnedfrom years of experience with them. Tom. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Question about the IPoIB bandwidth performance ?
At 12:11 PM 6/5/2006, hbchen wrote: >>Perhaps you are out of CPU. >> >> >Tom, >I am HB Chen from LANL not the Helen Chen from SNL. Oops, sorry! I have too many email messages going by. :-) HB, then. >I didn't run out of CPU. It is about 70-80 % of CPU utilization. But, is one CPU at 100%? Interrupt processing, for example. > >> >>Outrageous opinion follows. >> >>Frankly, an IB HCA running Ethernet emulation is approximately the >>world's worst 10GbE adapter (not to put too fine of a point on it :-) ) >> >The IP over Myrinet ( Ethernet emulation) can reach upto 96%-98% bandwidth >utilization why not the IPoIB ? I am not familiar with the implementation Myrinet uses. In any case, I am not saying that an emulation can't reach certain goals, just that they will pretty much always be inferior to native approaches. Sometimes far inferior. Tom. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Question about the IPoIB bandwidth performance ?
Talpey, Thomas wrote: At 11:38 AM 6/5/2006, hbchen wrote: Even with this IB-4X = 8Gb/sec = 1024 MB/sec the IPoIB bandwidth utilization is still very low. IPoIB=420MB/sec bandwidth utilization= 420/1024 = 41.01% Helen, have you measured the CPU utilizations during these runs? Perhaps you are out of CPU. Tom, I am HB Chen from LANL not the Helen Chen from SNL. I didn't run out of CPU. It is about 70-80 % of CPU utilization. Outrageous opinion follows. Frankly, an IB HCA running Ethernet emulation is approximately the world's worst 10GbE adapter (not to put too fine of a point on it :-) ) The IP over Myrinet ( Ethernet emulation) can reach upto 96%-98% bandwidth utilization why not the IPoIB ? HB Chen [EMAIL PROTECTED] There is no hardware checksumming, nor large-send offloading, both of which force overhead onto software. And, as you just discovered it isn't even 10Gb! In general, network emulation layers are always going to perform more poorly than native implementations. But this is only a generality learned from years of experience with them. Tom. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Question about the IPoIB bandwidth performance ?
At 11:38 AM 6/5/2006, hbchen wrote: >Even with this IB-4X = 8Gb/sec = 1024 MB/sec the IPoIB bandwidth utilization >is still very low. >>> IPoIB=420MB/sec >>> bandwidth utilization= 420/1024 = 41.01% Helen, have you measured the CPU utilizations during these runs? Perhaps you are out of CPU. Outrageous opinion follows. Frankly, an IB HCA running Ethernet emulation is approximately the world's worst 10GbE adapter (not to put too fine of a point on it :-) ) There is no hardware checksumming, nor large-send offloading, both of which force overhead onto software. And, as you just discovered it isn't even 10Gb! In general, network emulation layers are always going to perform more poorly than native implementations. But this is only a generality learned from years of experience with them. Tom. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Question about the IPoIB bandwidth performance ?
Hal Rosenstock wrote: > On Mon, 2006-06-05 at 11:12, hbchen wrote: > > Hi, > > I have a question about the IPoIB bandwidth performance. > > I did netperf testing using Single GiGE, Myrinet D card, Myrinet 10G > > ethernet card, > > and Voltaire Infiniband 4X HCA400Ex (PCI-Express interface). > > > > > > NIC (Jumbo enabled) Line bandwidth(LB) IPoverNIC bandwidth utilization > > (IPoNIC/LB) > > - -- > > -- > > Single Gigabit NIC : 1Gb/sec=125MB/sec 120MB/sec 96% (PIC-X interface) > > Myrinet D card : 250MB/sec 240~-245MB/sec 96% ~ 98% (PCI-X interface) > > Myrinet 10G Ethernet: 10Gb/sec=1280MB/sec 980MB/sec 76.6% (My testing > > > using Linux 2.6.14.6) > > (PCI-Express) 1225MB/sec 95.7% (Data from Myrinet website) > > IB HCA4X(PCI-Express): 10Gb/sec=1280MB/sec 420MB/sec 32.8% (My testing > > using Linux 2.6.14.6) > > 474MB/sec 37% (the best from OpenIB mailing list) > > (2.6.12-rc5 patch 1) > > > > Why the bandwidth utilization of IPoIB is so low compared to the others > > NICs? > > One thing to note is that the max utilization of 10G IB (4x) is 8G due > to the signalling being included in this rate (unlike ethernet whose > rate represents the data rate and does not include the signalling > overhead). > > -- Hal > You also have larger IP packets when you use GigE ( especially in large send/offload ) and Myrinet. I think Myrinet uses a 60K MTU and for GigE, without large send you get a 9000 MTU. With large send you get a 64K buffer to the adapter so fragmentation to 1500/9000 IP packets is offloaded in the adapter. Currently with IPoIB using UD mode, you have to generate lots of 2K packets. With serialized IBoIP drivers you end up bottlenecking on a single CPU. There is a IPoIB-CM IEFT spec out which should significantly improve IPoIB performance if implemented. > > There must be a lot of room to improve the IPoIB software to reach 75%+ > > bandwidth utilization. > > > > > > HB Chen > > Los Alamos National Lab > > [EMAIL PROTECTED] > > > > ___ > > openib-general mailing list > > openib-general@openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general Bernie King-Smith IBM Corporation Server Group Cluster System Performance [EMAIL PROTECTED](845)433-8483 Tie. 293-8483 or wombat2 on NOTES "We are not responsible for the world we are born into, only for the world we leave when we die. So we have to accept what has gone before us and work to change the only thing we can, -- The Future." William Shatner ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Question about the IPoIB bandwidth performance ?
Hal Rosenstock wrote: On Mon, 2006-06-05 at 11:12, hbchen wrote: Hi, I have a question about the IPoIB bandwidth performance. I did netperf testing using Single GiGE, Myrinet D card, Myrinet 10G ethernet card, and Voltaire Infiniband 4X HCA400Ex (PCI-Express interface). NIC (Jumbo enabled) Line bandwidth(LB) IPoverNIC bandwidth utilization (IPoNIC/LB) - -- -- Single Gigabit NIC : 1Gb/sec=125MB/sec 120MB/sec 96% (PIC-X interface) Myrinet D card : 250MB/sec 240~-245MB/sec 96% ~ 98% (PCI-X interface) Myrinet 10G Ethernet: 10Gb/sec=1280MB/sec 980MB/sec 76.6% (My testing using Linux 2.6.14.6) (PCI-Express) 1225MB/sec 95.7% (Data from Myrinet website) IB HCA4X(PCI-Express): 10Gb/sec=1280MB/sec 420MB/sec 32.8% (My testing using Linux 2.6.14.6) 474MB/sec 37% (the best from OpenIB mailing list) (2.6.12-rc5 patch 1) Why the bandwidth utilization of IPoIB is so low compared to the others NICs? One thing to note is that the max utilization of 10G IB (4x) is 8G due to the signalling being included in this rate (unlike ethernet whose rate represents the data rate and does not include the signalling overhead). Hal, Even with this IB-4X = 8Gb/sec = 1024 MB/sec the IPoIB bandwidth utilization is still very low. >> IPoIB=420MB/sec >> bandwidth utilization= 420/1024 = 41.01% HB -- Hal There must be a lot of room to improve the IPoIB software to reach 75%+ bandwidth utilization. HB Chen Los Alamos National Lab [EMAIL PROTECTED] ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Question about the IPoIB bandwidth performance ?
On Mon, 2006-06-05 at 11:12, hbchen wrote: > Hi, > I have a question about the IPoIB bandwidth performance. > I did netperf testing using Single GiGE, Myrinet D card, Myrinet 10G > ethernet card, > and Voltaire Infiniband 4X HCA400Ex (PCI-Express interface). > > > NIC (Jumbo enabled) Line bandwidth(LB) IPoverNIC bandwidth utilization > (IPoNIC/LB) > - -- > -- > Single Gigabit NIC : 1Gb/sec=125MB/sec 120MB/sec 96% (PIC-X interface) > Myrinet D card : 250MB/sec 240~-245MB/sec 96% ~ 98% (PCI-X interface) > Myrinet 10G Ethernet: 10Gb/sec=1280MB/sec 980MB/sec 76.6% (My testing > using Linux 2.6.14.6) > (PCI-Express) 1225MB/sec 95.7% (Data from Myrinet website) > IB HCA4X(PCI-Express): 10Gb/sec=1280MB/sec 420MB/sec 32.8% (My testing > using Linux 2.6.14.6) > 474MB/sec 37% (the best from OpenIB mailing list) > (2.6.12-rc5 patch 1) > > Why the bandwidth utilization of IPoIB is so low compared to the others > NICs? One thing to note is that the max utilization of 10G IB (4x) is 8G due to the signalling being included in this rate (unlike ethernet whose rate represents the data rate and does not include the signalling overhead). -- Hal > There must be a lot of room to improve the IPoIB software to reach 75%+ > bandwidth utilization. > > > HB Chen > Los Alamos National Lab > [EMAIL PROTECTED] > > ___ > openib-general mailing list > openib-general@openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] question regarding GRH flag in ib_ah_attr
On Sun, 2006-05-14 at 15:30, Jason Gunthorpe wrote: > On Sun, May 14, 2006 at 07:40:25AM -0400, Hal Rosenstock wrote: > > > > Not always true in terms of local subnet (multicast and management MAD > > > > response exceptions). > > > > > > Yes, but these are well specified. Multicast must always have a GRH. > > > MAD requests are covered under my scenario above and MAD responses > > > to MAD requests with GRH's are specified to use the GRH and set the > > > HopLimit = 0xFF. > > > > Where does the spec say HopLmt needs to be 0xFF for multicast ? > > I ment that the spec says a MAD response with a GRH should have 0xFF > for HopLmt. (13.5.4.4) Right; from the MAD response rules. > I'd expect the Multicast HopLmt to come from the SA, just like in the > unicast case. OK; that's what I thought. > > Off subnet is either determined by the prefix comparison or HopLimit >=2 > > in the response from the SA. The latter is implied by C8-16 on p. 229. > > The only possible downside of using HopLimit, that I can see, is > compatability with existing SA's. Do all existing SA's set HopLmt to 0 > or 1 in path record responses? (Since no SA's support routers, > that would be correct..) I would argue that the implementations would not be conformant if that were not the case currently. > Scope should not be a problem because the SA can follow whatever > scope based rules might exist and then set HopLimit properly. Sure, the SA would certainly use the scope to know whether it needs to go beyond the local subnet for path resolution (both unicast and multicast). > FWIW, my vote would be to use HopLimit, since that lets the SA > tell the client if it should use a GRH. With prefix comparison GRH > usage is not under the control of the SA - so it is less flexable. Makes sense to me (now)... -- Hal > Jason ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] question regarding GRH flag in ib_ah_attr
On Sun, May 14, 2006 at 07:40:25AM -0400, Hal Rosenstock wrote: > > > Not always true in terms of local subnet (multicast and management MAD > > > response exceptions). > > > > Yes, but these are well specified. Multicast must always have a GRH. > > MAD requests are covered under my scenario above and MAD responses > > to MAD requests with GRH's are specified to use the GRH and set the > > HopLimit = 0xFF. > > Where does the spec say HopLmt needs to be 0xFF for multicast ? I ment that the spec says a MAD response with a GRH should have 0xFF for HopLmt. (13.5.4.4) I'd expect the Multicast HopLmt to come from the SA, just like in the unicast case. > Off subnet is either determined by the prefix comparison or HopLimit >=2 > in the response from the SA. The latter is implied by C8-16 on p. 229. The only possible downside of using HopLimit, that I can see, is compatability with existing SA's. Do all existing SA's set HopLmt to 0 or 1 in path record responses? (Since no SA's support routers, that would be correct..) Scope should not be a problem because the SA can follow whatever scope based rules might exist and then set HopLimit properly. FWIW, my vote would be to use HopLimit, since that lets the SA tell the client if it should use a GRH. With prefix comparison GRH usage is not under the control of the SA - so it is less flexable. Jason ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] question regarding GRH flag in ib_ah_attr
On Fri, 2006-05-12 at 13:55, Sean Hefty wrote: > Jason Gunthorpe wrote: > > How about this, how do you see this scenario: > > > > 1) Client gets a DGID from 'someplace' > > 2) Client sends a SA query to resolve the DGID to a Path Record > > 3) Client configures a QP based on the Path Record > > > > Now, the question I'm interested in is this: > > During step #3 what test should the client apply to determine if a > > GRH should be used with the QP. > > This is the scenario that I need to resolve. > > What would happen if the GRH flag were always set? That would work but there would be additional overhead (especially for small packets this would be more noticeable) in the local subnet case. > Set only if the GID prefixes of the SGID/DGID were different? That's one way although it is more complex than what Jason has been proposing for this (SA response with HopLimit>=2). I'm not yet sure that the latter is sufficient as I think there may be other factors as to whether a packet is forwarded off subnet. One is the prefix scope (but I would think link local scopes should be limited in HopLimit except for multicasts (Jason cited that multicasts were required to have HopLimit 0xFF) but they require GRHs anyhow) so maybe I'm wrong about this and HopLimit>=2 is sufficient. -- Hal > - Sean ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] question regarding GRH flag in ib_ah_attr
On Fri, 2006-05-12 at 13:10, Jason Gunthorpe wrote: > On Fri, May 12, 2006 at 08:11:17AM -0400, Hal Rosenstock wrote: > > > > To allow what Roland is talking about you need an unambiguous > > > mechanism where the SA can signal to the client that the path > > > needs a GRH. > > > > Ah, you are referring to the SA path record response not the request. > > Yes.. Though I think we are still talking about different things in a > few places ;> > > How about this, how do you see this scenario: > > 1) Client gets a DGID from 'someplace' > 2) Client sends a SA query to resolve the DGID to a Path Record > 3) Client configures a QP based on the Path Record > > Now, the question I'm interested in is this: > During step #3 what test should the client apply to determine if a > GRH should be used with the QP. > > Other issues around the GRH like management MAD responses use and > multicast I feel are well specified and don't need more consideration. Thanks for clarifying. > > > Think of it the other way, HopLimit < 2 means it _can't_ be forwarded > > > off subnet, so that result from the SA should _always_ cause the > > > requesting client to not use a GRH for that path. > > > > Not always true in terms of local subnet (multicast and management MAD > > response exceptions). > > Yes, but these are well specified. Multicast must always have a GRH. > MAD requests are covered under my scenario above and MAD responses > to MAD requests with GRH's are specified to use the GRH and set the > HopLimit = 0xFF. Where does the spec say HopLmt needs to be 0xFF for multicast ? > Also, I would assume when building a router that multicast packets > with a hop limit of 0 are non-forwardable based on the rules in IBA. 0 or 1 hop limit for both unicast and multicast > > Are you saying HopLimit is supplied to the SA in the request ? It could > > be but it's optional in general. In the router case, an off subnet DGID > > should be sufficient. I would think the HopLimit (as well as the other > > GRH fields) would need to be returned by the SA to the client. > > Talking about a request for a Path to the SA from a client now: > I would suggest that if the client wishes to restrict itself to paths > that are only on-link then it could send a SA request with the > path record HopLimit=0. Yes (or HopLimit=1). > A SA request with HopLimit=* (masked out > of component mask) should let the SA return routed paths. Yes. > I also think that the SA response should have a HopLimit of 0 for > local paths 1 would also be valid here too. > and a HopLimit >= 2 for routed paths. Yes. > However, I can't find any wording in IBA that would require this > behavior. In terms of the SA responses to Path/MultiPathRecord requests, the HopLimit is required to be filled in in the response. Is that what you are asking ? It's up to the SA to determine this and for the client to use the values returned subsequently just as it does for DLIDs, SLs, etc. > > Not sure exactly what you mean by full control over the routing header > > (GRH). The SA supplies the info for the headers to the client and the > > client is responsible for putting the correct info in the headers. Do > > you mean supplies sufficient info for the client to do this correctly ? > > If so, I agree. > > As far as I can see IBA includes all header information for the GRH > and LRH in the PathRecord response. It does not define a how to > determine if the path described by a PathRecord response requires > a GRH or not. I think the rules are there: Multicasts always have GRH. Unicasts off subnet have GRH and on subnet they are optional. Off subnet is either determined by the prefix comparison or HopLimit >=2 in the response from the SA. The latter is implied by C8-16 on p. 229. -- Hal > Thanks, > Jason ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] question regarding GRH flag in ib_ah_attr
Jason Gunthorpe wrote: How about this, how do you see this scenario: 1) Client gets a DGID from 'someplace' 2) Client sends a SA query to resolve the DGID to a Path Record 3) Client configures a QP based on the Path Record Now, the question I'm interested in is this: During step #3 what test should the client apply to determine if a GRH should be used with the QP. This is the scenario that I need to resolve. What would happen if the GRH flag were always set? Set only if the GID prefixes of the SGID/DGID were different? - Sean ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] question regarding GRH flag in ib_ah_attr
On Fri, May 12, 2006 at 08:11:17AM -0400, Hal Rosenstock wrote: > > To allow what Roland is talking about you need an unambiguous > > mechanism where the SA can signal to the client that the path > > needs a GRH. > > Ah, you are referring to the SA path record response not the request. Yes.. Though I think we are still talking about different things in a few places ;> How about this, how do you see this scenario: 1) Client gets a DGID from 'someplace' 2) Client sends a SA query to resolve the DGID to a Path Record 3) Client configures a QP based on the Path Record Now, the question I'm interested in is this: During step #3 what test should the client apply to determine if a GRH should be used with the QP. Other issues around the GRH like management MAD responses use and multicast I feel are well specified and don't need more consideration. > > Think of it the other way, HopLimit < 2 means it _can't_ be forwarded > > off subnet, so that result from the SA should _always_ cause the > > requesting client to not use a GRH for that path. > > Not always true in terms of local subnet (multicast and management MAD > response exceptions). Yes, but these are well specified. Multicast must always have a GRH. MAD requests are covered under my scenario above and MAD responses to MAD requests with GRH's are specified to use the GRH and set the HopLimit = 0xFF. Also, I would assume when building a router that multicast packets with a hop limit of 0 are non-forwardable based on the rules in IBA. > Are you saying HopLimit is supplied to the SA in the request ? It could > be but it's optional in general. In the router case, an off subnet DGID > should be sufficient. I would think the HopLimit (as well as the other > GRH fields) would need to be returned by the SA to the client. Talking about a request for a Path to the SA from a client now: I would suggest that if the client wishes to restrict itself to paths that are only on-link then it could send a SA request with the path record HopLimit=0. A SA request with HopLimit=* (masked out of component mask) should let the SA return routed paths. I also think that the SA response should have a HopLimit of 0 for local paths and a HopLimit >= 2 for routed paths. However, I can't find any wording in IBA that would require this behavior. > Not sure exactly what you mean by full control over the routing header > (GRH). The SA supplies the info for the headers to the client and the > client is responsible for putting the correct info in the headers. Do > you mean supplies sufficient info for the client to do this correctly ? > If so, I agree. As far as I can see IBA includes all header information for the GRH and LRH in the PathRecord response. It does not define a how to determine if the path described by a PathRecord response requires a GRH or not. Thanks, Jason ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] question regarding GRH flag in ib_ah_attr
On Thu, 2006-05-11 at 13:12, Jason Gunthorpe wrote: > On Thu, May 11, 2006 at 07:20:19AM -0400, Hal Rosenstock wrote: > > > That would be a simpler check but HopLimit is not a required component > > of PathRecord but I think this may not be sufficient as just because a > > HopLimit >= 2 doesn't mean that a packet would be forwarded off subnet. > > I was thinking of the other direction: How does the requestor/client > know if a Path requires a GRH. The requester/client needs to request a path for a DGID which is off (the local) subnet. > To allow what Roland is talking about you need an unambiguous > mechanism where the SA can signal to the client that the path > needs a GRH. Ah, you are referring to the SA path record response not the request. > The only field I can see that could be used for that is HopLimit.. That's one. The ugly prefix comparison would be another. > Think of it the other way, HopLimit < 2 means it _can't_ be forwarded > off subnet, so that result from the SA should _always_ cause the > requesting client to not use a GRH for that path. Not always true in terms of local subnet (multicast and management MAD response exceptions). > Any test beyond HopLimit could be done in the SA prior to returning > the path records to the client. Are you saying HopLimit is supplied to the SA in the request ? It could be but it's optional in general. In the router case, an off subnet DGID should be sufficient. I would think the HopLimit (as well as the other GRH fields) would need to be returned by the SA to the client. > If further tests are put in the client > they only limit the routing configurations that are possible. Not sure what further tests you are referring to here. I agree with the goal not to add any unnecessary constraints on routing configurations. > Note: > Although 8.3.6 specifies that 0 and 1 don't let the packet off > the subnet table 60 says that CA's should set the HopLimit > to 0 and the 'first' router should fill it in. Hmm.. Interesting. The description is table 60 also says "Alternately set according to application." > > Why is a request with just a non link local prefix (with HopLimit > > wildcarded) not sufficient ? > > I think it wouuld be best of the SA had full control over what headers > the CA's put on their packets on a path by path basis. That allows for > the most flexability down the road. Not sure exactly what you mean by full control over the routing header (GRH). The SA supplies the info for the headers to the client and the client is responsible for putting the correct info in the headers. Do you mean supplies sufficient info for the client to do this correctly ? If so, I agree. -- Hal > Jason ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] question regarding GRH flag in ib_ah_attr
On Thu, May 11, 2006 at 10:21:08AM -0700, Sean Hefty wrote: > Hal Rosenstock wrote: > >Anytime the send is off the local subnet (as well as multicast), a GRH > >is required. Also, there is a management response rule for responding > >when the request contained a GRH that require a GRH (13.5.4.4 p. 769). > Reading through the responses, I think my problems are worse. Now I'm not > even sure how I determine which remote node I'm trying to talk to short of > hard-coding the DGID... > We currently use ARP to resolve an IP address to a DGID, which I don't > believe will work across a router. Does an app even know enough to be able > to get a path record? The only wrinkles I could see you having is how to choose between multiple DGID's when generating the ARP response. I don't think that is a serious issue though since any GID to any GID should be routable on the subnet. I haven't looked at the ARP code, but based on the RFCs the IPv4 ARP process would be more or less: 1) Send ARP datagram to the broadcast multicast group LID w/ GRH. The ARP packet includes the IPv4 address of the sender and the GID/QPN (hardware address) of the sender, asking for the hardware address of the target IPv4. A router must support multicast routing so that the ARP request is forwarded to the remote subnet. It has a GRH of course so this is OK. The SM and router work together to make this happen. 2) ARP responder matches the target IP address, gets the IP of the requestor, and the GID/QPN from the ARP packet's sender fields We are still OK since the GID in the ARP packet's sender fields is global. 3) ARP responder produces a unicast packet to the IPv4 requestor address: - The sender's GID/QPN is converted into a path either from a local cache or via a SA query. The sender's GID combined with any of the target's GID's should be sufficient to ask the SA for a path. [Note: that you must use the _hardware_ address here and you cannot just lookup the IPv4 sender address in the neighbor cache. This is needed to support ARP tricks like zeroconf that use null source IPs] - This query results in a path record for communication with the sender. [Some implementations will learn based on ARP requests and will update the neighbor cache here] - The path record is used to generate the unicast headers, GRH and all - if necessary. - The same SGID that was used in the path record query above is returned in the ARP response as the target's address. Since the SA specifies the path to get back to the requestor based only on the GID in the ARP request it can produce a path that crosses the router. 4) The ARP requestor now gets the respondor's GID/QPN from the unicast ARP response and does the same path lookup that the ARP requestor did to get the 'reverse' path. Again, since the SA is now involved the resulting path can cross the router. IPv6 is similar, but the packet format is different and the 'ARP' (NS packet) request is sent to a multicast address chosen by 'hashing' the IPv6 address. Hope this helps, Jason ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] question regarding GRH flag in ib_ah_attr
On Thu, 2006-05-11 at 13:29, Roland Dreier wrote: > Sean> We currently use ARP to resolve an IP address to a DGID, > Sean> which I don't believe will work across a router. Does an > Sean> app even know enough to be able to get a path record? > > I think you're fine. The IB router just has to handle forwarding > multicasts Specifically IPoIB broadcast > between two IB subnets for ARP to work. Yes, because an IPoIB subnet can span multiple IB subnets. > If there's also an IP router in between the two hosts when the hosts are on different IP(oIB) subnets. > then there's a problem, but I don't think it's that reasonable > to expect to make a direct RDMA connection in that case. That's a different case; you don't ARP off your IPoIB subnet; you get the next hop router towards that IPoIB subnet. -- Hal > - R. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] question regarding GRH flag in ib_ah_attr
Sean> We currently use ARP to resolve an IP address to a DGID, Sean> which I don't believe will work across a router. Does an Sean> app even know enough to be able to get a path record? I think you're fine. The IB router just has to handle forwarding multicasts between two IB subnets for ARP to work. If there's also an IP router in between the two hosts then there's a problem, but I don't think it's that reasonable to expect to make a direct RDMA connection in that case. - R. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] question regarding GRH flag in ib_ah_attr
Hal Rosenstock wrote: Anytime the send is off the local subnet (as well as multicast), a GRH is required. Also, there is a management response rule for responding when the request contained a GRH that require a GRH (13.5.4.4 p. 769). Reading through the responses, I think my problems are worse. Now I'm not even sure how I determine which remote node I'm trying to talk to short of hard-coding the DGID... We currently use ARP to resolve an IP address to a DGID, which I don't believe will work across a router. Does an app even know enough to be able to get a path record? - Sean ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] question regarding GRH flag in ib_ah_attr
On Thu, May 11, 2006 at 07:20:19AM -0400, Hal Rosenstock wrote: > That would be a simpler check but HopLimit is not a required component > of PathRecord but I think this may not be sufficient as just because a > HopLimit >= 2 doesn't mean that a packet would be forwarded off subnet. I was thinking of the other direction: How does the requestor/client know if a Path requires a GRH. To allow what Roland is talking about you need an unambiguous mechanism where the SA can signal to the client that the path needs a GRH. The only field I can see that could be used for that is HopLimit.. Think of it the other way, HopLimit < 2 means it _can't_ be forwarded off subnet, so that result from the SA should _always_ cause the requesting client to not use a GRH for that path. Any test beyond HopLimit could be done in the SA prior to returning the path records to the client. If further tests are put in the client they only limit the routing configurations that are possible. Note: Although 8.3.6 specifies that 0 and 1 don't let the packet off the subnet table 60 says that CA's should set the HopLimit to 0 and the 'first' router should fill it in. Hmm.. > Why is a request with just a non link local prefix (with HopLimit > wildcarded) not sufficient ? I think it wouuld be best of the SA had full control over what headers the CA's put on their packets on a path by path basis. That allows for the most flexability down the road. Jason ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] question regarding GRH flag in ib_ah_attr
I agree with Hal. If you look for Path Record to ANOTHER subnet you should provide the GRH in the sent packet address ... Eitan Zahavi Design Technology Director Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL > -Original Message- > From: [EMAIL PROTECTED] [mailto:openib-general- > [EMAIL PROTECTED] On Behalf Of Hal Rosenstock > Sent: Thursday, May 11, 2006 2:20 PM > To: Jason Gunthorpe > Cc: Roland Dreier; openib-general@openib.org > Subject: Re: [openib-general] question regarding GRH flag in ib_ah_attr > > On Thu, 2006-05-11 at 01:48, Jason Gunthorpe wrote: > > On Wed, May 10, 2006 at 09:56:58PM -0700, Roland Dreier wrote: > > > Hal> Huh ? In this case, aren't the subnet prefixes are required > > > Hal> to be different ? > > > > > > It's kind of a crazy thing to do but I don't see anything in the IB > > > spec that forbids two subnets with the same subnet prefix, or any > > > reason why a router couldn't route between them. The SMs would just > > > have to be smart enough to return the LID of the router for paths to > > > ports on the other subnet, and the routers would have to have explicit > > > routes rather than forwarding based on just GID prefix. > > > > Hmm, this is an interesting point, you can do this in IP land using > > host routes. > > > > How about this - the Path record (and related) SA responses include > > the Hop Limit fields and the spec says: > > > > 8.3.6 Hop Limit: [..] Setting this value to 0 or 1 will ensure that > > the packet will not be forwarded beyond the local subnet. > > > > So, it is within the spec to use HopLmt >= 2 as the GRH required flag. > > That would be a simpler check but HopLimit is not a required component > of PathRecord but I think this may not be sufficient as just because a > HopLimit >= 2 doesn't mean that a packet would be forwarded off subnet. > > > I'd propose that the combination of a non-link-local prefix and a >= 2 > > Hop Limit should force a GRH. SM's that do not support routers should > > always fill in 0 for HopLmt. > > Why is a request with just a non link local prefix (with HopLimit > wildcarded) not sufficient ? > > -- Hal > > > Jason > > ___ > openib-general mailing list > openib-general@openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] question regarding GRH flag in ib_ah_attr
On Thu, 2006-05-11 at 01:48, Jason Gunthorpe wrote: > On Wed, May 10, 2006 at 09:56:58PM -0700, Roland Dreier wrote: > > Hal> Huh ? In this case, aren't the subnet prefixes are required > > Hal> to be different ? > > > > It's kind of a crazy thing to do but I don't see anything in the IB > > spec that forbids two subnets with the same subnet prefix, or any > > reason why a router couldn't route between them. The SMs would just > > have to be smart enough to return the LID of the router for paths to > > ports on the other subnet, and the routers would have to have explicit > > routes rather than forwarding based on just GID prefix. > > Hmm, this is an interesting point, you can do this in IP land using > host routes. > > How about this - the Path record (and related) SA responses include > the Hop Limit fields and the spec says: > > 8.3.6 Hop Limit: [..] Setting this value to 0 or 1 will ensure that > the packet will not be forwarded beyond the local subnet. > > So, it is within the spec to use HopLmt >= 2 as the GRH required flag. That would be a simpler check but HopLimit is not a required component of PathRecord but I think this may not be sufficient as just because a HopLimit >= 2 doesn't mean that a packet would be forwarded off subnet. > I'd propose that the combination of a non-link-local prefix and a >= 2 > Hop Limit should force a GRH. SM's that do not support routers should > always fill in 0 for HopLmt. Why is a request with just a non link local prefix (with HopLimit wildcarded) not sufficient ? -- Hal > Jason ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] question regarding GRH flag in ib_ah_attr
On Thu, 2006-05-11 at 00:56, Roland Dreier wrote: > Hal> Huh ? In this case, aren't the subnet prefixes are required > Hal> to be different ? > > It's kind of a crazy thing to do but I don't see anything in the IB > spec that forbids two subnets with the same subnet prefix, There's errata against the current confusion in the IBA spec in terms of GID v. subnet prefix. The bottom line on this is: Each subnet is uniquely identified with a subnet ID known as the Subnet Prefix. > or any reason why a router couldn't route between them. The SMs would just > have to be smart enough to return the LID of the router for paths to > ports on the other subnet, and the routers would have to have explicit > routes rather than forwarding based on just GID prefix. Assuming the above is ignored (and the subnet prefixes are not unique), the routers along any particular path would just have explicit routes for one of these duplicate subnets, right ? -- Hal > - R. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] question regarding GRH flag in ib_ah_attr
On Wed, May 10, 2006 at 09:56:58PM -0700, Roland Dreier wrote: > Hal> Huh ? In this case, aren't the subnet prefixes are required > Hal> to be different ? > > It's kind of a crazy thing to do but I don't see anything in the IB > spec that forbids two subnets with the same subnet prefix, or any > reason why a router couldn't route between them. The SMs would just > have to be smart enough to return the LID of the router for paths to > ports on the other subnet, and the routers would have to have explicit > routes rather than forwarding based on just GID prefix. Hmm, this is an interesting point, you can do this in IP land using host routes. How about this - the Path record (and related) SA responses include the Hop Limit fields and the spec says: 8.3.6 Hop Limit: [..] Setting this value to 0 or 1 will ensure that the packet will not be forwarded beyond the local subnet. So, it is within the spec to use HopLmt >= 2 as the GRH required flag. I'd propose that the combination of a non-link-local prefix and a >= 2 Hop Limit should force a GRH. SM's that do not support routers should always fill in 0 for HopLmt. Jason ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] question regarding GRH flag in ib_ah_attr
Hal> What you are describing is similar to a NAT function for IB Hal> which would need to be supported in the IB edge router to Hal> that private network. Why does there have to be any NAT? The router would just have to replace the DLID the same as it usually does. I don't see why the GID prefix makes any difference really. - R. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] question regarding GRH flag in ib_ah_attr
Hal> Huh ? In this case, aren't the subnet prefixes are required Hal> to be different ? It's kind of a crazy thing to do but I don't see anything in the IB spec that forbids two subnets with the same subnet prefix, or any reason why a router couldn't route between them. The SMs would just have to be smart enough to return the LID of the router for paths to ports on the other subnet, and the routers would have to have explicit routes rather than forwarding based on just GID prefix. - R. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] question regarding GRH flag in ib_ah_attr
On Wed, 2006-05-10 at 21:26, Hal Rosenstock wrote: > On Wed, 2006-05-10 at 19:44, Roland Dreier wrote: > > Sean> Does anyone know how the user determines if the grh flag > > Sean> should be set in the ib_ah_attr when allocating an ib_ah? > > Sean> Do they do this by examining the GIDs in a path record? > > > > Good question. It's always needed for multicast, of course. For > > unicast, I guess one could look at whether the subnet prefixes of the > > SGID and DGID are the same, but I'm not sure that's sufficient -- a > > router could conceivably sit between two subnets with the same subnet > > prefix. > > Huh ? In this case, aren't the subnet prefixes are required to be > different ? Not just different but globally unique, right ? What you are describing is similar to a NAT function for IB which would need to be supported in the IB edge router to that private network. -- Hal > > -- Hal > > > Perhaps some of the Obsidian guys could comment? > > > > - R. > > ___ > > openib-general mailing list > > openib-general@openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit > > http://openib.org/mailman/listinfo/openib-general > > ___ > openib-general mailing list > openib-general@openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] question regarding GRH flag in ib_ah_attr
On Wed, 2006-05-10 at 19:35, Sean Hefty wrote: > For context, I'm trying to work backwards from send a message on a UD QP to > determine what information is needed and how it is obtained. > > Does anyone know how the user determines if the grh flag should be set in the > ib_ah_attr when allocating an ib_ah? Do they do this by examining the GIDs > in a > path record? Anytime the send is off the local subnet (as well as multicast), a GRH is required. Also, there is a management response rule for responding when the request contained a GRH that require a GRH (13.5.4.4 p. 769). -- Hal > > - Sean > ___ > openib-general mailing list > openib-general@openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] question regarding GRH flag in ib_ah_attr
On Wed, 2006-05-10 at 19:44, Roland Dreier wrote: > Sean> Does anyone know how the user determines if the grh flag > Sean> should be set in the ib_ah_attr when allocating an ib_ah? > Sean> Do they do this by examining the GIDs in a path record? > > Good question. It's always needed for multicast, of course. For > unicast, I guess one could look at whether the subnet prefixes of the > SGID and DGID are the same, but I'm not sure that's sufficient -- a > router could conceivably sit between two subnets with the same subnet > prefix. Huh ? In this case, aren't the subnet prefixes are required to be different ? -- Hal > Perhaps some of the Obsidian guys could comment? > > - R. > ___ > openib-general mailing list > openib-general@openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] question regarding GRH flag in ib_ah_attr
On Wed, May 10, 2006 at 04:44:42PM -0700, Roland Dreier wrote: > Sean> Does anyone know how the user determines if the grh flag > Sean> should be set in the ib_ah_attr when allocating an ib_ah? > Sean> Do they do this by examining the GIDs in a path record? > > Good question. It's always needed for multicast, of course. For > unicast, I guess one could look at whether the subnet prefixes of the > SGID and DGID are the same, but I'm not sure that's sufficient -- a > router could conceivably sit between two subnets with the same subnet > prefix. > Perhaps some of the Obsidian guys could comment? Our intention in the absence of standardization is to leverage common practice in IPv6 for numbering - which means that global prefixes need to be globally unique (or at least site unqiue). A generic N port router cannot connect subnets with the same prefix because it is ambiguous where to send the packets. Logically I think the GRH usage should be selected after the output port is determined based on matching the port's PortInfo.GIDPrefix and the IBA default prefix (the link local prefix FE80:: which is always on-link) against the DGID. If there is a match it is on link, otherwise it is off link, through a router, and a GRH is necessary. Right now IBA only allows two prefixes, FE80:: and PortInfo.GIDPrefix so the check described above can be reduced to comparing the SGID and DGID prefixes, if they are different and the DGID prefix is not FE80:: then it is off link and needs a GRH. Regards, Jason ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] question regarding GRH flag in ib_ah_attr
Sean> Does anyone know how the user determines if the grh flag Sean> should be set in the ib_ah_attr when allocating an ib_ah? Sean> Do they do this by examining the GIDs in a path record? Good question. It's always needed for multicast, of course. For unicast, I guess one could look at whether the subnet prefixes of the SGID and DGID are the same, but I'm not sure that's sufficient -- a router could conceivably sit between two subnets with the same subnet prefix. Perhaps some of the Obsidian guys could comment? - R. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Question on : ib_reg_phys_mr()
On Sat, 8 Apr 2006, Devesh Sharma wrote: > In your nfs-rdma context what this function is supposed to do? It should create a memory region for the specified address range. For the exact semantics, see the IBTA spec's description of the REGISTER PHYSICAL MEMORY REGION verb (section 11.2.8.3 of the 1.2 spec). > I know that this function returns memory region, but what is the > difference from other mr returning functions? why get_dma_mr can't > be used? get_dma_mr() will return a memory region which covers all of physical memory. For security reasons, it is not always desirable to expose all of physical memory. ib_reg_phys_mr() allows for more fine grained access control. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Question on : ib_reg_phys_mr()
Thanks James for quick reply, In your nfs-rdma context what this function is supposed to do? I know that this function returns memory region, but what is the difference from other mr returning functions?why get_dma_mr can't be used? Devesh On 4/7/06, James Lentini <[EMAIL PROTECTED]> wrote: On Fri, 7 Apr 2006, Devesh Sharma wrote:> Hello list,> In Ib kernel verbs there is a function ib_reg_phys_mr(). > I am not able to trace the call of this verb by any ulp or uverb.> Who calls this function?NFS-RDMA uses this function:http://sourceforge.net/projects/nfs-rdma > Is this function mendatory to be supported by the HCA driver provider?As a ULP implementer, I expect it to be supported. It is a standardIBTA verb. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Question on : ib_reg_phys_mr()
On Fri, 7 Apr 2006, Devesh Sharma wrote: > Hello list, > In Ib kernel verbs there is a function ib_reg_phys_mr(). > I am not able to trace the call of this verb by any ulp or uverb. > Who calls this function? NFS-RDMA uses this function: http://sourceforge.net/projects/nfs-rdma > Is this function mendatory to be supported by the HCA driver provider? As a ULP implementer, I expect it to be supported. It is a standard IBTA verb. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Question on get_dma_mr()
Hi list and Roland, Is this verb (ib_get_dma_mr) is equivalent to the verb explained in the section 11.2.8.1 Allocate L_key?On 3/30/06, Steve Wise <[EMAIL PROTECTED]> wrote: On Wed, 2006-03-29 at 20:35 -0800, Roland Dreier wrote:> Devesh> Here I am saying that assigning Key is sufficient Or there> Devesh> are some other specific setps to be taken?>> It would depend on the device. You can look at the mthca, ipath and ehca > drivers' implementation of get_dma_mr() for examples.>As well as the iwarp devices in the iwarp branch. amso1100 and cxgb3. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Question on get_dma_mr()
On Wed, 2006-03-29 at 20:35 -0800, Roland Dreier wrote: > Devesh> Here I am saying that assigning Key is sufficient Or there > Devesh> are some other specific setps to be taken? > > It would depend on the device. You can look at the mthca, ipath and ehca > drivers' implementation of get_dma_mr() for examples. > As well as the iwarp devices in the iwarp branch. amso1100 and cxgb3. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Question on get_dma_mr()
yha Ok Thanks for replying Once again. DeveshOn 3/30/06, Roland Dreier <[EMAIL PROTECTED]> wrote: Devesh> Here I am saying that assigning Key is sufficient Or thereDevesh> are some other specific setps to be taken?It would depend on the device. You can look at the mthca, ipath and ehcadrivers' implementation of get_dma_mr() for examples. - R. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Question on get_dma_mr()
Devesh> Here I am saying that assigning Key is sufficient Or there Devesh> are some other specific setps to be taken? It would depend on the device. You can look at the mthca, ipath and ehca drivers' implementation of get_dma_mr() for examples. - R. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Question on get_dma_mr()
On 3/29/06, Roland Dreier <[EMAIL PROTECTED]> wrote: Devesh> S/G entry ?scatter gather entryDevesh> What is the size of this region ? is there any limitationDevesh> in providing this size?It must be large enough to cover all DMA (bus) addresses for the device. Devesh> Finally you mean to say in the implementation of thisDevesh> function providing a unique L_Key and R_Key isDevesh> sufficient. Is it?I can't really understand this question. Of course keys must be unique -- if two regions had the same key, then there would be no wayfor the HCA to know which one to use. Here I am saying that assigning Key is sufficient Or there are some other specific setps to be taken? - R. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Question on get_dma_mr()
Devesh> S/G entry ? scatter gather entry Devesh> What is the size of this region ? is there any limitation Devesh> in providing this size? It must be large enough to cover all DMA (bus) addresses for the device. Devesh> Finally you mean to say in the implementation of this Devesh> function providing a unique L_Key and R_Key is Devesh> sufficient. Is it? I can't really understand this question. Of course keys must be unique -- if two regions had the same key, then there would be no way for the HCA to know which one to use. - R. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Question on get_dma_mr()
Thanks to all of you On 3/27/06, Roland Dreier <[EMAIL PROTECTED]> wrote: Devesh> Hello all, Please any body explain me about theDevesh> functionality of verbs ib_get_dma_mr()?Actually, the responses you've gotten are not quite right.ib_get_dma_mr() returns a memory region that can be used for any _bus_ addresses. In other words, if an S/G entry is passed to the driver S/G entry ? that uses the L_Key from ib_get_dma_mr() and an address of, say,0xdeadbeef, then the RDMA device should use a bus address of 0xdeadbeef to access that memory. What is the size of this region ? is there any limitation in providing this size? The difference between bus addresses and physical addresses issignificant when IOMMUs are present. This is somewhat similar to the verbs extensions notion of "reservedL_Key," except that it also provides an R_Key and the ability tospecify the access permissions of the region. Finally you mean to say in the implementation of this function providing a unique L_Key and R_Key is sufficient. Is it? - R. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] question related to rdma_bind_addr
On Sun, 26 Mar 2006, Or Gerlitz wrote: > > I would find calling it rdma_bind_device() confusing. > > why? I find it very much unconfusing I associate the word bind with bind(2). For that reason, rdma_bind_addr() is a good name because it is the CMA's analog for bind(2). Since it isn't related to bind(2), I find the name rdma_bind_device(dst_addr) confusing. > > In any event, I don't find the functionality very interesting. > > Hey, as i mentioned earlier in this thread, the interest came from a > ***possible*** enhancement to the open iscsi initiator design, now > being discussed, with which a transport (TCP/iSER/iSCSI offload > HW/etc) is asked to create its connection resources synchronously, , > not sure what is your interest in that. I was speaking from my experience with NFS/RDMA. If this functionality is necessary for implementing iSER, I would definitely support adding it. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Question on get_dma_mr()
Devesh> Hello all, Please any body explain me about the Devesh> functionality of verbs ib_get_dma_mr()? Actually, the responses you've gotten are not quite right. ib_get_dma_mr() returns a memory region that can be used for any _bus_ addresses. In other words, if an S/G entry is passed to the driver that uses the L_Key from ib_get_dma_mr() and an address of, say, 0xdeadbeef, then the RDMA device should use a bus address of 0xdeadbeef to access that memory. The difference between bus addresses and physical addresses is significant when IOMMUs are present. This is somewhat similar to the verbs extensions notion of "reserved L_Key," except that it also provides an R_Key and the ability to specify the access permissions of the region. - R. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] question related to rdma_bind_addr
James Lentini wrote: On Thu, 23 Mar 2006, Sean Hefty wrote: I think that Or is just exploring the idea of synchronously binding to a local *device* based on a remote address. This would allow an application to bind, then allocate PDs, CQs, QPs, etc. up front, rather than deferring resource allocation until address resolution completes. exactly. Yes - this is what rdma_bind_addr(src_addr) does. But I can envision adding a new call, rdma_bind_device(dst_addr), provided some use for it can be found. Indeed, but hold your horses, i told you i was just seeking to resolve if possible impl is possible, no real need yet... I would find calling it rdma_bind_device() confusing. why? I find it very much unconfusing In any event, I don't find the functionality very interesting. Hey, as i mentioned earlier in this thread, the interest came from a ***possible*** enhancement to the open iscsi initiator design, now being discussed, with which a transport (TCP/iSER/iSCSI offload HW/etc) is asked to create its connection resources synchronously, , not sure what is your interest in that. Or. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Question on get_dma_mr()
Devesh Sharma wrote: Please any body explain me about the functionality of verbs ib_get_dma_mr()? What is the need of this function? what a driver implementer is supposed to implement in this function? This function returns a memory region for all of system memory. See mthca_provider.c for an implementation. - Sean ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] question related to rdma_bind_addr
On Thu, 23 Mar 2006, Sean Hefty wrote: > >What does it mean to bind to a remote address? What functionality > >would that enable? Spoofing? > > I think that Or is just exploring the idea of synchronously binding > to a local *device* based on a remote address. > > This would allow an application to bind, then allocate PDs, CQs, > QPs, etc. up front, rather than deferring resource allocation until > address resolution completes. A ULP may be able to take advantage > of this, but I can't personally say that I know what benefit it > would provide. (Maybe avoid the need to keep track of everything > that must be allocated once address resolution completes?) > > >When I think of bind(2), I only think of binding to local > >addresses. > > Yes - this is what rdma_bind_addr(src_addr) does. But I can > envision adding a new call, rdma_bind_device(dst_addr), provided > some use for it can be found. I would find calling it rdma_bind_device() confusing. Could you modify the behavior of rdma_resolve_addr() to set the cma id's device field before returning? If so, that would be better than adding a new function. In any event, I don't find the functionality very interesting. If the address resolved properly, it would speed up setup time (of course setup is not generally a bottleneck). In the error case (which is when address resolution would take a long time) things aren't any faster. Also consumers would still need to handle an asynchronous event. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Question on get_dma_mr()
On Fri, 2006-03-24 at 12:51 +0530, Devesh Sharma wrote: > Hello all, > > Please any body explain me about the functionality of verbs > ib_get_dma_mr()? > What is the need of this function? > what a driver implementer is supposed to implement in this function? It returns a MR that maps all of physical memory. Steve. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] question related to rdma_bind_addr
>What does it mean to bind to a remote address? What functionality >would that enable? Spoofing? I think that Or is just exploring the idea of synchronously binding to a local *device* based on a remote address. This would allow an application to bind, then allocate PDs, CQs, QPs, etc. up front, rather than deferring resource allocation until address resolution completes. A ULP may be able to take advantage of this, but I can't personally say that I know what benefit it would provide. (Maybe avoid the need to keep track of everything that must be allocated once address resolution completes?) >When I think of bind(2), I only think of binding to local addresses. Yes - this is what rdma_bind_addr(src_addr) does. But I can envision adding a new call, rdma_bind_device(dst_addr), provided some use for it can be found. - Sean ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] question related to rdma_bind_addr
On Thu, 23 Mar 2006, Sean Hefty wrote: > >I could not approve my assumptions from looking on the cma/addr > >code, but if i am correct this opens the door for future > >enhancement of rdma_bind_addr() to work on non local addresses. > > I believe that could be the case. What does it mean to bind to a remote address? What functionality would that enable? Spoofing? When I think of bind(2), I only think of binding to local addresses. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] question related to rdma_bind_addr
Sean Hefty wrote: If my understanding is correct, the current code of rdma_bind_addr assumes you supply it one of three (all are **src** address) +1 ANY (0.0.0.0) addr +2 local loopback addr +3 other local addr Correct. - Note that currently a valid port number needs to be provided, but this is a temporary restriction. I am not sure to understand your comment on the port number, you mean to the ((struct sockaddr_in *)addr)->sin_port field of addr ? So it is not possible to syncrously create and bind the cma id to ib device based on the destination address (which is the typical info the active side has). It is not possible to synchronously bind based on the destination address. Rdma_bind_addr() binds synchronously to a local device based on a local address only. To bind based on a destination address, you use rdma_resolve_addr(). However, the lookup may involve issuing an ARP request in order to determine the remote hardware address, which is needed in resolving the route. rdma_resolve_addr resolves two things based on the dest address +1 the local IB device to use (plus its port number, pkey etc) +2 the remote (dest) IB GID (or iWARP MAC) Now, i was thinking that the first step of getting the local device based on the dest address is done by ip_route_output_key() and friends, so you synchronously get a network device (on which you later issues the ARP) whose private/rdma pointer is ipoib_device who has ib device. I could not approve my assumptions from looking on the cma/addr code, but if i am correct this opens the door for future enhancement of rdma_bind_addr() to work on non local addresses. I'm not sure that binding to a local device synchronously based on a remote address is exactly impossible. But it doesn't remove the need to resolve the remote address to a hardware address, which is asynchronous. sure, i see that you kind of approve my assumptions that its possible. thanks, Or. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] question related to rdma_bind_addr
Or Gerlitz wrote: At this point i see an actual need, it just related to some change we discuss in the open scsi model for iser integration, and i wanted to make sure that currently creating the IB resources in synchronous manner is impossible. Sorry, my fingers are broken today... I meant to say "i still ***dont*** see an actual need Or. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Question On mad.c
Devesh Sharma wrote: In mad.c while calling ib_post_receive() operation spin_lock_irqsave(&recv_queue->lock, flags); post = (++recv_queue->count < recv_queue->max_active); list_add_tail(&mad_priv->header.mad_list.list, &recv_queue->list); spin_unlock_irqrestore(&recv_queue->lock, flags); ret = ib_post_recv(qp_info->qp, &recv_wr, &bad_recv_wr); This is in while loop till "post" variable remains true, value of max_active is 512 So loop will go 512 times. If the qp on which this posting is going on dose not supports 512 recevie descriptors posting then what will happen? Although during qp creation max_recv supported will be returned but loop is independent of this. The QP is created with a size of IB_MAD_QP_RECV_SIZE (512). If the hardware cannot support this size of a QP, then the create QP call will fail. I.e. the hardware can provide a QP that is larger, but not smaller. The code cannot adjust to using a larger size without resizing the corresponding CQ, which is not yet supported. - Sean ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Question about QP access flags (struct ib_qp_attr.qp_access_flags)
Ralph Campbell wrote: Ralph> When ib_modify_qp() is called with the IB_QP_ACCESS_FLAGS Ralph> set in the mask, what values should be used in struct Ralph> ib_qp_attr.qp_access_flags? The IB spec. seems to indicate Ralph> that RDMA and atomic operations are all enabled or disabled Ralph> as a group but all I see in ib_verbs.h is the enum Ralph> ib_access_flags which is used for memory region access. Ralph> These are more fine grained than the IB spec. implies for Ralph> QPs. So I can see qp_access_flags being either a boolean Ralph> or perhaps a new enum defined for the values for Ralph> qp_access_flags. Roland> I think the IB spec is at best ambiguous as to whether RDMA Roland> and atomics are enabled as a group or not. Roland> The values are IB_ACCESS_REMOTE_ATOMIC, IB_ACCESS_REMOTE_WRITE, Roland> and IB_ACCESS_REMOTE_READ or-ed together I think. Roland's response is correct. Atomics and RDMA reads are enabled separately. (See page 573 of release 1.2 of the spec. I interpreted the separate bullets to mean that they are set separately.) I think it makes sense to keep this distinction, since atomics are also an optional feature of an HCA. If you look in cm.c for init_qp_attr, you can see how the IB CM sets the mask and QP attributes. - Sean ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Question about locked pages
Jeff> Ditto (I thought those were shmem values / didn't think they Jeff> had any effect on Open IB). The information that I got was Jeff> third-hand, which is why I posted here to ask about it. :-) Jeff> I'll remove them from the FAQ entry -- any other comments? Well, a normal user can't use "ulimit -l" to increase their limit on locked memory. However I've never really looked into what the cleanest way to increase the limit is. /etc/security/limits.conf is part of the answer, but ssh+privilege separation can cause that to break as well. - R. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general