Re: [infiniband-diags] [libibmad] Support new ibccquery congestion control tool
Hi Al, On Wed, Sep 21, 2011 at 2:25 PM, Albert Chu ch...@llnl.gov wrote: Hey Hal, Ira answered most of the important stuff already. But some extra details ... Second the current alignment breaks libibmad. So it would be a lot more code to support the miss-alignment and would probably have to be changed anyway. As far as I can tell, all the code in libibmad currently assumes 32 bit alignment. For example, BITSOFFS(48, 32) leads to an offset of -16, which is not good. I played around with updating the code in libibmad, but it was a nice chunk of changes for 1 measly field. We could remove the problem field until the issue is resolved in IBTA? Or perhaps we need to really redo chunks of libibmad to deal with this. OK; maybe this is a new case of misalignment than has been dealt with in the past. Ideally/hopefully, we don't need to deal with this. -- Hal Also, would you comment on what testing has been done with this ? I've gotten ibccquery to talk to Mellanox CAs and Switches and output the current config. I've got a side tool that can do rudimentary congestion control settings, so at the minimum I can twiddle some values and they show up in the output of ibccquery. I haven't done any testing w/ real network traffic yet, so there could be lingering bugs. But this is just the initial first pass. Al On Wed, 2011-09-21 at 08:49 -0700, Ira Weiny wrote: On Wed, 21 Sep 2011 07:17:38 -0700 Hal Rosenstock hal.rosenst...@gmail.com wrote: Hi Al, On Mon, Sep 19, 2011 at 6:06 PM, Albert Chu ch...@llnl.gov wrote: The following patches add a new tool ibccquery to infiniband-diags. It supports the querying of various congestion control settings. Related updates to libibmad are also included. Looks good to me :-) Just a few comments below: Attaching rather than inlining patches makes it harder to comment. On 0001-Add-support-for-congestion-control-mads.patch, is ib_rpc_cc_t really needed ? Couldn't mkey in existing rpc struct just be reused/overloaded for this (and change comment to indicate mkey or cckey) and then some code could be eliminated ? I am not sure I like overloading fields like this. I will take a look at it and see if it looks good but in general to keep ABI and clarity of the code I preferred the separate struct. On 0001-Support-ibccquery-congestion-control-query-tool.patch, I'm worried about the following: + /* XXX: Q3/2010 errata lists first entry offset at 80, but we assume + * will be updated to 96 once CurrentTimeStamp field is word aligned. + * In addition, assume max 13 log events instead of 16. Due to + * errata changes increasing size of CA log event, 16 log events is + * no longer possible to fit in max MAD size. + */ As far as the 13 v. 16 entries, this appears correct to me (MAD size) but I'm concerned about changing the offset from 80 to 96 for better alignment as this is putting the cart before the horse a little as since these changes have not been finalized AFAIK at the IBTA. Yes, it is a bit premature. I have submitted the above alignment as a comment to the IBTA but as you say it is not published. Most importantly the miss-alignment breaks the convention of the spec. So I don't think the IBTA will reject the comment. Second the current alignment breaks libibmad. So it would be a lot more code to support the miss-alignment and would probably have to be changed anyway. Also, would you comment on what testing has been done with this ? Right, the real question is what does current hardware do? We have been unable to determine if any of the vendors support the errata fully or specifically the miss-aligned CurrentTimeStamp. When I asked the vendors I got concrete answers back, so we proceeded with trying to reverse engineer it. Right now the query succeeds, that is all we know. Perhaps someone on the list can help us find out? :-D In the meantime we wanted to get comments on the patches. Ira -- Hal Al -- Albert Chu ch...@llnl.gov Computer Scientist High Performance Systems Division Lawrence Livermore National Laboratory -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Albert Chu ch...@llnl.gov Computer Scientist High Performance Systems Division Lawrence Livermore National Laboratory -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ib-diags: compatability issue with ibstat
On Wed, Sep 21, 2011 at 6:35 PM, Jason Gunthorpe jguntho...@obsidianresearch.com wrote: On Wed, Sep 21, 2011 at 03:32:54PM -0700, Ira Weiny wrote: To be clear I am not against using ibverbs on the linux side. It sounds like that would be the best move going forward. Technically umad already needs to use verbs on Linux because that is the only way to get the subnet timeout value that is needed to compute the MAD RPC timeout. Currently umad ignores this part of IBA ... Determining the MAD timeout is more complicated than this. Subnet timeout is insufficient (so you would need more than just this from verbs - things that would not be part of verbs) if this mechanism were to be made compliant rather than a configurable timeout as it is now. -- Hal Jason -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ofw] ib-diags: compatability issue with ibstat
On Wed, Sep 21, 2011 at 8:14 PM, Ira Weiny wei...@llnl.gov wrote: On Wed, 21 Sep 2011 15:32:54 -0700 Ira Weiny wei...@llnl.gov wrote: On Wed, 21 Sep 2011 15:23:45 -0700 Hefty, Sean sean.he...@intel.com wrote: Does this mean ibstatus does not work on Windows? We do not support any of the scripts on windows. As far as I could tell, the scripts look like they just do post-processing of available output. Good to know, thanks. How are you proposing the addition to ibverbs? It seems this would break ABI there. On windows, libibumad uses libibverbs to obtain whatever data it needs. I'm assuming that a non-MAD application will eventually be able to use ibverbs to determine the rate, without needing to parse data from some file. Does windows have sysfs like capabilities which would make the separate layer clean? There's nothing like sysfs on windows. If the data is available from ibverbs, windows could pull the data from there, even if the diags do not want to rely on ibverbs on the linux side. To be clear I am not against using ibverbs on the linux side. It sounds like that would be the best move going forward. Honestly, ibverbs supplies similar functionality in ibv_devinfo. :-/ It seems both ibstat and ibstatus should be dropped and ibv_devinfo enhanced to supply, Rate, Physical state, Capability Mask, and the transport (on a per port basis)[*]. There might be some other functionality as well. Roland, ibv_devinfo, is in the examples directory. Is there any reason this should not be used as the official tool? If not I think we should use the code from ibv_devinfo as a basis for ibstat. But I hate to see 2 implementations so close. Even if ibv_devinfo is updated to include the additional information, do we want to require libibverbs, etc. on any IB management machine just for this ? That's not the case today on IB management nodes. -- Hal Ira [*] There are cards which support this. However, ibverbs does not have a transport field for each port. Example: 15:54:10 ibv_devinfo -d mlx4_1 | grep trans transport: InfiniBand (0) 15:55:22 ibstat mlx4_1 | grep Link Physical state: LinkUp Link layer: InfiniBand Link layer: Ethernet Full output from ibstat: 15:55:32 ibstat mlx4_1 CA 'mlx4_1' CA type: MT26428 Number of ports: 2 Firmware version: 2.9.1000 Hardware version: b0 Node GUID: 0x0002c9030008e7f0 System image GUID: 0x0002c9030008e7f3 Port 1: State: Initializing Physical state: LinkUp Rate: 10 Base lid: 0 LMC: 0 SM lid: 0 Capability mask: 0x02510868 Port GUID: 0x0002c9030008e7f1 Link layer: InfiniBand Port 2: State: Down Physical state: Disabled Rate: 10 Base lid: 0 LMC: 0 SM lid: 0 Capability mask: 0x0001 Port GUID: 0x0202c9fffe08e7f1 Link layer: Ethernet Ira - Sean -- Ira Weiny Math Programmer/Computer Scientist Lawrence Livermore National Lab 925-423-8008 wei...@llnl.gov -- Ira Weiny Member of Technical Staff Lawrence Livermore National Lab 925-423-8008 wei...@llnl.gov ___ ofw mailing list o...@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ofw -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ofw] ib-diags: compatability issue with ibstat
On Wed, Sep 21, 2011 at 3:06 PM, Hefty, Sean sean.he...@intel.com wrote: commit 1344cb3feacafc462440dabfa5997c5205486d83 added support for FDR10 in a way that is not compatible with Windows support. Windows does not use files to read attribute information. I will probably need to obtain the necessary information using ibverbs on windows by reading port attributes. I don't think FDR10 support is available through ibverbs on linux yet, but would this be acceptable? Is there some other way that you'd like to handle this? The other option I can think of is moving is_fdr10() and sys_read_string() out of ibstat.c and into a linux specific file, so that windows can provide its own implementation. Thoughts? I think there are 2 (related but somewhat separate) issues here: 1. How to obtain is_fdr10 2. Whether ibverbs should be used for this (in windows, linux, both) The only way to determine whether fdr10 is active or not is via the vendor proprietary MAD. That info may be reflected in some other API (and/or file) so that MAD does not need to be reissued. In a separate thread on linux-rdma, there was discussion on a couple of different ways to do that from verbs and in this thread that there's no sysfs equivalent in Windows. You've already stated that the Windows support is using libibverbs for libibumad support so it seems appropriate to me to do the same here (in Windows at least). -- Hal - Sean ___ ofw mailing list o...@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ofw -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ofw] ib-diags: compatability issue with ibstat
On Fri, Sep 23, 2011 at 9:28 AM, Hal Rosenstock hal.rosenst...@gmail.com wrote: On Wed, Sep 21, 2011 at 3:06 PM, Hefty, Sean sean.he...@intel.com wrote: commit 1344cb3feacafc462440dabfa5997c5205486d83 added support for FDR10 in a way that is not compatible with Windows support. Windows does not use files to read attribute information. I will probably need to obtain the necessary information using ibverbs on windows by reading port attributes. I don't think FDR10 support is available through ibverbs on linux yet, but would this be acceptable? Is there some other way that you'd like to handle this? The other option I can think of is moving is_fdr10() and sys_read_string() out of ibstat.c and into a linux specific file, so that windows can provide its own implementation. Thoughts? I think there are 2 (related but somewhat separate) issues here: 1. How to obtain is_fdr10 2. Whether ibverbs should be used for this (in windows, linux, both) The only way to determine whether fdr10 is active or not is via the vendor proprietary MAD. That info may be reflected in some other API (and/or file) so that MAD does not need to be reissued. In a separate thread on linux-rdma, there was discussion on a couple of different ways to do that from verbs and in this thread that there's no sysfs equivalent in Windows. You've already stated that the Windows support is using libibverbs for libibumad support so it seems appropriate to me to do the same here (in Windows at least). The proper place for is_fdr10 is in libibumad (and then we wouldn't be discussing libibverbs w/ibstat) but that was not done to avoid a change to the umad port structure. -- Hal -- Hal - Sean ___ ofw mailing list o...@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ofw -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [ofw] ib-diags: compatability issue with ibstat
Even if ibv_devinfo is updated to include the additional information, do we want to require libibverbs, etc. on any IB management machine just for this ? That's not the case today on IB management nodes. IMO, it's acceptable for ibverbs to be the basic requirement for any IB userspace application. I think you'll find the ibumad duplicates functionally that ibverbs provides, and that's why windows already has the dependency built in. - Sean
RE: [ofw] ib-diags: compatability issue with ibstat
The only way to determine whether fdr10 is active or not is via the vendor proprietary MAD. That info may be reflected in some other API (and/or file) so that MAD does not need to be reissued. In a separate thread on linux-rdma, there was discussion on a couple of different ways to do that from verbs and in this thread that there's no sysfs equivalent in Windows. You've already stated that the Windows support is using libibverbs for libibumad support so it seems appropriate to me to do the same here (in Windows at least). The proper place for is_fdr10 is in libibumad (and then we wouldn't be discussing libibverbs w/ibstat) but that was not done to avoid a change to the umad port structure. ibstat determines is_fdr10 by reading a file. That same data could just as easily be exported as a port attribute, which would make it belong to ibverbs, rather than umad. If the only point of exporting is_fdr10 from the kernel is for an ib management diag to display the value, then it's not a useful value for verbs applications.
RE: [ofw] ib-diags: compatability issue with ibstat
So far, it seems that the choices are: 1. Remove ibstat and use ibv_devinfo instead 2. Change ibstat to obtain the fdr10 information using ibverbs 3. Move the is_fdr10 functionality into OS specific files or code sections of ibstat 4. Change ibstat to obtain fdr10 data using MADs or some other OS independent mechanism 5. Change ibumad so that it provides the fdr10 data Note: umad has OS dependent implementation anyway, but the windows umad would depend on the data being available from ibverbs 6. ??? Can someone explain why the information can't come from the umad_port::rate field? - Sean N�r��yb�X��ǧv�^�){.n�+{��ٚ�{ay�ʇڙ�,j��f���h���z��w��� ���j:+v���w�j�mzZ+�ݢj��!�i
Re: ib-diags: compatability issue with ibstat
On Fri, Sep 23, 2011 at 08:57:12AM -0400, Hal Rosenstock wrote: Technically umad already needs to use verbs on Linux because that is the only way to get the subnet timeout value that is needed to compute the MAD RPC timeout. Currently umad ignores this part of IBA ... Determining the MAD timeout is more complicated than this. Subnet timeout is insufficient (so you would need more than just this from verbs - things that would not be part of verbs) if this mechanism were to be made compliant rather than a configurable timeout as it is now. I didn't have any problem with this in python-rdma. The only wonky bit was that verbs was needed for subnet timeout because it isn't in sysfs. Everything else went according to the spec. Jason -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [ofw] ib-diags: compatability issue with ibstat
1. Remove ibstat and use ibv_devinfo instead 2. Change ibstat to obtain the fdr10 information using ibverbs 3. Move the is_fdr10 functionality into OS specific files or code sections of ibstat 4. Change ibstat to obtain fdr10 data using MADs or some other OS independent mechanism 5. Change ibumad so that it provides the fdr10 data Note: umad has OS dependent implementation anyway, but the windows umad would depend on the data being available from ibverbs 6. ??? Can someone explain why the information can't come from the umad_port::rate field? Because fdr10 is non standard, there's no rate for this; it overloads QDR based rates as both are 10 Gbps but have different link encodings. Ok, then for choices 6-8: 6. Have ibstat just report the rate without specifying the link encoding 7. Use umad_port:rate to encode the rate, link encoding, inter-packet delay, etc. (I'm only partially kidding here.) 8. Drop ibstat from windows. Windows users will be directed to use ibv_devinfo. Ira, what are your thoughts on this?
Re: ib-diags: compatability issue with ibstat
On 9/23/2011 1:55 PM, Jason Gunthorpe wrote: On Fri, Sep 23, 2011 at 01:31:20PM -0400, Hal Rosenstock wrote: The only wonky bit was that verbs was needed for subnet timeout because it isn't in sysfs. Everything else went according to the spec. I don't see how your algorithm can be compliant with just subnet timeout. Really? 13.4.6.3 specifies the formula CommTimeValueOutCommTimeValueInRespTimeValue 4.096 µsec x ( 2 +2 +2 ) Which boils down to 4.096 µsec x (2 x 2**(Subnet_Timeout) + 2**(RespTimeValue)) For the on subnet case. RespTimeValue comes from the class port info of the target manager, and C13-13.1.1 gives guidance what value to use for the initial class port info query to get the value. Which is exactly what I implemented. It isn't hard. and for SM MADs you need PortInfo:RespTimeValue (do you do SM MADs too ?) My point was that verbs alone is insufficient as you need a MAD to do this and the first round trip needs an artificial timeout (before you know the RespTimeValue from wherever you get it). Clearly, for non SM MADs, having subnet timeout as a port attribute saves that round trip. -- Hal Jason -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 33/55] infiniband: irq: Remove IRQF_DISABLED
On Thu, Sep 22, 2011 at 6:48 PM, Yong Zhang yong.zha...@gmail.com wrote: So like I said above, you could take it if you want to. Great, I queued this up for 3.2, thanks. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ib-diags: compatability issue with ibstat
On Fri, Sep 23, 2011 at 02:01:50PM -0400, Hal Rosenstock wrote: On 9/23/2011 1:55 PM, Jason Gunthorpe wrote: On Fri, Sep 23, 2011 at 01:31:20PM -0400, Hal Rosenstock wrote: The only wonky bit was that verbs was needed for subnet timeout because it isn't in sysfs. Everything else went according to the spec. I don't see how your algorithm can be compliant with just subnet timeout. Really? 13.4.6.3 specifies the formula CommTimeValueOutCommTimeValueInRespTimeValue 4.096 ?sec x ( 2 +2 +2 ) Which boils down to 4.096 ?sec x (2 x 2**(Subnet_Timeout) + 2**(RespTimeValue)) For the on subnet case. RespTimeValue comes from the class port info of the target manager, and C13-13.1.1 gives guidance what value to use for the initial class port info query to get the value. Which is exactly what I implemented. It isn't hard. and for SM MADs you need PortInfo:RespTimeValue (do you do SM MADs too ?) You should really read what I wrote. To be very clear, my implementation uses SubnetTimeout and a fixed value for RespTimeValue for initial queries directed at the SA. It switches to SubnetTimeout plus CPI.RespTimeValue once the SA's CPI is queried. It uses PathRecord.PacketLifetime and a fixed value for RespTimeValue for initial queries to non-SA managers in the network, and again switches to CPI.RespTimeValue once the manager's CPI is queried - eg the PMA code does all this. This is all exactly the required behavior outlined in the spec. My point was that verbs alone is insufficient as you need a MAD to do this and the first round trip needs an artificial timeout (before you know the RespTimeValue from wherever you get it). Clearly, for non SM MADs, having subnet timeout as a port attribute saves that round trip. And my original point is that Subnet_Timeout is not optional, IBA says you need the value to compute the timeouts. Jason -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/7] IB/qib: qpn lookup optimizations
The heavy weight spinlock in qib_lookup_qpn() is replaced with the RCU locking mechanism. The hash list itself is now accessed via jhash functions vs. the mod. The changes should benefit multiple receive contexts in different processors by not contending for the lock to just read the hash structures. The patch also adds a lookaside_qp (pointer) and a lookaside_qpn in the context. The interrupt handler will test the current packet's qpn against lookaside_qpn if the lookaside_qp pointer is non-null. The pointer is NULL'ed when the interrupt handler exits. Signed-off-by: Mike Marciniszyn mike.marcinis...@qlogic.com --- drivers/infiniband/hw/qib/qib.h|3 + drivers/infiniband/hw/qib/qib_driver.c |9 drivers/infiniband/hw/qib/qib_qp.c | 77 drivers/infiniband/hw/qib/qib_verbs.c | 36 ++- drivers/infiniband/hw/qib/qib_verbs.h |3 + 5 files changed, 87 insertions(+), 41 deletions(-) diff --git a/drivers/infiniband/hw/qib/qib.h b/drivers/infiniband/hw/qib/qib.h index 97e6233..b881bdc 100644 --- a/drivers/infiniband/hw/qib/qib.h +++ b/drivers/infiniband/hw/qib/qib.h @@ -223,6 +223,9 @@ struct qib_ctxtdata { /* ctxt rcvhdrq head offset */ u32 head; u32 pkt_count; + /* lookaside fields */ + struct qib_qp *lookaside_qp; + u32 lookaside_qpn; /* QPs waiting for context processing */ struct list_head qp_wait_list; }; diff --git a/drivers/infiniband/hw/qib/qib_driver.c b/drivers/infiniband/hw/qib/qib_driver.c index 89264ff..d35c9d3 100644 --- a/drivers/infiniband/hw/qib/qib_driver.c +++ b/drivers/infiniband/hw/qib/qib_driver.c @@ -547,6 +547,15 @@ move_along: updegr = 0; } } + /* +* Notify qib_destroy_qp() if it is waiting +* for lookaside_qp to finish. +*/ + if (rcd-lookaside_qp) { + if (atomic_dec_and_test(rcd-lookaside_qp-refcount)) + wake_up(rcd-lookaside_qp-wait); + rcd-lookaside_qp = NULL; + } rcd-head = l; rcd-pkt_count += i; diff --git a/drivers/infiniband/hw/qib/qib_qp.c b/drivers/infiniband/hw/qib/qib_qp.c index 9d094f9..500981b 100644 --- a/drivers/infiniband/hw/qib/qib_qp.c +++ b/drivers/infiniband/hw/qib/qib_qp.c @@ -34,6 +34,7 @@ #include linux/err.h #include linux/vmalloc.h +#include linux/jhash.h #include qib.h @@ -204,6 +205,13 @@ static void free_qpn(struct qib_qpn_table *qpt, u32 qpn) clear_bit(qpn BITS_PER_PAGE_MASK, map-page); } +static inline unsigned qpn_hash(struct qib_ibdev *dev, u32 qpn) +{ + return jhash_1word(qpn, dev-qp_rnd) + (dev-qp_table_size - 1); +} + + /* * Put the QP into the hash table. * The hash table holds a reference to the QP. @@ -211,22 +219,23 @@ static void free_qpn(struct qib_qpn_table *qpt, u32 qpn) static void insert_qp(struct qib_ibdev *dev, struct qib_qp *qp) { struct qib_ibport *ibp = to_iport(qp-ibqp.device, qp-port_num); - unsigned n = qp-ibqp.qp_num % dev-qp_table_size; unsigned long flags; + unsigned n = qpn_hash(dev, qp-ibqp.qp_num); spin_lock_irqsave(dev-qpt_lock, flags); + atomic_inc(qp-refcount); if (qp-ibqp.qp_num == 0) - ibp-qp0 = qp; + rcu_assign_pointer(ibp-qp0, qp); else if (qp-ibqp.qp_num == 1) - ibp-qp1 = qp; + rcu_assign_pointer(ibp-qp1, qp); else { qp-next = dev-qp_table[n]; - dev-qp_table[n] = qp; + rcu_assign_pointer(dev-qp_table[n], qp); } - atomic_inc(qp-refcount); spin_unlock_irqrestore(dev-qpt_lock, flags); + synchronize_rcu(); } /* @@ -236,29 +245,32 @@ static void insert_qp(struct qib_ibdev *dev, struct qib_qp *qp) static void remove_qp(struct qib_ibdev *dev, struct qib_qp *qp) { struct qib_ibport *ibp = to_iport(qp-ibqp.device, qp-port_num); - struct qib_qp *q, **qpp; + unsigned n = qpn_hash(dev, qp-ibqp.qp_num); unsigned long flags; - qpp = dev-qp_table[qp-ibqp.qp_num % dev-qp_table_size]; - spin_lock_irqsave(dev-qpt_lock, flags); if (ibp-qp0 == qp) { - ibp-qp0 = NULL; atomic_dec(qp-refcount); + rcu_assign_pointer(ibp-qp0, NULL); } else if (ibp-qp1 == qp) { - ibp-qp1 = NULL; atomic_dec(qp-refcount); - } else + rcu_assign_pointer(ibp-qp1, NULL); + } else { + struct qib_qp *q, **qpp; + + qpp = dev-qp_table[n]; for (; (q = *qpp) != NULL; qpp = q-next) if (q == qp) { - *qpp = qp-next; - qp-next = NULL; atomic_dec(qp-refcount); +
[PATCH 1/7] IB/qib: Optimize RC/UC code by IB operation
The memset for zeroing the work completion had been unconditional. This patch removes the memset and moves the zeroing into the work completion with a more explicit field by field set. With this patch, non-ONLY/non-LAST packets will avoid the overhead since they will not generate a completion. Signed-off-by: Mike Marciniszyn mike.marcinis...@qlogic.com --- drivers/infiniband/hw/qib/qib_rc.c | 19 +-- drivers/infiniband/hw/qib/qib_uc.c | 14 +++--- 2 files changed, 24 insertions(+), 9 deletions(-) diff --git a/drivers/infiniband/hw/qib/qib_rc.c b/drivers/infiniband/hw/qib/qib_rc.c index eca0c41..5c1c49b 100644 --- a/drivers/infiniband/hw/qib/qib_rc.c +++ b/drivers/infiniband/hw/qib/qib_rc.c @@ -1955,8 +1955,6 @@ void qib_rc_rcv(struct qib_ctxtdata *rcd, struct qib_ib_header *hdr, break; } - memset(wc, 0, sizeof wc); - if (qp-state == IB_QPS_RTR !(qp-r_flags QIB_R_COMM_EST)) { qp-r_flags |= QIB_R_COMM_EST; if (qp-ibqp.event_handler) { @@ -2009,16 +2007,19 @@ send_middle: goto rnr_nak; qp-r_rcv_len = 0; if (opcode == OP(SEND_ONLY)) - goto send_last; - /* FALLTHROUGH */ + goto no_immediate_data; + /* FALLTHROUGH for SEND_ONLY_WITH_IMMEDIATE */ case OP(SEND_LAST_WITH_IMMEDIATE): send_last_imm: wc.ex.imm_data = ohdr-u.imm_data; hdrsize += 4; wc.wc_flags = IB_WC_WITH_IMM; - /* FALLTHROUGH */ + goto send_last; case OP(SEND_LAST): case OP(RDMA_WRITE_LAST): +no_immediate_data: + wc.wc_flags = 0; + wc.ex.imm_data = 0; send_last: /* Get the number of bytes the message was padded by. */ pad = (be32_to_cpu(ohdr-bth[0]) 20) 3; @@ -2051,6 +2052,12 @@ send_last: wc.src_qp = qp-remote_qpn; wc.slid = qp-remote_ah_attr.dlid; wc.sl = qp-remote_ah_attr.sl; + /* zero fields that are N/A */ + wc.vendor_err = 0; + wc.pkey_index = 0; + wc.dlid_path_bits = 0; + wc.port_num = 0; + wc.csum_ok = 0; /* Signal completion event if the solicited bit is set. */ qib_cq_enter(to_icq(qp-ibqp.recv_cq), wc, (ohdr-bth[0] @@ -2089,7 +2096,7 @@ send_last: if (opcode == OP(RDMA_WRITE_FIRST)) goto send_middle; else if (opcode == OP(RDMA_WRITE_ONLY)) - goto send_last; + goto no_immediate_data; ret = qib_get_rwqe(qp, 1); if (ret 0) goto nack_op_err; diff --git a/drivers/infiniband/hw/qib/qib_uc.c b/drivers/infiniband/hw/qib/qib_uc.c index 32ccf3c..d31f33a 100644 --- a/drivers/infiniband/hw/qib/qib_uc.c +++ b/drivers/infiniband/hw/qib/qib_uc.c @@ -270,7 +270,6 @@ void qib_uc_rcv(struct qib_ibport *ibp, struct qib_ib_header *hdr, psn = be32_to_cpu(ohdr-bth[2]); opcode = 24; - memset(wc, 0, sizeof wc); /* Compare the PSN verses the expected PSN. */ if (unlikely(qib_cmp24(psn, qp-r_psn) != 0)) { @@ -370,7 +369,7 @@ send_first: } qp-r_rcv_len = 0; if (opcode == OP(SEND_ONLY)) - goto send_last; + goto no_immediate_data; else if (opcode == OP(SEND_ONLY_WITH_IMMEDIATE)) goto send_last_imm; /* FALLTHROUGH */ @@ -389,8 +388,11 @@ send_last_imm: wc.ex.imm_data = ohdr-u.imm_data; hdrsize += 4; wc.wc_flags = IB_WC_WITH_IMM; - /* FALLTHROUGH */ + goto send_last; case OP(SEND_LAST): +no_immediate_data: + wc.ex.imm_data = 0; + wc.wc_flags = 0; send_last: /* Get the number of bytes the message was padded by. */ pad = (be32_to_cpu(ohdr-bth[0]) 20) 3; @@ -418,6 +420,12 @@ last_imm: wc.src_qp = qp-remote_qpn; wc.slid = qp-remote_ah_attr.dlid; wc.sl = qp-remote_ah_attr.sl; + /* zero fields that are N/A */ + wc.vendor_err = 0; + wc.pkey_index = 0; + wc.dlid_path_bits = 0; + wc.port_num = 0; + wc.csum_ok = 0; /* Signal completion event if the solicited bit is set. */ qib_cq_enter(to_icq(qp-ibqp.recv_cq), wc, (ohdr-bth[0] -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 7/7] IB/qib: remove s_lock around header validation
Observation in qib_ruc_check_hdr() shows that the s_lock is not required in the normal case. The r_lock is held in all cases, and protects the qp fields that are read. The s_lock will be needed to around the call to qib_migrate_qp() to insure that the send engine sees a consistent set of fields. Signed-off-by: Mike Marciniszyn mike.marcinis...@qlogic.com --- drivers/infiniband/hw/qib/qib_driver.c |5 - drivers/infiniband/hw/qib/qib_rc.c |4 +--- drivers/infiniband/hw/qib/qib_ruc.c|7 ++- drivers/infiniband/hw/qib/qib_uc.c |7 +-- 4 files changed, 8 insertions(+), 15 deletions(-) diff --git a/drivers/infiniband/hw/qib/qib_driver.c b/drivers/infiniband/hw/qib/qib_driver.c index 1ba9da9..8191925 100644 --- a/drivers/infiniband/hw/qib/qib_driver.c +++ b/drivers/infiniband/hw/qib/qib_driver.c @@ -313,7 +313,6 @@ static u32 qib_rcv_hdrerr(struct qib_ctxtdata *rcd, struct qib_pportdata *ppd, u32 opcode; u32 psn; int diff; - unsigned long flags; /* Sanity check packet */ if (tlen 24) @@ -368,7 +367,6 @@ static u32 qib_rcv_hdrerr(struct qib_ctxtdata *rcd, struct qib_pportdata *ppd, switch (qp-ibqp.qp_type) { case IB_QPT_RC: - spin_lock_irqsave(qp-s_lock, flags); ruc_res = qib_ruc_check_hdr( ibp, hdr, @@ -376,11 +374,8 @@ static u32 qib_rcv_hdrerr(struct qib_ctxtdata *rcd, struct qib_pportdata *ppd, qp, be32_to_cpu(ohdr-bth[0])); if (ruc_res) { - spin_unlock_irqrestore(qp-s_lock, - flags); goto unlock; } - spin_unlock_irqrestore(qp-s_lock, flags); /* Only deal with RDMA Writes for now */ if (opcode diff --git a/drivers/infiniband/hw/qib/qib_rc.c b/drivers/infiniband/hw/qib/qib_rc.c index a75b713..23c905e 100644 --- a/drivers/infiniband/hw/qib/qib_rc.c +++ b/drivers/infiniband/hw/qib/qib_rc.c @@ -1889,10 +1889,8 @@ void qib_rc_rcv(struct qib_ctxtdata *rcd, struct qib_ib_header *hdr, } opcode = be32_to_cpu(ohdr-bth[0]); - spin_lock_irqsave(qp-s_lock, flags); if (qib_ruc_check_hdr(ibp, hdr, has_grh, qp, opcode)) - goto sunlock; - spin_unlock_irqrestore(qp-s_lock, flags); + return; psn = be32_to_cpu(ohdr-bth[2]); opcode = 24; diff --git a/drivers/infiniband/hw/qib/qib_ruc.c b/drivers/infiniband/hw/qib/qib_ruc.c index eb78d93..b4b37e4 100644 --- a/drivers/infiniband/hw/qib/qib_ruc.c +++ b/drivers/infiniband/hw/qib/qib_ruc.c @@ -260,12 +260,15 @@ static int gid_ok(union ib_gid *gid, __be64 gid_prefix, __be64 id) /* * - * This should be called with the QP s_lock held. + * This should be called with the QP r_lock held. + * + * The s_lock will be acquired around the qib_migrate_qp() call. */ int qib_ruc_check_hdr(struct qib_ibport *ibp, struct qib_ib_header *hdr, int has_grh, struct qib_qp *qp, u32 bth0) { __be64 guid; + unsigned long flags; if (qp-s_mig_state == IB_MIG_ARMED (bth0 IB_BTH_MIG_REQ)) { if (!has_grh) { @@ -295,7 +298,9 @@ int qib_ruc_check_hdr(struct qib_ibport *ibp, struct qib_ib_header *hdr, if (be16_to_cpu(hdr-lrh[3]) != qp-alt_ah_attr.dlid || ppd_from_ibp(ibp)-port != qp-alt_ah_attr.port_num) goto err; + spin_lock_irqsave(qp-s_lock, flags); qib_migrate_qp(qp); + spin_unlock_irqrestore(qp-s_lock, flags); } else { if (!has_grh) { if (qp-remote_ah_attr.ah_flags IB_AH_GRH) diff --git a/drivers/infiniband/hw/qib/qib_uc.c b/drivers/infiniband/hw/qib/qib_uc.c index 7aa346f..56fa7a9 100644 --- a/drivers/infiniband/hw/qib/qib_uc.c +++ b/drivers/infiniband/hw/qib/qib_uc.c @@ -243,7 +243,6 @@ void qib_uc_rcv(struct qib_ibport *ibp, struct qib_ib_header *hdr, int has_grh, void *data, u32 tlen, struct qib_qp *qp) { struct qib_other_headers *ohdr; - unsigned long flags; u32 opcode; u32 hdrsize; u32 psn; @@ -263,10 +262,8 @@ void qib_uc_rcv(struct qib_ibport *ibp, struct qib_ib_header *hdr, } opcode = be32_to_cpu(ohdr-bth[0]); - spin_lock_irqsave(qp-s_lock, flags); if (qib_ruc_check_hdr(ibp, hdr, has_grh, qp, opcode)) - goto sunlock; - spin_unlock_irqrestore(qp-s_lock,
[PATCH 6/7] IB/qib: memcpy optimizations
The default memcpy used by qib_copy_sge() ends up being a rep movsb on x86_64, which is pretty slow. This fix adds an x86_64 specific routine that 1) probes for X86_FEATURE_REP_GOOD and 2) uses an inline asm routine builton rep movsq that testing has shown is better than the builtin memcpy for all cases up to 4K. The probing routine is now called when the qib module is loaded to enable the optimization. When X86_FEATURE_REP_GOOD is not set, the routine uses the kernel's unrolled __memcpy when the length is more than 64 and the builtin memcpy otherwise. This patch also adds the cache bypass copies from older releases. Testing has shown that AMD cpus benefit with a 40% improvement in netperf/ipoib. The cache_bypass_copy module parameter can be used to enable on non-AMD CPUs. The qib_verbs_send_dma() and qib_copy_from_sge are also changed to use the memcpy_string_op() to improve packet delivery performance to the send engine. The existing copy as well as a new stub probe routine are maintained as weak symbols for other architectures. Signed-off-by: Mike Marciniszyn mike.marcinis...@qlogic.com --- drivers/infiniband/hw/qib/Makefile |2 .../infiniband/hw/qib/memcpy_cachebypass_x86_64.S | 266 drivers/infiniband/hw/qib/qib_copy_sge_x86_64.c| 146 +++ drivers/infiniband/hw/qib/qib_driver.c |5 drivers/infiniband/hw/qib/qib_init.c |1 drivers/infiniband/hw/qib/qib_rc.c |1 drivers/infiniband/hw/qib/qib_uc.c |1 drivers/infiniband/hw/qib/qib_verbs.c | 16 + drivers/infiniband/hw/qib/qib_verbs.h | 56 9 files changed, 488 insertions(+), 6 deletions(-) create mode 100644 drivers/infiniband/hw/qib/memcpy_cachebypass_x86_64.S create mode 100644 drivers/infiniband/hw/qib/qib_copy_sge_x86_64.c diff --git a/drivers/infiniband/hw/qib/Makefile b/drivers/infiniband/hw/qib/Makefile index f12d7bb..911e24c 100644 --- a/drivers/infiniband/hw/qib/Makefile +++ b/drivers/infiniband/hw/qib/Makefile @@ -11,5 +11,5 @@ ib_qib-y := qib_cq.o qib_diag.o qib_dma.o qib_driver.o qib_eeprom.o \ # 6120 has no fallback if no MSI interrupts, others can do INTx ib_qib-$(CONFIG_PCI_MSI) += qib_iba6120.o -ib_qib-$(CONFIG_X86_64) += qib_wc_x86_64.o +ib_qib-$(CONFIG_X86_64) += qib_wc_x86_64.o qib_copy_sge_x86_64.o memcpy_cachebypass_x86_64.o ib_qib-$(CONFIG_PPC64) += qib_wc_ppc64.o diff --git a/drivers/infiniband/hw/qib/memcpy_cachebypass_x86_64.S b/drivers/infiniband/hw/qib/memcpy_cachebypass_x86_64.S new file mode 100644 index 000..41096c2 --- /dev/null +++ b/drivers/infiniband/hw/qib/memcpy_cachebypass_x86_64.S @@ -0,0 +1,266 @@ +/* + * Copyright (c) 2006, 2007, 2008 QLogic Corporation. All rights reserved. + * Copyright (c) 2003, 2004, 2005, 2006 PathScale, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + *copyright notice, this list of conditions and the following + *disclaimer. + * + * - Redistributions in binary form must reproduce the above + *copyright notice, this list of conditions and the following + *disclaimer in the documentation and/or other materials + *provided with the distribution. + * + * THE SOFTWARE IS PROVIDED AS IS, WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + + .text + .p2align 4,,15 + /* rdi destination, rsi source, rdx count */ + .globl memcpy_cachebypass + .type memcpy_cachebypass, @function +# loads bypass the cache, stores fill the cache +memcpy_cachebypass: + movq%rdi, %rax +.L5: + cmpq$15, %rdx + ja .L34 +.L3: + cmpl$8, %edx/* rdx is 0..15 */ + jbe .L9 +.L6: + testb $8, %dxl/* rdx is 3,5,6,7,9..15 */ + je .L13 + movq(%rsi), %rcx + addq$8, %rsi + movq%rcx, (%rdi) + addq$8, %rdi +.L13: + testb $4, %dxl + je .L15 + movl(%rsi), %ecx + addq$4, %rsi + movl
[PATCH] IB/qib: Add logic for affinity hint
Call irq_set_affinity_hint to provide user mode programs like irqbalance the information to be able to distribute qib interrupts appropriately. The logic allocates all non-receive interrupts to the first CPU local to the HCA. Receive interrupts are allocated round robin starting with the second CPU local to the HCA with potential wrap back to the second CPU. This patch also adds a refinement to the name registered with MSIX interrupts so that user level scripts can determine the device associated with the IRQs when there are multiple HCAs with a potentially different set of local CPUs. Signed-off-by: Mike Marciniszyn mike.marcinis...@qlogic.com --- drivers/infiniband/hw/qib/qib.h | 10 +++ drivers/infiniband/hw/qib/qib_iba7322.c | 107 ++- drivers/infiniband/hw/qib/qib_pcie.c| 21 ++ 3 files changed, 104 insertions(+), 34 deletions(-) diff --git a/drivers/infiniband/hw/qib/qib.h b/drivers/infiniband/hw/qib/qib.h index b881bdc..6b811e3 100644 --- a/drivers/infiniband/hw/qib/qib.h +++ b/drivers/infiniband/hw/qib/qib.h @@ -427,6 +427,14 @@ struct qib_verbs_txreq { /* how often we check for packet activity for power on hours (in seconds) */ #define ACTIVITY_TIMER 5 +#define MAX_NAME_SIZE 64 +struct qib_msix_entry { + struct msix_entry msix; + void *arg; + char name[MAX_NAME_SIZE]; + cpumask_var_t mask; +}; + /* Below is an opaque struct. Each chip (device) can maintain * private data needed for its operation, but not germane to the * rest of the driver. For convenience, we define another that @@ -1355,7 +1363,7 @@ int qib_pcie_init(struct pci_dev *, const struct pci_device_id *); int qib_pcie_ddinit(struct qib_devdata *, struct pci_dev *, const struct pci_device_id *); void qib_pcie_ddcleanup(struct qib_devdata *); -int qib_pcie_params(struct qib_devdata *, u32, u32 *, struct msix_entry *); +int qib_pcie_params(struct qib_devdata *, u32, u32 *, struct qib_msix_entry *); int qib_reinit_intr(struct qib_devdata *); void qib_enable_intx(struct pci_dev *); void qib_nomsi(struct qib_devdata *); diff --git a/drivers/infiniband/hw/qib/qib_iba7322.c b/drivers/infiniband/hw/qib/qib_iba7322.c index f3f4b55..1e44cab 100644 --- a/drivers/infiniband/hw/qib/qib_iba7322.c +++ b/drivers/infiniband/hw/qib/qib_iba7322.c @@ -540,8 +540,7 @@ struct qib_chip_specific { u32 lastbuf_for_pio; u32 stay_in_freeze; u32 recovery_ports_initted; - struct msix_entry *msix_entries; - void **msix_arg; + struct qib_msix_entry *msix_entries; unsigned long *sendchkenable; unsigned long *sendgrhchk; unsigned long *sendibchk; @@ -638,24 +637,24 @@ static struct { int lsb; int port; /* 0 if not port-specific, else port # */ } irq_table[] = { - { QIB_DRV_NAME, qib_7322intr, -1, 0 }, - { QIB_DRV_NAME (buf avail), qib_7322bufavail, + { , qib_7322intr, -1, 0 }, + { (buf avail), qib_7322bufavail, SYM_LSB(IntStatus, SendBufAvail), 0 }, - { QIB_DRV_NAME (sdma 0), sdma_intr, + { (sdma 0), sdma_intr, SYM_LSB(IntStatus, SDmaInt_0), 1 }, - { QIB_DRV_NAME (sdma 1), sdma_intr, + { (sdma 1), sdma_intr, SYM_LSB(IntStatus, SDmaInt_1), 2 }, - { QIB_DRV_NAME (sdmaI 0), sdma_idle_intr, + { (sdmaI 0), sdma_idle_intr, SYM_LSB(IntStatus, SDmaIdleInt_0), 1 }, - { QIB_DRV_NAME (sdmaI 1), sdma_idle_intr, + { (sdmaI 1), sdma_idle_intr, SYM_LSB(IntStatus, SDmaIdleInt_1), 2 }, - { QIB_DRV_NAME (sdmaP 0), sdma_progress_intr, + { (sdmaP 0), sdma_progress_intr, SYM_LSB(IntStatus, SDmaProgressInt_0), 1 }, - { QIB_DRV_NAME (sdmaP 1), sdma_progress_intr, + { (sdmaP 1), sdma_progress_intr, SYM_LSB(IntStatus, SDmaProgressInt_1), 2 }, - { QIB_DRV_NAME (sdmaC 0), sdma_cleanup_intr, + { (sdmaC 0), sdma_cleanup_intr, SYM_LSB(IntStatus, SDmaCleanupDone_0), 1 }, - { QIB_DRV_NAME (sdmaC 1), sdma_cleanup_intr, + { (sdmaC 1), sdma_cleanup_intr, SYM_LSB(IntStatus, SDmaCleanupDone_1), 2 }, }; @@ -2567,9 +2566,13 @@ static void qib_7322_nomsix(struct qib_devdata *dd) int i; dd-cspec-num_msix_entries = 0; - for (i = 0; i n; i++) - free_irq(dd-cspec-msix_entries[i].vector, -dd-cspec-msix_arg[i]); + for (i = 0; i n; i++) { + irq_set_affinity_hint( + dd-cspec-msix_entries[i].msix.vector, NULL); + free_cpumask_var(dd-cspec-msix_entries[i].mask); + free_irq(dd-cspec-msix_entries[i].msix.vector, + dd-cspec-msix_entries[i].arg); + } qib_nomsix(dd); } /*
[PATCH 3/7] IB/qib: Eliminate divide/mod in converting idx to egr buf pointer
The context init now saves a shift from rcvegrbufs_perchunk rcvegrbufs_perchunk_shift using ilog2. A BUG_ON protects the power of 2 assumption. Signed-off-by: Mike Marciniszyn mike.marcinis...@qlogic.com --- drivers/infiniband/hw/qib/qib.h |8 ++-- drivers/infiniband/hw/qib/qib_driver.c |6 +++--- drivers/infiniband/hw/qib/qib_iba6120.c |2 ++ drivers/infiniband/hw/qib/qib_iba7220.c |2 ++ drivers/infiniband/hw/qib/qib_iba7322.c |2 ++ drivers/infiniband/hw/qib/qib_init.c|3 +++ 6 files changed, 18 insertions(+), 5 deletions(-) diff --git a/drivers/infiniband/hw/qib/qib.h b/drivers/infiniband/hw/qib/qib.h index ee993e7..97e6233 100644 --- a/drivers/infiniband/hw/qib/qib.h +++ b/drivers/infiniband/hw/qib/qib.h @@ -171,7 +171,9 @@ struct qib_ctxtdata { /* how many alloc_pages() chunks in rcvegrbuf_pages */ u32 rcvegrbuf_chunks; /* how many egrbufs per chunk */ - u32 rcvegrbufs_perchunk; + u16 rcvegrbufs_perchunk; + /* ilog2 of above */ + u16 rcvegrbufs_perchunk_shift; /* order for rcvegrbuf_pages */ size_t rcvegrbuf_size; /* rcvhdrq size (for freeing) */ @@ -940,7 +942,9 @@ struct qib_devdata { /* chip address space used by 4k pio buffers */ u32 align4k; /* size of each rcvegrbuffer */ - u32 rcvegrbufsize; + u16 rcvegrbufsize; + /* log2 of above */ + u16 rcvegrbufsize_shift; /* localbus width (1, 2,4,8,16,32) from config space */ u32 lbus_width; /* localbus speed in MHz */ diff --git a/drivers/infiniband/hw/qib/qib_driver.c b/drivers/infiniband/hw/qib/qib_driver.c index 23e584f..89264ff 100644 --- a/drivers/infiniband/hw/qib/qib_driver.c +++ b/drivers/infiniband/hw/qib/qib_driver.c @@ -279,10 +279,10 @@ bail: */ static inline void *qib_get_egrbuf(const struct qib_ctxtdata *rcd, u32 etail) { - const u32 chunk = etail / rcd-rcvegrbufs_perchunk; - const u32 idx = etail % rcd-rcvegrbufs_perchunk; + const u32 chunk = etail rcd-rcvegrbufs_perchunk_shift; + const u32 idx = etail ((u32)rcd-rcvegrbufs_perchunk - 1); - return rcd-rcvegrbuf[chunk] + idx * rcd-dd-rcvegrbufsize; + return rcd-rcvegrbuf[chunk] + (idx rcd-dd-rcvegrbufsize_shift); } /* diff --git a/drivers/infiniband/hw/qib/qib_iba6120.c b/drivers/infiniband/hw/qib/qib_iba6120.c index d8ca0a0..781a802 100644 --- a/drivers/infiniband/hw/qib/qib_iba6120.c +++ b/drivers/infiniband/hw/qib/qib_iba6120.c @@ -3273,6 +3273,8 @@ static int init_6120_variables(struct qib_devdata *dd) /* we always allocate at least 2048 bytes for eager buffers */ ret = ib_mtu_enum_to_int(qib_ibmtu); dd-rcvegrbufsize = ret != -1 ? max(ret, 2048) : QIB_DEFAULT_MTU; + BUG_ON(!is_power_of_2(dd-rcvegrbufsize)); + dd-rcvegrbufsize_shift = ilog2(dd-rcvegrbufsize); qib_6120_tidtemplate(dd); diff --git a/drivers/infiniband/hw/qib/qib_iba7220.c b/drivers/infiniband/hw/qib/qib_iba7220.c index e1f9474..3f1d562 100644 --- a/drivers/infiniband/hw/qib/qib_iba7220.c +++ b/drivers/infiniband/hw/qib/qib_iba7220.c @@ -4085,6 +4085,8 @@ static int qib_init_7220_variables(struct qib_devdata *dd) /* we always allocate at least 2048 bytes for eager buffers */ ret = ib_mtu_enum_to_int(qib_ibmtu); dd-rcvegrbufsize = ret != -1 ? max(ret, 2048) : QIB_DEFAULT_MTU; + BUG_ON(!is_power_of_2(dd-rcvegrbufsize)); + dd-rcvegrbufsize_shift = ilog2(dd-rcvegrbufsize); qib_7220_tidtemplate(dd); diff --git a/drivers/infiniband/hw/qib/qib_iba7322.c b/drivers/infiniband/hw/qib/qib_iba7322.c index 5ea9ece..f3f4b55 100644 --- a/drivers/infiniband/hw/qib/qib_iba7322.c +++ b/drivers/infiniband/hw/qib/qib_iba7322.c @@ -6205,6 +6205,8 @@ static int qib_init_7322_variables(struct qib_devdata *dd) /* we always allocate at least 2048 bytes for eager buffers */ dd-rcvegrbufsize = max(mtu, 2048); + BUG_ON(!is_power_of_2(dd-rcvegrbufsize)); + dd-rcvegrbufsize_shift = ilog2(dd-rcvegrbufsize); qib_7322_tidtemplate(dd); diff --git a/drivers/infiniband/hw/qib/qib_init.c b/drivers/infiniband/hw/qib/qib_init.c index 021636d..21ffa7c 100644 --- a/drivers/infiniband/hw/qib/qib_init.c +++ b/drivers/infiniband/hw/qib/qib_init.c @@ -183,6 +183,9 @@ struct qib_ctxtdata *qib_create_ctxtdata(struct qib_pportdata *ppd, u32 ctxt) rcd-rcvegrbuf_chunks = (rcd-rcvegrcnt + rcd-rcvegrbufs_perchunk - 1) / rcd-rcvegrbufs_perchunk; + BUG_ON(!is_power_of_2(rcd-rcvegrbufs_perchunk)); + rcd-rcvegrbufs_perchunk_shift = + ilog2(rcd-rcvegrbufs_perchunk); } return rcd; } -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/7] IB/qib: decode path mtu optimization
Store both the encoded and decoded mtu in the qp structure as a minor optimization UC/RC receive routines. Signed-off-by: Mike Marciniszyn mike.marcinis...@qlogic.com --- drivers/infiniband/hw/qib/qib_qp.c|4 +++- drivers/infiniband/hw/qib/qib_rc.c|6 +++--- drivers/infiniband/hw/qib/qib_uc.c|4 ++-- drivers/infiniband/hw/qib/qib_verbs.h |1 + 4 files changed, 9 insertions(+), 6 deletions(-) diff --git a/drivers/infiniband/hw/qib/qib_qp.c b/drivers/infiniband/hw/qib/qib_qp.c index e16751f..9d094f9 100644 --- a/drivers/infiniband/hw/qib/qib_qp.c +++ b/drivers/infiniband/hw/qib/qib_qp.c @@ -765,8 +765,10 @@ int qib_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, } } - if (attr_mask IB_QP_PATH_MTU) + if (attr_mask IB_QP_PATH_MTU) { qp-path_mtu = pmtu; + qp-pmtu = ib_mtu_enum_to_int(pmtu); + } if (attr_mask IB_QP_RETRY_CNT) { qp-s_retry_cnt = attr-retry_cnt; diff --git a/drivers/infiniband/hw/qib/qib_rc.c b/drivers/infiniband/hw/qib/qib_rc.c index 5c1c49b..5d77cb8 100644 --- a/drivers/infiniband/hw/qib/qib_rc.c +++ b/drivers/infiniband/hw/qib/qib_rc.c @@ -239,7 +239,7 @@ int qib_make_rc_req(struct qib_qp *qp) u32 len; u32 bth0; u32 bth2; - u32 pmtu = ib_mtu_enum_to_int(qp-path_mtu); + u32 pmtu = qp-pmtu; char newreq; unsigned long flags; int ret = 0; @@ -1732,7 +1732,7 @@ static int qib_rc_rcv_error(struct qib_other_headers *ohdr, * same request. */ offset = ((psn - e-psn) QIB_PSN_MASK) * - ib_mtu_enum_to_int(qp-path_mtu); + qp-pmtu; len = be32_to_cpu(reth-length); if (unlikely(offset + len != e-rdma_sge.sge_length)) goto unlock_done; @@ -1876,7 +1876,7 @@ void qib_rc_rcv(struct qib_ctxtdata *rcd, struct qib_ib_header *hdr, u32 psn; u32 pad; struct ib_wc wc; - u32 pmtu = ib_mtu_enum_to_int(qp-path_mtu); + u32 pmtu = qp-pmtu; int diff; struct ib_reth *reth; unsigned long flags; diff --git a/drivers/infiniband/hw/qib/qib_uc.c b/drivers/infiniband/hw/qib/qib_uc.c index d31f33a..9a4630f 100644 --- a/drivers/infiniband/hw/qib/qib_uc.c +++ b/drivers/infiniband/hw/qib/qib_uc.c @@ -51,7 +51,7 @@ int qib_make_uc_req(struct qib_qp *qp) u32 hwords; u32 bth0; u32 len; - u32 pmtu = ib_mtu_enum_to_int(qp-path_mtu); + u32 pmtu = qp-pmtu; int ret = 0; spin_lock_irqsave(qp-s_lock, flags); @@ -249,7 +249,7 @@ void qib_uc_rcv(struct qib_ibport *ibp, struct qib_ib_header *hdr, u32 psn; u32 pad; struct ib_wc wc; - u32 pmtu = ib_mtu_enum_to_int(qp-path_mtu); + u32 pmtu = qp-pmtu; struct ib_reth *reth; int ret; diff --git a/drivers/infiniband/hw/qib/qib_verbs.h b/drivers/infiniband/hw/qib/qib_verbs.h index 95e5b47..ec3711f 100644 --- a/drivers/infiniband/hw/qib/qib_verbs.h +++ b/drivers/infiniband/hw/qib/qib_verbs.h @@ -485,6 +485,7 @@ struct qib_qp { u8 alt_timeout; /* Alternate path timeout for this QP */ u8 port_num; enum ib_mtu path_mtu; + u32 pmtu; /* decoded from path_mtu */ u32 remote_qpn; u32 qkey; /* QKEY for this QP (for UD or RD) */ u32 s_size; /* send work queue size */ -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/7] qib optimizations
The following patch series contains optimizations that benefit qib verbs performance. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
how to debug (mlx4) CQ overrun
I have a test program that does RDMA read-write as the following: node A: server listens and handles connection requests setup a piece of memory initialized to 0 node B: two processes parent child child: 1. setup a new channel with server, including a CQ with 1024 entries (ibv_create_cq(ctx, 1024, NULL, channel, 0);) 2. RDMA sequential write (8192 bytes a time) to server memory 4. sync with parent parent: 1. setup the new channel with server, including a CQ with 1024 entries (ibv_create_cq(ctx, 1024, NULL, channel, 0);) 3. RDMA sequential read (8192 byes a time) to the same piece of memory from server - check the buffer contents. - if memory content is still zero, re-read 4. sync with child The parent hangs (but child finishes its write) after the following pops up in /var/log/messages: mlx4_core :06:00.0: CQ overrun on CQN 87 I have my own counters that restrict the read (and write) to 512 max. Both write and read are blocking (i.e. cq is polled after each read/write). I suspect I do not have the cq poll logic correct. The question here is .. is there any diag tool available to check on the internal counters (and /or states) of ibverbs library and/or kernel drivers (to help RDMA applications debug) ? In my case, it hangs around 14546 block (i.e. after 14546*8192 byes). Thanks, Wendy -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: how to debug (mlx4) CQ overrun
On Fri, Sep 23, 2011 at 02:15:30PM -0700, Wendy Cheng wrote: I have my own counters that restrict the read (and write) to 512 max. Both write and read are blocking (i.e. cq is polled after each read/write). I suspect I do not have the cq poll logic correct. The question here is .. is there any diag tool available to check on the internal counters (and /or states) of ibverbs library and/or kernel drivers (to help RDMA applications debug) ? In my case, it hangs around 14546 block (i.e. after 14546*8192 byes). There are not really any tools, but this is usually straightforward to look at from your app. Every time you post to the send Q increment a counter. Everytime you get something back from ibv_poll_cq increment another counter. The (A - B) must never exceed the number of entries in the CQ, and it must not exceed the number of entries in the send Q (very important). This assumes you are posting everything with IBV_SEND_SIGNALED. Doing otherwise is basically the same but there is a bit more complexity to manage the CQ counter as each completion represents multiple sendQ entries. Make sure you check for error codes from ibv_post_send. Jason -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html