Re: [infiniband-diags] [libibmad] Support new ibccquery congestion control tool

2011-09-23 Thread Hal Rosenstock
Hi Al,

On Wed, Sep 21, 2011 at 2:25 PM, Albert Chu ch...@llnl.gov wrote:
 Hey Hal,

 Ira answered most of the important stuff already.  But some extra
 details ...

 Second the current alignment breaks libibmad.  So it would be a lot
 more code to support the miss-alignment and would probably have to be
 changed anyway.

 As far as I can tell, all the code in libibmad currently assumes 32 bit
 alignment.  For example, BITSOFFS(48, 32) leads to an offset of -16,
 which is not good.  I played around with updating the code in libibmad,
 but it was a nice chunk of changes for 1 measly field.  We could remove
 the problem field until the issue is resolved in IBTA?  Or perhaps we
 need to really redo chunks of libibmad to deal with this.

OK; maybe this is a new case of misalignment than has been dealt with
in the past. Ideally/hopefully, we don't need to deal with this.

-- Hal

 Also, would you comment on what testing has been done with this ?

 I've gotten ibccquery to talk to Mellanox CAs and Switches and output
 the current config.  I've got a side tool that can do rudimentary
 congestion control settings, so at the minimum I can twiddle some values
 and they show up in the output of ibccquery.  I haven't done any testing
 w/ real network traffic yet, so there could be lingering bugs.  But this
 is just the initial first pass.

 Al

 On Wed, 2011-09-21 at 08:49 -0700, Ira Weiny wrote:
 On Wed, 21 Sep 2011 07:17:38 -0700
 Hal Rosenstock hal.rosenst...@gmail.com wrote:

  Hi Al,
 
  On Mon, Sep 19, 2011 at 6:06 PM, Albert Chu ch...@llnl.gov wrote:
   The following patches add a new tool ibccquery to infiniband-diags.  It
   supports the querying of various congestion control settings.  Related
   updates to libibmad are also included.
 
  Looks good to me :-) Just a few comments below:
 
  Attaching rather than inlining patches makes it harder to comment.
 
  On 0001-Add-support-for-congestion-control-mads.patch, is ib_rpc_cc_t
  really needed ? Couldn't mkey in existing rpc struct just be
  reused/overloaded for this (and change comment to indicate mkey or
  cckey) and then some code could be eliminated ?

 I am not sure I like overloading fields like this.  I will take a look at it
 and see if it looks good but in general to keep ABI and clarity of the code
 I preferred the separate struct.

 
  On 0001-Support-ibccquery-congestion-control-query-tool.patch, I'm
  worried about the following:
  +           /* XXX: Q3/2010 errata lists first entry offset at 80, but we 
  assume
  +            * will be updated to 96 once CurrentTimeStamp field is word 
  aligned.
  +            * In addition, assume max 13 log events instead of 16.  Due to
  +            * errata changes increasing size of CA log event, 16 log 
  events is
  +            * no longer possible to fit in max MAD size.
  +            */
 
  As far as the 13 v. 16 entries, this appears correct to me (MAD size)
  but I'm concerned about changing the offset from 80 to 96 for better
  alignment as this is putting the cart before the horse a little as
  since these changes have not been finalized AFAIK at the IBTA.

 Yes, it is a bit premature.  I have submitted the above alignment as a 
 comment
 to the IBTA but as you say it is not published.  Most importantly the
 miss-alignment breaks the convention of the spec.  So I don't think the IBTA
 will reject the comment.

 Second the current alignment breaks libibmad.  So it would be a lot more code
 to support the miss-alignment and would probably have to be changed anyway.

 
  Also, would you comment on what testing has been done with this ?
 

 Right, the real question is what does current hardware do?

 We have been unable to determine if any of the vendors support the errata
 fully or specifically the miss-aligned CurrentTimeStamp.  When I asked the
 vendors I got concrete answers back, so we proceeded with trying to reverse
 engineer it.  Right now the query succeeds, that is all we know.

 Perhaps someone on the list can help us find out?  :-D

 In the meantime we wanted to get comments on the patches.

 Ira

  -- Hal
 
   Al
  
   --
   Albert Chu
   ch...@llnl.gov
   Computer Scientist
   High Performance Systems Division
   Lawrence Livermore National Laboratory
  
  
  --
  To unsubscribe from this list: send the line unsubscribe linux-rdma in
  the body of a message to majord...@vger.kernel.org
  More majordomo info at  http://vger.kernel.org/majordomo-info.html


 --
 Albert Chu
 ch...@llnl.gov
 Computer Scientist
 High Performance Systems Division
 Lawrence Livermore National Laboratory



--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ib-diags: compatability issue with ibstat

2011-09-23 Thread Hal Rosenstock
On Wed, Sep 21, 2011 at 6:35 PM, Jason Gunthorpe
jguntho...@obsidianresearch.com wrote:
 On Wed, Sep 21, 2011 at 03:32:54PM -0700, Ira Weiny wrote:

 To be clear I am not against using ibverbs on the linux side.  It
 sounds like that would be the best move going forward.

 Technically umad already needs to use verbs on Linux because that is
 the only way to get the subnet timeout value that is needed to compute
 the MAD RPC timeout. Currently umad ignores this part of IBA ...

Determining the MAD timeout is more complicated than this. Subnet
timeout is insufficient (so you would need more than just this from
verbs - things that would not be part of verbs) if this mechanism were
to be made compliant rather than a configurable timeout as it is now.

-- Hal

 Jason
 --
 To unsubscribe from this list: send the line unsubscribe linux-rdma in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ofw] ib-diags: compatability issue with ibstat

2011-09-23 Thread Hal Rosenstock
On Wed, Sep 21, 2011 at 8:14 PM, Ira Weiny wei...@llnl.gov wrote:
 On Wed, 21 Sep 2011 15:32:54 -0700
 Ira Weiny wei...@llnl.gov wrote:

 On Wed, 21 Sep 2011 15:23:45 -0700
 Hefty, Sean sean.he...@intel.com wrote:

   Does this mean ibstatus does not work on Windows?
 
  We do not support any of the scripts on windows.  As far as I could tell, 
  the scripts look like they just do post-processing of available output.
 

 Good to know, thanks.

   How are you proposing the addition to ibverbs?  It seems this would 
   break ABI
   there.
 
  On windows, libibumad uses libibverbs to obtain whatever data it needs.  
  I'm assuming that a non-MAD application will eventually be able to use 
  ibverbs to determine the rate, without needing to parse data from some 
  file.
 
   Does windows have sysfs like capabilities which would make the separate 
   layer
   clean?
 
  There's nothing like sysfs on windows.  If the data is available from 
  ibverbs, windows could pull the data from there, even if the diags do not 
  want to rely on ibverbs on the linux side.
 

 To be clear I am not against using ibverbs on the linux side.  It sounds 
 like that would be the best move going forward.


 Honestly, ibverbs supplies similar functionality in ibv_devinfo.  :-/

 It seems both ibstat and ibstatus should be dropped and ibv_devinfo enhanced 
 to supply, Rate, Physical state, Capability Mask, and the transport (on a per 
 port basis)[*].  There might be some other functionality as well.  Roland, 
 ibv_devinfo, is in the examples directory.  Is there any reason this should 
 not be used as the official tool?  If not I think we should use the code 
 from ibv_devinfo as a basis for ibstat.  But I hate to see 2 implementations 
 so close.

Even if ibv_devinfo is updated to include the additional information,
do we want to require libibverbs, etc. on any IB management machine
just for this ? That's not the case today on IB management nodes.

-- Hal


 Ira

 [*] There are cards which support this.  However, ibverbs does not have a 
 transport field for each port.  Example:

 15:54:10  ibv_devinfo -d mlx4_1 | grep trans
        transport:                      InfiniBand (0)

 15:55:22  ibstat mlx4_1 | grep Link
                Physical state: LinkUp
                Link layer: InfiniBand
                Link layer: Ethernet


 Full output from ibstat:

 15:55:32  ibstat mlx4_1
 CA 'mlx4_1'
        CA type: MT26428
        Number of ports: 2
        Firmware version: 2.9.1000
        Hardware version: b0
        Node GUID: 0x0002c9030008e7f0
        System image GUID: 0x0002c9030008e7f3
        Port 1:
                State: Initializing
                Physical state: LinkUp
                Rate: 10
                Base lid: 0
                LMC: 0
                SM lid: 0
                Capability mask: 0x02510868
                Port GUID: 0x0002c9030008e7f1
                Link layer: InfiniBand
        Port 2:
                State: Down
                Physical state: Disabled
                Rate: 10
                Base lid: 0
                LMC: 0
                SM lid: 0
                Capability mask: 0x0001
                Port GUID: 0x0202c9fffe08e7f1
                Link layer: Ethernet


 Ira

  - Sean


 --
 Ira Weiny
 Math Programmer/Computer Scientist
 Lawrence Livermore National Lab
 925-423-8008
 wei...@llnl.gov


 --
 Ira Weiny
 Member of Technical Staff
 Lawrence Livermore National Lab
 925-423-8008
 wei...@llnl.gov
 ___
 ofw mailing list
 o...@lists.openfabrics.org
 http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ofw

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ofw] ib-diags: compatability issue with ibstat

2011-09-23 Thread Hal Rosenstock
On Wed, Sep 21, 2011 at 3:06 PM, Hefty, Sean sean.he...@intel.com wrote:
 commit 1344cb3feacafc462440dabfa5997c5205486d83 added support for FDR10 in a 
 way that is not compatible with Windows support.  Windows does not use files 
 to read attribute information.

 I will probably need to obtain the necessary information using ibverbs on 
 windows by reading port attributes.  I don't think FDR10 support is available 
 through ibverbs on linux yet, but would this be acceptable?  Is there some 
 other way that you'd like to handle this?  The other option I can think of is 
 moving is_fdr10() and sys_read_string() out of ibstat.c and into a linux 
 specific file, so that windows can provide its own implementation.  Thoughts?

I think there are 2 (related but somewhat separate) issues here:
1. How to obtain is_fdr10
2. Whether ibverbs should be used for this (in windows, linux, both)

The only way to determine whether fdr10 is active or not is via the
vendor proprietary MAD. That info may be reflected in some other API
(and/or file) so that MAD does not need to be reissued. In a separate
thread on linux-rdma, there was discussion on a couple of different
ways to do that from verbs and in this thread that there's no sysfs
equivalent in Windows. You've already stated that the Windows support
is using libibverbs for libibumad support so it seems appropriate to
me to do the same here (in Windows at least).

-- Hal


 - Sean
 ___
 ofw mailing list
 o...@lists.openfabrics.org
 http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ofw

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ofw] ib-diags: compatability issue with ibstat

2011-09-23 Thread Hal Rosenstock
On Fri, Sep 23, 2011 at 9:28 AM, Hal Rosenstock
hal.rosenst...@gmail.com wrote:
 On Wed, Sep 21, 2011 at 3:06 PM, Hefty, Sean sean.he...@intel.com wrote:
 commit 1344cb3feacafc462440dabfa5997c5205486d83 added support for FDR10 in a 
 way that is not compatible with Windows support.  Windows does not use files 
 to read attribute information.

 I will probably need to obtain the necessary information using ibverbs on 
 windows by reading port attributes.  I don't think FDR10 support is 
 available through ibverbs on linux yet, but would this be acceptable?  Is 
 there some other way that you'd like to handle this?  The other option I can 
 think of is moving is_fdr10() and sys_read_string() out of ibstat.c and into 
 a linux specific file, so that windows can provide its own implementation.  
 Thoughts?

 I think there are 2 (related but somewhat separate) issues here:
 1. How to obtain is_fdr10
 2. Whether ibverbs should be used for this (in windows, linux, both)

 The only way to determine whether fdr10 is active or not is via the
 vendor proprietary MAD. That info may be reflected in some other API
 (and/or file) so that MAD does not need to be reissued. In a separate
 thread on linux-rdma, there was discussion on a couple of different
 ways to do that from verbs and in this thread that there's no sysfs
 equivalent in Windows. You've already stated that the Windows support
 is using libibverbs for libibumad support so it seems appropriate to
 me to do the same here (in Windows at least).

The proper place for is_fdr10 is in libibumad (and then we wouldn't be
discussing libibverbs w/ibstat) but that was not done to avoid a
change to the umad port structure.

-- Hal


 -- Hal


 - Sean
 ___
 ofw mailing list
 o...@lists.openfabrics.org
 http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ofw


--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [ofw] ib-diags: compatability issue with ibstat

2011-09-23 Thread Hefty, Sean
 Even if ibv_devinfo is updated to include the additional information,
 do we want to require libibverbs, etc. on any IB management machine
 just for this ? That's not the case today on IB management nodes.

IMO, it's acceptable for ibverbs to be the basic requirement for any IB 
userspace application.  I think you'll find the ibumad duplicates functionally 
that ibverbs provides, and that's why windows already has the dependency built 
in.

- Sean


RE: [ofw] ib-diags: compatability issue with ibstat

2011-09-23 Thread Hefty, Sean
  The only way to determine whether fdr10 is active or not is via the
  vendor proprietary MAD. That info may be reflected in some other API
  (and/or file) so that MAD does not need to be reissued. In a separate
  thread on linux-rdma, there was discussion on a couple of different
  ways to do that from verbs and in this thread that there's no sysfs
  equivalent in Windows. You've already stated that the Windows support
  is using libibverbs for libibumad support so it seems appropriate to
  me to do the same here (in Windows at least).
 
 The proper place for is_fdr10 is in libibumad (and then we wouldn't be
 discussing libibverbs w/ibstat) but that was not done to avoid a
 change to the umad port structure.

ibstat determines is_fdr10 by reading a file.  That same data could just as 
easily be exported as a port attribute, which would make it belong to ibverbs, 
rather than umad.

If the only point of exporting is_fdr10 from the kernel is for an ib management 
diag to display the value, then it's not a useful value for verbs applications.


RE: [ofw] ib-diags: compatability issue with ibstat

2011-09-23 Thread Hefty, Sean
So far, it seems that the choices are:

1. Remove ibstat and use ibv_devinfo instead
2. Change ibstat to obtain the fdr10 information using ibverbs
3. Move the is_fdr10 functionality into OS specific files or code sections of 
ibstat
4. Change ibstat to obtain fdr10 data using MADs or some other OS independent 
mechanism
5. Change ibumad so that it provides the fdr10 data
   Note: umad has OS dependent implementation anyway, but the windows umad would
 depend on the data being available from ibverbs
6. ???

Can someone explain why the information can't come from the umad_port::rate 
field?

- Sean
N�r��yb�X��ǧv�^�)޺{.n�+{��ٚ�{ay�ʇڙ�,j��f���h���z��w���
���j:+v���w�j�mzZ+�ݢj��!�i

Re: ib-diags: compatability issue with ibstat

2011-09-23 Thread Jason Gunthorpe
On Fri, Sep 23, 2011 at 08:57:12AM -0400, Hal Rosenstock wrote:

  Technically umad already needs to use verbs on Linux because that is
  the only way to get the subnet timeout value that is needed to compute
  the MAD RPC timeout. Currently umad ignores this part of IBA ...
 
 Determining the MAD timeout is more complicated than this. Subnet
 timeout is insufficient (so you would need more than just this from
 verbs - things that would not be part of verbs) if this mechanism were
 to be made compliant rather than a configurable timeout as it is now.

I didn't have any problem with this in python-rdma. The only wonky bit
was that verbs was needed for subnet timeout because it isn't in
sysfs. Everything else went according to the spec.

Jason
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [ofw] ib-diags: compatability issue with ibstat

2011-09-23 Thread Hefty, Sean
   1. Remove ibstat and use ibv_devinfo instead
   2. Change ibstat to obtain the fdr10 information using ibverbs
   3. Move the is_fdr10 functionality into OS specific files or code
 sections of ibstat
   4. Change ibstat to obtain fdr10 data using MADs or some other OS
 independent mechanism
   5. Change ibumad so that it provides the fdr10 data
 Note: umad has OS dependent implementation anyway, but the windows
 umad would
   depend on the data being available from ibverbs
   6. ???
 
   Can someone explain why the information can't come from the
 umad_port::rate field?
 
 
 Because fdr10 is non standard, there's no rate for this; it overloads QDR
 based rates as both are 10 Gbps but have different link encodings.

Ok, then for choices 6-8:

6. Have ibstat just report the rate without specifying the link encoding
7. Use umad_port:rate to encode the rate, link encoding, inter-packet delay, 
etc.
   (I'm only partially kidding here.)
8. Drop ibstat from windows.  Windows users will be directed to use ibv_devinfo.

Ira, what are your thoughts on this?


Re: ib-diags: compatability issue with ibstat

2011-09-23 Thread Hal Rosenstock
On 9/23/2011 1:55 PM, Jason Gunthorpe wrote:
 On Fri, Sep 23, 2011 at 01:31:20PM -0400, Hal Rosenstock wrote:
 
 The only wonky bit
 was that verbs was needed for subnet timeout because it isn't in
 sysfs. Everything else went according to the spec.

 I don't see how your algorithm can be compliant with just subnet timeout.
 
 Really?
 
 13.4.6.3  specifies the formula
 
  CommTimeValueOutCommTimeValueInRespTimeValue
 4.096 µsec x ( 2  +2 +2   )
 
 Which boils down to 
 
 4.096 µsec x (2 x 2**(Subnet_Timeout) + 2**(RespTimeValue))
 
 For the on subnet case.
 
 RespTimeValue comes from the class port info of the target manager,
 and C13-13.1.1 gives guidance what value to use for the initial class
 port info query to get the value.
 
 Which is exactly what I implemented.
 
 It isn't hard.

and for SM MADs you need PortInfo:RespTimeValue (do you do SM MADs too ?)

My point was that verbs alone is insufficient as you need a MAD to do
this and the first round trip needs an artificial timeout (before you
know the RespTimeValue from wherever you get it). Clearly, for non SM
MADs, having subnet timeout as a port attribute saves that round trip.

-- Hal

 Jason
 

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 33/55] infiniband: irq: Remove IRQF_DISABLED

2011-09-23 Thread Roland Dreier
On Thu, Sep 22, 2011 at 6:48 PM, Yong Zhang yong.zha...@gmail.com wrote:
 So like I said above, you could take it if you want to.

Great, I queued this up for 3.2, thanks.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ib-diags: compatability issue with ibstat

2011-09-23 Thread Jason Gunthorpe
On Fri, Sep 23, 2011 at 02:01:50PM -0400, Hal Rosenstock wrote:
 On 9/23/2011 1:55 PM, Jason Gunthorpe wrote:
  On Fri, Sep 23, 2011 at 01:31:20PM -0400, Hal Rosenstock wrote:
  
  The only wonky bit
  was that verbs was needed for subnet timeout because it isn't in
  sysfs. Everything else went according to the spec.
 
  I don't see how your algorithm can be compliant with just subnet timeout.
  
  Really?
  
  13.4.6.3  specifies the formula
  
   CommTimeValueOutCommTimeValueInRespTimeValue
  4.096 ?sec x ( 2  +2 +2   )
  
  Which boils down to 
  
  4.096 ?sec x (2 x 2**(Subnet_Timeout) + 2**(RespTimeValue))
  
  For the on subnet case.
  
  RespTimeValue comes from the class port info of the target manager,
  and C13-13.1.1 gives guidance what value to use for the initial class
  port info query to get the value.
  
  Which is exactly what I implemented.
  
  It isn't hard.
 
 and for SM MADs you need PortInfo:RespTimeValue (do you do SM MADs too ?)

You should really read what I wrote. 

To be very clear, my implementation uses SubnetTimeout and a fixed
value for RespTimeValue for initial queries directed at the SA.

It switches to SubnetTimeout plus CPI.RespTimeValue once the SA's CPI is
queried.

It uses PathRecord.PacketLifetime and a fixed value for RespTimeValue
for initial queries to non-SA managers in the network, and again
switches to CPI.RespTimeValue once the manager's CPI is queried - eg
the PMA code does all this.

This is all exactly the required behavior outlined in the spec.

 My point was that verbs alone is insufficient as you need a MAD to do
 this and the first round trip needs an artificial timeout (before you
 know the RespTimeValue from wherever you get it). Clearly, for non SM
 MADs, having subnet timeout as a port attribute saves that round trip.

And my original point is that Subnet_Timeout is not optional, IBA says
you need the value to compute the timeouts.

Jason
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/7] IB/qib: qpn lookup optimizations

2011-09-23 Thread Mike Marciniszyn
The heavy weight spinlock in qib_lookup_qpn() is replaced with
the RCU locking mechanism.  The hash list itself is now accessed
via jhash functions vs. the mod.

The changes should benefit multiple receive contexts in different
processors by not contending for the lock to just read the hash
structures.

The patch also adds a lookaside_qp (pointer) and a lookaside_qpn
in the context.  The interrupt handler will test the current
packet's qpn against lookaside_qpn if the lookaside_qp pointer
is non-null.  The pointer is NULL'ed when the interrupt
handler exits.

Signed-off-by: Mike Marciniszyn mike.marcinis...@qlogic.com
---
 drivers/infiniband/hw/qib/qib.h|3 +
 drivers/infiniband/hw/qib/qib_driver.c |9 
 drivers/infiniband/hw/qib/qib_qp.c |   77 
 drivers/infiniband/hw/qib/qib_verbs.c  |   36 ++-
 drivers/infiniband/hw/qib/qib_verbs.h  |3 +
 5 files changed, 87 insertions(+), 41 deletions(-)

diff --git a/drivers/infiniband/hw/qib/qib.h b/drivers/infiniband/hw/qib/qib.h
index 97e6233..b881bdc 100644
--- a/drivers/infiniband/hw/qib/qib.h
+++ b/drivers/infiniband/hw/qib/qib.h
@@ -223,6 +223,9 @@ struct qib_ctxtdata {
/* ctxt rcvhdrq head offset */
u32 head;
u32 pkt_count;
+   /* lookaside fields */
+   struct qib_qp *lookaside_qp;
+   u32 lookaside_qpn;
/* QPs waiting for context processing */
struct list_head qp_wait_list;
 };
diff --git a/drivers/infiniband/hw/qib/qib_driver.c 
b/drivers/infiniband/hw/qib/qib_driver.c
index 89264ff..d35c9d3 100644
--- a/drivers/infiniband/hw/qib/qib_driver.c
+++ b/drivers/infiniband/hw/qib/qib_driver.c
@@ -547,6 +547,15 @@ move_along:
updegr = 0;
}
}
+   /*
+* Notify qib_destroy_qp() if it is waiting
+* for lookaside_qp to finish.
+*/
+   if (rcd-lookaside_qp) {
+   if (atomic_dec_and_test(rcd-lookaside_qp-refcount))
+   wake_up(rcd-lookaside_qp-wait);
+   rcd-lookaside_qp = NULL;
+   }
 
rcd-head = l;
rcd-pkt_count += i;
diff --git a/drivers/infiniband/hw/qib/qib_qp.c 
b/drivers/infiniband/hw/qib/qib_qp.c
index 9d094f9..500981b 100644
--- a/drivers/infiniband/hw/qib/qib_qp.c
+++ b/drivers/infiniband/hw/qib/qib_qp.c
@@ -34,6 +34,7 @@
 
 #include linux/err.h
 #include linux/vmalloc.h
+#include linux/jhash.h
 
 #include qib.h
 
@@ -204,6 +205,13 @@ static void free_qpn(struct qib_qpn_table *qpt, u32 qpn)
clear_bit(qpn  BITS_PER_PAGE_MASK, map-page);
 }
 
+static inline unsigned qpn_hash(struct qib_ibdev *dev, u32 qpn)
+{
+   return jhash_1word(qpn, dev-qp_rnd) 
+   (dev-qp_table_size - 1);
+}
+
+
 /*
  * Put the QP into the hash table.
  * The hash table holds a reference to the QP.
@@ -211,22 +219,23 @@ static void free_qpn(struct qib_qpn_table *qpt, u32 qpn)
 static void insert_qp(struct qib_ibdev *dev, struct qib_qp *qp)
 {
struct qib_ibport *ibp = to_iport(qp-ibqp.device, qp-port_num);
-   unsigned n = qp-ibqp.qp_num % dev-qp_table_size;
unsigned long flags;
+   unsigned n = qpn_hash(dev, qp-ibqp.qp_num);
 
spin_lock_irqsave(dev-qpt_lock, flags);
+   atomic_inc(qp-refcount);
 
if (qp-ibqp.qp_num == 0)
-   ibp-qp0 = qp;
+   rcu_assign_pointer(ibp-qp0, qp);
else if (qp-ibqp.qp_num == 1)
-   ibp-qp1 = qp;
+   rcu_assign_pointer(ibp-qp1, qp);
else {
qp-next = dev-qp_table[n];
-   dev-qp_table[n] = qp;
+   rcu_assign_pointer(dev-qp_table[n], qp);
}
-   atomic_inc(qp-refcount);
 
spin_unlock_irqrestore(dev-qpt_lock, flags);
+   synchronize_rcu();
 }
 
 /*
@@ -236,29 +245,32 @@ static void insert_qp(struct qib_ibdev *dev, struct 
qib_qp *qp)
 static void remove_qp(struct qib_ibdev *dev, struct qib_qp *qp)
 {
struct qib_ibport *ibp = to_iport(qp-ibqp.device, qp-port_num);
-   struct qib_qp *q, **qpp;
+   unsigned n = qpn_hash(dev, qp-ibqp.qp_num);
unsigned long flags;
 
-   qpp = dev-qp_table[qp-ibqp.qp_num % dev-qp_table_size];
-
spin_lock_irqsave(dev-qpt_lock, flags);
 
if (ibp-qp0 == qp) {
-   ibp-qp0 = NULL;
atomic_dec(qp-refcount);
+   rcu_assign_pointer(ibp-qp0, NULL);
} else if (ibp-qp1 == qp) {
-   ibp-qp1 = NULL;
atomic_dec(qp-refcount);
-   } else
+   rcu_assign_pointer(ibp-qp1, NULL);
+   } else {
+   struct qib_qp *q, **qpp;
+
+   qpp = dev-qp_table[n];
for (; (q = *qpp) != NULL; qpp = q-next)
if (q == qp) {
-   *qpp = qp-next;
-   qp-next = NULL;
atomic_dec(qp-refcount);
+   

[PATCH 1/7] IB/qib: Optimize RC/UC code by IB operation

2011-09-23 Thread Mike Marciniszyn
The memset for zeroing the work completion had been unconditional.

This patch removes the memset and moves the zeroing into the work
completion with a more explicit field by field set.   With this patch,
non-ONLY/non-LAST packets will avoid the overhead since they will
not generate a completion.

Signed-off-by: Mike Marciniszyn mike.marcinis...@qlogic.com
---
 drivers/infiniband/hw/qib/qib_rc.c |   19 +--
 drivers/infiniband/hw/qib/qib_uc.c |   14 +++---
 2 files changed, 24 insertions(+), 9 deletions(-)

diff --git a/drivers/infiniband/hw/qib/qib_rc.c 
b/drivers/infiniband/hw/qib/qib_rc.c
index eca0c41..5c1c49b 100644
--- a/drivers/infiniband/hw/qib/qib_rc.c
+++ b/drivers/infiniband/hw/qib/qib_rc.c
@@ -1955,8 +1955,6 @@ void qib_rc_rcv(struct qib_ctxtdata *rcd, struct 
qib_ib_header *hdr,
break;
}
 
-   memset(wc, 0, sizeof wc);
-
if (qp-state == IB_QPS_RTR  !(qp-r_flags  QIB_R_COMM_EST)) {
qp-r_flags |= QIB_R_COMM_EST;
if (qp-ibqp.event_handler) {
@@ -2009,16 +2007,19 @@ send_middle:
goto rnr_nak;
qp-r_rcv_len = 0;
if (opcode == OP(SEND_ONLY))
-   goto send_last;
-   /* FALLTHROUGH */
+   goto no_immediate_data;
+   /* FALLTHROUGH for SEND_ONLY_WITH_IMMEDIATE */
case OP(SEND_LAST_WITH_IMMEDIATE):
 send_last_imm:
wc.ex.imm_data = ohdr-u.imm_data;
hdrsize += 4;
wc.wc_flags = IB_WC_WITH_IMM;
-   /* FALLTHROUGH */
+   goto send_last;
case OP(SEND_LAST):
case OP(RDMA_WRITE_LAST):
+no_immediate_data:
+   wc.wc_flags = 0;
+   wc.ex.imm_data = 0;
 send_last:
/* Get the number of bytes the message was padded by. */
pad = (be32_to_cpu(ohdr-bth[0])  20)  3;
@@ -2051,6 +2052,12 @@ send_last:
wc.src_qp = qp-remote_qpn;
wc.slid = qp-remote_ah_attr.dlid;
wc.sl = qp-remote_ah_attr.sl;
+   /* zero fields that are N/A */
+   wc.vendor_err = 0;
+   wc.pkey_index = 0;
+   wc.dlid_path_bits = 0;
+   wc.port_num = 0;
+   wc.csum_ok = 0;
/* Signal completion event if the solicited bit is set. */
qib_cq_enter(to_icq(qp-ibqp.recv_cq), wc,
 (ohdr-bth[0] 
@@ -2089,7 +2096,7 @@ send_last:
if (opcode == OP(RDMA_WRITE_FIRST))
goto send_middle;
else if (opcode == OP(RDMA_WRITE_ONLY))
-   goto send_last;
+   goto no_immediate_data;
ret = qib_get_rwqe(qp, 1);
if (ret  0)
goto nack_op_err;
diff --git a/drivers/infiniband/hw/qib/qib_uc.c 
b/drivers/infiniband/hw/qib/qib_uc.c
index 32ccf3c..d31f33a 100644
--- a/drivers/infiniband/hw/qib/qib_uc.c
+++ b/drivers/infiniband/hw/qib/qib_uc.c
@@ -270,7 +270,6 @@ void qib_uc_rcv(struct qib_ibport *ibp, struct 
qib_ib_header *hdr,
 
psn = be32_to_cpu(ohdr-bth[2]);
opcode = 24;
-   memset(wc, 0, sizeof wc);
 
/* Compare the PSN verses the expected PSN. */
if (unlikely(qib_cmp24(psn, qp-r_psn) != 0)) {
@@ -370,7 +369,7 @@ send_first:
}
qp-r_rcv_len = 0;
if (opcode == OP(SEND_ONLY))
-   goto send_last;
+   goto no_immediate_data;
else if (opcode == OP(SEND_ONLY_WITH_IMMEDIATE))
goto send_last_imm;
/* FALLTHROUGH */
@@ -389,8 +388,11 @@ send_last_imm:
wc.ex.imm_data = ohdr-u.imm_data;
hdrsize += 4;
wc.wc_flags = IB_WC_WITH_IMM;
-   /* FALLTHROUGH */
+   goto send_last;
case OP(SEND_LAST):
+no_immediate_data:
+   wc.ex.imm_data = 0;
+   wc.wc_flags = 0;
 send_last:
/* Get the number of bytes the message was padded by. */
pad = (be32_to_cpu(ohdr-bth[0])  20)  3;
@@ -418,6 +420,12 @@ last_imm:
wc.src_qp = qp-remote_qpn;
wc.slid = qp-remote_ah_attr.dlid;
wc.sl = qp-remote_ah_attr.sl;
+   /* zero fields that are N/A */
+   wc.vendor_err = 0;
+   wc.pkey_index = 0;
+   wc.dlid_path_bits = 0;
+   wc.port_num = 0;
+   wc.csum_ok = 0;
/* Signal completion event if the solicited bit is set. */
qib_cq_enter(to_icq(qp-ibqp.recv_cq), wc,
 (ohdr-bth[0] 

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 7/7] IB/qib: remove s_lock around header validation

2011-09-23 Thread Mike Marciniszyn
Observation in qib_ruc_check_hdr() shows that the s_lock is not required
in the normal case.   The r_lock is held in all cases, and protects
the qp fields that are read.

The s_lock will be needed to around the call to qib_migrate_qp() to
insure that the send engine sees a consistent set of fields.

Signed-off-by: Mike Marciniszyn mike.marcinis...@qlogic.com
---
 drivers/infiniband/hw/qib/qib_driver.c |5 -
 drivers/infiniband/hw/qib/qib_rc.c |4 +---
 drivers/infiniband/hw/qib/qib_ruc.c|7 ++-
 drivers/infiniband/hw/qib/qib_uc.c |7 +--
 4 files changed, 8 insertions(+), 15 deletions(-)

diff --git a/drivers/infiniband/hw/qib/qib_driver.c 
b/drivers/infiniband/hw/qib/qib_driver.c
index 1ba9da9..8191925 100644
--- a/drivers/infiniband/hw/qib/qib_driver.c
+++ b/drivers/infiniband/hw/qib/qib_driver.c
@@ -313,7 +313,6 @@ static u32 qib_rcv_hdrerr(struct qib_ctxtdata *rcd, struct 
qib_pportdata *ppd,
u32 opcode;
u32 psn;
int diff;
-   unsigned long flags;
 
/* Sanity check packet */
if (tlen  24)
@@ -368,7 +367,6 @@ static u32 qib_rcv_hdrerr(struct qib_ctxtdata *rcd, struct 
qib_pportdata *ppd,
 
switch (qp-ibqp.qp_type) {
case IB_QPT_RC:
-   spin_lock_irqsave(qp-s_lock, flags);
ruc_res =
qib_ruc_check_hdr(
ibp, hdr,
@@ -376,11 +374,8 @@ static u32 qib_rcv_hdrerr(struct qib_ctxtdata *rcd, struct 
qib_pportdata *ppd,
qp,
be32_to_cpu(ohdr-bth[0]));
if (ruc_res) {
-   spin_unlock_irqrestore(qp-s_lock,
-  flags);
goto unlock;
}
-   spin_unlock_irqrestore(qp-s_lock, flags);
 
/* Only deal with RDMA Writes for now */
if (opcode 
diff --git a/drivers/infiniband/hw/qib/qib_rc.c 
b/drivers/infiniband/hw/qib/qib_rc.c
index a75b713..23c905e 100644
--- a/drivers/infiniband/hw/qib/qib_rc.c
+++ b/drivers/infiniband/hw/qib/qib_rc.c
@@ -1889,10 +1889,8 @@ void qib_rc_rcv(struct qib_ctxtdata *rcd, struct 
qib_ib_header *hdr,
}
 
opcode = be32_to_cpu(ohdr-bth[0]);
-   spin_lock_irqsave(qp-s_lock, flags);
if (qib_ruc_check_hdr(ibp, hdr, has_grh, qp, opcode))
-   goto sunlock;
-   spin_unlock_irqrestore(qp-s_lock, flags);
+   return;
 
psn = be32_to_cpu(ohdr-bth[2]);
opcode = 24;
diff --git a/drivers/infiniband/hw/qib/qib_ruc.c 
b/drivers/infiniband/hw/qib/qib_ruc.c
index eb78d93..b4b37e4 100644
--- a/drivers/infiniband/hw/qib/qib_ruc.c
+++ b/drivers/infiniband/hw/qib/qib_ruc.c
@@ -260,12 +260,15 @@ static int gid_ok(union ib_gid *gid, __be64 gid_prefix, 
__be64 id)
 
 /*
  *
- * This should be called with the QP s_lock held.
+ * This should be called with the QP r_lock held.
+ *
+ * The s_lock will be acquired around the qib_migrate_qp() call.
  */
 int qib_ruc_check_hdr(struct qib_ibport *ibp, struct qib_ib_header *hdr,
  int has_grh, struct qib_qp *qp, u32 bth0)
 {
__be64 guid;
+   unsigned long flags;
 
if (qp-s_mig_state == IB_MIG_ARMED  (bth0  IB_BTH_MIG_REQ)) {
if (!has_grh) {
@@ -295,7 +298,9 @@ int qib_ruc_check_hdr(struct qib_ibport *ibp, struct 
qib_ib_header *hdr,
if (be16_to_cpu(hdr-lrh[3]) != qp-alt_ah_attr.dlid ||
ppd_from_ibp(ibp)-port != qp-alt_ah_attr.port_num)
goto err;
+   spin_lock_irqsave(qp-s_lock, flags);
qib_migrate_qp(qp);
+   spin_unlock_irqrestore(qp-s_lock, flags);
} else {
if (!has_grh) {
if (qp-remote_ah_attr.ah_flags  IB_AH_GRH)
diff --git a/drivers/infiniband/hw/qib/qib_uc.c 
b/drivers/infiniband/hw/qib/qib_uc.c
index 7aa346f..56fa7a9 100644
--- a/drivers/infiniband/hw/qib/qib_uc.c
+++ b/drivers/infiniband/hw/qib/qib_uc.c
@@ -243,7 +243,6 @@ void qib_uc_rcv(struct qib_ibport *ibp, struct 
qib_ib_header *hdr,
int has_grh, void *data, u32 tlen, struct qib_qp *qp)
 {
struct qib_other_headers *ohdr;
-   unsigned long flags;
u32 opcode;
u32 hdrsize;
u32 psn;
@@ -263,10 +262,8 @@ void qib_uc_rcv(struct qib_ibport *ibp, struct 
qib_ib_header *hdr,
}
 
opcode = be32_to_cpu(ohdr-bth[0]);
-   spin_lock_irqsave(qp-s_lock, flags);
if (qib_ruc_check_hdr(ibp, hdr, has_grh, qp, opcode))
-   goto sunlock;
-   spin_unlock_irqrestore(qp-s_lock, 

[PATCH 6/7] IB/qib: memcpy optimizations

2011-09-23 Thread Mike Marciniszyn
The default memcpy used by qib_copy_sge() ends up being a rep movsb on x86_64,
which is pretty slow.

This fix adds an x86_64 specific routine that 1) probes for X86_FEATURE_REP_GOOD
and 2) uses an inline asm routine builton rep movsq that testing has shown is
better than the builtin memcpy for all cases up to 4K.  The probing routine is
now called when the qib module is loaded to enable the optimization.   When
X86_FEATURE_REP_GOOD is not set, the routine uses the kernel's unrolled __memcpy
when the length is more than 64 and the builtin memcpy otherwise.

This patch also adds the cache bypass copies from older releases.  Testing has
shown that AMD cpus benefit with a 40% improvement in netperf/ipoib.

The cache_bypass_copy module parameter can be used to enable on non-AMD
CPUs.

The qib_verbs_send_dma() and qib_copy_from_sge are also changed to use the
memcpy_string_op() to improve packet delivery performance to the
send engine.

The existing copy as well as a new stub probe routine are maintained as weak
symbols for other architectures.

Signed-off-by: Mike Marciniszyn mike.marcinis...@qlogic.com
---
 drivers/infiniband/hw/qib/Makefile |2 
 .../infiniband/hw/qib/memcpy_cachebypass_x86_64.S  |  266 
 drivers/infiniband/hw/qib/qib_copy_sge_x86_64.c|  146 +++
 drivers/infiniband/hw/qib/qib_driver.c |5 
 drivers/infiniband/hw/qib/qib_init.c   |1 
 drivers/infiniband/hw/qib/qib_rc.c |1 
 drivers/infiniband/hw/qib/qib_uc.c |1 
 drivers/infiniband/hw/qib/qib_verbs.c  |   16 +
 drivers/infiniband/hw/qib/qib_verbs.h  |   56 
 9 files changed, 488 insertions(+), 6 deletions(-)
 create mode 100644 drivers/infiniband/hw/qib/memcpy_cachebypass_x86_64.S
 create mode 100644 drivers/infiniband/hw/qib/qib_copy_sge_x86_64.c

diff --git a/drivers/infiniband/hw/qib/Makefile 
b/drivers/infiniband/hw/qib/Makefile
index f12d7bb..911e24c 100644
--- a/drivers/infiniband/hw/qib/Makefile
+++ b/drivers/infiniband/hw/qib/Makefile
@@ -11,5 +11,5 @@ ib_qib-y := qib_cq.o qib_diag.o qib_dma.o qib_driver.o 
qib_eeprom.o \
 # 6120 has no fallback if no MSI interrupts, others can do INTx
 ib_qib-$(CONFIG_PCI_MSI) += qib_iba6120.o
 
-ib_qib-$(CONFIG_X86_64) += qib_wc_x86_64.o
+ib_qib-$(CONFIG_X86_64) += qib_wc_x86_64.o qib_copy_sge_x86_64.o 
memcpy_cachebypass_x86_64.o
 ib_qib-$(CONFIG_PPC64) += qib_wc_ppc64.o
diff --git a/drivers/infiniband/hw/qib/memcpy_cachebypass_x86_64.S 
b/drivers/infiniband/hw/qib/memcpy_cachebypass_x86_64.S
new file mode 100644
index 000..41096c2
--- /dev/null
+++ b/drivers/infiniband/hw/qib/memcpy_cachebypass_x86_64.S
@@ -0,0 +1,266 @@
+/*
+ * Copyright (c) 2006, 2007, 2008 QLogic Corporation. All rights reserved.
+ * Copyright (c) 2003, 2004, 2005, 2006 PathScale, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ * Redistribution and use in source and binary forms, with or
+ * without modification, are permitted provided that the following
+ * conditions are met:
+ *
+ *  - Redistributions of source code must retain the above
+ *copyright notice, this list of conditions and the following
+ *disclaimer.
+ *
+ *  - Redistributions in binary form must reproduce the above
+ *copyright notice, this list of conditions and the following
+ *disclaimer in the documentation and/or other materials
+ *provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED AS IS, WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+   .text
+   .p2align 4,,15
+   /* rdi  destination, rsi source, rdx count */
+   .globl  memcpy_cachebypass
+   .type   memcpy_cachebypass, @function
+# loads bypass the cache, stores fill the cache
+memcpy_cachebypass:
+   movq%rdi, %rax
+.L5:
+   cmpq$15, %rdx
+   ja  .L34
+.L3:
+   cmpl$8, %edx/* rdx is 0..15 */
+   jbe .L9
+.L6:
+   testb   $8, %dxl/* rdx is 3,5,6,7,9..15 */
+   je  .L13
+   movq(%rsi), %rcx
+   addq$8, %rsi
+   movq%rcx, (%rdi)
+   addq$8, %rdi
+.L13:
+   testb   $4, %dxl
+   je  .L15
+   movl(%rsi), %ecx
+   addq$4, %rsi
+   movl

[PATCH] IB/qib: Add logic for affinity hint

2011-09-23 Thread Mike Marciniszyn
Call irq_set_affinity_hint to provide user mode programs
like irqbalance the information to be able to distribute
qib interrupts appropriately.

The logic allocates all non-receive interrupts to the first
CPU local to the HCA.  Receive interrupts are allocated round
robin starting with the second CPU local to the HCA with
potential wrap back to the second CPU.

This patch also adds a refinement to the name registered with
MSIX interrupts so that user level scripts can determine
the device associated with the IRQs when there are
multiple HCAs with a potentially different set of
local CPUs.

Signed-off-by: Mike Marciniszyn mike.marcinis...@qlogic.com
---
 drivers/infiniband/hw/qib/qib.h |   10 +++
 drivers/infiniband/hw/qib/qib_iba7322.c |  107 ++-
 drivers/infiniband/hw/qib/qib_pcie.c|   21 ++
 3 files changed, 104 insertions(+), 34 deletions(-)

diff --git a/drivers/infiniband/hw/qib/qib.h b/drivers/infiniband/hw/qib/qib.h
index b881bdc..6b811e3 100644
--- a/drivers/infiniband/hw/qib/qib.h
+++ b/drivers/infiniband/hw/qib/qib.h
@@ -427,6 +427,14 @@ struct qib_verbs_txreq {
 /* how often we check for packet activity for power on hours (in seconds) */
 #define ACTIVITY_TIMER 5
 
+#define MAX_NAME_SIZE 64
+struct qib_msix_entry {
+   struct msix_entry msix;
+   void *arg;
+   char name[MAX_NAME_SIZE];
+   cpumask_var_t mask;
+};
+
 /* Below is an opaque struct. Each chip (device) can maintain
  * private data needed for its operation, but not germane to the
  * rest of the driver.  For convenience, we define another that
@@ -1355,7 +1363,7 @@ int qib_pcie_init(struct pci_dev *, const struct 
pci_device_id *);
 int qib_pcie_ddinit(struct qib_devdata *, struct pci_dev *,
const struct pci_device_id *);
 void qib_pcie_ddcleanup(struct qib_devdata *);
-int qib_pcie_params(struct qib_devdata *, u32, u32 *, struct msix_entry *);
+int qib_pcie_params(struct qib_devdata *, u32, u32 *, struct qib_msix_entry *);
 int qib_reinit_intr(struct qib_devdata *);
 void qib_enable_intx(struct pci_dev *);
 void qib_nomsi(struct qib_devdata *);
diff --git a/drivers/infiniband/hw/qib/qib_iba7322.c 
b/drivers/infiniband/hw/qib/qib_iba7322.c
index f3f4b55..1e44cab 100644
--- a/drivers/infiniband/hw/qib/qib_iba7322.c
+++ b/drivers/infiniband/hw/qib/qib_iba7322.c
@@ -540,8 +540,7 @@ struct qib_chip_specific {
u32 lastbuf_for_pio;
u32 stay_in_freeze;
u32 recovery_ports_initted;
-   struct msix_entry *msix_entries;
-   void  **msix_arg;
+   struct qib_msix_entry *msix_entries;
unsigned long *sendchkenable;
unsigned long *sendgrhchk;
unsigned long *sendibchk;
@@ -638,24 +637,24 @@ static struct {
int lsb;
int port; /* 0 if not port-specific, else port # */
 } irq_table[] = {
-   { QIB_DRV_NAME, qib_7322intr, -1, 0 },
-   { QIB_DRV_NAME  (buf avail), qib_7322bufavail,
+   { , qib_7322intr, -1, 0 },
+   {  (buf avail), qib_7322bufavail,
SYM_LSB(IntStatus, SendBufAvail), 0 },
-   { QIB_DRV_NAME  (sdma 0), sdma_intr,
+   {  (sdma 0), sdma_intr,
SYM_LSB(IntStatus, SDmaInt_0), 1 },
-   { QIB_DRV_NAME  (sdma 1), sdma_intr,
+   {  (sdma 1), sdma_intr,
SYM_LSB(IntStatus, SDmaInt_1), 2 },
-   { QIB_DRV_NAME  (sdmaI 0), sdma_idle_intr,
+   {  (sdmaI 0), sdma_idle_intr,
SYM_LSB(IntStatus, SDmaIdleInt_0), 1 },
-   { QIB_DRV_NAME  (sdmaI 1), sdma_idle_intr,
+   {  (sdmaI 1), sdma_idle_intr,
SYM_LSB(IntStatus, SDmaIdleInt_1), 2 },
-   { QIB_DRV_NAME  (sdmaP 0), sdma_progress_intr,
+   {  (sdmaP 0), sdma_progress_intr,
SYM_LSB(IntStatus, SDmaProgressInt_0), 1 },
-   { QIB_DRV_NAME  (sdmaP 1), sdma_progress_intr,
+   {  (sdmaP 1), sdma_progress_intr,
SYM_LSB(IntStatus, SDmaProgressInt_1), 2 },
-   { QIB_DRV_NAME  (sdmaC 0), sdma_cleanup_intr,
+   {  (sdmaC 0), sdma_cleanup_intr,
SYM_LSB(IntStatus, SDmaCleanupDone_0), 1 },
-   { QIB_DRV_NAME  (sdmaC 1), sdma_cleanup_intr,
+   {  (sdmaC 1), sdma_cleanup_intr,
SYM_LSB(IntStatus, SDmaCleanupDone_1), 2 },
 };
 
@@ -2567,9 +2566,13 @@ static void qib_7322_nomsix(struct qib_devdata *dd)
int i;
 
dd-cspec-num_msix_entries = 0;
-   for (i = 0; i  n; i++)
-   free_irq(dd-cspec-msix_entries[i].vector,
-dd-cspec-msix_arg[i]);
+   for (i = 0; i  n; i++) {
+   irq_set_affinity_hint(
+ dd-cspec-msix_entries[i].msix.vector, NULL);
+   free_cpumask_var(dd-cspec-msix_entries[i].mask);
+   free_irq(dd-cspec-msix_entries[i].msix.vector,
+  dd-cspec-msix_entries[i].arg);
+   }
qib_nomsix(dd);
}
/* 

[PATCH 3/7] IB/qib: Eliminate divide/mod in converting idx to egr buf pointer

2011-09-23 Thread Mike Marciniszyn
The context init now saves a shift from rcvegrbufs_perchunk
rcvegrbufs_perchunk_shift using ilog2.   A BUG_ON protects the
power of 2 assumption.

Signed-off-by: Mike Marciniszyn mike.marcinis...@qlogic.com
---
 drivers/infiniband/hw/qib/qib.h |8 ++--
 drivers/infiniband/hw/qib/qib_driver.c  |6 +++---
 drivers/infiniband/hw/qib/qib_iba6120.c |2 ++
 drivers/infiniband/hw/qib/qib_iba7220.c |2 ++
 drivers/infiniband/hw/qib/qib_iba7322.c |2 ++
 drivers/infiniband/hw/qib/qib_init.c|3 +++
 6 files changed, 18 insertions(+), 5 deletions(-)

diff --git a/drivers/infiniband/hw/qib/qib.h b/drivers/infiniband/hw/qib/qib.h
index ee993e7..97e6233 100644
--- a/drivers/infiniband/hw/qib/qib.h
+++ b/drivers/infiniband/hw/qib/qib.h
@@ -171,7 +171,9 @@ struct qib_ctxtdata {
/* how many alloc_pages() chunks in rcvegrbuf_pages */
u32 rcvegrbuf_chunks;
/* how many egrbufs per chunk */
-   u32 rcvegrbufs_perchunk;
+   u16 rcvegrbufs_perchunk;
+   /* ilog2 of above */
+   u16 rcvegrbufs_perchunk_shift;
/* order for rcvegrbuf_pages */
size_t rcvegrbuf_size;
/* rcvhdrq size (for freeing) */
@@ -940,7 +942,9 @@ struct qib_devdata {
/* chip address space used by 4k pio buffers */
u32 align4k;
/* size of each rcvegrbuffer */
-   u32 rcvegrbufsize;
+   u16 rcvegrbufsize;
+   /* log2 of above */
+   u16 rcvegrbufsize_shift;
/* localbus width (1, 2,4,8,16,32) from config space  */
u32 lbus_width;
/* localbus speed in MHz */
diff --git a/drivers/infiniband/hw/qib/qib_driver.c 
b/drivers/infiniband/hw/qib/qib_driver.c
index 23e584f..89264ff 100644
--- a/drivers/infiniband/hw/qib/qib_driver.c
+++ b/drivers/infiniband/hw/qib/qib_driver.c
@@ -279,10 +279,10 @@ bail:
  */
 static inline void *qib_get_egrbuf(const struct qib_ctxtdata *rcd, u32 etail)
 {
-   const u32 chunk = etail / rcd-rcvegrbufs_perchunk;
-   const u32 idx =  etail % rcd-rcvegrbufs_perchunk;
+   const u32 chunk = etail  rcd-rcvegrbufs_perchunk_shift;
+   const u32 idx =  etail  ((u32)rcd-rcvegrbufs_perchunk - 1);
 
-   return rcd-rcvegrbuf[chunk] + idx * rcd-dd-rcvegrbufsize;
+   return rcd-rcvegrbuf[chunk] + (idx  rcd-dd-rcvegrbufsize_shift);
 }
 
 /*
diff --git a/drivers/infiniband/hw/qib/qib_iba6120.c 
b/drivers/infiniband/hw/qib/qib_iba6120.c
index d8ca0a0..781a802 100644
--- a/drivers/infiniband/hw/qib/qib_iba6120.c
+++ b/drivers/infiniband/hw/qib/qib_iba6120.c
@@ -3273,6 +3273,8 @@ static int init_6120_variables(struct qib_devdata *dd)
/* we always allocate at least 2048 bytes for eager buffers */
ret = ib_mtu_enum_to_int(qib_ibmtu);
dd-rcvegrbufsize = ret != -1 ? max(ret, 2048) : QIB_DEFAULT_MTU;
+   BUG_ON(!is_power_of_2(dd-rcvegrbufsize));
+   dd-rcvegrbufsize_shift = ilog2(dd-rcvegrbufsize);
 
qib_6120_tidtemplate(dd);
 
diff --git a/drivers/infiniband/hw/qib/qib_iba7220.c 
b/drivers/infiniband/hw/qib/qib_iba7220.c
index e1f9474..3f1d562 100644
--- a/drivers/infiniband/hw/qib/qib_iba7220.c
+++ b/drivers/infiniband/hw/qib/qib_iba7220.c
@@ -4085,6 +4085,8 @@ static int qib_init_7220_variables(struct qib_devdata *dd)
/* we always allocate at least 2048 bytes for eager buffers */
ret = ib_mtu_enum_to_int(qib_ibmtu);
dd-rcvegrbufsize = ret != -1 ? max(ret, 2048) : QIB_DEFAULT_MTU;
+   BUG_ON(!is_power_of_2(dd-rcvegrbufsize));
+   dd-rcvegrbufsize_shift = ilog2(dd-rcvegrbufsize);
 
qib_7220_tidtemplate(dd);
 
diff --git a/drivers/infiniband/hw/qib/qib_iba7322.c 
b/drivers/infiniband/hw/qib/qib_iba7322.c
index 5ea9ece..f3f4b55 100644
--- a/drivers/infiniband/hw/qib/qib_iba7322.c
+++ b/drivers/infiniband/hw/qib/qib_iba7322.c
@@ -6205,6 +6205,8 @@ static int qib_init_7322_variables(struct qib_devdata *dd)
 
/* we always allocate at least 2048 bytes for eager buffers */
dd-rcvegrbufsize = max(mtu, 2048);
+   BUG_ON(!is_power_of_2(dd-rcvegrbufsize));
+   dd-rcvegrbufsize_shift = ilog2(dd-rcvegrbufsize);
 
qib_7322_tidtemplate(dd);
 
diff --git a/drivers/infiniband/hw/qib/qib_init.c 
b/drivers/infiniband/hw/qib/qib_init.c
index 021636d..21ffa7c 100644
--- a/drivers/infiniband/hw/qib/qib_init.c
+++ b/drivers/infiniband/hw/qib/qib_init.c
@@ -183,6 +183,9 @@ struct qib_ctxtdata *qib_create_ctxtdata(struct 
qib_pportdata *ppd, u32 ctxt)
rcd-rcvegrbuf_chunks = (rcd-rcvegrcnt +
rcd-rcvegrbufs_perchunk - 1) /
rcd-rcvegrbufs_perchunk;
+   BUG_ON(!is_power_of_2(rcd-rcvegrbufs_perchunk));
+   rcd-rcvegrbufs_perchunk_shift =
+   ilog2(rcd-rcvegrbufs_perchunk);
}
return rcd;
 }

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/7] IB/qib: decode path mtu optimization

2011-09-23 Thread Mike Marciniszyn
Store both the encoded and decoded mtu in the qp structure as a minor
optimization UC/RC receive routines.

Signed-off-by: Mike Marciniszyn mike.marcinis...@qlogic.com
---
 drivers/infiniband/hw/qib/qib_qp.c|4 +++-
 drivers/infiniband/hw/qib/qib_rc.c|6 +++---
 drivers/infiniband/hw/qib/qib_uc.c|4 ++--
 drivers/infiniband/hw/qib/qib_verbs.h |1 +
 4 files changed, 9 insertions(+), 6 deletions(-)

diff --git a/drivers/infiniband/hw/qib/qib_qp.c 
b/drivers/infiniband/hw/qib/qib_qp.c
index e16751f..9d094f9 100644
--- a/drivers/infiniband/hw/qib/qib_qp.c
+++ b/drivers/infiniband/hw/qib/qib_qp.c
@@ -765,8 +765,10 @@ int qib_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr 
*attr,
}
}
 
-   if (attr_mask  IB_QP_PATH_MTU)
+   if (attr_mask  IB_QP_PATH_MTU) {
qp-path_mtu = pmtu;
+   qp-pmtu = ib_mtu_enum_to_int(pmtu);
+   }
 
if (attr_mask  IB_QP_RETRY_CNT) {
qp-s_retry_cnt = attr-retry_cnt;
diff --git a/drivers/infiniband/hw/qib/qib_rc.c 
b/drivers/infiniband/hw/qib/qib_rc.c
index 5c1c49b..5d77cb8 100644
--- a/drivers/infiniband/hw/qib/qib_rc.c
+++ b/drivers/infiniband/hw/qib/qib_rc.c
@@ -239,7 +239,7 @@ int qib_make_rc_req(struct qib_qp *qp)
u32 len;
u32 bth0;
u32 bth2;
-   u32 pmtu = ib_mtu_enum_to_int(qp-path_mtu);
+   u32 pmtu = qp-pmtu;
char newreq;
unsigned long flags;
int ret = 0;
@@ -1732,7 +1732,7 @@ static int qib_rc_rcv_error(struct qib_other_headers 
*ohdr,
 * same request.
 */
offset = ((psn - e-psn)  QIB_PSN_MASK) *
-   ib_mtu_enum_to_int(qp-path_mtu);
+   qp-pmtu;
len = be32_to_cpu(reth-length);
if (unlikely(offset + len != e-rdma_sge.sge_length))
goto unlock_done;
@@ -1876,7 +1876,7 @@ void qib_rc_rcv(struct qib_ctxtdata *rcd, struct 
qib_ib_header *hdr,
u32 psn;
u32 pad;
struct ib_wc wc;
-   u32 pmtu = ib_mtu_enum_to_int(qp-path_mtu);
+   u32 pmtu = qp-pmtu;
int diff;
struct ib_reth *reth;
unsigned long flags;
diff --git a/drivers/infiniband/hw/qib/qib_uc.c 
b/drivers/infiniband/hw/qib/qib_uc.c
index d31f33a..9a4630f 100644
--- a/drivers/infiniband/hw/qib/qib_uc.c
+++ b/drivers/infiniband/hw/qib/qib_uc.c
@@ -51,7 +51,7 @@ int qib_make_uc_req(struct qib_qp *qp)
u32 hwords;
u32 bth0;
u32 len;
-   u32 pmtu = ib_mtu_enum_to_int(qp-path_mtu);
+   u32 pmtu = qp-pmtu;
int ret = 0;
 
spin_lock_irqsave(qp-s_lock, flags);
@@ -249,7 +249,7 @@ void qib_uc_rcv(struct qib_ibport *ibp, struct 
qib_ib_header *hdr,
u32 psn;
u32 pad;
struct ib_wc wc;
-   u32 pmtu = ib_mtu_enum_to_int(qp-path_mtu);
+   u32 pmtu = qp-pmtu;
struct ib_reth *reth;
int ret;
 
diff --git a/drivers/infiniband/hw/qib/qib_verbs.h 
b/drivers/infiniband/hw/qib/qib_verbs.h
index 95e5b47..ec3711f 100644
--- a/drivers/infiniband/hw/qib/qib_verbs.h
+++ b/drivers/infiniband/hw/qib/qib_verbs.h
@@ -485,6 +485,7 @@ struct qib_qp {
u8 alt_timeout; /* Alternate path timeout for this QP */
u8 port_num;
enum ib_mtu path_mtu;
+   u32 pmtu;   /* decoded from path_mtu */
u32 remote_qpn;
u32 qkey;   /* QKEY for this QP (for UD or RD) */
u32 s_size; /* send work queue size */

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/7] qib optimizations

2011-09-23 Thread Mike Marciniszyn
The following patch series contains optimizations that benefit
qib verbs performance.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


how to debug (mlx4) CQ overrun

2011-09-23 Thread Wendy Cheng
I have a test program that does RDMA read-write as the following:

node A: server listens and handles connection requests
   setup a piece of memory initialized to 0
node B: two processes parent  child

child:
  1. setup a new channel with server, including a CQ with 1024 entries
(ibv_create_cq(ctx, 1024, NULL, channel, 0);)
  2. RDMA sequential write (8192 bytes a time) to server memory
  4. sync with parent

parent:
   1. setup the new channel with server, including a CQ with 1024 entries
  (ibv_create_cq(ctx, 1024, NULL, channel, 0);)
3. RDMA sequential read (8192 byes a time) to the same piece of
memory from server
 - check the buffer contents.
 - if memory content is still zero, re-read
4. sync with child

The parent hangs (but child finishes its write) after the following
pops up in /var/log/messages:
 mlx4_core :06:00.0: CQ overrun on CQN 87

I have my own counters that restrict the read (and write) to 512 max.
Both write and read are blocking (i.e. cq is polled after each
read/write). I suspect I do not have the cq poll logic correct. The
question here is .. is there any diag tool available to check on the
internal counters (and /or states) of ibverbs library and/or kernel
drivers (to help RDMA applications debug) ? In my case, it hangs
around 14546 block (i.e. after 14546*8192 byes).

Thanks,
Wendy
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how to debug (mlx4) CQ overrun

2011-09-23 Thread Jason Gunthorpe
On Fri, Sep 23, 2011 at 02:15:30PM -0700, Wendy Cheng wrote:

 I have my own counters that restrict the read (and write) to 512 max.
 Both write and read are blocking (i.e. cq is polled after each
 read/write). I suspect I do not have the cq poll logic correct. The
 question here is .. is there any diag tool available to check on the
 internal counters (and /or states) of ibverbs library and/or kernel
 drivers (to help RDMA applications debug) ? In my case, it hangs
 around 14546 block (i.e. after 14546*8192 byes).

There are not really any tools, but this is usually straightforward to
look at from your app.

Every time you post to the send Q increment a counter. Everytime you
get something back from ibv_poll_cq increment another counter.

The (A - B) must never exceed the number of entries in the CQ, and it
must not exceed the number of entries in the send Q (very important).

This assumes you are posting everything with IBV_SEND_SIGNALED. Doing
otherwise is basically the same but there is a bit more complexity to
manage the CQ counter as each completion represents multiple sendQ
entries.

Make sure you check for error codes from ibv_post_send.

Jason
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html