Re: Promiscuous mode support in IPoIB

2011-07-21 Thread Devesh Sharma
Thanks Woodruff for your response. Eli please can you through some
light on this. Thanks you your help and time.

On Mon, Jul 18, 2011 at 11:55 PM, Woodruff, Robert J
robert.j.woodr...@intel.com wrote:
 Devesh Sharma wrote,

Hello List,

Kindly someone help me on this. I need to know about promiscuous mode
support in IPoIB and have a confusion if it is possible at all on
IPoIB stack. If yes then How?

 I am not sure that it is supported, but Eli would know for sure.
 --
 To unsubscribe from this list: send the line unsubscribe linux-rdma in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html




-- 
Please don't print this E- mail unless you really need to - this will
preserve trees on planet earth.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] XRC upstream merge reboot

2011-07-21 Thread Jack Morgenstein
On Wednesday 20 July 2011 21:51, Hefty, Sean wrote:
 I've tried to come up with a clean way to determine the lifetime of an xrc 
 tgt qp,\
 and I think the best approach is still: 
 
 1. Allow the creating process to destroy it at any time, and
 
 2a. If not explicitly destroyed, the tgt qp is bound to the lifetime of the 
 xrc domain
 or
 2b. The creating process specifies during the creation of the tgt qp
 whether the qp should be destroyed on exit. 
 
 The MPIs associate an xrc domain with a job, so this should work.
 Everything else significantly complicates the usage model and implementation,
 both for verbs and the CM.  An application can maintain a reference count
 out of band with a persistent server and use explicit destruction
 if they want to share the xrcd across jobs.
I assume that you intend the persistent server to replace the reg_xrc_rcv_qp/
unreg_xrc_rcv_qp verbs.  Correct?
 
 Option 2a is the current implementation, but 2b should be a minor change.
 I'd like to reach a consensus on the right approach here, since there doesn't
 appear to be issues elsewhere.  
 
 - Sean

I have no opinion either way (with regard to tgt qp registration and reference 
counting).
The OFED xrc implementation was driven by the requirements of the MPI community.

If MPI can use a different XRC domain per job (and deallocate the domain
at the job's end), this would solve the tgt qp lifetime problem (-- by
destroying all the tgt qp's when the xrc domain is deallocated).

Regarding option 2b: do you mean that in this case the tgt qp is NOT bound to 
the
XRC domain lifetime? who destroys the tgt qp in this case when the creator 
indicates
that the tgt qp should not be destroyed on exit?

I am concerned with backwards compatibility, here.  It seems that XRC users 
will need to
change their source-code, not just recompile.  I am assuming that OFED will 
take the
mainstream kernel implementation at some point.  Since this is **userspace** 
code, there could be a problem
if OFED users upgrade their OFED installation to one which supports the new 
interface.
This could be especially difficult if, for example, the customer is using 
3rd-party packages
which utilize the current OFED xrc interface.  We could start seeing customers 
not take
new OFED releases solely because of the XRC incompatibility (or worse, 
customers upgrading
and then finding out that their 3rd-party XRC apps no longer work).

Having a new OFED support BOTH interfaces is a nightmare I don't even want to 
think about!

-Jack
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] XRC upstream merge reboot

2011-07-21 Thread Jack Morgenstein
On Thursday 21 July 2011 10:38, Jack Morgenstein wrote:
 Having a new OFED support BOTH interfaces is a nightmare I don't even want to 
 think about!

I over-reacted here, sorry about that.  I know that it will be difficult
to support both the old and the new interface.  However, to support the
current OFED customer base, we will do so to ease the interface transition
by the apps, which we expect will take place over time.

Sean, we will do our best to help you with getting XRC into the mainstream
kernel. 

-Jack
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] XRC upstream merge reboot

2011-07-21 Thread Jeff Squyres
On Jul 21, 2011, at 3:38 AM, Jack Morgenstein wrote:

 If MPI can use a different XRC domain per job (and deallocate the domain
 at the job's end), this would solve the tgt qp lifetime problem (-- by
 destroying all the tgt qp's when the xrc domain is deallocated).

What happens if the MPI job crashes and does not properly deallocate the XRC 
domain / tgt qp?

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] XRC upstream merge reboot

2011-07-21 Thread Jeff Squyres
On Jul 21, 2011, at 8:47 AM, Jack Morgenstein wrote:

 [snip]
 When the last user of an XRC domain exits cleanly (or crashes), the domain 
 should be destroyed.
 In this case, with Sean's design, the tgt qp's for the XRC domain should also 
 be destroyed.

Sounds perfect.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] IB/qib: defer hca error events to tasklet

2011-07-21 Thread Mike Marciniszyn
With ib_qib options as follows:

options ib_qib krcvqs=1 pcie_caps=0x51 rcvhdrcnt=4096 singleport=1 ibmtu=4

A run of ib_write_bw -a yields the following:

--
 #bytes #iterationsBW peak[MB/sec]BW average[MB/sec]
 1048576   5000   2910.64229.80
--

The top cpu use in a profile is:

CPU: Intel Architectural Perfmon, speed 2400.15 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask
of 0x00 (No unit mask) count 1002300
Counted LLC_MISSES events (Last level cache demand requests from this core that
missed the LLC) with a unit mask of 0x41 (No unit mask) count 1
samples  %samples  %app name symbol name
1523729.2642  964  17.1195  ib_qib.koqib_7322intr
1232023.6618  1040 18.4692  ib_qib.kohandle_7322_errors
4106  7.8860  0  0  vmlinux  vsnprintf


Analysis of the stats, profile, the code, and the annotated profile indicate:
- All of the overflow interrupts (one per packet overflow) are serviced
  on CPU0 with no mitigation on the frequency.
- All of the receive interrupts are being serviced by CPU0.  (That is the way
  truescale.cmds statically allocates the kctx IRQs to CPU.)
- The code is spending all of its time servicing QIB_I_C_ERROR RcvEgrFullErr
  interrupts on CPU0, starving the packet receive processing
- The decode_err routine is very inefficient, using a printf variant to format
  a %s and continues to loop when the errs mask has been cleared
- Both qib_7322intr and handle_7322_errors read pci registers, which is very
  inefficient.

The fix does the following:
- Adds a tasklet to service the QIB_I_C_ERROR
- replaces the very inefficient scnprintf() with a memcpy().  A field is added
  to qib_hwerror_msgs to save the sizeof(string) at compile time so that a
  strlen is not needed during err_decode().
- The most frequent errors (Overflows) are serviced first to exit the loop as
  early as possible
- The loop now exits as soon as the errs mask is clear rather than fruitlessly
  looping through the msp array

With this fix the performance changes to:

--
 #bytes #iterationsBW peak[MB/sec]BW average[MB/sec]
 1048576   5000   2990.642941.35
--

During testing of the error handling overflow patch, it was determined
that some CPU's were slower when servicing both overflow
and receive interrupts on CPU0 with different MSI interrupt vectors.

This patch adds an option (krcvq01_no_msi) to not use a dedicated MSI
interrupt for kctx's  2 and to service them on the default
interrupt.  For some CPUs, the cost of the interrupt enter/exit is
more costly than then the additional PCI read in the default handler.

Signed-off-by: Mike Marciniszyn mike.marcinis...@qlogic.com
---
 drivers/infiniband/hw/qib/qib.h |3 +
 drivers/infiniband/hw/qib/qib_iba7322.c |   71 ++-
 2 files changed, 53 insertions(+), 21 deletions(-)

diff --git a/drivers/infiniband/hw/qib/qib.h b/drivers/infiniband/hw/qib/qib.h
index 769a1d9..c9624ea 100644
--- a/drivers/infiniband/hw/qib/qib.h
+++ b/drivers/infiniband/hw/qib/qib.h
@@ -1012,6 +1012,8 @@ struct qib_devdata {
u8 psxmitwait_supported;
/* cycle length of PS* counters in HW (in picoseconds) */
u16 psxmitwait_check_rate;
+   /* high volume overflow errors defered to tasklet */
+   struct tasklet_struct error_tasklet;
 };
 
 /* hol_state values */
@@ -1433,6 +1435,7 @@ extern struct mutex qib_mutex;
 struct qib_hwerror_msgs {
u64 mask;
const char *msg;
+   size_t sz;
 };
 
 #define QLOGIC_IB_HWE_MSG(a, b) { .mask = a, .msg = b }
diff --git a/drivers/infiniband/hw/qib/qib_iba7322.c 
b/drivers/infiniband/hw/qib/qib_iba7322.c
index 821226c..5ea9ece 100644
--- a/drivers/infiniband/hw/qib/qib_iba7322.c
+++ b/drivers/infiniband/hw/qib/qib_iba7322.c
@@ -114,6 +114,10 @@ static ushort qib_singleport;
 module_param_named(singleport, qib_singleport, ushort, S_IRUGO);
 MODULE_PARM_DESC(singleport, Use only IB port 1; more per-port buffer space);
 
+static ushort qib_krcvq01_no_msi;
+module_param_named(krcvq01_no_msi, qib_krcvq01_no_msi, ushort, S_IRUGO);
+MODULE_PARM_DESC(krcvq01_no_msi, No MSI for kctx  2);
+
 /*
  * Receive header queue sizes
  */
@@ -1106,9 +1110,9 @@ static inline u32 read_7322_creg32_port(const struct 
qib_pportdata *ppd,
 #define AUTONEG_TRIES 3 /* sequential retries to negotiate DDR */
 
 #define HWE_AUTO(fldname) { .mask = SYM_MASK(HwErrMask, fldname##Mask), \
-   .msg = #fldname }
+   .msg = #fldname , .sz = sizeof(#fldname) }
 #define HWE_AUTO_P(fldname, port) { .mask = SYM_MASK(HwErrMask, \
-   

Re: [PATCH] IB/qib: defer hca error events to tasklet

2011-07-21 Thread Roland Dreier
 +static ushort qib_krcvq01_no_msi;
 +module_param_named(krcvq01_no_msi, qib_krcvq01_no_msi, ushort, S_IRUGO);
 +MODULE_PARM_DESC(krcvq01_no_msi, No MSI for kctx  2);

First, the obvious question: is there really no better way to handle this
than to have yet another cryptically named module parameter for users
to try both ways?

Second, why can't this just be module_param() (why is _named needed?)?
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [RFC] XRC upstream merge reboot

2011-07-21 Thread Hefty, Sean
 If you use file descriptors for the XRC domain, then when the last user of the
 domain exits, the domain
 gets destroyed (at least this is in OFED.  Sean's code looks the same).
 
 In this case, the kernel cleanup code for the process should close the XRC
 domains opened by that
 process, so there is no leakage.
 
 When the last user of an XRC domain exits cleanly (or crashes), the domain
 should be destroyed.
 In this case, with Sean's design, the tgt qp's for the XRC domain should also
 be destroyed.
 
 Sean, is this correct?

This is correct.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH] IB/qib: defer hca error events to tasklet

2011-07-21 Thread Mike Marciniszyn
There is no choice but a module parameter since the msix initialization is done 
at driver load, and is only needed after testing reveals that the change is 
necessary.

I agree that there are a lot of module parameters.   Some that don't allocate 
resources at load time could be sys.

I can re-issue the patch with module_param and change the identifier.

The module parameters used throughout the code usually prepend a qib_ or and 
ib_qib_, and I was following precedent in the code.

The only no prepend I see is the sdma_descq_cnt, which could change as well 
to use module_param.

Mike

 -Original Message-
 From: rol...@purestorage.com [mailto:rol...@purestorage.com] On Behalf
 Of Roland Dreier
 Sent: Thursday, July 21, 2011 11:47 AM
 To: Mike Marciniszyn
 Cc: linux-rdma@vger.kernel.org
 Subject: Re: [PATCH] IB/qib: defer hca error events to tasklet

  +static ushort qib_krcvq01_no_msi;
  +module_param_named(krcvq01_no_msi, qib_krcvq01_no_msi, ushort,
 S_IRUGO);
  +MODULE_PARM_DESC(krcvq01_no_msi, No MSI for kctx  2);

 First, the obvious question: is there really no better way to handle
 this
 than to have yet another cryptically named module parameter for users
 to try both ways?

 Second, why can't this just be module_param() (why is _named needed?)?


This message and any attached documents contain information from QLogic 
Corporation or its wholly-owned subsidiaries that may be confidential. If you 
are not the intended recipient, you may not read, copy, distribute, or use this 
information. If you have received this transmission in error, please notify the 
sender immediately by reply e-mail and then delete this message.

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [RFC] XRC upstream merge reboot

2011-07-21 Thread Hefty, Sean
  I've tried to come up with a clean way to determine the lifetime of an xrc
 tgt qp,\
  and I think the best approach is still:
 
  1. Allow the creating process to destroy it at any time, and
 
  2a. If not explicitly destroyed, the tgt qp is bound to the lifetime of the
 xrc domain
  or
  2b. The creating process specifies during the creation of the tgt qp
  whether the qp should be destroyed on exit.
 
  The MPIs associate an xrc domain with a job, so this should work.
  Everything else significantly complicates the usage model and
 implementation,
  both for verbs and the CM.  An application can maintain a reference count
  out of band with a persistent server and use explicit destruction
  if they want to share the xrcd across jobs.
 I assume that you intend the persistent server to replace the reg_xrc_rcv_qp/
 unreg_xrc_rcv_qp verbs.  Correct?

I'm suggesting that anyone who wants to share an xrcd across jobs can use out 
of band communication to maintain their own reference count, rather than 
pushing that feature into the mainline.  This requires a code change for apps 
that have coded to OFED and use this feature.

 I have no opinion either way (with regard to tgt qp registration and reference
 counting).
 The OFED xrc implementation was driven by the requirements of the MPI
 community.

From the emails threads I followed, it was a request from HP MPI.  The other 
MPIs have used the same interface since it was what was defined, but do not 
appear to be sharing the xrcd across jobs.  HP has since canceled their MPI 
product.

 Regarding option 2b: do you mean that in this case the tgt qp is NOT bound to
 the
 XRC domain lifetime? who destroys the tgt qp in this case when the creator
 indicates
 that the tgt qp should not be destroyed on exit?

With option 2b, the tgt qp lifetime is either tied to the life of the creating 
process or the xrcd.  The creating process specifies which on creation.  
Basically, the choice allows the creating process to destroy the tgt qp when it 
exits, rather than waiting until the xrcd is closed.  Note that ibverbs only 
considers the life of the tgt qp, but we also need to the consider the life of 
a corresponding connection maintained by the IB CM.
 
 I am concerned with backwards compatibility, here.  It seems that XRC users
 will need to
 change their source-code, not just recompile.  I am assuming that OFED will
 take the
 mainstream kernel implementation at some point.  Since this is **userspace**
 code, there could be a problem
 if OFED users upgrade their OFED installation to one which supports the new
 interface.
 This could be especially difficult if, for example, the customer is using 3rd-
 party packages
 which utilize the current OFED xrc interface.  We could start seeing customers
 not take
 new OFED releases solely because of the XRC incompatibility (or worse,
 customers upgrading
 and then finding out that their 3rd-party XRC apps no longer work).

Eventually, the xrc users should change their source code to move away from the 
ofed compatability APIs.  An app needs to recompile regardless.  Existing apps 
will run into issues if they share the xrcd across jobs.  In that case, they 
will leak tgt qps.  There are also issues if an app calls the OFED 
ibv_modify_xrc_rcv_qp() or ibv_query_xrc_rcv_qp() APIs from a process other 
than the one which created the qp.  These are the main risks that I see.

 Having a new OFED support BOTH interfaces is a nightmare I don't even want to
 think about!

We're already in a situation where there are multiple libibverbs interfaces.  
The OFED compatibility patch to libibverbs was added specifically so that OFED 
could support both sets of APIs, while being binary compatible with the 
upstream ibverbs.  The proposed kernel patches do not support the functionality 
required for the OFED APIs, but it's not clear whether apps are really 
dependent on that functionality.  (I don't want to make MPI have to change 
their code right away either.)

- Sean
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Support optional performance counters, including congestion control performance counters.

2011-07-21 Thread Ira Weiny
On Wed, 20 Jul 2011 17:04:43 -0700
Albert Chu ch...@llnl.gov wrote:

 Hey everyone,
 
 Here's a new patch series for the optional performance counters.  It
 fixes up the issues brought up by Hal.  I also include a new patch that
 fixes up the incorrect BITSOFFS for other fields in libibmad.

Thanks all 3 applied.

 
 Al
 
 On Wed, 2011-07-20 at 11:35 -0700, Hal Rosenstock wrote:
  Hi again Al,
  
  On 7/20/2011 1:38 PM, Albert Chu wrote:
   Hey Hal,
   
   Thanks for the nit-catches.  As for
   
   + {32, 2, PortVLXmitFlowCtlUpdateErrors0, mad_dump_uint},
   + {34, 2, PortVLXmitFlowCtlUpdateErrors1, mad_dump_uint},
   + {36, 2, PortVLXmitFlowCtlUpdateErrors2, mad_dump_uint},
   + {38, 2, PortVLXmitFlowCtlUpdateErrors3, mad_dump_uint},
   + {40, 2, PortVLXmitFlowCtlUpdateErrors4, mad_dump_uint},
   + {42, 2, PortVLXmitFlowCtlUpdateErrors5, mad_dump_uint},
   + {44, 2, PortVLXmitFlowCtlUpdateErrors6, mad_dump_uint},
   + {46, 2, PortVLXmitFlowCtlUpdateErrors7, mad_dump_uint},
   + {48, 2, PortVLXmitFlowCtlUpdateErrors8, mad_dump_uint},
   + {50, 2, PortVLXmitFlowCtlUpdateErrors9, mad_dump_uint},
   + {52, 2, PortVLXmitFlowCtlUpdateErrors10, mad_dump_uint},
   + {54, 2, PortVLXmitFlowCtlUpdateErrors11, mad_dump_uint},
   + {56, 2, PortVLXmitFlowCtlUpdateErrors12, mad_dump_uint},
   + {58, 2, PortVLXmitFlowCtlUpdateErrors13, mad_dump_uint},
   + {60, 2, PortVLXmitFlowCtlUpdateErrors14, mad_dump_uint},
   + {62, 2, PortVLXmitFlowCtlUpdateErrors15, mad_dump_uint},
  
   Don't these need to be BITSOFFS(nn, 2)  ?
   
   Perhaps there's a subtlety I'm missing.  If these require BITSOFFS, then
   wouldn't the 16 bit fields require them too?  There are many places
   amongst the performance counters that BITSOFFS isn't used w/ 16 bit
   fields.
  
  Yes; it looks like any field less than 32 bits should use BITSOFFS so I
  think that there are some existing things to fix in fields.c (the 16 bit
  fields that are not using the macro).
  
  -- Hal
  
   Al
  --
  To unsubscribe from this list: send the line unsubscribe linux-rdma in
  the body of a message to majord...@vger.kernel.org
  More majordomo info at  http://vger.kernel.org/majordomo-info.html
 -- 
 Albert Chu
 ch...@llnl.gov
 Computer Scientist
 High Performance Systems Division
 Lawrence Livermore National Laboratory
 


-- 
Ira Weiny
Math Programmer/Computer Scientist
Lawrence Livermore National Lab
925-423-8008
wei...@llnl.gov
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html