Re: Promiscuous mode support in IPoIB
Thanks Woodruff for your response. Eli please can you through some light on this. Thanks you your help and time. On Mon, Jul 18, 2011 at 11:55 PM, Woodruff, Robert J robert.j.woodr...@intel.com wrote: Devesh Sharma wrote, Hello List, Kindly someone help me on this. I need to know about promiscuous mode support in IPoIB and have a confusion if it is possible at all on IPoIB stack. If yes then How? I am not sure that it is supported, but Eli would know for sure. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Please don't print this E- mail unless you really need to - this will preserve trees on planet earth. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] XRC upstream merge reboot
On Wednesday 20 July 2011 21:51, Hefty, Sean wrote: I've tried to come up with a clean way to determine the lifetime of an xrc tgt qp,\ and I think the best approach is still: 1. Allow the creating process to destroy it at any time, and 2a. If not explicitly destroyed, the tgt qp is bound to the lifetime of the xrc domain or 2b. The creating process specifies during the creation of the tgt qp whether the qp should be destroyed on exit. The MPIs associate an xrc domain with a job, so this should work. Everything else significantly complicates the usage model and implementation, both for verbs and the CM. An application can maintain a reference count out of band with a persistent server and use explicit destruction if they want to share the xrcd across jobs. I assume that you intend the persistent server to replace the reg_xrc_rcv_qp/ unreg_xrc_rcv_qp verbs. Correct? Option 2a is the current implementation, but 2b should be a minor change. I'd like to reach a consensus on the right approach here, since there doesn't appear to be issues elsewhere. - Sean I have no opinion either way (with regard to tgt qp registration and reference counting). The OFED xrc implementation was driven by the requirements of the MPI community. If MPI can use a different XRC domain per job (and deallocate the domain at the job's end), this would solve the tgt qp lifetime problem (-- by destroying all the tgt qp's when the xrc domain is deallocated). Regarding option 2b: do you mean that in this case the tgt qp is NOT bound to the XRC domain lifetime? who destroys the tgt qp in this case when the creator indicates that the tgt qp should not be destroyed on exit? I am concerned with backwards compatibility, here. It seems that XRC users will need to change their source-code, not just recompile. I am assuming that OFED will take the mainstream kernel implementation at some point. Since this is **userspace** code, there could be a problem if OFED users upgrade their OFED installation to one which supports the new interface. This could be especially difficult if, for example, the customer is using 3rd-party packages which utilize the current OFED xrc interface. We could start seeing customers not take new OFED releases solely because of the XRC incompatibility (or worse, customers upgrading and then finding out that their 3rd-party XRC apps no longer work). Having a new OFED support BOTH interfaces is a nightmare I don't even want to think about! -Jack -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] XRC upstream merge reboot
On Thursday 21 July 2011 10:38, Jack Morgenstein wrote: Having a new OFED support BOTH interfaces is a nightmare I don't even want to think about! I over-reacted here, sorry about that. I know that it will be difficult to support both the old and the new interface. However, to support the current OFED customer base, we will do so to ease the interface transition by the apps, which we expect will take place over time. Sean, we will do our best to help you with getting XRC into the mainstream kernel. -Jack -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] XRC upstream merge reboot
On Jul 21, 2011, at 3:38 AM, Jack Morgenstein wrote: If MPI can use a different XRC domain per job (and deallocate the domain at the job's end), this would solve the tgt qp lifetime problem (-- by destroying all the tgt qp's when the xrc domain is deallocated). What happens if the MPI job crashes and does not properly deallocate the XRC domain / tgt qp? -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/ -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] XRC upstream merge reboot
On Jul 21, 2011, at 8:47 AM, Jack Morgenstein wrote: [snip] When the last user of an XRC domain exits cleanly (or crashes), the domain should be destroyed. In this case, with Sean's design, the tgt qp's for the XRC domain should also be destroyed. Sounds perfect. -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/ -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] IB/qib: defer hca error events to tasklet
With ib_qib options as follows: options ib_qib krcvqs=1 pcie_caps=0x51 rcvhdrcnt=4096 singleport=1 ibmtu=4 A run of ib_write_bw -a yields the following: -- #bytes #iterationsBW peak[MB/sec]BW average[MB/sec] 1048576 5000 2910.64229.80 -- The top cpu use in a profile is: CPU: Intel Architectural Perfmon, speed 2400.15 MHz (estimated) Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (No unit mask) count 1002300 Counted LLC_MISSES events (Last level cache demand requests from this core that missed the LLC) with a unit mask of 0x41 (No unit mask) count 1 samples %samples %app name symbol name 1523729.2642 964 17.1195 ib_qib.koqib_7322intr 1232023.6618 1040 18.4692 ib_qib.kohandle_7322_errors 4106 7.8860 0 0 vmlinux vsnprintf Analysis of the stats, profile, the code, and the annotated profile indicate: - All of the overflow interrupts (one per packet overflow) are serviced on CPU0 with no mitigation on the frequency. - All of the receive interrupts are being serviced by CPU0. (That is the way truescale.cmds statically allocates the kctx IRQs to CPU.) - The code is spending all of its time servicing QIB_I_C_ERROR RcvEgrFullErr interrupts on CPU0, starving the packet receive processing - The decode_err routine is very inefficient, using a printf variant to format a %s and continues to loop when the errs mask has been cleared - Both qib_7322intr and handle_7322_errors read pci registers, which is very inefficient. The fix does the following: - Adds a tasklet to service the QIB_I_C_ERROR - replaces the very inefficient scnprintf() with a memcpy(). A field is added to qib_hwerror_msgs to save the sizeof(string) at compile time so that a strlen is not needed during err_decode(). - The most frequent errors (Overflows) are serviced first to exit the loop as early as possible - The loop now exits as soon as the errs mask is clear rather than fruitlessly looping through the msp array With this fix the performance changes to: -- #bytes #iterationsBW peak[MB/sec]BW average[MB/sec] 1048576 5000 2990.642941.35 -- During testing of the error handling overflow patch, it was determined that some CPU's were slower when servicing both overflow and receive interrupts on CPU0 with different MSI interrupt vectors. This patch adds an option (krcvq01_no_msi) to not use a dedicated MSI interrupt for kctx's 2 and to service them on the default interrupt. For some CPUs, the cost of the interrupt enter/exit is more costly than then the additional PCI read in the default handler. Signed-off-by: Mike Marciniszyn mike.marcinis...@qlogic.com --- drivers/infiniband/hw/qib/qib.h |3 + drivers/infiniband/hw/qib/qib_iba7322.c | 71 ++- 2 files changed, 53 insertions(+), 21 deletions(-) diff --git a/drivers/infiniband/hw/qib/qib.h b/drivers/infiniband/hw/qib/qib.h index 769a1d9..c9624ea 100644 --- a/drivers/infiniband/hw/qib/qib.h +++ b/drivers/infiniband/hw/qib/qib.h @@ -1012,6 +1012,8 @@ struct qib_devdata { u8 psxmitwait_supported; /* cycle length of PS* counters in HW (in picoseconds) */ u16 psxmitwait_check_rate; + /* high volume overflow errors defered to tasklet */ + struct tasklet_struct error_tasklet; }; /* hol_state values */ @@ -1433,6 +1435,7 @@ extern struct mutex qib_mutex; struct qib_hwerror_msgs { u64 mask; const char *msg; + size_t sz; }; #define QLOGIC_IB_HWE_MSG(a, b) { .mask = a, .msg = b } diff --git a/drivers/infiniband/hw/qib/qib_iba7322.c b/drivers/infiniband/hw/qib/qib_iba7322.c index 821226c..5ea9ece 100644 --- a/drivers/infiniband/hw/qib/qib_iba7322.c +++ b/drivers/infiniband/hw/qib/qib_iba7322.c @@ -114,6 +114,10 @@ static ushort qib_singleport; module_param_named(singleport, qib_singleport, ushort, S_IRUGO); MODULE_PARM_DESC(singleport, Use only IB port 1; more per-port buffer space); +static ushort qib_krcvq01_no_msi; +module_param_named(krcvq01_no_msi, qib_krcvq01_no_msi, ushort, S_IRUGO); +MODULE_PARM_DESC(krcvq01_no_msi, No MSI for kctx 2); + /* * Receive header queue sizes */ @@ -1106,9 +1110,9 @@ static inline u32 read_7322_creg32_port(const struct qib_pportdata *ppd, #define AUTONEG_TRIES 3 /* sequential retries to negotiate DDR */ #define HWE_AUTO(fldname) { .mask = SYM_MASK(HwErrMask, fldname##Mask), \ - .msg = #fldname } + .msg = #fldname , .sz = sizeof(#fldname) } #define HWE_AUTO_P(fldname, port) { .mask = SYM_MASK(HwErrMask, \ -
Re: [PATCH] IB/qib: defer hca error events to tasklet
+static ushort qib_krcvq01_no_msi; +module_param_named(krcvq01_no_msi, qib_krcvq01_no_msi, ushort, S_IRUGO); +MODULE_PARM_DESC(krcvq01_no_msi, No MSI for kctx 2); First, the obvious question: is there really no better way to handle this than to have yet another cryptically named module parameter for users to try both ways? Second, why can't this just be module_param() (why is _named needed?)? -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [RFC] XRC upstream merge reboot
If you use file descriptors for the XRC domain, then when the last user of the domain exits, the domain gets destroyed (at least this is in OFED. Sean's code looks the same). In this case, the kernel cleanup code for the process should close the XRC domains opened by that process, so there is no leakage. When the last user of an XRC domain exits cleanly (or crashes), the domain should be destroyed. In this case, with Sean's design, the tgt qp's for the XRC domain should also be destroyed. Sean, is this correct? This is correct. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH] IB/qib: defer hca error events to tasklet
There is no choice but a module parameter since the msix initialization is done at driver load, and is only needed after testing reveals that the change is necessary. I agree that there are a lot of module parameters. Some that don't allocate resources at load time could be sys. I can re-issue the patch with module_param and change the identifier. The module parameters used throughout the code usually prepend a qib_ or and ib_qib_, and I was following precedent in the code. The only no prepend I see is the sdma_descq_cnt, which could change as well to use module_param. Mike -Original Message- From: rol...@purestorage.com [mailto:rol...@purestorage.com] On Behalf Of Roland Dreier Sent: Thursday, July 21, 2011 11:47 AM To: Mike Marciniszyn Cc: linux-rdma@vger.kernel.org Subject: Re: [PATCH] IB/qib: defer hca error events to tasklet +static ushort qib_krcvq01_no_msi; +module_param_named(krcvq01_no_msi, qib_krcvq01_no_msi, ushort, S_IRUGO); +MODULE_PARM_DESC(krcvq01_no_msi, No MSI for kctx 2); First, the obvious question: is there really no better way to handle this than to have yet another cryptically named module parameter for users to try both ways? Second, why can't this just be module_param() (why is _named needed?)? This message and any attached documents contain information from QLogic Corporation or its wholly-owned subsidiaries that may be confidential. If you are not the intended recipient, you may not read, copy, distribute, or use this information. If you have received this transmission in error, please notify the sender immediately by reply e-mail and then delete this message. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [RFC] XRC upstream merge reboot
I've tried to come up with a clean way to determine the lifetime of an xrc tgt qp,\ and I think the best approach is still: 1. Allow the creating process to destroy it at any time, and 2a. If not explicitly destroyed, the tgt qp is bound to the lifetime of the xrc domain or 2b. The creating process specifies during the creation of the tgt qp whether the qp should be destroyed on exit. The MPIs associate an xrc domain with a job, so this should work. Everything else significantly complicates the usage model and implementation, both for verbs and the CM. An application can maintain a reference count out of band with a persistent server and use explicit destruction if they want to share the xrcd across jobs. I assume that you intend the persistent server to replace the reg_xrc_rcv_qp/ unreg_xrc_rcv_qp verbs. Correct? I'm suggesting that anyone who wants to share an xrcd across jobs can use out of band communication to maintain their own reference count, rather than pushing that feature into the mainline. This requires a code change for apps that have coded to OFED and use this feature. I have no opinion either way (with regard to tgt qp registration and reference counting). The OFED xrc implementation was driven by the requirements of the MPI community. From the emails threads I followed, it was a request from HP MPI. The other MPIs have used the same interface since it was what was defined, but do not appear to be sharing the xrcd across jobs. HP has since canceled their MPI product. Regarding option 2b: do you mean that in this case the tgt qp is NOT bound to the XRC domain lifetime? who destroys the tgt qp in this case when the creator indicates that the tgt qp should not be destroyed on exit? With option 2b, the tgt qp lifetime is either tied to the life of the creating process or the xrcd. The creating process specifies which on creation. Basically, the choice allows the creating process to destroy the tgt qp when it exits, rather than waiting until the xrcd is closed. Note that ibverbs only considers the life of the tgt qp, but we also need to the consider the life of a corresponding connection maintained by the IB CM. I am concerned with backwards compatibility, here. It seems that XRC users will need to change their source-code, not just recompile. I am assuming that OFED will take the mainstream kernel implementation at some point. Since this is **userspace** code, there could be a problem if OFED users upgrade their OFED installation to one which supports the new interface. This could be especially difficult if, for example, the customer is using 3rd- party packages which utilize the current OFED xrc interface. We could start seeing customers not take new OFED releases solely because of the XRC incompatibility (or worse, customers upgrading and then finding out that their 3rd-party XRC apps no longer work). Eventually, the xrc users should change their source code to move away from the ofed compatability APIs. An app needs to recompile regardless. Existing apps will run into issues if they share the xrcd across jobs. In that case, they will leak tgt qps. There are also issues if an app calls the OFED ibv_modify_xrc_rcv_qp() or ibv_query_xrc_rcv_qp() APIs from a process other than the one which created the qp. These are the main risks that I see. Having a new OFED support BOTH interfaces is a nightmare I don't even want to think about! We're already in a situation where there are multiple libibverbs interfaces. The OFED compatibility patch to libibverbs was added specifically so that OFED could support both sets of APIs, while being binary compatible with the upstream ibverbs. The proposed kernel patches do not support the functionality required for the OFED APIs, but it's not clear whether apps are really dependent on that functionality. (I don't want to make MPI have to change their code right away either.) - Sean -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Support optional performance counters, including congestion control performance counters.
On Wed, 20 Jul 2011 17:04:43 -0700 Albert Chu ch...@llnl.gov wrote: Hey everyone, Here's a new patch series for the optional performance counters. It fixes up the issues brought up by Hal. I also include a new patch that fixes up the incorrect BITSOFFS for other fields in libibmad. Thanks all 3 applied. Al On Wed, 2011-07-20 at 11:35 -0700, Hal Rosenstock wrote: Hi again Al, On 7/20/2011 1:38 PM, Albert Chu wrote: Hey Hal, Thanks for the nit-catches. As for + {32, 2, PortVLXmitFlowCtlUpdateErrors0, mad_dump_uint}, + {34, 2, PortVLXmitFlowCtlUpdateErrors1, mad_dump_uint}, + {36, 2, PortVLXmitFlowCtlUpdateErrors2, mad_dump_uint}, + {38, 2, PortVLXmitFlowCtlUpdateErrors3, mad_dump_uint}, + {40, 2, PortVLXmitFlowCtlUpdateErrors4, mad_dump_uint}, + {42, 2, PortVLXmitFlowCtlUpdateErrors5, mad_dump_uint}, + {44, 2, PortVLXmitFlowCtlUpdateErrors6, mad_dump_uint}, + {46, 2, PortVLXmitFlowCtlUpdateErrors7, mad_dump_uint}, + {48, 2, PortVLXmitFlowCtlUpdateErrors8, mad_dump_uint}, + {50, 2, PortVLXmitFlowCtlUpdateErrors9, mad_dump_uint}, + {52, 2, PortVLXmitFlowCtlUpdateErrors10, mad_dump_uint}, + {54, 2, PortVLXmitFlowCtlUpdateErrors11, mad_dump_uint}, + {56, 2, PortVLXmitFlowCtlUpdateErrors12, mad_dump_uint}, + {58, 2, PortVLXmitFlowCtlUpdateErrors13, mad_dump_uint}, + {60, 2, PortVLXmitFlowCtlUpdateErrors14, mad_dump_uint}, + {62, 2, PortVLXmitFlowCtlUpdateErrors15, mad_dump_uint}, Don't these need to be BITSOFFS(nn, 2) ? Perhaps there's a subtlety I'm missing. If these require BITSOFFS, then wouldn't the 16 bit fields require them too? There are many places amongst the performance counters that BITSOFFS isn't used w/ 16 bit fields. Yes; it looks like any field less than 32 bits should use BITSOFFS so I think that there are some existing things to fix in fields.c (the 16 bit fields that are not using the macro). -- Hal Al -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Albert Chu ch...@llnl.gov Computer Scientist High Performance Systems Division Lawrence Livermore National Laboratory -- Ira Weiny Math Programmer/Computer Scientist Lawrence Livermore National Lab 925-423-8008 wei...@llnl.gov -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html