ibutils license problems + a few other minor issues
The ibutils tarball as found for download on the openfabrics web site has some licensing issues that need resolved ASAP. The tarball in its current form is probably illegal to distribute until the items are fixed. So, here's what popped up in review (done by Jay Fenlason): The COPYING files sprinkled around in the tarball are not consistent. Some of them contain a copy of the GPLv2, some of them contain a statement that the package is licensed under either GPLv2 in the file COPYING or the OFA BSD license included here in. Of course, it doesn't work to reference a COPYING file from within the COPYING file you are referencing and say it has the information you aren't including here. I would suggest that the proper place for the statement of GPLv2 or BSD is in possibly a LICENSE file with the COPYING file being the GPLv2 as stated. It also seems redundant to have multiple COPYING files and possibly also multiple LICENSE files in a single tarball, so cleanup of unnecessary files seems in order here. There are a number of files who have headers that indicate they are *not* under either GPLv2 or BSD. ibdm/replace/regex.[ch] are LGPLv2+, so need a LGPLv2+ COPYING somewhere ibdm/replace/{memset.c,realloc.c,malloc.c} are GPLv2+ versus GPLv2 ibdm/src/fabric_sim.cpp claims to be confidential and proprietary! It refers to LICENSE.txt for t&c. There is no LICENSE.txt under ibdm/. There is one under ibmtsim/utils/ though. ibmgtsim/config/{missing,config.sub,ltmain.sh,depcomp,config.guess} is GPLv2+ ibmgtsim/config/install-sh is X license ibmgtsim/utils/{,un}install.sh is GPLv2+ There are also a few other minor issues with the tarball as it builds: /usr/bin/ibnlparse has no man page /usr/share/man/man1/ibdm-ibnl-file.1.gz /usr/share/man/man1/ibdm-topo-file.1.gz Are in man1, but have no corresponding executables. Perhaps they should be in a different man section? -- Doug Ledford GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband signature.asc Description: OpenPGP digital signature
Re: [ewg] [PATCH v4] IB Core: RAW ETH support
On 06/16/2010 02:35 PM, Steve Wise wrote: > Jason Gunthorpe wrote: >> On Wed, Jun 16, 2010 at 12:12:06PM -0500, Steve Wise wrote: >> >> >>>>> Granted our dev process may not be documented, but I always assumed >>>>> the general idea was to get changes accepted upstream, then pull >>>>> into ofed. OFED is just a mechanism to make top-of-tree linux work >>>>> on distro kernels. There are some exceptions, but this stuff >>>>> shouldn't be an exception. >>>>> >> >> >>>> That is what many people wish for, me included, but it is not at all >>>> what generally happens :( >>>> >>>> In my observation the typical flow is: >>>> - A patch is written, it may or may not be sent to the list >>>> - 'business drivers' get it slammed into OFED right away >>>> - A patch is finally sent for proper review >>>> - It is not merged, there are comments.. >>>> - Interest in doing anything is lost because it is already in >>>>OFED and that is all that matters, right? >>>> - People complain. >>>> >>>> For instance, the iWarp thingy we were just discussing fits this >>>> process rather well. >>>> >> >> >>> You're wrong. I started that iWARP change in 2007 on LKLM. I >>> proposed a few ideas and show the pros/cons of each. And it was >>> NAKed 100% by mister miller.It was then included in OFED as a >>> last resort only because I couldn't get any help with trying to add >>> this upstream in any form. I even spent a few weeks developing a >>> way to administor "iwarp only" ipaddresses, but Roland didn't like >>> that scheme for various reasons. So please don't mention that >>> particular patch as a "bad process" unless you want to argue with me >>> some more about it. >>> >> >> Uhm, what you just described does fit my process outline: >> >> #1 - Patch written, sent to LKML. Check. >> #3 - Patch sent for proper review - in 2007. Check. >> #4 - Not merged. NAK by DM. Check. >> #2 - 'business drivers' force it into OFED - 'last resort' ie, iWarp >> cards can't be used without some fix. Check. >> #5 - Interest is lost. Yep, this was done in 2007, and it was idle >> till now. Check. >> #6 - People Complain. Hmm. Yep. Check. >> >> > > Note the ordering is different. IE I tried very hard to get the "right" > solution designed and agreed-upon upstream. But failed. That's my > bad.I did, however help with the iWARP core code including neighbour > update net events which did go in upstream before ofed. > >> Don't think I'm being critical toward only you, or singling out that >> little iWarp patch. But it really isn't special, or different, or an >> exception. Pick nearly any patch in OFED and someone will rush to its >> defense with a 'we tried to follow the process and it failed, so we >> did it anyway' argument. >> >> I also didn't say this is the only way that RDMA development goes, >> lots and lots of stuff goes into mainline first, from everyone. >> >> Jason >> > > > OFED maintainers should be more rigid, perhaps, with requiring that > changes be accepted upstream first. One observation is that there is no > "OFED RDMA maintainer", aka a Roland Dreier, for the OFED code. So each > driver maintainer pretty much has free reign to do the right thing or > the wrong thing... Yep, no doubt that has an impact on things. It's for this very reason that our next operating system is not following OFED but instead is using upstream as its basis. That will be true from now on with our products. -- Doug Ledford GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband signature.asc Description: OpenPGP digital signature
Re: [ewg] [PATCH v4] IB Core: RAW ETH support
On 06/16/2010 01:12 PM, Steve Wise wrote: > Jason Gunthorpe wrote: >> On Wed, Jun 16, 2010 at 09:09:59AM -0500, Steve Wise wrote: >> >> >>> Granted our dev process may not be documented, but I always assumed >>> the general idea was to get changes accepted upstream, then pull into >>> ofed. OFED is just a mechanism to make top-of-tree linux work on >>> distro kernels. There are some exceptions, but this stuff shouldn't >>> be an exception. >>> >> >> That is what many people wish for, me included, but it is not at all >> what generally happens :( >> >> In my observation the typical flow is: >> - A patch is written, it may or may not be sent to the list >> - 'business drivers' get it slammed into OFED right away >> - A patch is finally sent for proper review >> - It is not merged, there are comments.. >> - Interest in doing anything is lost because it is already in >>OFED and that is all that matters, right? >> - People complain. >> >> For instance, the iWarp thingy we were just discussing fits this >> process rather well. >> >> > > You're wrong. I started that iWARP change in 2007 on LKLM. I proposed > a few ideas and show the pros/cons of each. And it was NAKed 100% by > mister miller.It was then included in OFED as a last resort only ^ Which, of course, is the problem. Once you have a solution besides "get it upstream", you throw whatever you feel like into OFED instead of whatever upstream will accept. How long has OFED shipped the XRC stuff now while it *still* isn't upstream? > because I couldn't get any help with trying to add this upstream in any > form. Again, OFED is part of the reason this failed. That users had someplace else to get "working" code besides upstream meant that you didn't have end users putting pressure on the upstream kernel folks to accept *some* form of solution. So, your job was harder because there were no users present to put pressure on mister miller or others, and then you perpetuated the issue by caving and going back to OFED "as a last resort". It has become a "last resort" so often now that trying to get things upstream first is just a sort of private joke amongst some people I think. > I even spent a few weeks developing a way to administor "iwarp > only" ipaddresses, but Roland didn't like that scheme for various > reasons. So please don't mention that particular patch as a "bad > process" unless you want to argue with me some more about it. > > Also, the chelsio iWARP driver has 100% been upstream first, then > ofed. Some of us are indeed trying to do the right thing. > > OFED just needs to go away. It's been far too abused for far too long and it's mere existence is hindering upstream development. I appreciate that you attempt to do the right thing most of the time, but it really needs to be all of the time, and you need your users right there beside you in order to carry the weight you need in order to get solutions designed and accepted instead of running into the brick wall you ran into before. -- Doug Ledford GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband signature.asc Description: OpenPGP digital signature
Re: [PATCH 2/2] ib/cm: Bump reference count on cm_id before invoking callback
On Wed, 2011-02-23 at 08:17 -0800, Hefty, Sean wrote: > When processing a SIDR REQ, the ib_cm allocates a new cm_id. The refcount > of the cm_id is initialized to 1. However, cm_process_work will > decrement the refcount after invoking all callbacks. The result is that > the cm_id will end up with refcount set to 0 by the end of the sidr req > handler. > > If a user tries to destroy the cm_id, the destruction will proceed, under > the incorrect assumption that no other threads are referencing the cm_id. > This can lead to a crash when the cm callback thread tries to access > the cm_id. > > This problem was noticed as part of a larger investigation with kernel > crashes in the rdma_cm when running on a real time OS. > > Signed-off-by: Sean Hefty Acked-by: Doug Ledford > --- > > drivers/infiniband/core/cm.c |1 + > 1 files changed, 1 insertions(+), 0 deletions(-) > > diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c > index 64e0903..1d9616b 100644 > --- a/drivers/infiniband/core/cm.c > +++ b/drivers/infiniband/core/cm.c > @@ -2989,6 +2989,7 @@ static int cm_sidr_req_handler(struct cm_work *work) > goto out; /* No match. */ > } > atomic_inc(&cur_cm_id_priv->refcount); > + atomic_inc(&cm_id_priv->refcount); > spin_unlock_irq(&cm.lock); > > cm_id_priv->id.cm_handler = cur_cm_id_priv->id.cm_handler; > > -- Doug Ledford GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband signature.asc Description: This is a digitally signed message part
Re: [PATCH 1/2] rdma/cm: Fix crash in request handlers
On Wed, 2011-02-23 at 08:11 -0800, Hefty, Sean wrote: > Doug Ledford and RedHat reported a crash when running the rdma_cm on a real > time OS. The crash has the following call trace: > > cm_process_work >cma_req_handler > cma_disable_callback > rdma_create_id > kzalloc > init_completion > cma_get_net_info > cma_save_net_info > cma_any_addr > cma_zero_addr > rdma_translate_ip > rdma_copy_addr > cma_acquire_dev > rdma_addr_get_sgid > ib_find_cached_gid > cma_attach_to_dev > ucma_event_handler > kzalloc > ib_copy_ah_attr_to_user > cma_comp > > [ preempted ] > > cma_write > copy_from_user > ucma_destroy_id >copy_from_user >_ucma_find_context >ucma_put_ctx >ucma_free_ctx > rdma_destroy_id > cma_exch > cma_cancel_operation > rdma_node_get_transport > > rt_mutex_slowunlock > bad_area_nosemaphore > oops_enter > > They were able to reproduce the crash multiple times with the following > details: > > Crash seems to always happen on the: > mutex_unlock(&conn_id->handler_mutex); > as conn_id looks to have been freed during this code path. > > An examination of the code shows that a race exists in the request handlers. > When a new connection request is received, the rdma_cm allocates a new > connection identifier. This identifier has a single reference count on it. > If a user calls rdma_destroy_id() from another thread after receiving a > callback, rdma_destroy_id will proceed to destroy the id and free > the associated memory. However, the request handlers may still be in > the process of running. When control returns to the request handlers, > they can attempt to access the newly created identifiers. > > Fix this by holding a reference on the newly created rdma_cm_id until > the request handler is through accessing it. > > Signed-off-by: Sean Hefty Acked-by: Doug Ledford > --- > This is obviously serious enough to cause a crash, but the bug has also been > there > for a while before anyone detected it. I have no real opinion on whether this > targets 2.6.38 (if that's even possible at this point) or 2.6.39. > > drivers/infiniband/core/cma.c | 15 +++ > 1 files changed, 15 insertions(+), 0 deletions(-) > > diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c > index 6884da2..e450c5a 100644 > --- a/drivers/infiniband/core/cma.c > +++ b/drivers/infiniband/core/cma.c > @@ -1210,6 +1210,11 @@ static int cma_req_handler(struct ib_cm_id *cm_id, > struct ib_cm_event *ib_event) > cm_id->context = conn_id; > cm_id->cm_handler = cma_ib_handler; > > + /* > + * Protect against the user destroying conn_id from another thread > + * until we're done accessing it. > + */ > + atomic_inc(&conn_id->refcount); > ret = conn_id->id.event_handler(&conn_id->id, &event); > if (!ret) { > /* > @@ -1222,8 +1227,10 @@ static int cma_req_handler(struct ib_cm_id *cm_id, > struct ib_cm_event *ib_event) > ib_send_cm_mra(cm_id, CMA_CM_MRA_SETTING, NULL, 0); > mutex_unlock(&lock); > mutex_unlock(&conn_id->handler_mutex); > + cma_deref_id(conn_id); > goto out; > } > + cma_deref_id(conn_id); > > /* Destroy the CM ID by returning a non-zero value. */ > conn_id->cm_id.ib = NULL; > @@ -1425,17 +1432,25 @@ static int iw_conn_req_handler(struct iw_cm_id *cm_id, > event.param.conn.private_data_len = iw_event->private_data_len; > event.param.conn.initiator_depth = attr.max_qp_init_rd_atom; > event.param.conn.responder_resources = attr.max_qp_rd_atom; > + > + /* > + * Protect against the user destroying conn_id from another thread > + * until we're done accessing it. > + */ > + atomic_inc(&conn_id->refcount); > ret = conn_id->id.event_handler(&conn_id->id, &event); > if (ret) { > /* User wants to destroy the CM ID */ > conn_id->cm_id.iw = NULL; > cma_exch(conn_id, CMA_DESTROYING); > mutex_unlock(&conn_id->handler_mutex); > + cma_deref_id(conn_id); > rdma_destroy_id(&conn_id->id); > goto out; > } > > mutex_unlock(&conn_id->handler_mutex); > + cma_deref_id(conn_id); > > out: > if (dev) > > -- Doug Ledford GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband signature.asc Description: This is a digitally signed message part
Re: [PATCH 1/2] rdma/cm: Fix crash in request handlers
On Mon, 2011-03-14 at 21:22 -0700, Roland Dreier wrote: > On Mon, Mar 14, 2011 at 7:27 PM, Roland Dreier wrote: > > Doesn't that mean unprivileged userspace could trigger a use-after-free > > in the kernel? (and it might be malicious code, not buggy userspace) > > >From reading the code a bit, I guess ucma is OK in this area. I do see > what seems like an exploitable race in ucma_create_id(): > > - one thread create an id with an invalid userspace pointer >(so the copy_to_user in ucma_create_id returns -EFAULT >and calls rdma_destroy_id before idr_remove) > - another thread guess the id that is going to be returned and >call ucma_destroy_id() > > if the second thread hits the window where the cm_id is > destroyed but the ctx is still in the idr, it can trigger a double free. > > But this is a completely different bug. > > - R. Heh, nice. This would be a good example of why I prefer that rules such as "cm_destroy_id/rdma_destroy_id/ucma_destroy_id can only be called once per ID" be coded into the destroy function themselves and enforced. As long as you are talking about user space guessing ID's, it is patently impossible to guarantee that we won't call a destroy function more than once. -- Doug Ledford GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband signature.asc Description: This is a digitally signed message part
[Patch] mlx4: make mcg entry size a module parameter
We ran across a problem at a customer's site where the qp array in the mcg entry was being filling up and denying further qp attaches. In addition, the upcoming SRIOV support is expected to increase use of this array as well for unicast address steering (or so the comments in a patch Yevgeny's original patch that statically increased the size of the mcg entries stated). But, since increasing the size of this to an arbitrarily large number is just a waste of memory, and since we don't know that 0x200 will be large enough for all use cases, make the option a module parameter instead. This has been tested at our customer's site and solves their problem. commit ff608ce370b49d2e5b614ff91f4e23b5deaac8a4 Author: Doug Ledford Date: Wed Mar 23 12:20:47 2011 -0400 mlx4: make the size of the mcg entry a module parameter Testing showed that the default size of 0x100 could be overrun. Bumping to 0x200 would fix the problem, but only until we hit the cap again, and at the expense of making memory consumption in all scenarios worse. So, make the size of the mcg entry a module parameter and let those people who need to bump the size do so. Signed-off-by: Doug Ledford diff --git a/drivers/infiniband/hw/mlx4/main.c b/drivers/infiniband/hw/mlx4/main.c index c7a6213..322f0af 100644 --- a/drivers/infiniband/hw/mlx4/main.c +++ b/drivers/infiniband/hw/mlx4/main.c @@ -158,7 +158,7 @@ static int mlx4_ib_query_device(struct ib_device *ibdev, props->masked_atomic_cap = IB_ATOMIC_HCA; props->max_pkeys = dev->dev->caps.pkey_table_len[1]; props->max_mcast_grp = dev->dev->caps.num_mgms + dev->dev->caps.num_amgms; - props->max_mcast_qp_attach = dev->dev->caps.num_qp_per_mgm; + props->max_mcast_qp_attach = dev->dev->caps.num_qp_per_mcg; props->max_total_mcast_qp_attach = props->max_mcast_qp_attach * props->max_mcast_grp; props->max_map_per_fmr = (1 << (32 - ilog2(dev->dev->caps.num_mpts))) - 1; diff --git a/drivers/net/mlx4/fw.c b/drivers/net/mlx4/fw.c index 5de1db8..800cb2d 100644 --- a/drivers/net/mlx4/fw.c +++ b/drivers/net/mlx4/fw.c @@ -434,8 +434,8 @@ int mlx4_QUERY_DEV_CAP(struct mlx4_dev *dev, struct mlx4_dev_cap *dev_cap) dev_cap->reserved_mrws, dev_cap->reserved_mtts); mlx4_dbg(dev, "Max PDs: %d, reserved PDs: %d, reserved UARs: %d\n", dev_cap->max_pds, dev_cap->reserved_pds, dev_cap->reserved_uars); - mlx4_dbg(dev, "Max QP/MCG: %d, reserved MGMs: %d\n", -dev_cap->max_pds, dev_cap->reserved_mgms); + mlx4_dbg(dev, "Max QP/MCG: %d, Max MCGs: %d, reserved MGMs: %d\n", +dev_cap->max_qp_per_mcg, dev_cap->max_mcgs, dev_cap->reserved_mgms); mlx4_dbg(dev, "Max CQEs: %d, max WQEs: %d, max SRQ WQEs: %d\n", dev_cap->max_cq_sz, dev_cap->max_qp_sz, dev_cap->max_srq_sz); mlx4_dbg(dev, "Local CA ACK delay: %d, max MTU: %d, port width cap: %d\n", diff --git a/drivers/net/mlx4/main.c b/drivers/net/mlx4/main.c index 2765a3c..5306141 100644 --- a/drivers/net/mlx4/main.c +++ b/drivers/net/mlx4/main.c @@ -101,6 +101,10 @@ module_param_named(use_prio, use_prio, bool, 0444); MODULE_PARM_DESC(use_prio, "Enable steering by VLAN priority on ETH ports " "(0/1, default 0)"); +static int log_mcg_size = 8; +module_param_named(log_mcg_size, log_mcg_size, int, 0444); +MODULE_PARM_DESC(log_mcg_size, "Log2 size of MCG struct (8-11)"); + static int log_mtts_per_seg = ilog2(MLX4_MTT_ENTRY_PER_SEG); module_param_named(log_mtts_per_seg, log_mtts_per_seg, int, 0444); MODULE_PARM_DESC(log_mtts_per_seg, "Log2 number of MTT entries per segment (1-7)"); @@ -203,7 +207,14 @@ static int mlx4_dev_cap(struct mlx4_dev *dev, struct mlx4_dev_cap *dev_cap) dev->caps.reserved_srqs = dev_cap->reserved_srqs; dev->caps.max_sq_desc_sz = dev_cap->max_sq_desc_sz; dev->caps.max_rq_desc_sz = dev_cap->max_rq_desc_sz; - dev->caps.num_qp_per_mgm = MLX4_QP_PER_MGM; + dev->caps.max_qp_per_mcg = dev_cap->max_qp_per_mcg; + dev->caps.max_mcgs = dev_cap->max_mcgs; + i = 0; + do { + dev->caps.mcg_entry_size = 1 << (log_mcg_size - i++); + dev->caps.num_qp_per_mcg = 4 * (dev->caps.mcg_entry_size / 16 - 2); + } while (dev->caps.num_qp_per_mcg > dev->caps.max_qp_per_mcg); + /* * Subtract 1 from the limit because we need to allocate a * spare CQE so the HCA HW can tell the difference between an @@ -642,7 +653,7 @@ static int mlx4_init_icm(struct mlx4_dev *dev, struct mlx4_
Re: [PATCH for-next V2 01/22] IB/core: Reserve bits in enum ib_qp_create_flags for low-level driver use
On 8/3/2012 4:40 AM, Jack Morgenstein wrote: > Reserve bits 26-31 for internal use by low-level drivers. Two > such bits are used in the mlx4 driver SRIOV IB implementation. > > These enum additions guarantee that the core layer will never use > these bits, so that low level drivers may safely make use of them. > > Signed-off-by: Jack Morgenstein > --- > include/rdma/ib_verbs.h |3 +++ > 1 files changed, 3 insertions(+), 0 deletions(-) > > diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h > index 07996af..46bc045 100644 > --- a/include/rdma/ib_verbs.h > +++ b/include/rdma/ib_verbs.h > @@ -614,6 +614,9 @@ enum ib_qp_type { > enum ib_qp_create_flags { > IB_QP_CREATE_IPOIB_UD_LSO = 1 << 0, > IB_QP_CREATE_BLOCK_MULTICAST_LOOPBACK = 1 << 1, > + /* reserve bits 26-31 for low level drivers' internal use */ > + IB_QP_CREATE_RESERVED_START = 1 << 26, > + IB_QP_CREATE_RESERVED_END = 1 << 31, > }; > > struct ib_qp_init_attr { > Reserving 6 bits for driver use out of 32 seems reasonable. Acked-by: Doug Ledford -- Doug Ledford GPG KeyID: 0E572FDD http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband signature.asc Description: OpenPGP digital signature
Re: [PATCH for-next V2 02/22] IB/core: change pkey table lookups to support full and partial membership for the same pkey
On 8/3/2012 4:40 AM, Jack Morgenstein wrote: > Enhance the cached and non-cached pkey table lookups to enable limited and > full > members of the same pkey to co-exist in the pkey table. > > This is necessary for SRIOV to allow for a scheme where some guests would > have the full > membership pkey in their virtual pkey table, where other guests on the same > hypervisor > would have the limited one. In that sense, its an extension of the IBTA model > for > non virtualized nodes. OK, maybe I'm not getting something, but I'm curious why we always pick the full pkey in preference to the partial pkey. Shouldn't we pick the pkey that's appropriate for the vHCA sending the message? Also, given the rule of least surprise, don't you think it would be best to rename this function ib_find_cached_full_or_parital_pkey and in your next patch instead of naming it ib_find_exact_pkey just call that one ib_find_cached_pkey? > To accomplish this, we need both the limited and full membership pkeys to be > present > in the master's (hypervisor physical port) pkey table. > > The algorithm for supporting pkey tables which contain both the limited and > the full > membership versions of the same pkey works as follows: > > When scanning the pkey table for a 15 bit pkey: > > A. If there is a full member version of that pkey anywhere > in the table, return its index (even if a limited-member > version of the pkey exists earlier in the table). > > B. If the full member version is not in the table, > but the limited-member version is in the table, > return the index of the limited pkey. > > Signed-off-by: Liran Liss > Signed-off-by: Jack Morgenstein > Signed-off-by: Or Gerlitz > --- > drivers/infiniband/core/cache.c | 14 +++--- > drivers/infiniband/core/device.c | 17 + > 2 files changed, 24 insertions(+), 7 deletions(-) > > diff --git a/drivers/infiniband/core/cache.c b/drivers/infiniband/core/cache.c > index 9353992..0f2f2b7 100644 > --- a/drivers/infiniband/core/cache.c > +++ b/drivers/infiniband/core/cache.c > @@ -167,6 +167,7 @@ int ib_find_cached_pkey(struct ib_device *device, > unsigned long flags; > int i; > int ret = -ENOENT; > + int partial_ix = -1; > > if (port_num < start_port(device) || port_num > end_port(device)) > return -EINVAL; > @@ -179,10 +180,17 @@ int ib_find_cached_pkey(struct ib_device *device, > > for (i = 0; i < cache->table_len; ++i) > if ((cache->table[i] & 0x7fff) == (pkey & 0x7fff)) { > - *index = i; > - ret = 0; > - break; > + if (cache->table[i] & 0x8000) { > + *index = i; > + ret = 0; > + break; > + } else > + partial_ix = i; > } > + if (ret && partial_ix >= 0) { > + *index = partial_ix; > + ret = 0; > + } > > read_unlock_irqrestore(&device->cache.lock, flags); > > diff --git a/drivers/infiniband/core/device.c > b/drivers/infiniband/core/device.c > index e711de4..a645c68 100644 > --- a/drivers/infiniband/core/device.c > +++ b/drivers/infiniband/core/device.c > @@ -707,18 +707,27 @@ int ib_find_pkey(struct ib_device *device, > { > int ret, i; > u16 tmp_pkey; > + int partial_ix = -1; > > for (i = 0; i < device->pkey_tbl_len[port_num - start_port(device)]; > ++i) { > ret = ib_query_pkey(device, port_num, i, &tmp_pkey); > if (ret) > return ret; > - > if ((pkey & 0x7fff) == (tmp_pkey & 0x7fff)) { > - *index = i; > - return 0; > + /* if there is full-member pkey take it.*/ > + if (tmp_pkey & 0x8000) { > + *index = i; > + return 0; > + } > + if (partial_ix < 0) > + partial_ix = i; > } > } > - > + /*no full-member, if exists take the limited*/ > + if (partial_ix >= 0) { > + *index = partial_ix; > + return 0; > + } > return -ENOENT; > } > EXPORT_SYMBOL(ib_find_pkey); > -- Doug Ledford GPG KeyID: 0E572FDD http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband signature.asc Description: OpenPGP digital signature
Re: [PATCH for-next V2 03/22] IB/core: Add ib_find_exact_cached_pkey() to search for 16-bit pkey match
On 8/3/2012 4:40 AM, Jack Morgenstein wrote: > When port pkey table potentially contains both full and partial > membership copies for the same pkey, we need a function to find > the exact (16-bit) pkey index. The code on this patch is fine, just see my previous email about the function naming... > This is particularly necessary when the master forwards QP1 MADS > sent by guests. If the guest has sent the MAD with a limited > membership pkey, we wish to forward the MAD using the same limited > membership pkey. Since master may have both the limited and > the full member pkeys in its table, we must make sure to retrieve > the limited membership pkey in this case. > > This requires the 16-bit pkey lookup function (which includes the > membership bit). > > Signed-off-by: Jack Morgenstein > Signed-off-by: Or Gerlitz > --- > drivers/infiniband/core/cache.c | 32 > include/rdma/ib_cache.h | 16 > 2 files changed, 48 insertions(+), 0 deletions(-) > > diff --git a/drivers/infiniband/core/cache.c b/drivers/infiniband/core/cache.c > index 0f2f2b7..d8a8c83 100644 > --- a/drivers/infiniband/core/cache.c > +++ b/drivers/infiniband/core/cache.c > @@ -198,6 +198,38 @@ int ib_find_cached_pkey(struct ib_device *device, > } > EXPORT_SYMBOL(ib_find_cached_pkey); > > +int ib_find_exact_cached_pkey(struct ib_device *device, > + u8port_num, > + u16 pkey, > + u16 *index) > +{ > + struct ib_pkey_cache *cache; > + unsigned long flags; > + int i; > + int ret = -ENOENT; > + > + if (port_num < start_port(device) || port_num > end_port(device)) > + return -EINVAL; > + > + read_lock_irqsave(&device->cache.lock, flags); > + > + cache = device->cache.pkey_cache[port_num - start_port(device)]; > + > + *index = -1; > + > + for (i = 0; i < cache->table_len; ++i) > + if (cache->table[i] == pkey) { > + *index = i; > + ret = 0; > + break; > + } > + > + read_unlock_irqrestore(&device->cache.lock, flags); > + > + return ret; > +} > +EXPORT_SYMBOL(ib_find_exact_cached_pkey); > + > int ib_get_cached_lmc(struct ib_device *device, > u8port_num, > u8*lmc) > diff --git a/include/rdma/ib_cache.h b/include/rdma/ib_cache.h > index 00a2b8e..ad9a3c2 100644 > --- a/include/rdma/ib_cache.h > +++ b/include/rdma/ib_cache.h > @@ -101,6 +101,22 @@ int ib_find_cached_pkey(struct ib_device*device, > u16 *index); > > /** > + * ib_find_exact_cached_pkey - Returns the PKey table index where a specified > + * PKey value occurs. Comparison uses the FULL 16 bits (incl membership > bit) > + * @device: The device to query. > + * @port_num: The port number of the device to search for the PKey. > + * @pkey: The PKey value to search for. > + * @index: The index into the cached PKey table where the PKey was found. > + * > + * ib_find_exact_cached_pkey() searches the specified PKey table in > + * the local software cache. > + */ > +int ib_find_exact_cached_pkey(struct ib_device*device, > + u8 port_num, > + u16 pkey, > + u16 *index); > + > +/** > * ib_get_cached_lmc - Returns a cached lmc table entry > * @device: The device to query. > * @port_num: The port number of the device to query. > -- Doug Ledford GPG KeyID: 0E572FDD http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband signature.asc Description: OpenPGP digital signature
Re: [PATCH for-next V2 04/22] IB/mlx4: SRIOV IB context objects and proxy/tunnel sqp support
On 8/3/2012 4:40 AM, Jack Morgenstein wrote: > 1. Introduce the basic sriov parvirtualization context objects >for multiplexing and demultiplexing MADs. > 2. Introduce support for the new proxy and tunnel QP types. > > This patch introduces the objects required by the master > for managing QP paravirtualization for guests. > > struct mlx4_ib_sriov{} is created by the master only. > It is a container for the following: > 1. All the info required by the PPF to multiplex and de-multiplex MADs >(including those from the PF). (struct mlx4_ib_demux_ctx demux) OK, so can we have at least a single reference to the various abbreviations before using them exclusively? I know PF and PPF may be common, but it might be nice that they were used once in full form before abbreviated in commit messages. > 2. All the info required to manage alias GUIDs (i.e., the GUID at >index 0 that each guest perceives. In fact, this is not the >GUID which is actually at index 0, but is, in fact, the GUID >which is at index[] in the physical table. OK, this has been one of the things that has made reviewing this difficult. I freely admit that I've steadfastly ignored SRIOV for as long as I can, so maybe this is just me. But, in the context of this driver, how am I supposed to know which code paths will be on the host and which on the guest? Also, I note that you do math every time you want to know if you are on a parent device or a virtual device. Do you really want to do math all the time, or would it be better to save off your status on device init and just refer to that when you would do math in this patch? -- Doug Ledford GPG KeyID: 0E572FDD http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband signature.asc Description: OpenPGP digital signature
Re: [PATCH for-next V2 03/22] IB/core: Add ib_find_exact_cached_pkey() to search for 16-bit pkey match
On 8/3/2012 4:40 AM, Jack Morgenstein wrote: > When port pkey table potentially contains both full and partial > membership copies for the same pkey, we need a function to find > the exact (16-bit) pkey index. > > This is particularly necessary when the master forwards QP1 MADS > sent by guests. If the guest has sent the MAD with a limited > membership pkey, we wish to forward the MAD using the same limited > membership pkey. Since master may have both the limited and > the full member pkeys in its table, we must make sure to retrieve > the limited membership pkey in this case. > > This requires the 16-bit pkey lookup function (which includes the > membership bit). As a second note, I would like to know why Intel (previously QLogic) does not use these functions in their driver and what it would take to get all drivers to use the functions. Do we need to add more to them? In my opinion these should be generally useful and used by all drivers. Mike? -- Doug Ledford GPG KeyID: 0E572FDD http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband signature.asc Description: OpenPGP digital signature
Re: [PATCH for-next V2 03/22] IB/core: Add ib_find_exact_cached_pkey() to search for 16-bit pkey match
On 9/11/2012 3:07 PM, Roland Dreier wrote: > On Tue, Sep 11, 2012 at 10:12 AM, Doug Ledford wrote: >> As a second note, I would like to know why Intel (previously QLogic) >> does not use these functions in their driver and what it would take to >> get all drivers to use the functions. Do we need to add more to them? >> In my opinion these should be generally useful and used by all drivers. > > Use which functions? The P_Key lookup functions? > > What would a low-level driver use them for? I thought these are for > use by upper-level protocols. Well, at this point, the mlx4 driver uses them, the rdmacm kernel driver uses them, and both QLogic/Intel drivers have their own internal pkey table implementation. So, it isn't so much upper layer as it is drivers. -- Doug Ledford GPG KeyID: 0E572FDD http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband signature.asc Description: OpenPGP digital signature
Re: [PATCH for-next V2 03/22] IB/core: Add ib_find_exact_cached_pkey() to search for 16-bit pkey match
On 9/11/2012 4:43 PM, Roland Dreier wrote: > On Tue, Sep 11, 2012 at 1:34 PM, Doug Ledford wrote: >> Well, at this point, the mlx4 driver uses them, the rdmacm kernel driver >> uses them, and both QLogic/Intel drivers have their own internal pkey >> table implementation. So, it isn't so much upper layer as it is drivers. > > rdmacm is an upper-level protocol (it's above the midlayer / hardware > abstraction). Yeah, I know. My point wasn't that it was a low level item, just that it's the only upper layer consumer that I saw. > mlx4 and mthca look up P_Keys because of internal details of how they > send MADs, and really they should move to maintaining their own P_Key > table too. Why not make the routines useful for all users instead of multiple implementations of the same thing? -- Doug Ledford GPG KeyID: 0E572FDD http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband signature.asc Description: OpenPGP digital signature
Re: [PATCH for-next V2 02/22] IB/core: change pkey table lookups to support full and partial membership for the same pkey
On 09/12/2012 03:56 AM, Jack Morgenstein wrote: > On the Hypervisor, however, we assume that if both versions of the pkey are > in its pkey table, > then for its own infiniband operation (as opposed to performing its pkey > virtualizing function), > it should operate with the highest membership type in its table for a given > 15-bit pkey. That's what I was looking for. So, how can you know this assumption is correct? It seems to me that if someone wanted to restrict membership of the hypervisor as part of a security lockdown, then give full membership to a guest because that guest is some high security, single task guest, then this assumption would break things (the user would be able to assign the full membership key to the guest OK, but regardless of how they wanted the hypervisor to be subscribed to that particular pkey, it would always get the full membership from the guest). >> Shouldn't we pick the >> pkey that's appropriate for the vHCA sending the message? > > We do. When QPs on the guests are created, the modify-qp commands are not > executed on the guest, > but rather are passed to the PPF for processing. The PPF replaces the > guest-provided virtual pkey-index > value with the appropriate physical pkey-index value. See procedure > "update_pkey_index" in file > resource_tracker.c, and all the places it is called (i.e., in the wrapper > functions for the various > modify-qp firmware commands). That's fine for the guest, but I don't see how this solves the assumption issue. My concern is that I could see a valid scenario that a user might want to implement for security reasons that I think makes your assumption above incorrect. -- Doug Ledford GPG KeyID: 0E572FDD http://people.redhat.com/dledford signature.asc Description: OpenPGP digital signature
Problem running rping over Intel adapters
We ran into this problem when testing rping over Intel/QLogic hardware: [root@rdmaperf3 ~]# rping -s -a 172.31.2.103 -v wait for CONNECTED state 10 connect error -1 cma event RDMA_CM_EVENT_REJECTED, error 28 [root@rdmaperf3 ~]# [root@rdmaperf8 ~]# rping -c -a 172.31.2.103 -v -C 5 cma event RDMA_CM_EVENT_CONNECT_ERROR, error -1 wait for CONNECTED state 4 connect error -1 [root@rdmaperf8 ~]# Turns out this is because of a couple things: 1) rping, on the client side, clears the conn_params for the newly to be attempted connection, then sets: conn_param.responder_resources = 1; conn_param.initiator_depth = 1; conn_param.retry_count = 10; On the accept side, rping clears the conn_params and then sets just the responder_resources and initiator_depth, without even checking the incoming requested conn_param values from the incoming cm_id. So, OK, you can get away with that since this is a simple test program, but still not "best programming practices". However, the important part here is the retry_count of 10. That won't work on Intel/QLogic hardware. 2) the qib driver enforces a maximum of 7 for retry_count. I don't see anything in the spec that specifies a maximum for this entry, and in particular I know it doesn't call out for 7 to mean infinite retries like it does for rnr_retry_count. I don't think the spec really cares how we solve this, and I don't think there is a hard limit of 7 for the retry_count like the qib driver enforces. On the other hand, the spec doesn't call out a limit on the retry_count but I would assume each driver has the option to implement their own "reasonable, implementation defined" limit in a case like this. So, do we make qib more liberal in its acceptance of retry_count or do we fix rping to use a smaller number? Matters not to me... -- Doug Ledford GPG KeyID: 0E572FDD http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband signature.asc Description: OpenPGP digital signature
Re: Problem running rping over Intel adapters
On 10/3/2012 5:50 PM, Hefty, Sean wrote: >> 1) rping, on the client side, clears the conn_params for the newly to be >> attempted connection, then sets: >> >> conn_param.responder_resources = 1; >> conn_param.initiator_depth = 1; >> conn_param.retry_count = 10; > > >> 2) the qib driver enforces a maximum of 7 for retry_count. I don't see >> anything in the spec that specifies a maximum for this entry, and in >> particular I know it doesn't call out for 7 to mean infinite retries >> like it does for rnr_retry_count. > > On IB, retry count is a 3-bit value. See the CM REQ message. > > The kernel rdma cm should probably check for a valid value and reduce it > accordingly. Then it's also fair to say that right now rping passes a known invalid value and needs fixed ;-) -- Doug Ledford GPG KeyID: 0E572FDD http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband signature.asc Description: OpenPGP digital signature
Re: Problem running rping over Intel adapters
On 10/3/2012 6:02 PM, Hefty, Sean wrote: >>>> 1) rping, on the client side, clears the conn_params for the >>>> newly to be attempted connection, then sets: >>>> >>>> conn_param.responder_resources = 1; conn_param.initiator_depth >>>> = 1; conn_param.retry_count = 10; >>> >>> >>>> 2) the qib driver enforces a maximum of 7 for retry_count. I >>>> don't see anything in the spec that specifies a maximum for >>>> this entry, and in particular I know it doesn't call out for 7 >>>> to mean infinite retries like it does for rnr_retry_count. >>> >>> On IB, retry count is a 3-bit value. See the CM REQ message. >>> >>> The kernel rdma cm should probably check for a valid value and >>> reduce it >> accordingly. >> >> Then it's also fair to say that right now rping passes a known >> invalid value and needs fixed ;-) > > For IB, but I wasn't sure about iWarp. It doesn't look like it's > used there though. I would hope it's not used on iWARPit's running over it's own link layer where this is neither defined nor applicable. > If not, I'll apply a patch to rping, plus submit a fix for the > kernel to validate the value. -- Doug Ledford GPG KeyID: 0E572FDD http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband signature.asc Description: OpenPGP digital signature
[Patch libibverbs 0/5] Minor updates
This is a short series of fixes for minor issues. The most important being a memcpy buffer overrun on ppc arch for the compatibility wrapper. The rest are usability fixes for the example programs. Doug Ledford (5): Add an error when the user specifies an invalid port Don't allow port == 0 as an option Fix the compatibility wrapper on ppc Don't print link phys state on iWARP Don't try to send UD messages larger than MTU examples/devinfo.c | 15 +++ examples/ud_pingpong.c | 14 ++ src/compat-1_0.c | 42 -- 3 files changed, 65 insertions(+), 6 deletions(-) -- 1.7.7.6 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[Patch libibverbs 1/5] Add an error when the user specifies an invalid port
Signed-off-by: Doug Ledford --- examples/devinfo.c |8 +++- 1 files changed, 7 insertions(+), 1 deletions(-) diff --git a/examples/devinfo.c b/examples/devinfo.c index 7dc0463..b1a3b2e 100644 --- a/examples/devinfo.c +++ b/examples/devinfo.c @@ -218,10 +218,16 @@ static int print_hca_cap(struct ibv_device *ib_dev, uint8_t ib_port) goto cleanup; } if (ibv_query_device(ctx, &device_attr)) { - fprintf(stderr, "Failed to query device props"); + fprintf(stderr, "Failed to query device props\n"); rc = 2; goto cleanup; } + if (ib_port && ib_port > device_attr.phys_port_cnt) { + fprintf(stderr, "Invalid port requested for device\n"); + /* rc = 3 is taken by failure to clean up */ + rc = 4; + goto cleanup; + } printf("hca_id:\t%s\n", ibv_get_device_name(ib_dev)); printf("\ttransport:\t\t\t%s (%d)\n", -- 1.7.7.6 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[Patch libibverbs 4/5] Don't print link phys state on iWARP
The physical link state on iWARP transports has no meaning, so don't print it out at all. Signed-off-by: Doug Ledford --- examples/devinfo.c |5 +++-- 1 files changed, 3 insertions(+), 2 deletions(-) diff --git a/examples/devinfo.c b/examples/devinfo.c index d6e9218..ff078e4 100644 --- a/examples/devinfo.c +++ b/examples/devinfo.c @@ -327,8 +327,9 @@ static int print_hca_cap(struct ibv_device *ib_dev, uint8_t ib_port) width_str(port_attr.active_width), port_attr.active_width); printf("\t\t\tactive_speed:\t\t%s (%d)\n", speed_str(port_attr.active_speed), port_attr.active_speed); - printf("\t\t\tphys_state:\t\t%s (%d)\n", - port_phy_state_str(port_attr.phys_state), port_attr.phys_state); + if (ib_dev->transport_type == IBV_TRANSPORT_IB) + printf("\t\t\tphys_state:\t\t%s (%d)\n", + port_phy_state_str(port_attr.phys_state), port_attr.phys_state); if (print_all_port_gids(ctx, port, port_attr.gid_tbl_len)) goto cleanup; -- 1.7.7.6 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[Patch libibverbs 3/5] Fix the compatibility wrapper on ppc
The ppc arch packs the work request struct 1.0 in such a way that a straight memcpy won't work. Instead, break the copy out into chunks whenever the sizes don't match for given portions of the struct. Found by built in gcc memcpy buffer overflow checks. Help on the right fix provided by Jakub Jelinek from the gcc team inside Red Hat. Signed-off-by: Doug Ledford --- src/compat-1_0.c | 42 -- 1 files changed, 40 insertions(+), 2 deletions(-) diff --git a/src/compat-1_0.c b/src/compat-1_0.c index fff6412..36884bb 100644 --- a/src/compat-1_0.c +++ b/src/compat-1_0.c @@ -353,8 +353,46 @@ static int post_send_wrapper_1_0(struct ibv_qp_1_0 *qp, struct ibv_send_wr_1_0 * real_wr->wr_id = w->wr_id; real_wr->next = NULL; - memcpy(&real_wr->sg_list, &w->sg_list, - sizeof *w - offsetof(struct ibv_send_wr, sg_list)); +#define TEST_SIZE_2_POINT(f1, f2) \ +((offsetof(struct ibv_send_wr, f1) - offsetof(struct ibv_send_wr, f2)) \ +== offsetof(struct ibv_send_wr_1_0, f1) - offsetof(struct ibv_send_wr_1_0, f2)) +#define TEST_SIZE_TO_END(f1) \ +((sizeof(struct ibv_send_wr) - offsetof(struct ibv_send_wr, f1)) == \ + (sizeof(struct ibv_send_wr_1_0) - offsetof(struct ibv_send_wr_1_0, f1))) + + if (TEST_SIZE_TO_END (sg_list)) + memcpy(&real_wr->sg_list, &w->sg_list, sizeof *real_wr + - offsetof(struct ibv_send_wr, sg_list)); + else if (TEST_SIZE_2_POINT (imm_data, sg_list) && +TEST_SIZE_TO_END (wr)) { + /* we have alignment up to wr, but padding between +* imm_data and wr, and we know wr itself is the +* same size */ + memcpy(&real_wr->sg_list, &w->sg_list, + offsetof(struct ibv_send_wr, imm_data) - + offsetof(struct ibv_send_wr, sg_list) + + sizeof real_wr->imm_data); + memcpy(&real_wr->wr, &w->wr, sizeof real_wr->wr); + } else { + real_wr->sg_list = w->sg_list; + real_wr->num_sge = w->num_sge; + real_wr->opcode = w->opcode; + real_wr->send_flags = w->send_flags; + real_wr->imm_data = w->imm_data; + if (TEST_SIZE_TO_END (wr)) + memcpy(&real_wr->wr, &w->wr, + sizeof real_wr->wr); + else { + real_wr->wr.atomic.remote_addr = + w->wr.atomic.remote_addr; + real_wr->wr.atomic.compare_add = + w->wr.atomic.compare_add; + real_wr->wr.atomic.swap = + w->wr.atomic.swap; + real_wr->wr.atomic.rkey = + w->wr.atomic.rkey; + } + } if (is_ud) real_wr->wr.ud.ah = w->wr.ud.ah->real_ah; -- 1.7.7.6 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[Patch libibverbs 2/5] Don't allow port == 0 as an option
We special case port == 0 to mean all ports, and it's the default, so if a user passes in 0, they likely meant 1 instead. Throw an error because they probably didn't mean to specify the default behavior of scan all ports. Path of least surprise and all that. Signed-off-by: Doug Ledford --- examples/devinfo.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/examples/devinfo.c b/examples/devinfo.c index b1a3b2e..d6e9218 100644 --- a/examples/devinfo.c +++ b/examples/devinfo.c @@ -385,7 +385,7 @@ int main(int argc, char *argv[]) case 'i': ib_port = strtol(optarg, NULL, 0); - if (ib_port < 0) { + if (ib_port <= 0) { usage(argv[0]); return 1; } -- 1.7.7.6 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[Patch libibverbs 5/5] Don't try to send UD messages larger than MTU
The UD protocol doesn't support message sizes larger than the path MTU. We don't go so far as to check path MTU, but we do check port MTU. This prevents failed runs of the pingpong_ud program with large MTUs. Signed-off-by: Doug Ledford --- examples/ud_pingpong.c | 14 ++ 1 files changed, 14 insertions(+), 0 deletions(-) diff --git a/examples/ud_pingpong.c b/examples/ud_pingpong.c index 71b152d..9a1ae13 100644 --- a/examples/ud_pingpong.c +++ b/examples/ud_pingpong.c @@ -323,6 +323,20 @@ static struct pingpong_context *pp_init_ctx(struct ibv_device *ib_dev, int size, ibv_get_device_name(ib_dev)); return NULL; } + { + struct ibv_port_attr port_info = { 0 }; + int mtu; + + if (ibv_query_port(ctx->context, port, &port_info)) { + fprintf(stderr, "Unable to query port info for port %d\n", port); + return NULL; + } + mtu = 1 << (port_info.active_mtu + 7); + if (size > mtu) { + fprintf(stderr, "Requested size larger than port MTU (%d)\n", mtu); + return NULL; + } + } if (use_event) { ctx->channel = ibv_create_comp_channel(ctx->context); -- 1.7.7.6 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[Patch] ib/mlx4: Fix build error on platforms where UL is not 64bit
Line 110 uses UL as a compiler cast for the 0x constant, but it's not large enough to hold a 64bit value on 32bit arches. Use ULL instead. Signed-off-by: Doug Ledford --- drivers/infiniband/hw/mlx4/alias_GUID.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/drivers/infiniband/hw/mlx4/alias_GUID.c b/drivers/infiniband/hw/mlx4/alias_GUID.c index d2fb38d..cb790a8 100644 --- a/drivers/infiniband/hw/mlx4/alias_GUID.c +++ b/drivers/infiniband/hw/mlx4/alias_GUID.c @@ -107,7 +107,7 @@ static __be64 get_cached_alias_guid(struct mlx4_ib_dev *dev, int port, int index { if (index >= NUM_ALIAS_GUID_PER_PORT) { pr_err("%s: ERROR: asked for index:%d\n", __func__, index); - return (__force __be64) ((u64) 0xUL); + return (__force __be64) ((u64) 0xULL); } return *(__be64 *)&dev->sriov.demux[port - 1].guid_cache[index]; } -- 1.7.7.6 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/4 for opensm] /etc/init.d/opensmd: Improve systemd integration
On 10/24/2012 9:27 AM, Alex Netes wrote: > Hi Bart, > > On 16:44 Fri 21 Sep , Bart Van Assche wrote: >> On recent SLES and openSUSE systems it is necessary to read the >> file /lib/lsb/init-functions instead of /etc/rc.status for proper >> integration with systemd. Certain implementations of killproc >> need a pidfile so make sure that one gets created. Also, correct >> the opensm service dependency list. >> >> Signed-off-by: Bart Van Assche >> Cc: Doug Ledford >> --- >> configure.in |6 ++ >> scripts/opensm.init.in| 26 ++ >> scripts/redhat-opensm.init.in |2 ++ >> 3 files changed, 26 insertions(+), 8 deletions(-) >> >> diff --git a/configure.in b/configure.in >> index f660a5b..4515ae2 100644 >> --- a/configure.in >> +++ b/configure.in >> @@ -11,6 +11,12 @@ AM_INIT_AUTOMAKE >> AC_SUBST(RELEASE, ${RELEASE:-unknown}) >> AC_SUBST(TARBALL, ${TARBALL:-${PACKAGE}-${VERSION}.tar.gz}) >> >> +default_rdma_service=openibd >> +AC_ARG_WITH([rdma_service], >> +AC_HELP_STRING([--with-rdma-service=name], >> + [name of the RDMA service: "rdma" when using >> /etc/init.d/rdma to start RDMA services; "openibd" when using >> /etc/init.d/openibd to start RDMA services >> [default=${default_rdma_service}]])) >> +AC_SUBST(RDMA_SERVICE, ${with_rdma_service:-${default_rdma_service}}) >> + > > There is also an option to run opensm with ibsim. In that case neither rdma > nor openibd should run. ibsim is unlikely to be used by end users or system administrators in a normal startup environment. I doubt it needs to be accounted for in the init script. Most users simply don't even know enough about ibsim to be able to use it at all. >> if { rpm -q sles-release || rpm -q openSUSE-release; } >/dev/null 2>&1; then >> default_start="2 3 5" >> default_stop="0 1 4 6" >> diff --git a/scripts/opensm.init.in b/scripts/opensm.init.in >> index 007dae4..ddd1d6b 100644 >> --- a/scripts/opensm.init.in >> +++ b/scripts/opensm.init.in >> @@ -7,8 +7,8 @@ >> # >> ### BEGIN INIT INFO >> # Provides: opensm >> -# Required-Start: $syslog >> -# Required-Stop: >> +# Required-Start: $syslog @RDMA_SERVICE@ >> +# Required-Stop: $syslog @RDMA_SERVICE@ >> # Default-Start: @DEFAULT_START@ >> # Default-Stop: @DEFAULT_STOP@ >> # Description: Manage OpenSM >> @@ -42,14 +42,22 @@ >> >> prefix=@prefix@ >> exec_prefix=@exec_prefix@ >> +pidfile=/var/run/opensm.pid >> >> # Source function library. >> if [[ -s /etc/init.d/functions ]]; then >> +# RHEL / CentOS / SL / Fedora. >> . /etc/init.d/functions >> rc_status() { :; } >> rc_exit() { exit $RETVAL; } >> -fi >> -if [[ -s /etc/rc.status ]]; then >> +elif [[ -s /lib/lsb/init-functions ]]; then >> +# SLES / openSuSE / Debian. >> +. /lib/lsb/init-functions >> +rc_exit() { exit $RETVAL; } >> +success() { log_success_msg; } >> +failure() { log_failure_msg; } >> +elif [[ -s /etc/rc.status ]]; then >> +# Older SuSE systems. >> . /etc/rc.status >> failure() { rc_status -v; } >> success() { rc_status -v; } >> @@ -61,10 +69,13 @@ if [[ -s $CONFIG ]]; then >> fi >> >> start () { >> +if [ -e $pidfile ]; then > > On opensm segfault (happens one in a while :), pidfile won't be removed, so > you won't be able to start the opensm again. I guess that same thing can > happen on warm reboot. The stop action in the script should handle cleanup for you. If it's called, and a pid file exists, but the pid is not running, it should remove the pid file and the subsystem lock file so that a clean start works. >> +echo Already started >> +return 1 >> +fi > > -- Alex > -- > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- Doug Ledford GPG KeyID: 0E572FDD http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband signature.asc Description: OpenPGP digital signature
Re: [PATCH 3/4 for opensm] /etc/init.d/opensmd: Improve systemd integration
On 10/24/2012 10:45 AM, Bart Van Assche wrote: > On 10/24/12 16:33, Doug Ledford wrote: >> On 10/24/2012 9:27 AM, Alex Netes wrote: >>> On 16:44 Fri 21 Sep , Bart Van Assche wrote: >>>> start () { >>>> +if [ -e $pidfile ]; then >>> >>> On opensm segfault (happens one in a while :), pidfile won't be >>> removed, so >>> you won't be able to start the opensm again. I guess that same thing can >>> happen on warm reboot. >> >> The stop action in the script should handle cleanup for you. If it's >> called, and a pid file exists, but the pid is not running, it should >> remove the pid file and the subsystem lock file so that a clean start >> works. > > Hello Doug, > > Do we really need the lock file ? On some Linux systems (Ubuntu) the > /var/lock/subsys directory does not even exist. > > Bart. > It's an old holdover from LSB compliance. *I* don't personally care one way or another, but it's supposed to be there for ancient scripts that look there to find out if a service is running. Probably none of those ancient scripts look for opensm as it wasn't around back when they were written and the "new and improved" way of checking a service in LSB is to run the init script with the status command. But, that's why I have it there. -- Doug Ledford GPG KeyID: 0E572FDD http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband signature.asc Description: OpenPGP digital signature
Re: [PATCH] opensm/configure.in: Remove Default-Start from opensmd init script
On 01/30/13 03:59, Bart Van Assche wrote: > On 01/29/13 18:18, Alex Netes wrote: >> During opensm RPM packaging, `chkconfig --add opensmd` is called. >> `chkconfig --add` creates the appropriate entry as specified by the >> default values in the init script. Having opensmd run by default on boot >> isn't desired. >> >> Signed-off-by: Alex Netes >> --- >> configure.in | 5 +++-- >> 1 file changed, 3 insertions(+), 2 deletions(-) >> >> diff --git a/configure.in b/configure.in >> index 4515ae2..f798be2 100644 >> --- a/configure.in >> +++ b/configure.in >> @@ -18,12 +18,13 @@ AC_ARG_WITH([rdma_service], >> AC_SUBST(RDMA_SERVICE, ${with_rdma_service:-${default_rdma_service}}) >> >> if { rpm -q sles-release || rpm -q openSUSE-release; } >/dev/null >> 2>&1; then >> - default_start="2 3 5" >> default_stop="0 1 4 6" >> else >> - default_start="2 3 4 5" >> default_stop="0 1 6" >> fi >> + >> +default_start="null" >> + >> AC_SUBST(DEFAULT_START, $default_start) >> AC_SUBST(DEFAULT_STOP, $default_stop) > > Sorry but this patch doesn't make sense to me. This patch will prevent > anyone to enable opensm to run during boot via chkconfig. It does not. It means that chkconfig add does not create any start links, just all kill links. To actually enabled it, chkconfig --level 2345 opensmd on will still work. Which was the intent: leave it off by default, enabled by the admin. > How about > replacing the above by the (untested) patch below ? > > iff --git a/opensm.spec.in b/opensm.spec.in > index 6ae525b..2325b70 100644 > --- a/opensm.spec.in > +++ b/opensm.spec.in > @@ -107,13 +107,6 @@ rm -rf $RPM_BUILD_ROOT > > %post > if [ $1 = 1 ]; then > -if [ -e /sbin/chkconfig ]; then > -/sbin/chkconfig --add opensmd > -elif [ -e /usr/sbin/update-rc.d ]; then > -/usr/sbin/update-rc.d opensmd defaults > -else > -/usr/lib/lsb/install_initd /etc/init.d/opensmd > -fi > if type systemctl >/dev/null 2>&1; then > systemctl --system daemon-reload > fi This is not the correct fix. It leaves the init script unadded to the system, which means chkconfig --list will not show it properly until the user manually adds it later. -- Doug Ledford GPG KeyID: 0E572FDD http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] opensm/configure.in: Remove Default-Start from opensmd init script
On 1/30/2013 11:00 AM, Bart Van Assche wrote: > On 01/30/13 16:43, Doug Ledford wrote: >> On 01/30/13 03:59, Bart Van Assche wrote: >>> On 01/29/13 18:18, Alex Netes wrote: >>>> During opensm RPM packaging, `chkconfig --add opensmd` is called. >>>> `chkconfig --add` creates the appropriate entry as specified by the >>>> default values in the init script. Having opensmd run by default on >>>> boot >>>> isn't desired. >>>> >>>> Signed-off-by: Alex Netes >>>> --- >>>>configure.in | 5 +++-- >>>>1 file changed, 3 insertions(+), 2 deletions(-) >>>> >>>> diff --git a/configure.in b/configure.in >>>> index 4515ae2..f798be2 100644 >>>> --- a/configure.in >>>> +++ b/configure.in >>>> @@ -18,12 +18,13 @@ AC_ARG_WITH([rdma_service], >>>>AC_SUBST(RDMA_SERVICE, >>>> ${with_rdma_service:-${default_rdma_service}}) >>>> >>>>if { rpm -q sles-release || rpm -q openSUSE-release; } >/dev/null >>>> 2>&1; then >>>> - default_start="2 3 5" >>>> default_stop="0 1 4 6" >>>>else >>>> - default_start="2 3 4 5" >>>> default_stop="0 1 6" >>>>fi >>>> + >>>> +default_start="null" >>>> + >>>>AC_SUBST(DEFAULT_START, $default_start) >>>>AC_SUBST(DEFAULT_STOP, $default_stop) >>> >>> Sorry but this patch doesn't make sense to me. This patch will prevent >>> anyone to enable opensm to run during boot via chkconfig. >> >> It does not. It means that chkconfig add does not create any start >> links, just all kill links. To actually enabled it, chkconfig --level >> 2345 opensmd on will still work. Which was the intent: leave it off by >> default, enabled by the admin. >> >>> How about >>> replacing the above by the (untested) patch below ? >>> >>> iff --git a/opensm.spec.in b/opensm.spec.in >>> index 6ae525b..2325b70 100644 >>> --- a/opensm.spec.in >>> +++ b/opensm.spec.in >>> @@ -107,13 +107,6 @@ rm -rf $RPM_BUILD_ROOT >>> >>> %post >>> if [ $1 = 1 ]; then >>> -if [ -e /sbin/chkconfig ]; then >>> -/sbin/chkconfig --add opensmd >>> -elif [ -e /usr/sbin/update-rc.d ]; then >>> -/usr/sbin/update-rc.d opensmd defaults >>> -else >>> -/usr/lib/lsb/install_initd /etc/init.d/opensmd >>> -fi >>> if type systemctl >/dev/null 2>&1; then >>> systemctl --system daemon-reload >>> fi >> >> This is not the correct fix. It leaves the init script unadded to the >> system, which means chkconfig --list will not show it properly until the >> user manually adds it later. > > Which convention is followed for other packages ? This is what I found in > the Fedora 18 iscsi-initiator-utils package > (http://be.mirror.eurid.eu/fedora/linux/releases/18/Fedora/source/SRPMS/i/iscsi-initiator-utils-6.2.0.872-19.fc18.src.rpm): > > * iscsid.init: Default-Start: 3 4 5 > * iscsi-initiator-utils.spec: Okay, first off, any package that still uses the SysV initscripts as of Fedora 18 is not what I would call a package that is keeping up with the Fedora packaging guidelines or Fedora technologies. As such, I'm not really sure you want to use it as an example of a good package. However, that being said, you will note in this spec file that the iSCSI initiator package does exactly what you removed, or suggested be removed, from the opensmd spec file. It unilaterally adds the initscript to the system. The default start/stop settings are different, but the add action is the same. All initscripts should be added to the system, regardless of their default start/stop settings, and the default-start and default-stop should be used to control *how* they are added by default, and chkconfig --level .* [on|off] should be used to control whether or not they are on or off differently than their default settings. > %post > /sbin/ldconfig > if [ "$1" -eq "1" ]; then > if [ ! -f %{_sysconfdir}/iscsi/initiatorname.iscsi ]; then > echo "InitiatorName=`/sbin/iscsi-iname`" > > %{_sysconfdir}/iscsi/initiatorname.iscsi > fi > /sbin/chkconfig --add iscsid > /sbin/chkconfig --add iscsi > fi > > Bart. signature.asc Description: OpenPGP digital signature
Re: [PATCH] opensm/configure.in: Remove Default-Start from opensmd init script
On 1/30/2013 2:12 PM, Bart Van Assche wrote: > On 01/30/13 18:48, Doug Ledford wrote: >> On 1/30/2013 11:00 AM, Bart Van Assche wrote: >>> Which convention is followed for other packages ? This is what I >>> found in >>> the Fedora 18 iscsi-initiator-utils package >>> (http://be.mirror.eurid.eu/fedora/linux/releases/18/Fedora/source/SRPMS/i/iscsi-initiator-utils-6.2.0.872-19.fc18.src.rpm): >>> >>> >>> * iscsid.init: Default-Start: 3 4 5 >>> * iscsi-initiator-utils.spec: >> >> Okay, first off, any package that still uses the SysV initscripts as of >> Fedora 18 is not what I would call a package that is keeping up with the >> Fedora packaging guidelines or Fedora technologies. As such, I'm not >> really sure you want to use it as an example of a good package. >> However, that being said, you will note in this spec file that the iSCSI >> initiator package does exactly what you removed, or suggested be >> removed, from the opensmd spec file. It unilaterally adds the >> initscript to the system. The default start/stop settings are >> different, but the add action is the same. >> >> All initscripts should be added to the system, regardless of their >> default start/stop settings, and the default-start and default-stop >> should be used to control *how* they are added by default, and chkconfig >> --level .* [on|off] should be used to control whether or >> not they are on or off differently than their default settings. > > Thanks for the detailed reply. Regarding the purpose of the patch at the > start of this thread: do you know whether it is LSB-compliant to use > "Default-Start: null" or should "Default-Start" be left out entirely in > order not to create the start links ? See e.g. > http://refspecs.linuxbase.org/LSB_3.2.0/LSB-Core-generic/LSB-Core-generic/initscrcomconv.html. > > > Bart. I've always used "Default-start:" without any listed levels, but I haven't really tested it either and the spec is not specific about this particular possible configuration. signature.asc Description: OpenPGP digital signature
Re: [PATCH] opensm/configure.in: Remove Default-Start from opensmd init script
On 01/31/13 02:21, Alex Netes wrote: > On 14:24 Wed 30 Jan , Doug Ledford wrote: >> On 1/30/2013 2:12 PM, Bart Van Assche wrote: >>> On 01/30/13 18:48, Doug Ledford wrote: >>>> On 1/30/2013 11:00 AM, Bart Van Assche wrote: >>>>> Which convention is followed for other packages ? This is what I >>>>> found in >>>>> the Fedora 18 iscsi-initiator-utils package >>>>> (http://be.mirror.eurid.eu/fedora/linux/releases/18/Fedora/source/SRPMS/i/iscsi-initiator-utils-6.2.0.872-19.fc18.src.rpm): >>>>> >>>>> >>>>> * iscsid.init: Default-Start: 3 4 5 >>>>> * iscsi-initiator-utils.spec: >>>> >>>> Okay, first off, any package that still uses the SysV initscripts as of >>>> Fedora 18 is not what I would call a package that is keeping up with the >>>> Fedora packaging guidelines or Fedora technologies. As such, I'm not >>>> really sure you want to use it as an example of a good package. >>>> However, that being said, you will note in this spec file that the iSCSI >>>> initiator package does exactly what you removed, or suggested be >>>> removed, from the opensmd spec file. It unilaterally adds the >>>> initscript to the system. The default start/stop settings are >>>> different, but the add action is the same. >>>> >>>> All initscripts should be added to the system, regardless of their >>>> default start/stop settings, and the default-start and default-stop >>>> should be used to control *how* they are added by default, and chkconfig >>>> --level .* [on|off] should be used to control whether or >>>> not they are on or off differently than their default settings. >>> >>> Thanks for the detailed reply. Regarding the purpose of the patch at the >>> start of this thread: do you know whether it is LSB-compliant to use >>> "Default-Start: null" or should "Default-Start" be left out entirely in >>> order not to create the start links ? See e.g. >>> http://refspecs.linuxbase.org/LSB_3.2.0/LSB-Core-generic/LSB-Core-generic/initscrcomconv.html. >>> >>> >>> Bart. >> >> I've always used "Default-start:" without any listed levels, but I >> haven't really tested it either and the spec is not specific about this >> particular possible configuration. >> >> > > It's how the script was defined before. The behavior on RHEL and SLES is > different when not specifying "Default-start:". On RHEL, `chkconfig opensmd > on` > adds the service to the default runlevels: 2 3 4 5. While on SLES it doesn't. That sounds like a bug in SLES to be honest. chkconfig opensmd on without --levels should follow the Default-Start: item (just like chkconfig add opensmd). -- Doug Ledford GPG KeyID: 0E572FDD http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] ipoib: fix hard_header return value
If you have a patched up dhcp server (and dhclient), they will use AF_PACKET/SOCK_DGRAM pair to send dhcp packets over IPoIB. This has worked since forever if you use OFED kernels or one of the distribution kernels. However, when testing an upstream kernel, it has been broken for a very long time (I tested 2.6.34, 2.6.38, 3.0, 3.1, 3.8, HEAD). It turns out that the hard_header routine in ipoib is not following the API and is returning 0 even when it pushed data onto the skb. This then causes af_packet.c to overwrite the header just pushed with data from user space. This header is immediately referenced in the ipoib_start_xmit routine, so I'm wondering how this ever worked in distro/ofed kernels that also have this bug, but fixing the bug here makes things work in upstream kernels. Signed-off-by: Doug Ledford --- drivers/infiniband/ulp/ipoib/ipoib_main.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c index 8534afd..31dd2a7 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c @@ -828,7 +828,7 @@ static int ipoib_hard_header(struct sk_buff *skb, */ memcpy(cb->hwaddr, daddr, INFINIBAND_ALEN); - return 0; + return sizeof *header; } static void ipoib_set_mcast_list(struct net_device *dev) -- 1.8.1.2 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] ipoib: fix hard_header return value
On 03/26/2013 12:46 PM, Roland Dreier wrote: >> Checkpatch recommends since some time to use sizeof(e) instead of sizeof e, >> isn't it ? > > I actually prefer "sizeof e" since sizeof is an operator, not a > function. "sizeof(e)" looks just as silly as "return(e)" to me. > > I'll apply this patch soon, it's a good catch. > Cool, thanks! -- Doug Ledford GPG KeyID: 0E572FDD http://people.redhat.com/dledford signature.asc Description: OpenPGP digital signature
Re: [PATCH] ipoib: fix hard_header return value
On 03/26/2013 02:16 PM, Jason Gunthorpe wrote: > On Tue, Mar 26, 2013 at 09:46:28AM -0700, Roland Dreier wrote: >>> Checkpatch recommends since some time to use sizeof(e) instead of sizeof e, >>> isn't it ? >> >> I actually prefer "sizeof e" since sizeof is an operator, not a >> function. "sizeof(e)" looks just as silly as "return(e)" to me. > > Sizeof is used as an expression, return is not. > > They have different precedence rules: > return e + 1; // == return (e + 1) > vs > sizeof e + 1; // == sizeof(e) + 1 > > Or weirder: > return (void *)x; // OK > vs > sizeof (void *)x; // <-- compile error > > Coding sizeof as a function call frees the reader from having to > check/know the obscure precedence rules for sizeof. Personally, I couldn't care less. But about 5 lines outside of the context of this patch in the same function is another use of that same expression. The real reason I used the form I did was to match the style of that other instance. I prepared a separate patch to convert both of them to the checkpatch preferred style until I saw Roland say he didn't want it. But what I wouldn't have done is submitted a patch that caused a single function to have two different styles within 8 lines of each other just because some nanny state nag script tells me too. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] Ad IB_MTU_1500|9000 enums.
On 04/19/2013 11:24 AM, Jeff Squyres (jsquyres) wrote: > On Apr 12, 2013, at 11:40 AM, Jeff Squyres (jsquyres) > wrote: > >>> As an aside I like the use of RDMA_MTU_* for these values. Again to >>> distinguish them from the IBTA values. But I know that is poor form. >> >> So what's the right way to move forward on this? Is it this: >> >> enum ib_mtu { >> IB_MTU_256 = 1, >> IB_MTU_512 = 2, >> IB_MTU_1024 = 3, >> IB_MTU_2048 = 4, >> IB_MTU_4096 = 5, >> RDMA_MTU_1500 = 1500, >> RDMA_MTU_9000 =9000 >> }; > > > Bump. > This seems reasonable, but still concerns me a bit. The original version was flat out wrong because you can't re-arrange any exposed enum like this without requiring that all user space apps be recompiled. This is especially true because ibv_mtu_enum_to_int is an inline, so any compiled programs won't have been updated by a shared library update, yet the new enum values can end up showing up, so the apps break when they start seeing -1 as the return value, or when they interpret 1500 as 2048, etc. I wonder if the correct way forward isn't to leave these two functions alone (as inlines you can't change them without a recompile anyway). However, I would officially change the documentation to specify that both enum ib_mtu in the queue_pair structs and the result of ib_mtu_enum_to_int is not exact on non-IB link layers and is deprecated in favor of a new function: ib_qp_mtu() (or similar) that takes a queue pair and returns the real mtu for that queue pair based on params and device the queue pair is on (I could also see it being ibv_dev_mtu(struct ibv_device *) instead if you want to query the mtu before you make a queue pair instead). Then for both IBoE and USNIC devices, I would just store the actual MTU in the internal ibv device struct and return that when requested. This maintains 100% back compatibility with existing apps but allows a path for people to get better performance/utilization by using the new function instead of the old one. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] Ad IB_MTU_1500|9000 enums.
On 04/20/2013 07:29 PM, Hefty, Sean wrote: >> This seems reasonable, but still concerns me a bit. The original >> version was flat out wrong because you can't re-arrange any exposed >> enum like this without requiring that all user space apps be >> recompiled. This is especially true because ibv_mtu_enum_to_int is >> an inline > > ib_mtu_enum_to_int() is a kernel function, not user space, so I think > we're fine here, unless you're concerned about drivers built out of > tree. Well, although *I* might have to worry about out of kernel drivers, I wouldn't suggest such for upstream. However, for some reason I had it in my mind when I was reading the patch that it was against libibverbs. That's what I get for staying up late and reviewing when I'm tired :-/ -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] Ad IB_MTU_1500|9000 enums.
On 04/22/2013 02:00 PM, Jeff Squyres (jsquyres) wrote: > 2. Change all instances of ib_mtu/ibv_mtu to an int. Code such as > "switch(mtu) case IBV_MTU_1024: ..." will need to be updated to > "switch(mtu) case 1024: ...". > > PRO: solves the problem for all MTU values > PRO: eliminates the num-to-int translation functions > CON: much driver code will need to be updated per above, and also > update logic checking for out-of-bounds MTU calues > CON: similarly, userspace apps will need to be updated; it might be > worthwhile to bump libibverbs to 2.x, and then intentionally change > the MTU field names in ibv_port_attr and ibv_qp_attr so that apps > using those fields will fail to compile with libibverbs 2.x > (and therefore forcibly realize they need to adapt to the new > int MTU values) I was actually thinking that an ibverbs API version 2.0 might be an interesting way to go. The proliferation of non-IB link layers providing the verbs API make some of the original assumptions of IB link layer in the original API obsolete. But, if we were to do that, I'd take some time to really think the issue over and try to catch all of the needed updates in one go. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] ipoib: fix hard_header return value
On 04/01/13 17:25, Or Gerlitz wrote: > Doug Ledford wrote: >> If you have a patched up dhcp server (and dhclient), > > Could you be more specific, I assume you refer to the ISC dhcp bits, > which version and which patches? Any version of dhcp server, and the improved-xid and lpf-ib patches primarily. We've carried those patches forever, but as far as I know, they still have not been taken by ISC. Without them, dhcp server will only work with a cooked socket interface. You can't use raw as the socket type when compiling or else it won't work on IB. With the patches, you can enable the raw socket method, and on IB it will fall back to use PACKET instead. > AFAIK they don't give you access to > their source repo but rather only to drops plus possibly patches which > is a bit more tedious to follow, oh wel... > >> they will use AF_PACKET/SOCK_DGRAM pair to send dhcp packets over IPoIB. > > >> This has worked since forever if you use OFED kernels or one of the >> distribution >> kernels. However, when testing an upstream kernel, it has been broken >> for a very long time (I tested 2.6.34, 2.6.38, 3.0, 3.1, 3.8, HEAD). > > IMO doesn't seem relevant to the upstream commit message I disagree. I don't buy the whole "we are upstream, nothing else matters or is relevant" philosophy. The truth of the matter is that there is essentially a fork between upstream and OFED. I plan on spending some time bringing some of the relevant fixes present in OFED and not upstream back to upstream. In the context of attempting to manually merge some of that fork back together, I see no reason whatsoever to ignore relevant historical information during that process. >> It turns out that the hard_header routine in ipoib is not following >> the API and is returning 0 even when it pushed data onto the skb. This >> then causes af_packet.c to overwrite the header just pushed with data >> from user space. This header is immediately referenced in the >> ipoib_start_xmit routine > > cool, I assume we want this fix to go for -stable after spending some > time upstream, e.g probably by the time 3.9 is released and some more > testing is done on the patch (I'll advocate for that @ MLNX, copied > some folks now) get that to -stable too. Yes, it can go to -stable. But, given that no one has noticed before now that it doesn't work, I'm guessing not many people are using straight upstream (which is something that needs fixed IMO). > Erez, in the code we use internally which is based on upstream 3.7 do > we have DHCP/IPoIB working without this patch? > >> so I'm wondering how this ever worked in >> distro/ofed kernels that also have this bug, but fixing the bug here >> makes things work in upstream kernels. > > same for the last three lines > > > Signed-off-by: Doug Ledford > --- > drivers/infiniband/ulp/ipoib/ > ipoib_main.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c > b/drivers/infiniband/ulp/ipoib/ipoib_main.c > index 8534afd..31dd2a7 100644 > --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c > +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c > @@ -828,7 +828,7 @@ static int ipoib_hard_header(struct sk_buff *skb, > */ > memcpy(cb->hwaddr, daddr, INFINIBAND_ALEN); > > - return 0; > + return sizeof *header; > } > > static void ipoib_set_mcast_list(struct net_device *dev) > -- Doug Ledford GPG KeyID: 0E572FDD http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] libibverbs: Use autoreconf in autogen.sh
On 04/30/2013 10:29 AM, Jeff Squyres (jsquyres) wrote: > Bump bump. :-) This is fine with me, however, I think you also need to bump the autotools version to the latest upstream. The automated checkers in our build environment is spitting out errors about a number of upstream packages where the autotools used to configure the package does not include proper arm support. The latest autotools bring in all of the forthcoming arm variants. So I would like to see both of these things done. > On Apr 25, 2013, at 11:38 AM, Jeff Squyres (jsquyres) > wrote: > >> Bump. >> >> On Apr 22, 2013, at 1:41 PM, Jeff Squyres wrote: >> >>> The old sequence of Autotools commands listed in autogen.sh is no >>> longer correct. Instead, just use the single "autoreconf" command, >>> which will invoke all the Right Autotools commands in the correct >>> order. >>> >>> Signed-off-by: Jeff Squyres >>> --- >>> autogen.sh | 6 +- >>> 1 file changed, 1 insertion(+), 5 deletions(-) >>> >>> diff --git a/autogen.sh b/autogen.sh >>> index fd47839..6c9233e 100755 >>> --- a/autogen.sh >>> +++ b/autogen.sh >>> @@ -1,8 +1,4 @@ >>> #! /bin/sh >>> >>> set -x >>> -aclocal -I config >>> -libtoolize --force --copy >>> -autoheader >>> -automake --foreign --add-missing --copy >>> -autoconf >>> +autoreconf -ifv -I config >>> -- >>> 1.8.1.1 >>> >> >> >> -- >> Jeff Squyres >> jsquy...@cisco.com >> For corporate legal information go to: >> http://www.cisco.com/web/about/doing_business/legal/cri/ >> >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in >> the body of a message to majord...@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: libibverbs / libmlx4 release
On 05/20/2013 12:49 PM, Roland Dreier wrote: > On Mon, May 20, 2013 at 5:37 AM, Or Gerlitz wrote: >> Following what we discussed last week during the Linux Foundation EU summit, >> I think it would be good to follow what you said and have a point release >> for libibverbs and libmlx4 before we pull in the verbs extensions framework >> and features that use it (XRC, Flow-Steering, etc more fun). >> >> I mentioned to you that we have some more libmlx4 patches, but its totally >> OK for us to submit them after that release, makes sense? > > That's fine, I'll do the releases this week. If you haven't already done so, can you please update to the latest autotools and reroll the autotools files when you do the release? -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BUG: unable to handle kernel paging request at 0000000000070a78 IPoIB
On 05/23/2013 11:38 AM, Jack Wang wrote: > Tainted: G O 3.4.23-pserver-hotfix+ #109 System manufacturer ^^^ I would try a newer kernel. There are a couple known issues fixed since this kernel (including a memory corrupter that was involved with neighbor list handling, and some of your traces look vaguely familiar to that old failuer). -- Doug Ledford GPG KeyID: 0E572FDD http://people.redhat.com/dledford signature.asc Description: OpenPGP digital signature
Re: BUG: unable to handle kernel paging request at 0000000000070a78 IPoIB
On 05/23/2013 02:53 PM, Jack Wang wrote: > On 2013年05月23日 19:41, Doug Ledford wrote: >> On 05/23/2013 11:38 AM, Jack Wang wrote: >>> Tainted: G O 3.4.23-pserver-hotfix+ #109 System manufacturer >> ^^^ >> >> I would try a newer kernel. There are a couple known issues fixed since >> this kernel (including a memory corrupter that was involved with >> neighbor list handling, and some of your traces look vaguely familiar to >> that old failuer). >> >> >> > > Thanks Doug for reply, I tried branch rdma-for-linus, It panic in other > places. > > Could you point me which commit do you mean exactly? > > Regards, > Jack > Just try the official v3.9 kernel from Linus and see how it does. A 'git checkout v3.9' will do the trick. -- Doug Ledford GPG KeyID: 0E572FDD http://people.redhat.com/dledford signature.asc Description: OpenPGP digital signature
[Patch infiniband-diags:perfquery] Loop through all local HCAs/ports
The -a mode of perfquery is intended to loop through all ports on a single HCA and provide aggregated output across all ports. The -l mode is intended to loop through all ports of a single HCA and output non-aggregated data. Neither mode addresses a machine with more than one HCA. Furthermore, I found both -a and -l failed to loop properly on my Mellanox adapter (it would read the first port and error out trying to read the second). So, I wrote a new switch, -H, that loops through all ports on all HCAs in the system. Because of how it's implemented, it gets around the problem that both -a and -l had on my machine when dealing with the second Mellanox port. It, however, does not do aggregated output because each HCA/port combination is treated as its own device. I forgot to update the man page though. If the current infiniband-diags maintainer wants it, I can add that (that's assuming the base patch is acceptable). I think Ira is doing that now, right? Anyway, attached is the patch. diff -up infiniband-diags-1.5.8/src/perfquery.c.hcas infiniband-diags-1.5.8/src/perfquery.c --- infiniband-diags-1.5.8/src/perfquery.c.hcas 2011-02-16 05:13:21.0 -0500 +++ infiniband-diags-1.5.8/src/perfquery.c 2011-07-29 17:17:41.599262511 -0400 @@ -347,7 +347,7 @@ static void reset_counters(int extended, } static int reset, reset_only, all_ports, loop_ports, port, extended, xmt_sl, -rcv_sl, xmt_disc, rcv_err, smpl_ctl; +rcv_sl, xmt_disc, rcv_err, smpl_ctl, all_hcas; static void common_func(ib_portid_t * portid, int port_num, int mask, unsigned query, unsigned reset, @@ -447,12 +447,16 @@ static int process_opt(void *context, in case 'R': reset_only++; break; + case 'H': + all_hcas++; + break; default: return -1; } return 0; } +#define MAX_NAMES 10 int main(int argc, char **argv) { int mgmt_classes[4] = { IB_SMI_CLASS, IB_SMI_DIRECT_CLASS, IB_SA_CLASS, @@ -467,6 +471,10 @@ int main(int argc, char **argv) int start_port = 1; int enhancedport0; int i; + int names; + int cur_name = 0; + char name_list[10][UMAD_CA_NAME_LEN]; + umad_ca_t ca; const struct ibdiag_opt opts[] = { {"extended", 'x', 0, NULL, "show extended port counters"}, @@ -476,6 +484,7 @@ int main(int argc, char **argv) {"rcverr", 'E', 0, NULL, "show Rcv Error Details"}, {"smplctl", 'c', 0, NULL, "show samples control"}, {"all_ports", 'a', 0, NULL, "show aggregated counters"}, + {"all_hcas", 'H', 0, NULL, "iterate through all local HCAs and ports"}, {"loop_ports", 'l', 0, NULL, "iterate through each port"}, {"reset_after_read", 'r', 0, NULL, "reset counters after read"}, {"Reset_only", 'R', 0, NULL, "only reset counters"}, @@ -484,10 +493,12 @@ int main(int argc, char **argv) char usage_args[] = " [ [[port] [reset_mask]]]"; const char *usage_examples[] = { "\t\t# read local port's performance counters", + "-H\t\t# read performance counters on all local HCAs/ports", "32 1\t\t# read performance counters from lid 32, port 1", "-x 32 1\t# read extended performance counters from lid 32, port 1", "-a 32\t\t# read performance counters from lid 32, all ports", "-r 32 1\t# read performance counters and reset", + "-r -H\t\t# read and reset counters on all local HCAs/ports", "-x -r 32 1\t# read extended performance counters and reset", "-R 0x20 1\t# reset performance counters of port 1 only", "-x -R 0x20 1\t# reset extended performance counters of port 1 only", @@ -503,118 +514,173 @@ int main(int argc, char **argv) argc -= optind; argv += optind; + if (all_hcas && argc > 0) + IBERROR("Invalid input: all_hcas and any port/lid/guid are " + "not allowed together."); + if (all_hcas && (loop_ports || all_ports || (port==ALL_PORTS))) + IBERROR("Invalid input: all_hcas already goes over all ports, " + "but in a way that's incompatible with the all_ports " + "option or variants."); + if (argc > 1) port = strtoul(argv[1], 0, 0); if (argc > 2) mask = strtoul(argv[2], 0, 0); - srcport = mad_rpc_open_port(ibd_ca, ibd_ca_port, mgmt_classes, 4); - if (!srcport) - IBERROR("Failed to open '%s' port '%d'", ibd_ca, ibd_ca_port); - - if (argc) { - if (ib_resolve_portid_str_via(&portid, argv[0], ibd_dest_type, - ibd_sm_id, srcport) < 0) - IBERROR("can't resolve
Re: [Patch infiniband-diags:perfquery] Loop through all local HCAs/ports
On 07/29/2011 05:57 PM, Doug Ledford wrote: + if (umad_get_ca(name_list[cur_name], &ca)) + /* We're done, the next name was +* empty, just exit gracefully +*/ + exit(0); In retrosight, this may not exit quietly. I can't tell because the last adapter in my system causes the program to exit noisily right now for totally unrelated reasons, so if I got that fixed, I may find this isn't as quite of an exit as I thought. This may need to be changed to a check for a NULL name instead, on NULL exit quietly, and if non-NULL then attempt to get it, and then on failure be noisy. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Patch infiniband-diags:perfquery] Loop through all local HCAs/ports
On 07/29/2011 06:09 PM, Jason Gunthorpe wrote: On Fri, Jul 29, 2011 at 05:57:57PM -0400, Doug Ledford wrote: The -a mode of perfquery is intended to loop through all ports on a single HCA and provide aggregated output across all ports. The -l mode is intended to loop through all ports of a single HCA and output non-aggregated data. Actually, none of these modes are intended to support HCA scenarios, they are all only for switches. Well, they attempt to work on local HCAs if you don't specify a switch lid to query. So, intended or not, they are already being attempted to be used in this fashion out in the field. Not sure what I think of this, is dumping counters on all local HCA ports really that interesting? Would this be better done by doing something fancy with nodeGUID so at least all ports on remote HCAs can be dumped too? The request came in from one of our partners who wanted it for tracking performance stats specifically on the local machine. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Patch infiniband-diags:perfquery] Loop through all local HCAs/ports
On 7/29/2011 6:26 PM, Jason Gunthorpe wrote: >>> Not sure what I think of this, is dumping counters on all local HCA >>> ports really that interesting? Would this be better done by doing >>> something fancy with nodeGUID so at least all ports on remote HCAs can >>> be dumped too? >> >> The request came in from one of our partners who wanted it for >> tracking performance stats specifically on the local machine. > > Well, I think it would be best to make this work generally which is > fairly hard, unfortunately. > - If the destination is the local HCA then you have to iterate over >all local ports with matching node GUIDs by opening devices > - If the destination is a remote HCA then you have to query the SA >for all nodes with a matching GUID and iterate over them. Ira has >been working on some common code for this.. > - Maybe you want to cross product and try to query the SA attached to >all local end ports to try and find all ports. That would actually be >very useful for many situations I know of... > > Could you make your patch just do #1 and continue to misbehave for > the other cases? Well, it sort of already does. The patch looks bigger than it is because I had to re-indent in order to put the main functional code inside of a loop. For the most part, the loop is unchanged. Just apply the patch then look at what we do when all_hcas is non-0 right before the do { and right before the } while (ibd_ca); It iterates over all the hca names that libibumad gives us, and then loops over all the ports as returned by libibumad's umad_get_ca(). -- Doug Ledford GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband signature.asc Description: OpenPGP digital signature
[Patch libibverbs-1.1.5] Fix broken compat assumption on ppc arch
In the compat-1.0 wrapper, the post_send wrapper assumes that the layout of the tail end of the ibv_send_wr struct is the same in version 1.0 and 1.1. Unfortunately, padding of the struct makes this untrue depending on the architecture. In my case, it was ppc that was failing to compile because we were triggering the gcc built in memcpy check. We were scribbling past the end of the structure of the new structure because the old structure has alignment padding that the new one doesn't. There are two spots that are padded in the version 1.0 ibv_send_wr struct, 4 bytes before sg_list, and 4 more bytes before wr. So this patch conditionalizes whether or not we use memcpy on the basis of data that will be reduced to constants at compile time, so even though there is an if statement, it gets reduced to just the proper method for the given arch at compile time. And the patch will attempt a simple memcpy if possible, or two memcpy's in the case that wr needed padding to be aligned (as is the case on ppc), and fall back to field based copy just in case this is compiled on some arch with really odd field alignment characteristics. Signed-off-by: Doug Ledford diff -up libibverbs-1.1.5/src/compat-1_0.c.memcpy libibverbs-1.1.5/src/compat-1_0.c --- libibverbs-1.1.5/src/compat-1_0.c.memcpy2011-06-28 17:13:45.0 -0400 +++ libibverbs-1.1.5/src/compat-1_0.c 2011-08-02 12:07:20.114511405 -0400 @@ -350,8 +350,46 @@ static int post_send_wrapper_1_0(struct real_wr->wr_id = w->wr_id; real_wr->next = NULL; - memcpy(&real_wr->sg_list, &w->sg_list, - sizeof *w - offsetof(struct ibv_send_wr, sg_list)); +#define TEST_SIZE_2_POINT(f1, f2) \ +((offsetof(struct ibv_send_wr, f1) - offsetof(struct ibv_send_wr, f2)) \ +== offsetof(struct ibv_send_wr_1_0, f1) - offsetof(struct ibv_send_wr_1_0, f2)) +#define TEST_SIZE_TO_END(f1) \ +((sizeof(struct ibv_send_wr) - offsetof(struct ibv_send_wr, f1)) == \ + (sizeof(struct ibv_send_wr_1_0) - offsetof(struct ibv_send_wr_1_0, f1))) + + if (TEST_SIZE_TO_END (sg_list)) + memcpy(&real_wr->sg_list, &w->sg_list, sizeof *real_wr + - offsetof(struct ibv_send_wr, sg_list)); + else if (TEST_SIZE_2_POINT (imm_data, sg_list) && +TEST_SIZE_TO_END (wr)) { + /* we have alignment up to wr, but padding between +* imm_data and wr, and we know wr itself is the +* same size */ + memcpy(&real_wr->sg_list, &w->sg_list, + offsetof(struct ibv_send_wr, imm_data) - + offsetof(struct ibv_send_wr, sg_list) + + sizeof real_wr->imm_data); + memcpy(&real_wr->wr, &w->wr, sizeof real_wr->wr); + } else { + real_wr->sg_list = w->sg_list; + real_wr->num_sge = w->num_sge; + real_wr->opcode = w->opcode; + real_wr->send_flags = w->send_flags; + real_wr->imm_data = w->imm_data; + if (TEST_SIZE_TO_END (wr)) + memcpy(&real_wr->wr, &w->wr, + sizeof real_wr->wr); + else { + real_wr->wr.atomic.remote_addr = + w->wr.atomic.remote_addr; + real_wr->wr.atomic.compare_add = + w->wr.atomic.compare_add; + real_wr->wr.atomic.swap = + w->wr.atomic.swap; + real_wr->wr.atomic.rkey = + w->wr.atomic.rkey; + } + } if (is_ud) real_wr->wr.ud.ah = w->wr.ud.ah->real_ah;
Re: Building 3.1-rc9 in kernel infiniband support with OFED libraries
- Original Message - > On Thu, Oct 13, 2011 at 10:24 AM, Christoph Lameter > wrote: > >> Clean reviewed patches for this are where? > > > > They are in OFED-1.5.3.1 so they were already released. > > OFED is neither clean nor reviewed. Really. The stuff in OFED > always > needs a bunch > of review before it is suitable for upstream. Ignoring ABI stability > is just one problem > that OFED code typically has. Indeed. Because it is in OFED means it is in testing, nothing more. A better acronym for OFED is Open Fabrics Experimental Distribution. These issues are the primary reason we don't follow OFED in RHEL6. -- Doug Ledford GPG KeyID: 0E572FDD http://people.redhat.com/dledford -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Building 3.1-rc9 in kernel infiniband support with OFED libraries
- Original Message - > On Tue, Oct 11, 2011 at 7:39 PM, Jason Gunthorpe > wrote: > > On Tue, Oct 11, 2011 at 09:02:41AM -0500, Christoph Lameter wrote: > >> Has XRC support not been merged? How can I build the OFED > >> libraries > >> against Linux 3.1? I'd really like to get rid of the OFED kernel > >> tree > >> nightmare. > > > > You have to use upstream libraries with upstream kernels. Be warned > > that the OFED libraries of the same SONAME are not ABI compatible > > with > > upstream libraries. > > Why is the OFED libibverbs library binary incompatible with the > non-OFED libibverbs library ? Well, as an example: Upstream (Roland's) libibverbs (and this may not be entirely clean, I do have the original RoCE support in here, and plan to switch over to IBoE when it's available): /usr/include/infiniband/verbs.h struct ibv_srq { struct ibv_context *context; void *srq_context; struct ibv_pd *pd; uint32_thandle; pthread_mutex_t mutex; pthread_cond_t cond; uint32_tevents_completed; }; >From OFED's libibverbs: /usr/include/infiniband/verbs.h struct ibv_srq { struct ibv_context *context; void *srq_context; struct ibv_pd *pd; uint32_thandle; uint32_tevents_completed; uint32_txrc_srq_num; struct ibv_xrc_domain *xrc_domain; struct ibv_cq *xrc_cq; pthread_mutex_t mutex; pthread_cond_t cond; }; So, while they added a few new things, they also reordered existing items. This is a public structure passed back to the app and which the app might read any of the items if they are interested. So, with a struct change like this you would normally require an update to the library symbol map and have a compat wrapper so that binaries compiled with either the old or new struct definition can link against the library. However, here is the library symbol maps for ibv_create_srq in each library source: Roland's: ibv_create_srq{IBVERBS_1.0} OFED's: ibv_create_srq{IBVERBS_1.0} This is despite the fact that whenever an ibverbs interface has needed an update, Roland has done the right thing and updated the symbol version to the IBVERBS_1.1 space in the map and created a back compat wrapper in the 1.0 space. So it's not even like the OFED folks didn't have examples of how to do this properly staring them in the face, they just didn't care to do it properly. > Why hasn't XRC support been implemented > in the OFED libibverbs library such that applications built against > the upstream libibverbs headers also work with the latest OFED > version > of that library ? Because OFED exists. That's the *real* reason they don't bother to write this stuff right in the first place. OFED is released as one single unit, and there is no guarantee of API or ABI compatibility from one release to the next. And there is no guarantee of API/ABI compatibility between any given OFED released library and any other upstream library (aka, no guarantee that the OFED libibverbs will work with the upstream librdmacm). If something doesn't work, then their answer is "Upgrade to the latest OFED on all machines, and recompile your apps as needed". If you had your own tree and didn't have to answer to any upstream review, especially on controversial new features or difficult designs, would you bother? -- Doug Ledford GPG KeyID: 0E572FDD http://people.redhat.com/dledford -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Building 3.1-rc9 in kernel infiniband support with OFED libraries
- Original Message - > On Wed, Oct 12, 2011 at 9:32 AM, Wendy Cheng > wrote: > > The OFED package itself does include XRC support. The issue here > > (my > > guess) is that its build script needs to understand the running > > system's kernel version to decide what should be pulled (from the > > source). Linux 3.1 could be too new for OFED build script to make a > > correct decision. Nevertheless, mix-matching OFED modules/libraries > > is > > a *bad* idea. > > No. The same userspace build should work with all kernel versions. Wendy's referring to something other than what you are thinking. The same libibverbs user space build should work on all kernels going back a long way, except when you are talking about OFED their libibverbs is hard coded to assume XRC support and fail if it isn't present, so an OFED libibverbs won't really work without also the OFED kernel module stack. The script Wendy referred to is the script that checks the running kernel's version in order to determine which backport patches need applied to the ofa_kernel source tree in order to build the OFED kernel modules for your running kernel. Without that ofa_kernel build, the OFED libibverbs will indeed fail to run on the running kernel. And that script hasn't been updated to support version 3.x kernels last I checked, so she's right, the script itself doesn't recognize the running kernel version, so ofa_kernel modules don't get built, so OFED libibverbs won't work anyway. So, she's absolutely right, unless you want to start ripping hard coded assumptions about the existence of XRC support out of things like OFED's libibverbs, then out of qperf and a number of their other various packages, then you have to pair the OFED kernel modules and user space packages, they can not be separated. -- Doug Ledford GPG KeyID: 0E572FDD http://people.redhat.com/dledford -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ewg] [RFC] – Proposal for new process for OFED releases
On 12/01/2011 02:53 PM, Tziporet Koren wrote: > > We propose a new process for the OFED releases starting from next OFED > release: > - OFED content will be the relevant kernel.org modules and user space > released packages > - OFED will offer only backports to the distros (no fixes) > - OFED package will be used for easy installation of all packages in a > friendly manner Yay!!! I'm all in favor. -- Doug Ledford GPG KeyID: 0E572FDD -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[Patch] mlx4: fix ppc64 build issue
This may not be an issue with current gcc, I simply haven't tried it to know. But with the gcc shipped in rhel6 (gcc 4.4.6), you can't both define a function as static and also EXPORT_SYMBOL that function on ppc. This fixes that in the mlx4 driver. -- Doug Ledford GPG KeyID: 0E572FDD http://people.redhat.com/dledford commit 5e3ddfd1616295b47785ab81e330177cb962070a Author: Doug Ledford Date: Fri Feb 17 10:35:29 2012 -0500 mlx4: fix build error powerpc is picky, if you set a function to static and then try to EXPORT_SYMBOL_GPL that function, you get this error message: drivers/net/ethernet/mellanox/mlx4/fw.c:687: error: __ksymtab_mlx4_QUERY_PORT causes a section type conflict Signed-off-by: Doug Ledford diff --git a/drivers/net/ethernet/mellanox/mlx4/fw.c b/drivers/net/ethernet/mellanox/mlx4/fw.c index 8a21e10..9ea7cab 100644 --- a/drivers/net/ethernet/mellanox/mlx4/fw.c +++ b/drivers/net/ethernet/mellanox/mlx4/fw.c @@ -685,7 +685,7 @@ int mlx4_QUERY_PORT_wrapper(struct mlx4_dev *dev, int slave, return err; } -static int mlx4_QUERY_PORT(struct mlx4_dev *dev, void *ptr, u8 port) +int mlx4_QUERY_PORT(struct mlx4_dev *dev, void *ptr, u8 port) { struct mlx4_cmd_mailbox *outbox = ptr; diff --git a/drivers/net/ethernet/mellanox/mlx4/mr.c b/drivers/net/ethernet/mellanox/mlx4/mr.c index 8deeef9..25a80d7 100644 --- a/drivers/net/ethernet/mellanox/mlx4/mr.c +++ b/drivers/net/ethernet/mellanox/mlx4/mr.c @@ -304,7 +304,7 @@ static int mlx4_HW2SW_MPT(struct mlx4_dev *dev, struct mlx4_cmd_mailbox *mailbox MLX4_CMD_TIME_CLASS_B, MLX4_CMD_WRAPPED); } -static int mlx4_mr_reserve_range(struct mlx4_dev *dev, int cnt, int align, +int mlx4_mr_reserve_range(struct mlx4_dev *dev, int cnt, int align, u32 *base_mridx) { struct mlx4_priv *priv = mlx4_priv(dev); @@ -320,14 +320,14 @@ static int mlx4_mr_reserve_range(struct mlx4_dev *dev, int cnt, int align, } EXPORT_SYMBOL_GPL(mlx4_mr_reserve_range); -static void mlx4_mr_release_range(struct mlx4_dev *dev, u32 base_mridx, int cnt) +void mlx4_mr_release_range(struct mlx4_dev *dev, u32 base_mridx, int cnt) { struct mlx4_priv *priv = mlx4_priv(dev); mlx4_bitmap_free_range(&priv->mr_table.mpt_bitmap, base_mridx, cnt); } EXPORT_SYMBOL_GPL(mlx4_mr_release_range); -static int mlx4_mr_alloc_reserved(struct mlx4_dev *dev, u32 mridx, u32 pd, +int mlx4_mr_alloc_reserved(struct mlx4_dev *dev, u32 mridx, u32 pd, u64 iova, u64 size, u32 access, int npages, int page_shift, struct mlx4_mr *mr) { @@ -457,7 +457,7 @@ int mlx4_mr_alloc(struct mlx4_dev *dev, u32 pd, u64 iova, u64 size, u32 access, } EXPORT_SYMBOL_GPL(mlx4_mr_alloc); -static void mlx4_mr_free_reserved(struct mlx4_dev *dev, struct mlx4_mr *mr) +void mlx4_mr_free_reserved(struct mlx4_dev *dev, struct mlx4_mr *mr) { int err; @@ -852,7 +852,7 @@ err_free: } EXPORT_SYMBOL_GPL(mlx4_fmr_alloc); -static int mlx4_fmr_alloc_reserved(struct mlx4_dev *dev, u32 mridx, +int mlx4_fmr_alloc_reserved(struct mlx4_dev *dev, u32 mridx, u32 pd, u32 access, int max_pages, int max_maps, u8 page_shift, struct mlx4_fmr *fmr) { @@ -954,7 +954,7 @@ int mlx4_fmr_free(struct mlx4_dev *dev, struct mlx4_fmr *fmr) } EXPORT_SYMBOL_GPL(mlx4_fmr_free); -static int mlx4_fmr_free_reserved(struct mlx4_dev *dev, struct mlx4_fmr *fmr) +int mlx4_fmr_free_reserved(struct mlx4_dev *dev, struct mlx4_fmr *fmr) { if (fmr->maps) return -EBUSY;
[Patch] iser: free ib connection resources in the proper place
We allocate the login dma buffers in iser_verbs.c as part of alloc_ib_conn_resources(), however we are freeing them in iser_initiator.c as part of iser_free_rx_descriptors(). This is needlessly confusing. We have an alloc_rx_descriptors() and it doesn't alloc something that the free_rx_descriptors() frees, and we have an alloc_ib_conn_resources() that allocs something not freed by free_ib_conn_resources(). Clean that up. Signed-off-by: Doug Ledford --- drivers/infiniband/ulp/iser/iser_initiator.c | 12 drivers/infiniband/ulp/iser/iser_verbs.c | 12 2 files changed, 12 insertions(+), 12 deletions(-) diff --git a/drivers/infiniband/ulp/iser/iser_initiator.c b/drivers/infiniband/ulp/iser/iser_initiator.c index a607542..622e985 100644 --- a/drivers/infiniband/ulp/iser/iser_initiator.c +++ b/drivers/infiniband/ulp/iser/iser_initiator.c @@ -220,18 +220,6 @@ void iser_free_rx_descriptors(struct iser_conn *ib_conn) struct iser_rx_desc *rx_desc; struct iser_device *device = ib_conn->device; - if (ib_conn->login_buf) { - if (ib_conn->login_req_dma) - ib_dma_unmap_single(device->ib_device, - ib_conn->login_req_dma, - ISCSI_DEF_MAX_RECV_SEG_LEN, DMA_TO_DEVICE); - if (ib_conn->login_resp_dma) - ib_dma_unmap_single(device->ib_device, - ib_conn->login_resp_dma, - ISER_RX_LOGIN_SIZE, DMA_FROM_DEVICE); - kfree(ib_conn->login_buf); - } - if (!ib_conn->rx_descs) return; diff --git a/drivers/infiniband/ulp/iser/iser_verbs.c b/drivers/infiniband/ulp/iser/iser_verbs.c index e28877c..46ba982 100644 --- a/drivers/infiniband/ulp/iser/iser_verbs.c +++ b/drivers/infiniband/ulp/iser/iser_verbs.c @@ -274,6 +274,18 @@ static int iser_free_ib_conn_res(struct iser_conn *ib_conn, int can_destroy_id) ib_conn->cma_id = NULL; kfree(ib_conn->page_vec); + if (ib_conn->login_buf) { + if (ib_conn->login_req_dma) + ib_dma_unmap_single(device->ib_device, + ib_conn->login_req_dma, + ISCSI_DEF_MAX_RECV_SEG_LEN, DMA_TO_DEVICE); + if (ib_conn->login_resp_dma) + ib_dma_unmap_single(device->ib_device, + ib_conn->login_resp_dma, + ISER_RX_LOGIN_SIZE, DMA_FROM_DEVICE); + kfree(ib_conn->login_buf); + } + return 0; } -- 1.7.6.5 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
ibacm fixes/updates
Hi Sean, We were asked to integrate this into our product for scalability purposes. While working on that, I ran across a number of issues that needed fixed up (typical for a new package). So, here's a number of patches and fix ups for those things. First, the package is ibacm but the binary is ib_acm. Trust me, it's best to pick one of the other and stick with it. Especially since you used ibacm for the log and pid files too. In my build, I went ahead and changed the binary name to match the package and log and pid file names. I patched the Makefile.am and reran the autoconf tools to make this happen. I've attached that patch. In the package spec file don't differentiate between %{ver} and %{version}, these should always be the same and you should never need to use anything other than %{version}. Your spec file had a srv component, that I suspect was intended to hold the init script, but no init script was ever written and the section was incomplete (so your spec file wouldn't even build). I rewrote the spec file so that the base component is the service. In truth, for this you really want it to be a service and on by default if installed. I've attached my spec file. I've also attached the init script I made. By default ibacm expects to find its configuration files in /etc/ibacm. This adds to the proliferation of directories in /etc/ needlessly. We already have a number of RDMA related directories to choose from depending on your install (OFED == /etc/ofed or /etc/openib in the old days, RHEL5 == /etc/openib from the old days, RHEL6 and Fedora are /etc/rdma, don't know what SuSE uses). Would be best if these files were in the same place as the other RDMA related files. And to make that happen easily, it would be best if they picked up on %{sysconfdir} from the ./configure invocation in order to set the directory. I used a static patch for now because I'm behind on my work, so updating the configure script was more than I had time to do. I'm not bothering to include my patch as it needs done the other way so my patch is purely a stopgap that should not see the time of day ;-) OK, gotta run. Would like to see this stuff picked up, especially the init stuff. Getting it in your release before we have one init script, SuSE another, and OFED another will help keep things uniform. -- Doug Ledford GPG KeyID: 0E572FDD http://people.redhat.com/dledford #!/bin/bash # # Bring up/down the ibacm daemon # # chkconfig: 2345 25 75 # description: Starts/Stops InfiniBand ACM service # ### BEGIN INIT INFO # Provides: ibacm # Default-Start: 2 3 4 5 # Default-Stop: 0 1 6 # Required-Start: rdma $network # Required-Stop: rdma $network # Should-Start: opensm # Should-Stop: opensm # Short-Description: Starts and stops the InfiniBand ACM service # Description: The InfiniBand ACM service provides a user space implementation # of someting resembling an ARP cache for InfiniBand SA queries and # host route lookups. ### END INIT INFO pidfile=/var/run/ibacm.pid subsys=/var/lock/subsys/ibacm prog=/usr/sbin/ibacm . /etc/rc.d/init.d/functions start() { echo -n "Starting ibacm daemon:" daemon $prog RC=$? [ $RC -eq 0 ] && touch $subsys echo return $RC } stop() { echo -n "Stopping ibacm daemon:" killproc -p $pidfile $prog RC=$? rm -f $subsys echo return $RC } status() { if [ ! -f $subsys -a ! -f $pidfile ]; then return 3 fi if [ -f $pidfile ]; then checkpid `cat $pidfile` return $? fi if [ -f $subsys ]; then return 2 fi } restart () { stop start } condrestart () { [ -e $subsys ] && restart || return 0 } usage () { echo echo "Usage: `basename $0` {start|stop|restart|condrestart|try-restart|force-reload|status}" echo return 2 } case $1 in start|stop|restart|condrestart|try-restart|force-reload) [ `id -u` != "0" ] && exit 4 ;; esac case $1 in start) start; RC=$? ;; stop) stop; RC=$? ;; restart) restart; RC=$? ;; reload) RC=3 ;; condrestart) condrestart; RC=$? ;; try-restart) condrestart; RC=$? ;; force-reload) condrestart; RC=$? ;; status) status; RC=$? ;; *) usage; RC=$? ;; esac exit $RC Name: ibacm Version: 1.0.5 Release: 1%{?dist} Summary: InfiniBand Communication Manager Assistant Group: System Environment/Daemons License: GPLv2 or BSD Url: http://www.openfabrics.org/ Source: http://www.openfabrics.org/downloads/rdmacm/%{name}-%{version}.tar.gz Source1: ibacm.init Patch0: ibacm-1.0.5-make.patch Patch1: ibacm-1.0.5-conf.patch BuildRoot: %{_tmppath}/%{name}-%{version}-%{release}-root-%(%{__id_u} -n) BuildRequires: libibverbs-devel >= 1.1-1, autoconf, libtool, libibumad-devel Requires(post): chkconfig Requires(preun): chkconfig ExcludeArch: s390 s390x %descripti
Re: ibacm fixes/updates
On 02/28/2012 04:08 PM, Doug Ledford wrote: > By default ibacm expects to find its configuration files in /etc/ibacm. > This adds to the proliferation of directories in /etc/ needlessly. We > already have a number of RDMA related directories to choose from > depending on your install (OFED == /etc/ofed or /etc/openib in the old > days, RHEL5 == /etc/openib from the old days, RHEL6 and Fedora are > /etc/rdma, don't know what SuSE uses). Would be best if these files > were in the same place as the other RDMA related files. And to make > that happen easily, it would be best if they picked up on %{sysconfdir} > from the ./configure invocation in order to set the directory. I used a > static patch for now because I'm behind on my work, so updating the > configure script was more than I had time to do. I'm not bothering to > include my patch as it needs done the other way so my patch is purely a > stopgap that should not see the time of day ;-) Oh, and another issue, the pid and lock files are not in the LSB standard locations: Pid file: /var/run/ Lock file: /var/lock/subsys/ (this is created by the init script, and it is doing the right thing here) Right now the program creates the pid file in /var/lock (or /var/run/lock, don't know which and they are linked so it shows up both places), but not in /var/run which is the standard location for pid files. The port file can stay where it is. -- Doug Ledford GPG KeyID: 0E572FDD http://people.redhat.com/dledford signature.asc Description: OpenPGP digital signature
[Patch opensm] Allow for easily configuring multiple fabrics on one opensm server
There are two things that stand in the way of opensm being run on redundant fabrics easily: 1) The opensm init script only starts one instance of opensm and opensm will only work on one fabric per instance 2) Even if you start multiple instances, you have to hand modify config files for each instance and then when you upgrade the opensm rpm you either loose your modifications or loose getting new default settings I worked around both of these issues, I've attached the files I used to do so. First, I have an opensm init script that allows starting multiple opensm instances. It supports configuring this in one of two ways: 1) Create multiple opensm.conf files, each with a numbered suffix (so opensm.conf.1, opensm.conf.2, etc.) and it will start one opensm instance per config file. This allows an admin to copy the default config over and edit the things they need, and on rpm upgrade there will be a new default opensm.conf file so they can diff between their edited version and the new default and see if there are changes they need to bring back in. This also allows for complete flexibility in setting up the different fabrics, for instance you could use one type of routing on one and a totally different type on the others. 2) Edit the file /etc/sysconfig/opensm and define more than one GUID in the GUIDs variable. This will cause the opensm init script to automatically start one instance per GUID, passing the GUID in on the command line. For the most part, this works well. However, openmpi in particular doesn't like you to have physically separate fabrics that have the same subnet_prefix, and you can't specify a subnet_prefix on the command line to go along with the GUIDs. So I wrote a patch for that and made the init script unilaterally increment the subnet prefix for each different GUID it's attaching to. All in all, we use the attached opensm file in /etc/sysconfig as the standard place you put options belonging to an init script, we have the opensm init script, the subnet_prefix patch I wrote, and with those combined things work quite well. However, I will note that our init script does not (and will not ever) play the passwordless root ssh stuff that upstream does. This is considered a serious security risk on side. The idea that a customer (let's say a wall street bank) should set up passwordless root ssh on their cluster that's a backend to their web farm? Oh hell no... I might recommend that it is long since past time for that particular misfeature of the upstream opensm init script to be done away with. Personally, I would simply recommend that on failover from a primary to a backup that it simply scan the fabric and build a "current guid2lid" map from what it finds, then start updating from there. Or something like that. But passwordless ssh...bleh. Oh, and while I've got your ear...is there a good reason the opensm libs have been soname bumping so frequently? Is it not possible to extend the APIs without soname bumps quite so often? -- Doug Ledford GPG KeyID: 0E572FDD http://people.redhat.com/dledford #!/bin/bash # # Bring up/down opensm # # chkconfig: - 15 85 # description: Activates/Deactivates InfiniBand Subnet Manager # config: /etc/ofed/opensm.conf # ### BEGIN INIT INFO # Provides: opensm # Default-Stop: 0 1 2 3 4 5 6 # Required-Start: rdma # Required-Stop: rdma # Short-Description: Starts/Stops the InfiniBand Subnet Manager # Description: Starts/Stops the InfiniBand Subnet Manager ### END INIT INFO . /etc/rc.d/init.d/functions prog=/usr/sbin/opensm PID_FILE=/var/run/opensm.pid [ -f /etc/sysconfig/opensm ] && . /etc/sysconfig/opensm [ -n "$PRIORITY" ] && prio="-p $PRIORITY" ACTION=$1 start() { local OSM_PID= if [ -z "$GUIDS" ]; then CONFIGS="" CONFIG_CNT=0 for conf in /etc/rdma/opensm.conf.[0-9]*; do CONFIGS="$CONFIGS $conf" let CONFIG_CNT++ done else GUID_CNT=0 for guid in $GUIDS; do let GUID_CNT++ done fi [ -f /var/lock/subsys/opensm ] && return 0 # Start opensm echo -n "Starting IB Subnet Manager: " [ -n "$PRIORITY" ] && echo -n "Priority=$PRIORITY " [ -n "$GUIDS" ] && echo -n "$GUID_CNT guids " [ -n "$CONFIGS" ] && echo -n "$CONFIG_CNT instances " if [ -n "$GUIDS" ]; then SUBNET_COUNT=0 for guid in $GUIDS; do SUBNET_PREFIX=`printf "0xfe8000%02d" $SUBNET_COUNT` $prog -B $prio -g $guid --subnet_prefix $SUBNET_PREFIX >/dev/null 2>&1 let SUBNET_COUNT++ done elif [ -n "$CONFIGS" ]; then for config in $CONFIGS; do $prog -B $prio -F $config >/dev/null 2&g
Re: [Patch opensm] Allow for easily configuring multiple fabrics on one opensm server
On 02/29/2012 02:22 PM, Ira Weiny wrote: > Doug, > > First thanks for this. Some comments below. > > On Wed, 29 Feb 2012 00:01:16 -0500 > Doug Ledford wrote: > >> There are two things that stand in the way of opensm being run on >> redundant fabrics easily: >> >> 1) The opensm init script only starts one instance of opensm and opensm >> will only work on one fabric per instance >> 2) Even if you start multiple instances, you have to hand modify config >> files for each instance and then when you upgrade the opensm rpm you >> either loose your modifications or loose getting new default settings >> >> I worked around both of these issues, I've attached the files I used to >> do so. >> >> First, I have an opensm init script that allows starting multiple opensm >> instances. It supports configuring this in one of two ways: >> >> 1) Create multiple opensm.conf files, each with a numbered suffix (so >> opensm.conf.1, opensm.conf.2, etc.) and it will start one opensm >> instance per config file. This allows an admin to copy the default >> config over and edit the things they need, and on rpm upgrade there will >> be a new default opensm.conf file so they can diff between their edited >> version and the new default and see if there are changes they need to >> bring back in. This also allows for complete flexibility in setting up >> the different fabrics, for instance you could use one type of routing on >> one and a totally different type on the others. >> >> 2) Edit the file /etc/sysconfig/opensm and define more than one GUID in >> the GUIDs variable. This will cause the opensm init script to >> automatically start one instance per GUID, passing the GUID in on the >> command line. > > I know you are going for ease of use here, which is good, however, I worry > about this file becoming a redefinition of opensm.conf. Hehehe, I don't think you'll ever have to worry about that. You have looked at opensm.conf in recent times I take it? Replacing that with command line options in a shell startup script isn't reasonable. However, if you are going to run a redundant fabric setup, then the two things you *know* you will have to set are the guid and subnet_prefix (assuming you want to use openmpi). If you are going to run master/slave setup, then the one thing you *know* you will have to set is the priority. Supporting setting those items in an init script is reasonable. Beyond that, I would agree, you should just edit the config files. -- Doug Ledford GPG KeyID: 0E572FDD http://people.redhat.com/dledford signature.asc Description: OpenPGP digital signature
Re: [Patch opensm] Allow for easily configuring multiple fabrics on one opensm server
On 02/29/2012 09:15 PM, Brian Ginsbach wrote: > On Wed, Feb 29, 2012 at 02:47:00PM -0500, Doug Ledford wrote: >> On 02/29/2012 02:22 PM, Ira Weiny wrote: >>> Doug, >>> >>> First thanks for this. Some comments below. >>> >>> On Wed, 29 Feb 2012 00:01:16 -0500 >>> Doug Ledford wrote: >>> >>>> There are two things that stand in the way of opensm being run on >>>> redundant fabrics easily: >>>> >>>> 1) The opensm init script only starts one instance of opensm and opensm >>>> will only work on one fabric per instance >>>> 2) Even if you start multiple instances, you have to hand modify config >>>> files for each instance and then when you upgrade the opensm rpm you >>>> either loose your modifications or loose getting new default settings >>>> >>>> I worked around both of these issues, I've attached the files I used to >>>> do so. >>>> >>>> First, I have an opensm init script that allows starting multiple opensm >>>> instances. It supports configuring this in one of two ways: >>>> >>>> 1) Create multiple opensm.conf files, each with a numbered suffix (so >>>> opensm.conf.1, opensm.conf.2, etc.) and it will start one opensm >>>> instance per config file. This allows an admin to copy the default >>>> config over and edit the things they need, and on rpm upgrade there will >>>> be a new default opensm.conf file so they can diff between their edited >>>> version and the new default and see if there are changes they need to >>>> bring back in. This also allows for complete flexibility in setting up >>>> the different fabrics, for instance you could use one type of routing on >>>> one and a totally different type on the others. >>>> >>>> 2) Edit the file /etc/sysconfig/opensm and define more than one GUID in >>>> the GUIDs variable. This will cause the opensm init script to >>>> automatically start one instance per GUID, passing the GUID in on the >>>> command line. >>> >>> I know you are going for ease of use here, which is good, however, I worry >>> about this file becoming a redefinition of opensm.conf. >> >> Hehehe, I don't think you'll ever have to worry about that. You have >> looked at opensm.conf in recent times I take it? Replacing that with >> command line options in a shell startup script isn't reasonable. >> >> However, if you are going to run a redundant fabric setup, then the two >> things you *know* you will have to set are the guid and subnet_prefix >> (assuming you want to use openmpi). If you are going to run > > Assuming you are doing this for openmpi. The subnet_prefix should > not be needed if the separate subnets are for disjoint networks > (mpi and storage) or multiple storage networks. True enough, but that's why I said openmpi. It is, after all, a primary IB fabric user. >> master/slave setup, then the one thing you *know* you will have to set >> is the priority. Supporting setting those items in an init script is >> reasonable. Beyond that, I would agree, you should just edit the config >> files. >> > > Not everything can be done in the config files. I'm not sure that > it is a good idea to have every opensm instance using the same > temporary and cache directories (OSM_TMP_DIR and OSM_CACHE_DIR > environment variables). Seems like these fall into the *know* you > will have to set category. Unless opensm is smart enough to allow more than one instance to open the same log file and interleave their log messages successfully. Temporary files or cache files could do something like use a pid suffix if need be. But yes, I see your point. Opensm has lots of junk it likes to put on the drive :-/ > You'd also want to make sure that other potentially very useful > things are configured in the config files (e.g. log_file and > log_prefix). Aren't these also things you *know* you will have to > set. I would say we are simply getting to the point where we *know* we need opensm to handle more than one fabric from a single instance ;-) -- Doug Ledford GPG KeyID: 0E572FDD http://people.redhat.com/dledford signature.asc Description: OpenPGP digital signature
Re: [Patch opensm] Allow for easily configuring multiple fabrics on one opensm server
On 3/2/2012 5:30 AM, Alex Netes wrote: > On 11:22 Wed 29 Feb , Ira Weiny wrote: >> Doug, >> >> First thanks for this. Some comments below. >> >> On Wed, 29 Feb 2012 00:01:16 -0500 >> Doug Ledford wrote: >> >>> There are two things that stand in the way of opensm being run on >>> redundant fabrics easily: >>> >>> 1) The opensm init script only starts one instance of opensm and opensm >>> will only work on one fabric per instance >>> 2) Even if you start multiple instances, you have to hand modify config >>> files for each instance and then when you upgrade the opensm rpm you >>> either loose your modifications or loose getting new default settings >>> >>> I worked around both of these issues, I've attached the files I used to >>> do so. >>> >>> First, I have an opensm init script that allows starting multiple opensm >>> instances. It supports configuring this in one of two ways: >>> >>> 1) Create multiple opensm.conf files, each with a numbered suffix (so >>> opensm.conf.1, opensm.conf.2, etc.) and it will start one opensm >>> instance per config file. This allows an admin to copy the default >>> config over and edit the things they need, and on rpm upgrade there will >>> be a new default opensm.conf file so they can diff between their edited >>> version and the new default and see if there are changes they need to >>> bring back in. This also allows for complete flexibility in setting up >>> the different fabrics, for instance you could use one type of routing on >>> one and a totally different type on the others. >>> >>> 2) Edit the file /etc/sysconfig/opensm and define more than one GUID in >>> the GUIDs variable. This will cause the opensm init script to >>> automatically start one instance per GUID, passing the GUID in on the >>> command line. >> >> I know you are going for ease of use here, which is good, however, I worry >> about this file becoming a redefinition of opensm.conf. >> >>> >>> For the most part, this works well. However, openmpi in particular >>> doesn't like you to have physically separate fabrics that have the same >>> subnet_prefix, and you can't specify a subnet_prefix on the command line >>> to go along with the GUIDs. So I wrote a patch for that and made the >>> init script unilaterally increment the subnet prefix for each different >>> GUID it's attaching to. >> >> If you only allow option 1 above this takes care of itself by making the >> admin configure his subnet prefixes in each config file as appropriate. The >> only down side is the loss of new configuration options as you upgrade. >> However, that is probably better taken care of by a default config file with >> each package. I mentioned this to Sasha years back and was denied since >> "you can always generate a new one with '-c'". :-( >> >> Alex would a default config file be acceptable? It would mean more work on >> your part. >> > > What the default opensm.conf would be used for? Just as a reference to the > default values? No, he's referring to having a default config file that is parsed, then an override config file that is parsed where you only put options you want to update in the override config file. That way you could have, for instance, a default opensm.conf in the normal location and totally unedited so that it gets updated with each update of the opensm rpm, then you could create an opensm.conf.1 that is empty except for just a guid setting, a subnet_prefix setting, maybe a cache dir setting, etc. In that way, if say the default routing engine gets a new option in the future, your override config file won't already be populated with the old stuff. It's a means of inheritance that is functionally identical to specifying all this stuff on the command line, but doesn't require a huge command line or a complex init script. -- Doug Ledford GPG KeyID: 0E572FDD http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband signature.asc Description: OpenPGP digital signature
Re: [Patch opensm] Allow for easily configuring multiple fabrics on one opensm server
On 3/2/2012 10:31 AM, Doug Ledford wrote: > On 3/2/2012 5:30 AM, Alex Netes wrote: >> What the default opensm.conf would be used for? Just as a reference to the >> default values? > > No, he's referring to having a default config file that is parsed, then > an override config file that is parsed where you only put options you > want to update in the override config file. That way you could have, > for instance, a default opensm.conf in the normal location and totally > unedited so that it gets updated with each update of the opensm rpm, > then you could create an opensm.conf.1 that is empty except for just a > guid setting, a subnet_prefix setting, maybe a cache dir setting, etc. > In that way, if say the default routing engine gets a new option in the > future, your override config file won't already be populated with the > old stuff. It's a means of inheritance that is functionally identical > to specifying all this stuff on the command line, but doesn't require a > huge command line or a complex init script. And for what it's worth, it could be as simply done as the attached (untested, but compiled) patch. -- Doug Ledford GPG KeyID: 0E572FDD http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband diff -up opensm-3.3.13/opensm/main.c.config opensm-3.3.13/opensm/main.c --- opensm-3.3.13/opensm/main.c.config 2012-03-02 10:35:26.783996345 -0500 +++ opensm-3.3.13/opensm/main.c 2012-03-02 10:46:33.471939369 -0500 @@ -131,6 +131,13 @@ static void show_usage(void) " The name of the OpenSM config file. When not specified\n" " " OSM_DEFAULT_CONFIG_FILE " will be used (if exists).\n\n"); + printf("--extra-config, -E \n" + " The name of an OpenSM config file used to over ride\n" + " the entries in the primary config file. This is\n" + " useful when you have more than one opensm instance\n" + " to manage and you want them all to have a central,\n" + " shared set of options and you want a second, smaller\n" + " config file to hold their fabric specific options.\n\n"); printf("--create-config, -c \n" " OpenSM will dump its configuration to the specified file and exit.\n" " This is a way to generate OpenSM configuration file template.\n\n"); @@ -569,10 +576,10 @@ int main(int argc, char *argv[]) boolean_t run_once_flag = FALSE; int32_t vendor_debug = 0; int next_option; - char *conf_template = NULL, *config_file = NULL; + char *conf_template, *config_file, *extra_config_file; uint32_t val; const char *const short_option = - "F:c:i:w:O:f:ed:D:g:l:L:s:t:a:u:m:X:R:zM:U:S:P:Y:ANBIQvVhoryxp:n:q:k:C:G:H:"; + "F:E:c:i:w:O:f:ed:D:g:l:L:s:t:a:u:m:X:R:zM:U:S:P:Y:ANBIQvVhoryxp:n:q:k:C:G:H:"; /* In the array below, the 2nd parameter specifies the number @@ -584,6 +591,7 @@ int main(int argc, char *argv[]) const struct option long_option[] = { {"version", 0, NULL, 12}, {"config", 1, NULL, 'F'}, + {"extra-config", 1, NULL, 'E'}, {"create-config", 1, NULL, 'c'}, {"debug", 1, NULL, 'd'}, {"guid", 1, NULL, 'g'}, @@ -647,6 +655,7 @@ int main(int argc, char *argv[]) {"torus_config", 1, NULL, 10}, {NULL, 0, NULL, 0} /* Required at the end of the array */ }; + conf_template = config_file = extra_config_file = NULL; /* force stdout to be line-buffered */ setvbuf(stdout, NULL, _IOLBF, BUFSIZ); @@ -672,6 +681,11 @@ int main(int argc, char *argv[]) config_file = optarg; printf("Config file is `%s`:\n", config_file); break; + case 'E': + extra_config_file = optarg; + printf("Extra Config file is `%s`:\n", + extra_config_file); + break; default: break; } @@ -687,6 +701,11 @@ int main(int argc, char *argv[]) if (osm_subn_parse_conf_file(config_file, &opt) < 0) printf("\nFail to parse config file \'%s\'\n", config_file); + if (extra_config_file) + if (osm_subn_parse_conf_file(extra_config_file, &opt) < 0) + printf("\nFailed to parse extra config file `%s`\n", + extra_config_file); + printf("Command Line Arguments:\n"); do { next_option = getopt_long_only(argc, argv, short_option, signature.asc Description: OpenPGP digital signature
Re: [Patch opensm] Allow for easily configuring multiple fabrics on one opensm server
- Original Message - > On 3/1/2012 8:31 AM, Doug Ledford wrote: > > I would say we are simply getting to the point where we *know* we > > need > > opensm to handle more than one fabric from a single instance ;-) > > Why does a single OpenSM need to handle multiple subnets/fabrics ? > What's the issue with running multiple OpenSMs with each handling > it's > own subnet/fabric ? Because it wasn't built with the idea in mind of sharing logs or cache directories, etc. Not to mention that the proliferation of files when you start multiple instances of opensm is pretty insane (have you counted how many different files opensm can be configured to require now a days...). -- Doug Ledford GPG KeyID: 0E572FDD http://people.redhat.com/dledford -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Patch opensm] Allow for easily configuring multiple fabrics on one opensm server
- Original Message - > On 3/5/2012 10:28 AM, Doug Ledford wrote: > > - Original Message - > >> On 3/1/2012 8:31 AM, Doug Ledford wrote: > >>> I would say we are simply getting to the point where we *know* we > >>> need > >>> opensm to handle more than one fabric from a single instance ;-) > >> > >> Why does a single OpenSM need to handle multiple subnets/fabrics ? > >> What's the issue with running multiple OpenSMs with each handling > >> it's > >> own subnet/fabric ? > > > > Because it wasn't built with the idea in mind of sharing logs or > > cache > > directories, etc. > > Yes, each instance has it's own configuration and output files. > > > Not to mention that the proliferation of files when > > you start multiple instances of opensm is pretty insane (have you > > counted how many different files opensm can be configured to > > require > > now a days...). > > I'm not disagreeing with the fact that there are numerous config > files > (the main config file is already large enough without the myriad of > features with their own separate config files) but different subnets > have totally different configuration requirements. It's not just the > subnet prefix, SM priority, and the few other config items that you > identified. In some cases, and in other cases the different subnets are perfect mirrors of each other (redundant, identical fabrics). > So what would make this better in your mind ? If I got to wave a magic wand, I would say it's time that opensm management started getting simpler, not more complex. With all of the various options for routing engines and QoS and partitions, it's gotten to where you need the equivalent of the old Cisco CNA in order to configure opensm ;-) -- Doug Ledford GPG KeyID: 0E572FDD http://people.redhat.com/dledford -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Proposal to change Node Description naming scheme for HCA's
I'm fine with this change. If people agree, I'll "make it so" in our products. Sent from my ASUS Eee Pad Ira Weiny wrote: First, a question: what package installs the openibd script in OFED? For the life of me I can't find this script in 1.5.4.1 or 3.2 ... :-/ [*] Right now the "standard" for node description is, AFAIK, " HCA-", where num is simply a counter for the HCA's as they are found in /sys/class/infiniband. The problem is resolving this "random" HCA number to an actual HCA on the host. I thought about including the node description in ibstat but that seems a bit short sighted. I think the better solution would be to append the hca name (ie mlx4_X, qibX, etc) to the hostname for the Node Description. Hacking the RHEL start up script is really easy to do this and results in nice names on the fabric which are easily resolved by the infiniband-diags, ibverbs, and perftest utilities. bash-4.1# ibhosts Ca : 0x0002c90300325280 ports 1 "ending mlx4_2" Ca : 0x00117579da38 ports 1 "happy qib0" Ca : 0x0002c90300108f2e ports 1 "ending mlx4_1" Ca : 0x00117577d90e ports 2 "ending qib0" Ca : 0x0002c903004bebda ports 2 "happy mlx4_0" bash-4.1# hostname happy bash-4.1# ibstat mlx4_0 CA 'mlx4_0' CA type: MT26428 Number of ports: 2 Firmware version: 2.8.600 ... bash-4.1# ibv_rc_pingpong -d mlx4_0 local address: LID 0x0008, QPN 0x16004a, PSN 0x2e8316 ... bash-4.1# rdma_bw -d mlx4_0 6089: | port=18515 | ib_port=1 | size=65536 | tx_depth=100 | sl=0 | iters=1000 | duplex=0 | cma=0 | ... I realize this is really a distro thing but it would be nice if we could agree to change the current "standard". I can send a patch for RHEL and OFED (if I someone can point me to the openibd script or srpm). Thoughts? Ira [*] Last I knew openibd does the same as RHEL's rdma start up script in this regard. -- Ira Weiny Member of Technical Staff Lawrence Livermore National Lab 925-423-8008 wei...@llnl.gov -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ibacm fixes/updates
On 04/02/2012 04:49 PM, Hefty, Sean wrote: >> By default ibacm expects to find its configuration files in /etc/ibacm. >> This adds to the proliferation of directories in /etc/ needlessly. We >> already have a number of RDMA related directories to choose from >> depending on your install (OFED == /etc/ofed or /etc/openib in the old >> days, RHEL5 == /etc/openib from the old days, RHEL6 and Fedora are >> /etc/rdma, don't know what SuSE uses). Would be best if these files >> were in the same place as the other RDMA related files. And to make >> that happen easily, it would be best if they picked up on %{sysconfdir} >> from the ./configure invocation in order to set the directory. I used a >> static patch for now because I'm behind on my work, so updating the >> configure script was more than I had time to do. I'm not bothering to >> include my patch as it needs done the other way so my patch is purely a >> stopgap that should not see the time of day ;-) > > I've moved the config files to /etc/rdma. I'm still trying to figure out how > to make them use %{sysconfdir}, so I can avoid hard coding any paths. > > Note that the config files are NOT needed by default. If the option config > file (acm_opts.cfg) is not found, the ibacm service will simply use the > default settings. The default values are the same as defined in the sample > option file. > > The acm_addr.cfg file included with the package is a sample file only. The > address file should be generated by running the ib_acme program on the same > node. Copying the acm_addr.cfg file included with the package will not work, > since the addresses must be unique per node. If no address config file is > found, the ibacm service will automatically call ib_acme to generate one. > > So... I think you want to remove the following install lines from your > makefile.am patch. > > +install-exec-hook: > + if ! test -d $(DESTDIR)$(sysconfdir); then \ > + mkdir -p $(DESTDIR)$(sysconfdir); \ > + fi; \ > + install -m 644 acm_opts.cfg $(DESTDIR)$(sysconfdir)/acm_opts.cfg; \ > + install -m 644 acm_addr.cfg $(DESTDIR)$(sysconfdir)/acm_addr.cfg; OK, will do. -- Doug Ledford GPG KeyID: 0E572FDD http://people.redhat.com/dledford signature.asc Description: OpenPGP digital signature
Re: [PATCH] ibacm: Fixes to ACM package to support distros
On 04/03/2012 03:10 PM, Hefty, Sean wrote: > Set of changes to fixup the ibacm package for inclusion into RedHat 6. > Changes are based on feedback from Doug Ledford . > These are primarily changes to the build files, along with name changes > to the man pages and sample configuration files. > > Rename the ib_acm service to match the package name, ibacm. > > Rename the ibacm configuration files to use the prefix 'ibacm' instead > of 'acm'. The new sample files are 'ibacm_addr.cfg' and 'ibacm_opts.cfg'. > > Move location of ACM lock and configuration files and ibacm.pid > files. They are currently in non-standard locations. > > Modify ibacm and ib_acme to use $sysconfdir and $bindir configure > values. The ibacm_addr.cfg and ibacm_opt.cfg files will now be > read/written to $sysconfdir/rdma by default. And ibacm will execute > $bindir/ib_acme if it needs to create the ibacm_addr.cfg file. Without > $bindir, the ibacm service can fail to launch ib_acme when started > from an init script. > > Add init script as part of install. The init script is installed > into $sysconfdir/init.d. > > Fixup man pages based on changes. > > Signed-off-by: Doug Ledford > Signed-off-by: Sean Hefty > --- > Makefile.am| 30 + > acm_addr.cfg | 24 -- > acm_opts.cfg | 130 > > ibacm.init | 102 > ibacm.spec.in | 68 ++--- > ibacm_addr.cfg | 24 ++ > ibacm_opts.cfg | 130 > > linux/osd.h| 12 - > man/ib_acm.1 | 130 > > man/ib_acm.7 | 33 -- > man/ib_acme.1 | 14 +++--- > man/ibacm.1| 130 > > man/ibacm.7| 31 + > src/acm.c | 17 --- > src/acme.c | 16 +++ > 15 files changed, 523 insertions(+), 368 deletions(-) > delete mode 100644 acm_addr.cfg > delete mode 100644 acm_opts.cfg > create mode 100644 ibacm.init > create mode 100644 ibacm_addr.cfg > create mode 100644 ibacm_opts.cfg > delete mode 100644 man/ib_acm.1 > delete mode 100644 man/ib_acm.7 > create mode 100644 man/ibacm.1 > create mode 100644 man/ibacm.7 Thanks for this Sean. A few last tweaks inline below: > diff --git a/Makefile.am b/Makefile.am > index 503ad72..58e3415 100644 > --- a/Makefile.am > +++ b/Makefile.am > @@ -1,12 +1,12 @@ > INCLUDES = -I$(srcdir)/include -I$(srcdir)/linux > > -AM_CFLAGS = -g -Wall -D_GNU_SOURCE > +AM_CFLAGS = -g -Wall -D_GNU_SOURCE -DSYSCONFDIR=\"${sysconfdir}\" > -DBINDIR=\"$(bindir)\" OK, so you're making is so that the --sysconfdir option to configure works now, and works via always passing sysconfdir on the build command line (versus the other option which is to put it into config.h). Cool. This is a fine solution IMO. > -EXTRA_DIST = src/acm_mad.h src/libacm.h \ > - linux/osd.h linux/dlist.h ibacm.spec.in $(man_MANS) acm_opts.cfg \ > - acm_addr.cfg > +EXTRA_DIST = src/acm_mad.h src/libacm.h ibacm.init \ > + linux/osd.h linux/dlist.h ibacm.spec.in $(man_MANS) ibacm_opts.cfg > \ > + ibacm_addr.cfg > + > +install-exec-hook: > + if ! test -d $(DESTDIR)$(sysconfdir); then \ > + mkdir -p $(DESTDIR)$(sysconfdir); \ > + fi; \ > + if ! test -d $(DESTDIR)$(sysconfdir)/rdma; then \ > + mkdir -p $(DESTDIR)$(sysconfdir)/rdma; \ > + fi; \ > + if ! test -d $(DESTDIR)$(sysconfdir)/init.d; then \ > + mkdir -p $(DESTDIR)$(sysconfdir)/init.d; \ > + fi; \ > + install -m 755 ibacm.init $(DESTDIR)$(sysconfdir)/init.d/ibacmd; I'm curious why you didn't just do install -D -m 755 ibacm.init $(DESTDIR)$(sysconfdir)/init.d/ibacmd instead of all the individual calls to mkdir. Also, a minor nit, but along the lines of the whole naming thing, I would name the init script ibacm since the program is ibacm. Usually when the program name is used to name the init script, they match exactly (when possible, but it's a minor issue so I wouldn't change it once the package has been in use for a while, but that's not the case here, it's a new package, so we can do that little cleanup with no cost up until the package gets in wide spread use). > +PATH=$PATH:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin I'm not very keen on including /usr/local in the search path on system binaries. It can be done, but we would patch it out (and I imagine SuSE wo
Re: [RFC] Proposal to change Node Description naming scheme for HCA's
On 04/04/2012 01:12 PM, Jason Gunthorpe wrote: > On Tue, Apr 03, 2012 at 07:14:42PM -0700, Ira Weiny wrote: >> On Mon, 2 Apr 2012 12:45:45 -0600 >> Jason Gunthorpe wrote: >> >>> On Mon, Apr 02, 2012 at 03:27:35PM +, Heinz, Michael William wrote: >>> >>>> Any ideas on how we could solve the hostname problem while we're >>>> changing the description? >>> >>> The node description needs to be set from the DCHP notifier script >>> chain (eg /etc/network/if-up.d/ on Debian) and also from a udev rule >>> triggered on device insertion. >> >> Jason, I'm confused. Do you mean DHCP? > > Yes.. > >> It seems are you indicating you would like to see if[up/down] bring >> ports up/down like they do for IP? > > No.. > >> On my ubuntu box (the closest thing I have to Debian) it looks like >> ifupdown owns /etc/network/if-up.d and that is not specific to DHCP. >> I don't think DHCP should be required for IB. > > if-up.d/ is a '.d' directory, the idea is individual packages drop > their script files in the directory and the system runs them at > defined times. No package owns the directory. > > The purpose of placing a hook here is this call path is used when the > hostname changes via DHCP so you can have a chance to reset the node > description. > >> Using udev to set this __seems__ like a better idea although I have >> not prototyped it. > > The purpose of the udev hook path is to set the node description on > initial device insertion, which may be before, or after the DHCP > process completes, in such cases. Well, a udev rule is guaranteed to be before dhcp completes *on that device*. It might be that dhcp has completed on another device and that the other device is actually where the hostname is pulled, but the udev rule will always be before this given device in the rule has completed dhcp. > It may also be before or after the openibd script.. Non hot-plug events are going to end up always being before the openibd script. This is merely a side effect of the fact that udev start is called before the normal init script processing is started (this all changes somewhat in the systemd world though). > Having init.d > scripts depend on the ordering of hardware discovery and module > loading is considered sketchy these days with all the parallel boot > fancyness and what not. I would agree with that. However... > IMHO we should have a udev rule that also loads the higher level IB > modules when any RDMA capable device is discovered, including the mlx4 > IB layer, uverbs, umad, etc. This way systems that have RDMA will load > the right modules and systems that don't, won't. Fully supporting hot > plug, of course. One of the niceties of the rdma/openibd init script is the ability to completely reload the stack. That goes away if we switch to a udev load of the stack (well, unless you now create a script in /sbin and have the udev rule call that script, but the script should no longer be in the initrddir if it's not an init script). > This would broadly eliminate the openibd script, integrate more > correctly with the modern distro world, be better prepared for systemd > and just be an overall better example for distros to follow :) This is mostly true, but you do have to sacrifice the one item I listed above. >>> It should probably not be set from the openibd script.. >> >> I agree there might be better ways but I am not sure I follow your >> proposal. Furthermore, I don't know that a start up script of some >> sort is all that evil. >> >> Finally, I think Michael brings up a good point about which package >> should own any such scripts. > > udev is like if-up.d/, there is a rules directory packages can drop > hook scripts into that run at the appropriate time. Correct. On Red Hat Enterprise Linux I could have the rdma package drop in an /sbin/rdma script that would bring the stack up (and possibly reload it, but I'm a bit sketchy on that idea given that this would no longer be an init script but something else), have it drop a small script in /etc/dhcp/dhclient.d/ to set the node description after dhcp completes (the script in /sbin would also have to set the node description from whatever information is available on load just in case the machine doesn't even use dhcp though), have it drop a rules file in /etc/udev.d/rules.d for bringing up the stack on device discovery (this one is a bit tricky though, you basically have to match against all possible RDMA devices and bring up the stack on the presence of any one of them, and your script that you call to bring the stack up needs to be safe against parallel invocations), and then also
Re: [PATCH] ibacm: Fixes to ACM package to support distros
On 04/05/2012 03:10 PM, Hefty, Sean wrote: > I've incorporated most of the changes, but see below. > >>> -EXTRA_DIST = src/acm_mad.h src/libacm.h \ - linux/osd.h >>> linux/dlist.h ibacm.spec.in $(man_MANS) acm_opts.cfg \ - >>> acm_addr.cfg +EXTRA_DIST = src/acm_mad.h src/libacm.h ibacm.init >>> \ + linux/osd.h linux/dlist.h ibacm.spec.in $(man_MANS) >>> ibacm_opts.cfg >> \ >>> +ibacm_addr.cfg + +install-exec-hook: + if ! test -d >>> $(DESTDIR)$(sysconfdir); then \ + mkdir -p >>> $(DESTDIR)$(sysconfdir); \ +fi; \ + if ! test -d >>> $(DESTDIR)$(sysconfdir)/rdma; then \ + mkdir -p >>> $(DESTDIR)$(sysconfdir)/rdma; \ + fi; \ + if ! test -d >>> $(DESTDIR)$(sysconfdir)/init.d; then \ +mkdir -p >>> $(DESTDIR)$(sysconfdir)/init.d; \ + fi; \ + install -m 755 >>> ibacm.init $(DESTDIR)$(sysconfdir)/init.d/ibacmd; >> >> I'm curious why you didn't just do install -D -m 755 ibacm.init >> $(DESTDIR)$(sysconfdir)/init.d/ibacmd instead of all the >> individual calls to mkdir. > > ignorance on the existence of the '-D' option :) > >>> +PATH=$PATH:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin >> >> >>> I'm not very keen on including /usr/local in the search path on system >> binaries. It can be done, but we would patch it out (and I imagine >> SuSE would too). It's a support issue. Having /usr/local override >> the system installed binaries means that you can end up with >> strange things happening, and support scratching their head and >> going "what the hell is causing that" and in the end it's because a >> different binary than the one RPM installed is actually running. > > I create a .tar.gz package using 'make dist', copy it to another > system, then install it using 'configure && make install'. When I do > that, sysconfdir defaults to /usr/local/etc, sbindir /usr/local/sbin, > and bindir to /usr/local/bin. I added /usr/local to PATH, so that > the init script would work. Otherwise I get a 'command not found' > error when running the script. On a side note, my ibacm > configuration files end up in /usr/local/etc/rdma. configure --prefix=/usr --sysconfdir=/etc && make && make install > To say that I'm not sure what autotools and the init scripts are > doing is an understatement... The init script finds ibacm in > /usr/sbin, Which means some time it got installed there, either by an rpm or something else. > but not /usr/local/sbin where it ended up being installed. > I'm not sure how to fix that. Do an actual rpm build and install from the rpm. Or use the command I listed above to get the configure directories correct (configure defaults to /usr/local). But it's generally best to have your files tracked by rpm. It makes upgrades and uninstalls work much more smoothly. And the rpmbuild program and the macros in the spec file processing get all your directories correct for you too. Assuming you have an rpm based system that you are running from of course. In that case, it's fairly simple to do something like the following: make dist rpmbuild --rebuild rpm -Uvh ~/rpmbuild/RPMS// You might need this in your ~/.rpmmacros file: %_topdir %(echo $HOME)/rpmbuild -- Doug Ledford GPG KeyID: 0E572FDD http://people.redhat.com/dledford signature.asc Description: OpenPGP digital signature
Re: [PATCH] ibacm: Fixes to ACM package to support distros
On 04/05/2012 04:40 PM, Jason Gunthorpe wrote: > On Thu, Apr 05, 2012 at 07:10:25PM +, Hefty, Sean wrote: >>>> +PATH=$PATH:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin >>> >>> I'm not very keen on including /usr/local in the search path on system > >> To say that I'm not sure what autotools and the init scripts are >> doing is an understatement... The init script finds ibacm in >> /usr/sbin, but not /usr/local/sbin where it ended up being >> installed. I'm not sure how to fix that. > > In this instance you need to have configure variable substitute your > init script as well. You should not manipulate PATH from an init.d > script, and you should use an absolute path to the daemon to avoid > contamination from the invoking user's environment. > > So you want: > > %prefix$/bin/IBACM_BIN -foo > > (use start-stop-daemon if appropriate for RHEL, I'm not sure what > their conventions are...) > > There might be a better choice than %prefix% for this, check the > config.sub output for a better substitution. > > Also, these days it would be very forward-thinking of you to > contemplate how this will work in a systemd world. Make sure it can be > demand started, have a daemon that has a running mode that is systemd > compatible, etc. I think Doug was saying that the latest FC has IB and > systemd together? The latest FC has IB and systemd, but IB still uses SysV init scripts and systemd does the emulation to start IB. It's on my list of things ToDo to get it converted to native systemd. My seemingly never ending ToDo list :-/ -- Doug Ledford GPG KeyID: 0E572FDD http://people.redhat.com/dledford signature.asc Description: OpenPGP digital signature
Re: [PATCH v2] ibacm: Fixes to ACM package to support distros
On 04/06/2012 01:12 PM, Hefty, Sean wrote: > ibacm: Fixes to ACM package to support distros > > From: Sean Hefty > > Set of changes to fixup the ibacm package for inclusion into RedHat 6. > Changes are based on feedback from Doug Ledford . > These are primarily changes to the build files, along with name changes > to the man pages and sample configuration files. > > Rename the ib_acm service to match the package name, ibacm. > > Rename the ibacm configuration files to use the prefix 'ibacm' instead > of 'acm'. The new sample files are 'ibacm_addr.cfg' and 'ibacm_opts.cfg'. > > Move location of ACM lock and configuration files and ibacm.pid > files. They are currently in non-standard locations. > > Modify ibacm and ib_acme to use $sysconfdir, $bindir, and rdmadir > configure values. The ibacm_addr.cfg and ibacm_opt.cfg files will now be > read/written to $sysconfdir/$rdmadir by default, with rdmadir defaulting > to 'rdma' if not specified.. And ibacm will execute > $bindir/ib_acme if it needs to create the ibacm_addr.cfg file. Without > $bindir, the ibacm service can fail to launch ib_acme when started > from an init script. > > Add init script as part of install. The init script is installed > into $sysconfdir/init.d. The init script is processed by configure, > so that it executes the correct ibacm service that was installed. > > Fixup man pages based on changes. > > Signed-off-by: Doug Ledford > Signed-off-by: Sean Hefty > --- > Changes from v1: > > Removed mkdir calls from Makefile and replaced them with -D install option. > > Removed setting PATH in init script. Init script is now processed by > configure > to handle the case where the daemon is not installed in /usr/sbin. > Renamed init script to match package / daemon name. > > User can now specify the directory for configuration files by setting > rdmadir= when running configure. By default rdmadir=rdma. > rdmadir is passed to the compiler as a define. > > Fixed other minor issues, like name changes in man pages. This looks good enough to me Sean. Ack to the version 2 patch set. -- Doug Ledford GPG KeyID: 0E572FDD http://people.redhat.com/dledford signature.asc Description: OpenPGP digital signature
[PATCH 0/9] IB/ipoib: fixup multicast locking issues
This is the re-ordered, squashed version of my 22 patch set that I posted on Feb 11. There are a few minor differences between that set and this one. They are: 1) Rename __ipoib_mcast_continue_join_thread to __ipoib_mcast_schedule_join_thread 2) Make __ipoib_mcast_schedule_join_thread cancel any delayed work to avoid us accidentally trying to queue the single work struct instance twice (which doesn't work) 3) Slight alter layout of __ipoib_mcast_schedule_join_thread. Logic is the same modulo #2, but indenting is reduced and readability increased 4) Switch a few instances of FLAG_ADMIN_UP to FLAG_OPER_UP 5) Add a couple missing spinlocks so that we always call the schedule helper with the spinlock held 6) Make sure that we only clear the BUSY flag once we have done all the other things we are going to do to the mcast entry, and if possible, only call complete after we have released the spinlock 7) Fix the usage of time_before_eq when we should have just used time_before in ipoib_mcast_join_task 8) Create/destroy priv->wq in a slightly different point of ipoib_transport_dev_init/ipoib_transport_dev_cleanup This entire patchset was intended to address the issue of ipoib interfaces being brought up/down in a tight loop, which will hardlock a standard v3.19 kernel. It succeeds at resolving that problem. In order to be sure this patchset does not introduce other problems, and in order to ensure that this rework of the patches into a new set does not break bisectability, this entire patchset has been extensively tested, starting with the first patch and going through the last. I used a 12 machine group plus the subnet manager to test these patches. 1 machine ran ifconfig up/ifconfig down in a tight loop tests 1 machine ran rmmod/insmod ib_ipoib in a loop with a 10 second pause between insmod and rmmod 1 machine ran rmmod/insmod ib_ipoib in a tight loop with only a .1 second pause between insmod and rmmod 9 machines that kept their interfaces up and ran iperf servers, 6 also ran ping6 instances to the addresses of all 12 machines, 3 ran iperf clients that sent data to all 9 iperf servers in an infinite loop 1 subnet manager machine that otherwise did not participate, but during testing was set to restart opensm once every 30 seconds to force net re-register events on all 12 machines in the group In addition to the configuration of various machines above to test data transfers, the IPoIB infrastructure itself contained several elements designed to test specific multicast capabilities. The primary P_Key, the one with the ping6 instances running on it, intentionally had some well known multicast groups not defined in order to intentionally cause failed sendonly multicast joins on the same device that needed to work with IPv6 pings as well as IPv4 multicast. One of the alternate P_Key interfaces was defined with a minimum rate of 56GBit/s, so all machines without 56GBit/s capability were unable to ever join the broadcast group on these P_Keys. This was done to make sure that when the broadcast group is not joined, no other multicast joins, sendonly or otherwise, are ever sent. It also was done to make sure that failed attempts to join the broadcast group honored the backoff delays properly. Note: both machines that were doing the insmod/rmmod loops were changed to not have any P_Key interfaces defined other than the default P_Key interface. It is known that repeated insmod/rmmod of the ib_ipoib interface is fragile and easily breaks in the presence of child interfaces. It was not my intent to address that particular problem with this patch set and so to avoid false issues, children interfaces were removed from the mix on these machines. A wide array of hardware was also tested with this 12 machine group, covering mthca, mlx4, mlx5, and qib hardware. Patches 1 through 6 were tested without the ifconfig/rmmod/opensm loops as those particular problems were not expected to be addressed until patch 7. Pathes 7 through 9 were tested with all tests. The final, complete patch set was left running with the various tests until it had completed 257 opensm restarts, 12052 ifconfig up/ifconfig down loops, 765 10 second insmod/rmmod loops, and 1971 .1 second insmod/rmmod loops. The only observed problem was that the fast insmod/rmmod loop eventually locked up the network stack on the machine. It was stuck on a rtnl_lock deadlock, but not one related to the multicast code (and therefore outside the scope of these patches to address). There are several bits of additional locking to be fixed in the overall ipoib code in relation to insmod/rmmod races and this patch set does not attempt to address those. It merely attempts not to introduce any new issues while resolving the mcast locking issues related to bringing the interface up and down. I feel confident that it does that. Doug Ledford (9): IB/ipoib: factor out ah flushing IB/ipoib: change init
[PATCH 5/9] IB/ipoib: Use dedicated workqueues per interface
During my recent work on the rtnl lock deadlock in the IPoIB driver, I saw that even once I fixed the apparent races for a single device, as soon as that device had any children, new races popped up. It turns out that this is because no matter how well we protect against races on a single device, the fact that all devices use the same workqueue, and flush_workqueue() flushes *everything* from that workqueue means that we would also have to prevent all races between different devices (for instance, ipoib_mcast_restart_task on interface ib0 can race with ipoib_mcast_flush_dev on interface ib0.8002, resulting in a deadlock on the rtnl_lock). There are several possible solutions to this problem: Make carrier_on_task and mcast_restart_task try to take the rtnl for some set period of time and if they fail, then bail. This runs the real risk of dropping work on the floor, which can end up being its own separate kind of deadlock. Set some global flag in the driver that says some device is in the middle of going down, letting all tasks know to bail. Again, this can drop work on the floor. Or the method this patch attempts to use, which is when we bring an interface up, create a workqueue specifically for that interface, so that when we take it back down, we are flushing only those tasks associated with our interface. In addition, keep the global workqueue, but now limit it to only flush tasks. In this way, the flush tasks can always flush the device specific work queues without having deadlock issues. Signed-off-by: Doug Ledford --- drivers/infiniband/ulp/ipoib/ipoib.h | 1 + drivers/infiniband/ulp/ipoib/ipoib_cm.c| 18 +++ drivers/infiniband/ulp/ipoib/ipoib_ib.c| 6 ++--- drivers/infiniband/ulp/ipoib/ipoib_main.c | 28 +-- drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 20 - drivers/infiniband/ulp/ipoib/ipoib_verbs.c | 31 +++--- 6 files changed, 66 insertions(+), 38 deletions(-) diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h index d7562beb542..e940cd9f847 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib.h +++ b/drivers/infiniband/ulp/ipoib/ipoib.h @@ -317,6 +317,7 @@ struct ipoib_dev_priv { struct list_head multicast_list; struct rb_root multicast_tree; + struct workqueue_struct *wq; struct delayed_work mcast_task; struct work_struct carrier_on_task; struct work_struct flush_light; diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c b/drivers/infiniband/ulp/ipoib/ipoib_cm.c index 933efcea0d0..56959adb6c7 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c @@ -474,7 +474,7 @@ static int ipoib_cm_req_handler(struct ib_cm_id *cm_id, struct ib_cm_event *even } spin_lock_irq(&priv->lock); - queue_delayed_work(ipoib_workqueue, + queue_delayed_work(priv->wq, &priv->cm.stale_task, IPOIB_CM_RX_DELAY); /* Add this entry to passive ids list head, but do not re-add it * if IB_EVENT_QP_LAST_WQE_REACHED has moved it to flush list. */ @@ -576,7 +576,7 @@ void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc) spin_lock_irqsave(&priv->lock, flags); list_splice_init(&priv->cm.rx_drain_list, &priv->cm.rx_reap_list); ipoib_cm_start_rx_drain(priv); - queue_work(ipoib_workqueue, &priv->cm.rx_reap_task); + queue_work(priv->wq, &priv->cm.rx_reap_task); spin_unlock_irqrestore(&priv->lock, flags); } else ipoib_warn(priv, "cm recv completion event with wrid %d (> %d)\n", @@ -603,7 +603,7 @@ void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc) spin_lock_irqsave(&priv->lock, flags); list_move(&p->list, &priv->cm.rx_reap_list); spin_unlock_irqrestore(&priv->lock, flags); - queue_work(ipoib_workqueue, &priv->cm.rx_reap_task); + queue_work(priv->wq, &priv->cm.rx_reap_task); } return; } @@ -827,7 +827,7 @@ void ipoib_cm_handle_tx_wc(struct net_device *dev, struct ib_wc *wc) if (test_and_clear_bit(IPOIB_FLAG_INITIALIZED, &tx->flags)) { list_move(&tx->list, &priv->cm.reap_list); - queue_work(ipoib_workqueue, &priv->cm.reap_task); + queue_work(priv->wq, &priv->cm.reap_task); } clear_bit(IPOIB_FLAG_OPER_UP, &tx
[PATCH 4/9] IB/ipoib: Make the carrier_on_task race aware
We blindly assume that we can just take the rtnl lock and that will prevent races with downing this interface. Unfortunately, that's not the case. In ipoib_mcast_stop_thread() we will call flush_workqueue() in an attempt to clear out all remaining instances of ipoib_join_task. But, since this task is put on the same workqueue as the join task, the flush_workqueue waits on this thread too. But this thread is deadlocked on the rtnl lock. The better thing here is to use trylock and loop on that until we either get the lock or we see that FLAG_OPER_UP has been cleared, in which case we don't need to do anything anyway and we just return. While investigating which flag should be used, FLAG_ADMIN_UP or FLAG_OPER_UP, it was determined that FLAG_OPER_UP was the more appropriate flag to use. However, there was a mix of these two flags in use in the existing code. So while we check for that flag here as part of this race fix, also cleanup the two places that had used the less appropriate flag for their tests. Signed-off-by: Doug Ledford --- drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 25 + 1 file changed, 17 insertions(+), 8 deletions(-) diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c index eee66d13e5b..c63a598d0b4 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c @@ -353,18 +353,27 @@ void ipoib_mcast_carrier_on_task(struct work_struct *work) carrier_on_task); struct ib_port_attr attr; - /* -* Take rtnl_lock to avoid racing with ipoib_stop() and -* turning the carrier back on while a device is being -* removed. -*/ if (ib_query_port(priv->ca, priv->port, &attr) || attr.state != IB_PORT_ACTIVE) { ipoib_dbg(priv, "Keeping carrier off until IB port is active\n"); return; } - rtnl_lock(); + /* +* Take rtnl_lock to avoid racing with ipoib_stop() and +* turning the carrier back on while a device is being +* removed. However, ipoib_stop() will attempt to flush +* the workqueue while holding the rtnl lock, so loop +* on trylock until either we get the lock or we see +* FLAG_OPER_UP go away as that signals that we are bailing +* and can safely ignore the carrier on work. +*/ + while (!rtnl_trylock()) { + if (!test_bit(IPOIB_FLAG_OPER_UP, &priv->flags)) + return; + else + msleep(20); + } if (!ipoib_cm_admin_enabled(priv->dev)) dev_set_mtu(priv->dev, min(priv->mcast_mtu, priv->admin_mtu)); netif_carrier_on(priv->dev); @@ -535,7 +544,7 @@ void ipoib_mcast_join_task(struct work_struct *work) if (!priv->broadcast) { struct ipoib_mcast *broadcast; - if (!test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags)) + if (!test_bit(IPOIB_FLAG_OPER_UP, &priv->flags)) return; broadcast = ipoib_mcast_alloc(dev, 1); @@ -882,7 +891,7 @@ void ipoib_mcast_restart_task(struct work_struct *work) ipoib_mcast_free(mcast); } - if (test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags)) + if (test_bit(IPOIB_FLAG_OPER_UP, &priv->flags)) ipoib_mcast_start_thread(dev); } -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/9] IB/ipoib: change init sequence ordering
In preparation for using per device work queues, we need to move the start of the neighbor thread task to after ipoib_ib_dev_init and move the destruction of the neighbor task to before ipoib_ib_dev_cleanup. Otherwise we will end up freeing our workqueue with work possibly still on it. Signed-off-by: Doug Ledford --- drivers/infiniband/ulp/ipoib/ipoib_main.c | 24 +--- 1 file changed, 17 insertions(+), 7 deletions(-) diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c index 58b5aa3b6f2..002ff0da9fa 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c @@ -1262,15 +1262,13 @@ int ipoib_dev_init(struct net_device *dev, struct ib_device *ca, int port) { struct ipoib_dev_priv *priv = netdev_priv(dev); - if (ipoib_neigh_hash_init(priv) < 0) - goto out; /* Allocate RX/TX "rings" to hold queued skbs */ priv->rx_ring = kzalloc(ipoib_recvq_size * sizeof *priv->rx_ring, GFP_KERNEL); if (!priv->rx_ring) { printk(KERN_WARNING "%s: failed to allocate RX ring (%d entries)\n", ca->name, ipoib_recvq_size); - goto out_neigh_hash_cleanup; + goto out; } priv->tx_ring = vzalloc(ipoib_sendq_size * sizeof *priv->tx_ring); @@ -1285,16 +1283,24 @@ int ipoib_dev_init(struct net_device *dev, struct ib_device *ca, int port) if (ipoib_ib_dev_init(dev, ca, port)) goto out_tx_ring_cleanup; + /* +* Must be after ipoib_ib_dev_init so we can allocate a per +* device wq there and use it here +*/ + if (ipoib_neigh_hash_init(priv) < 0) + goto out_dev_uninit; + return 0; +out_dev_uninit: + ipoib_ib_dev_cleanup(dev); + out_tx_ring_cleanup: vfree(priv->tx_ring); out_rx_ring_cleanup: kfree(priv->rx_ring); -out_neigh_hash_cleanup: - ipoib_neigh_hash_uninit(dev); out: return -ENOMEM; } @@ -1317,6 +1323,12 @@ void ipoib_dev_cleanup(struct net_device *dev) } unregister_netdevice_many(&head); + /* +* Must be before ipoib_ib_dev_cleanup or we delete an in use +* work queue +*/ + ipoib_neigh_hash_uninit(dev); + ipoib_ib_dev_cleanup(dev); kfree(priv->rx_ring); @@ -1324,8 +1336,6 @@ void ipoib_dev_cleanup(struct net_device *dev) priv->rx_ring = NULL; priv->tx_ring = NULL; - - ipoib_neigh_hash_uninit(dev); } static const struct header_ops ipoib_header_ops = { -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/9] IB/ipoib: Consolidate rtnl_lock tasks in workqueue
The ipoib_mcast_flush_dev routine is called with the rtnl_lock held and needs to keep it held. It also needs to call flush_workqueue() to flush out any outstanding work. In the past, we've had to try and make sure that we didn't flush out any outstanding join completions because they also wanted to grab rtnl_lock() and that would deadlock. It turns out that the only thing in the join completion handler that needs this lock can be safely moved to our carrier_on_task, thereby reducing the potential for the join completion code and the flush code to deadlock against each other. Signed-off-by: Doug Ledford --- drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 8 ++-- 1 file changed, 2 insertions(+), 6 deletions(-) diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c index ffb83b5f7e8..eee66d13e5b 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c @@ -190,12 +190,6 @@ static int ipoib_mcast_join_finish(struct ipoib_mcast *mcast, spin_unlock_irq(&priv->lock); priv->tx_wr.wr.ud.remote_qkey = priv->qkey; set_qkey = 1; - - if (!ipoib_cm_admin_enabled(dev)) { - rtnl_lock(); - dev_set_mtu(dev, min(priv->mcast_mtu, priv->admin_mtu)); - rtnl_unlock(); - } } if (!test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags)) { @@ -371,6 +365,8 @@ void ipoib_mcast_carrier_on_task(struct work_struct *work) } rtnl_lock(); + if (!ipoib_cm_admin_enabled(priv->dev)) + dev_set_mtu(priv->dev, min(priv->mcast_mtu, priv->admin_mtu)); netif_carrier_on(priv->dev); rtnl_unlock(); } -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/9] IB/ipoib: factor out ah flushing
Create a an ipoib_flush_ah and ipoib_stop_ah routines to use at appropriate times to flush out all remaining ah entries before we shut the device down. Because neighbors and mcast entries can each have a reference on any given ah, we must make sure to free all of those first before our ah will actually have a 0 refcount and be able to be reaped. This factoring is needed in preparation for having per-device work queues. The original per-device workqueue code resulted in the following error message: : ib_dealloc_pd failed That error was tracked down to this issue. With the changes to which workqueues were flushed when, there were no flushes of the per device workqueue after the last ah's were freed, resulting in an attempt to dealloc the pd with outstanding resources still allocated. This code puts the explicit flushes in the needed places to avoid that problem. Signed-off-by: Doug Ledford --- drivers/infiniband/ulp/ipoib/ipoib_ib.c | 46 - 1 file changed, 28 insertions(+), 18 deletions(-) diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c index 72626c34817..cb02466a0eb 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c @@ -659,6 +659,24 @@ void ipoib_reap_ah(struct work_struct *work) round_jiffies_relative(HZ)); } +static void ipoib_flush_ah(struct net_device *dev, int flush) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + + cancel_delayed_work(&priv->ah_reap_task); + if (flush) + flush_workqueue(ipoib_workqueue); + ipoib_reap_ah(&priv->ah_reap_task.work); +} + +static void ipoib_stop_ah(struct net_device *dev, int flush) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + + set_bit(IPOIB_STOP_REAPER, &priv->flags); + ipoib_flush_ah(dev, flush); +} + static void ipoib_ib_tx_timer_func(unsigned long ctx) { drain_tx_cq((struct net_device *)ctx); @@ -877,24 +895,7 @@ timeout: if (ib_modify_qp(priv->qp, &qp_attr, IB_QP_STATE)) ipoib_warn(priv, "Failed to modify QP to RESET state\n"); - /* Wait for all AHs to be reaped */ - set_bit(IPOIB_STOP_REAPER, &priv->flags); - cancel_delayed_work(&priv->ah_reap_task); - if (flush) - flush_workqueue(ipoib_workqueue); - - begin = jiffies; - - while (!list_empty(&priv->dead_ahs)) { - __ipoib_reap_ah(dev); - - if (time_after(jiffies, begin + HZ)) { - ipoib_warn(priv, "timing out; will leak address handles\n"); - break; - } - - msleep(1); - } + ipoib_flush_ah(dev, flush); ib_req_notify_cq(priv->recv_cq, IB_CQ_NEXT_COMP); @@ -1037,6 +1038,7 @@ static void __ipoib_ib_dev_flush(struct ipoib_dev_priv *priv, if (level == IPOIB_FLUSH_LIGHT) { ipoib_mark_paths_invalid(dev); ipoib_mcast_dev_flush(dev); + ipoib_flush_ah(dev, 0); } if (level >= IPOIB_FLUSH_NORMAL) @@ -1100,6 +1102,14 @@ void ipoib_ib_dev_cleanup(struct net_device *dev) ipoib_mcast_stop_thread(dev, 1); ipoib_mcast_dev_flush(dev); + /* +* All of our ah references aren't free until after +* ipoib_mcast_dev_flush(), ipoib_flush_paths, and +* the neighbor garbage collection is stopped and reaped. +* That should all be done now, so make a final ah flush. +*/ + ipoib_stop_ah(dev, 1); + ipoib_transport_dev_cleanup(dev); } -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 6/9] IB/ipoib: No longer use flush as a parameter
Various places in the IPoIB code had a deadlock related to flushing the ipoib workqueue. Now that we have per device workqueues and a specific flush workqueue, there is no longer a deadlock issue with flushing the device specific workqueues and we can do so unilaterally. Signed-off-by: Doug Ledford --- drivers/infiniband/ulp/ipoib/ipoib.h | 8 +++--- drivers/infiniband/ulp/ipoib/ipoib_ib.c| 35 +- drivers/infiniband/ulp/ipoib/ipoib_main.c | 8 +++--- drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 18 ++--- 4 files changed, 39 insertions(+), 30 deletions(-) diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h index e940cd9f847..9ef432ae72e 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib.h +++ b/drivers/infiniband/ulp/ipoib/ipoib.h @@ -478,10 +478,10 @@ void ipoib_ib_dev_flush_heavy(struct work_struct *work); void ipoib_pkey_event(struct work_struct *work); void ipoib_ib_dev_cleanup(struct net_device *dev); -int ipoib_ib_dev_open(struct net_device *dev, int flush); +int ipoib_ib_dev_open(struct net_device *dev); int ipoib_ib_dev_up(struct net_device *dev); -int ipoib_ib_dev_down(struct net_device *dev, int flush); -int ipoib_ib_dev_stop(struct net_device *dev, int flush); +int ipoib_ib_dev_down(struct net_device *dev); +int ipoib_ib_dev_stop(struct net_device *dev); void ipoib_pkey_dev_check_presence(struct net_device *dev); int ipoib_dev_init(struct net_device *dev, struct ib_device *ca, int port); @@ -493,7 +493,7 @@ void ipoib_mcast_send(struct net_device *dev, u8 *daddr, struct sk_buff *skb); void ipoib_mcast_restart_task(struct work_struct *work); int ipoib_mcast_start_thread(struct net_device *dev); -int ipoib_mcast_stop_thread(struct net_device *dev, int flush); +int ipoib_mcast_stop_thread(struct net_device *dev); void ipoib_mcast_dev_down(struct net_device *dev); void ipoib_mcast_dev_flush(struct net_device *dev); diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c index 2a56b7a11a9..e144d07d53c 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c @@ -659,22 +659,21 @@ void ipoib_reap_ah(struct work_struct *work) round_jiffies_relative(HZ)); } -static void ipoib_flush_ah(struct net_device *dev, int flush) +static void ipoib_flush_ah(struct net_device *dev) { struct ipoib_dev_priv *priv = netdev_priv(dev); cancel_delayed_work(&priv->ah_reap_task); - if (flush) - flush_workqueue(priv->wq); + flush_workqueue(priv->wq); ipoib_reap_ah(&priv->ah_reap_task.work); } -static void ipoib_stop_ah(struct net_device *dev, int flush) +static void ipoib_stop_ah(struct net_device *dev) { struct ipoib_dev_priv *priv = netdev_priv(dev); set_bit(IPOIB_STOP_REAPER, &priv->flags); - ipoib_flush_ah(dev, flush); + ipoib_flush_ah(dev); } static void ipoib_ib_tx_timer_func(unsigned long ctx) @@ -682,7 +681,7 @@ static void ipoib_ib_tx_timer_func(unsigned long ctx) drain_tx_cq((struct net_device *)ctx); } -int ipoib_ib_dev_open(struct net_device *dev, int flush) +int ipoib_ib_dev_open(struct net_device *dev) { struct ipoib_dev_priv *priv = netdev_priv(dev); int ret; @@ -724,7 +723,7 @@ int ipoib_ib_dev_open(struct net_device *dev, int flush) dev_stop: if (!test_and_set_bit(IPOIB_FLAG_INITIALIZED, &priv->flags)) napi_enable(&priv->napi); - ipoib_ib_dev_stop(dev, flush); + ipoib_ib_dev_stop(dev); return -1; } @@ -756,7 +755,7 @@ int ipoib_ib_dev_up(struct net_device *dev) return ipoib_mcast_start_thread(dev); } -int ipoib_ib_dev_down(struct net_device *dev, int flush) +int ipoib_ib_dev_down(struct net_device *dev) { struct ipoib_dev_priv *priv = netdev_priv(dev); @@ -765,7 +764,7 @@ int ipoib_ib_dev_down(struct net_device *dev, int flush) clear_bit(IPOIB_FLAG_OPER_UP, &priv->flags); netif_carrier_off(dev); - ipoib_mcast_stop_thread(dev, flush); + ipoib_mcast_stop_thread(dev); ipoib_mcast_dev_flush(dev); ipoib_flush_paths(dev); @@ -825,7 +824,7 @@ void ipoib_drain_cq(struct net_device *dev) local_bh_enable(); } -int ipoib_ib_dev_stop(struct net_device *dev, int flush) +int ipoib_ib_dev_stop(struct net_device *dev) { struct ipoib_dev_priv *priv = netdev_priv(dev); struct ib_qp_attr qp_attr; @@ -895,7 +894,7 @@ timeout: if (ib_modify_qp(priv->qp, &qp_attr, IB_QP_STATE)) ipoib_warn(priv, "Failed to modify QP to RESET state\n"); - ipoib_flush_ah(dev, flush); + ipoib_flush_ah(dev); ib_req_notify_cq(priv->recv_cq, IB_CQ_NEXT_COMP); @@ -919,7 +918,7 @@ int ipoib_ib_dev_init(struct net_device *dev, struct ib_device *
[PATCH 7/9] IB/ipoib: fix MCAST_FLAG_BUSY usage
Commit a9c8ba5884 ("IPoIB: Fix usage of uninitialized multicast objects") added a new flag MCAST_JOIN_STARTED, but was not very strict in how it was used. We didn't always initialize the completion struct before we set the flag, and we didn't always call complete on the completion struct from all paths that complete it. And when we did complete it, sometimes we continued to touch the mcast entry after the completion, opening us up to possible use after free issues. This made it less than totally effective, and certainly made its use confusing. And in the flush function we would use the presence of this flag to signal that we should wait on the completion struct, but we never cleared this flag, ever. In order to make things clearer and aid in resolving the rtnl deadlock bug I've been chasing, I cleaned this up a bit. 1) Remove the MCAST_JOIN_STARTED flag entirely 2) Change MCAST_FLAG_BUSY so it now only means a join is in-flight 3) Test mcast->mc directly to see if we have completed ib_sa_join_multicast (using IS_ERR_OR_NULL) 4) Make sure that before setting MCAST_FLAG_BUSY we always initialize the mcast->done completion struct 5) Make sure that before calling complete(&mcast->done), we always clear the MCAST_FLAG_BUSY bit 6) Take the mcast_mutex before we call ib_sa_multicast_join and also take the mutex in our join callback. This forces ib_sa_multicast_join to return and set mcast->mc before we process the callback. This way, our callback can safely clear mcast->mc if there is an error on the join and we will do the right thing as a result in mcast_dev_flush. 7) Because we need the mutex to synchronize mcast->mc, we can no longer call mcast_sendonly_join directly from mcast_send and instead must add sendonly join processing to the mcast_join_task 8) Make MCAST_RUN mean that we have a working mcast subsystem, not that we have a running task. We know when we need to reschedule our join task thread and don't need a flag to tell us. 9) Add a helper for rescheduling the join task thread A number of different races are resolved with these changes. These races existed with the old MCAST_FLAG_BUSY usage, the MCAST_JOIN_STARTED flag was an attempt to address them, and while it helped, a determined effort could still trip things up. One race looks something like this: Thread 1 Thread 2 ib_sa_join_multicast (as part of running restart mcast task) alloc member call callback ifconfig ib0 down wait_for_completion callback call completes wait_for_completion in mcast_dev_flush completes mcast->mc is PTR_ERR_OR_NULL so we skip ib_sa_leave_multicast return from callback return from ib_sa_join_multicast set mcast->mc = return from ib_sa_multicast We now have a permanently unbalanced join/leave issue that trips up the refcounting in core/multicast.c Another like this: Thread 1 Thread 2 Thread 3 ib_sa_multicast_join ifconfig ib0 down priv->broadcast = NULL join_complete wait_for_completion mcast->mc is not yet set, so don't clear return from ib_sa_join_multicast and set mcast->mc complete return -EAGAIN (making mcast->mc invalid) call ib_sa_multicast_leave on invalid mcast->mc, hang forever By holding the mutex around ib_sa_multicast_join and taking the mutex early in the callback, we force mcast->mc to be valid at the time we run the callback. This allows us to clear mcast->mc if there is an error and the join is going to fail. We do this before we complete the mcast. In this way, mcast_dev_flush always sees consistent state in regards to mcast->mc membership at the time that the wait_for_completion() returns. Signed-off-by: Doug Ledford --- drivers/infiniband/ulp/ipoib/ipoib.h | 11 +- drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 355 - 2 files changed, 238 insertions(+), 128 deletions(-) diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h index 9ef432ae72e..c79dcd5ee8a 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib.h +++ b/drivers/infiniband/ulp/ipoib/ipoib.h @@ -98,9 +98,15 @@ enum { IPOIB_MCAST_FLAG_FOUND= 0, /* used in set_multicast_list */ IPOIB_MCAST_FLAG_SENDONLY = 1, - IP
[PATCH 8/9] IB/ipoib: deserialize multicast joins
Allow the ipoib layer to attempt to join all outstanding multicast groups at once. The ib_sa layer will serialize multiple attempts to join the same group, but will process attempts to join different groups in parallel. Take advantage of that. In order to make this happen, change the mcast_join_thread to loop through all needed joins, sending a join request for each one that we still need to join. There are a few special cases we handle though: 1) Don't attempt to join anything but the broadcast group until the join of the broadcast group has succeeded. 2) No longer restart the join task at the end of completion handling. If we completed successfully, we are done. The join task now needs kicked either by mcast_send or mcast_restart_task or mcast_start_thread, but should not need started anytime else except when scheduling a backoff attempt to rejoin. 3) No longer use separate join/completion routines for regular and sendonly joins, pass them all through the same routine and just do the right thing based on the SENDONLY join flag. 4) Only try to join a SENDONLY join twice, then drop the packets and quit trying. We leave the mcast group in the list so that if we get a new packet, all that we have to do is queue up the packet and restart the join task and it will automatically try to join twice and then either send or flush the queue again. Signed-off-by: Doug Ledford --- drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 250 - 1 file changed, 82 insertions(+), 168 deletions(-) diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c index 277e7ac7c4d..c670d9c2cda 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c @@ -307,111 +307,6 @@ static int ipoib_mcast_join_finish(struct ipoib_mcast *mcast, return 0; } -static int -ipoib_mcast_sendonly_join_complete(int status, - struct ib_sa_multicast *multicast) -{ - struct ipoib_mcast *mcast = multicast->context; - struct net_device *dev = mcast->dev; - struct ipoib_dev_priv *priv = netdev_priv(dev); - - /* -* We have to take the mutex to force mcast_sendonly_join to -* return from ib_sa_multicast_join and set mcast->mc to a -* valid value. Otherwise we were racing with ourselves in -* that we might fail here, but get a valid return from -* ib_sa_multicast_join after we had cleared mcast->mc here, -* resulting in mis-matched joins and leaves and a deadlock -*/ - mutex_lock(&mcast_mutex); - - /* We trap for port events ourselves. */ - if (status == -ENETRESET) { - status = 0; - goto out; - } - - if (!status) - status = ipoib_mcast_join_finish(mcast, &multicast->rec); - - if (status) { - if (mcast->logcount++ < 20) - ipoib_dbg_mcast(netdev_priv(dev), "sendonly multicast " - "join failed for %pI6, status %d\n", - mcast->mcmember.mgid.raw, status); - - /* Flush out any queued packets */ - netif_tx_lock_bh(dev); - while (!skb_queue_empty(&mcast->pkt_queue)) { - ++dev->stats.tx_dropped; - dev_kfree_skb_any(skb_dequeue(&mcast->pkt_queue)); - } - netif_tx_unlock_bh(dev); - __ipoib_mcast_schedule_join_thread(priv, mcast, 1); - } else { - mcast->backoff = 1; - mcast->delay_until = jiffies; - __ipoib_mcast_schedule_join_thread(priv, NULL, 0); - } -out: - clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags); - if (status) - mcast->mc = NULL; - complete(&mcast->done); - mutex_unlock(&mcast_mutex); - return status; -} - -static int ipoib_mcast_sendonly_join(struct ipoib_mcast *mcast) -{ - struct net_device *dev = mcast->dev; - struct ipoib_dev_priv *priv = netdev_priv(dev); - struct ib_sa_mcmember_rec rec = { -#if 0 /* Some SMs don't support send-only yet */ - .join_state = 4 -#else - .join_state = 1 -#endif - }; - int ret = 0; - - if (!test_bit(IPOIB_FLAG_OPER_UP, &priv->flags)) { - ipoib_dbg_mcast(priv, "device shutting down, no sendonly " - "multicast joins\n"); - clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags); - complete(&mcast->done); - return -ENODEV; - } - - rec.mgid = mcast->mcmember.mgid; - rec.port_gid = priv->local_
[PATCH 9/9] IB/ipoib: drop mcast_mutex usage
We needed the mcast_mutex when we had to prevent the join completion callback from having the value it stored in mcast->mc overwritten by a delayed return from ib_sa_join_multicast. By storing the return of ib_sa_join_multicast in an intermediate variable, we prevent a delayed return from ib_sa_join_multicast overwriting the valid contents of mcast->mc, and we no longer need a mutex to force the join callback to run after the return of ib_sa_join_multicast. This allows us to do away with the mutex entirely and protect our critical sections with a just a spinlock instead. This is highly desirable as there were some places where we couldn't use a mutex because the code was not allowed to sleep, and so we were currently using a mix of mutex and spinlock to protect what we needed to protect. Now we only have a spin lock and the locking complexity is greatly reduced. Signed-off-by: Doug Ledford --- drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 70 -- 1 file changed, 32 insertions(+), 38 deletions(-) diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c index c670d9c2cda..3203ebe9b10 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c @@ -55,8 +55,6 @@ MODULE_PARM_DESC(mcast_debug_level, "Enable multicast debug tracing if > 0"); #endif -static DEFINE_MUTEX(mcast_mutex); - struct ipoib_mcast_iter { struct net_device *dev; union ib_gid mgid; @@ -67,7 +65,7 @@ struct ipoib_mcast_iter { }; /* - * This should be called with the mcast_mutex held + * This should be called with the priv->lock held */ static void __ipoib_mcast_schedule_join_thread(struct ipoib_dev_priv *priv, struct ipoib_mcast *mcast, @@ -352,16 +350,6 @@ static int ipoib_mcast_join_complete(int status, "sendonly " : "", mcast->mcmember.mgid.raw, status); - /* -* We have to take the mutex to force mcast_join to -* return from ib_sa_multicast_join and set mcast->mc to a -* valid value. Otherwise we were racing with ourselves in -* that we might fail here, but get a valid return from -* ib_sa_multicast_join after we had cleared mcast->mc here, -* resulting in mis-matched joins and leaves and a deadlock -*/ - mutex_lock(&mcast_mutex); - /* We trap for port events ourselves. */ if (status == -ENETRESET) { status = 0; @@ -383,8 +371,10 @@ static int ipoib_mcast_join_complete(int status, * send out all of the non-broadcast joins */ if (mcast == priv->broadcast) { + spin_lock_irq(&priv->lock); queue_work(priv->wq, &priv->carrier_on_task); __ipoib_mcast_schedule_join_thread(priv, NULL, 0); + goto out_locked; } } else { if (mcast->logcount++ < 20) { @@ -417,16 +407,28 @@ static int ipoib_mcast_join_complete(int status, dev_kfree_skb_any(skb_dequeue(&mcast->pkt_queue)); } netif_tx_unlock_bh(dev); - } else + } else { + spin_lock_irq(&priv->lock); /* Requeue this join task with a backoff delay */ __ipoib_mcast_schedule_join_thread(priv, mcast, 1); + goto out_locked; + } } out: - clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags); + spin_lock_irq(&priv->lock); +out_locked: + /* +* Make sure to set mcast->mc before we clear the busy flag to avoid +* racing with code that checks for BUSY before checking mcast->mc +*/ if (status) mcast->mc = NULL; + else + mcast->mc = multicast; + clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags); + spin_unlock_irq(&priv->lock); complete(&mcast->done); - mutex_unlock(&mcast_mutex); + return status; } @@ -434,6 +436,7 @@ static void ipoib_mcast_join(struct net_device *dev, struct ipoib_mcast *mcast, int create) { struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ib_sa_multicast *multicast; struct ib_sa_mcmember_rec rec = { .join_state = 1 }; @@ -475,18 +478,19 @@ static void ipoib_mcast_join(struct net_device *dev, struct ipoib_mcast *mcast, rec.hop_limit = priv->broadcast->mcmember.hop_limit; } - mutex_lock(&mcast_mutex);
Re: [PATCH 0/9] IB/ipoib: fixup multicast locking issues
On Sun, 2015-02-22 at 23:34 +0200, Or Gerlitz wrote: > On Sun, Feb 22, 2015 at 2:26 AM, Doug Ledford wrote: > > This is the re-ordered, squashed version of my 22 patch set that I > > posted on Feb 11. There are a few minor differences between that > > set and this one. > > Hi Doug, > > I took quick look on your git repo @ > git://github.com/dledford/linux.git and it seems not to contain this > series, can you please get that there and tell what branch to pull? It's there now, branch for-3.20-squashed. > Or. > > > They are: > > 1) Rename __ipoib_mcast_continue_join_thread to > >__ipoib_mcast_schedule_join_thread > > 2) Make __ipoib_mcast_schedule_join_thread cancel any delayed work to > >avoid us accidentally trying to queue the single work struct instance > >twice (which doesn't work) > > 3) Slight alter layout of __ipoib_mcast_schedule_join_thread. Logic > >is the same modulo #2, but indenting is reduced and readability > >increased > > 4) Switch a few instances of FLAG_ADMIN_UP to FLAG_OPER_UP > > 5) Add a couple missing spinlocks so that we always call the schedule > >helper with the spinlock held > > 6) Make sure that we only clear the BUSY flag once we have done all the > >other things we are going to do to the mcast entry, and if possible, > >only call complete after we have released the spinlock > > 7) Fix the usage of time_before_eq when we should have just used > >time_before in ipoib_mcast_join_task > > 8) Create/destroy priv->wq in a slightly different point of > >ipoib_transport_dev_init/ipoib_transport_dev_cleanup > > > > This entire patchset was intended to address the issue of ipoib > > interfaces being brought up/down in a tight loop, which will hardlock > > a standard v3.19 kernel. It succeeds at resolving that problem. In > > order to be sure this patchset does not introduce other problems, > > and in order to ensure that this rework of the patches into a new > > set does not break bisectability, this entire patchset has been > > extensively tested, starting with the first patch and going through > > the last. > > > > I used a 12 machine group plus the subnet manager to test these > > patches. > > > > 1 machine ran ifconfig up/ifconfig down in a tight loop tests > > 1 machine ran rmmod/insmod ib_ipoib in a loop with a 10 second pause > > between insmod and rmmod > > 1 machine ran rmmod/insmod ib_ipoib in a tight loop with only a .1 > > second pause between insmod and rmmod > > 9 machines that kept their interfaces up and ran iperf servers, 6 also > > ran ping6 instances to the addresses of all 12 machines, 3 ran iperf > > clients that sent data to all 9 iperf servers in an infinite loop > > 1 subnet manager machine that otherwise did not participate, but > > during testing was set to restart opensm once every 30 seconds to > > force net re-register events on all 12 machines in the group > > > > In addition to the configuration of various machines above to test > > data transfers, the IPoIB infrastructure itself contained several > > elements designed to test specific multicast capabilities. > > > > The primary P_Key, the one with the ping6 instances running on it, > > intentionally had some well known multicast groups not defined in > > order to intentionally cause failed sendonly multicast joins on > > the same device that needed to work with IPv6 pings as well as > > IPv4 multicast. > > > > One of the alternate P_Key interfaces was defined with a minimum > > rate of 56GBit/s, so all machines without 56GBit/s capability > > were unable to ever join the broadcast group on these P_Keys. > > This was done to make sure that when the broadcast group is not > > joined, no other multicast joins, sendonly or otherwise, are ever > > sent. It also was done to make sure that failed attempts to join > > the broadcast group honored the backoff delays properly. > > > > Note: both machines that were doing the insmod/rmmod loops were > > changed to not have any P_Key interfaces defined other than the > > default P_Key interface. It is known that repeated insmod/rmmod > > of the ib_ipoib interface is fragile and easily breaks in the > > presence of child interfaces. It was not my intent to address > > that particular problem with this patch set and so to avoid false > > issues, children interfaces were removed from the mix on these > > machines. > > > > A wide array of hardware was also tested with this 12 machine group, > > covering mthca, mlx4, mlx5,
Re: [PATCH 0/9] IB/ipoib: fixup multicast locking issues
On Sun, 2015-02-22 at 16:56 -0500, Doug Ledford wrote: > On Sun, 2015-02-22 at 23:34 +0200, Or Gerlitz wrote: > > On Sun, Feb 22, 2015 at 2:26 AM, Doug Ledford wrote: > > > This is the re-ordered, squashed version of my 22 patch set that I > > > posted on Feb 11. There are a few minor differences between that > > > set and this one. > > > > Hi Doug, > > > > I took quick look on your git repo @ > > git://github.com/dledford/linux.git and it seems not to contain this > > series, can you please get that there and tell what branch to pull? > > It's there now, branch for-3.20-squashed. Also, git diff for-3.20..for-3.20-squashed will show the exact differences between this 9 patch set and the previous 22 patch set. > > Or. > > > > > They are: > > > 1) Rename __ipoib_mcast_continue_join_thread to > > >__ipoib_mcast_schedule_join_thread > > > 2) Make __ipoib_mcast_schedule_join_thread cancel any delayed work to > > >avoid us accidentally trying to queue the single work struct instance > > >twice (which doesn't work) > > > 3) Slight alter layout of __ipoib_mcast_schedule_join_thread. Logic > > >is the same modulo #2, but indenting is reduced and readability > > >increased > > > 4) Switch a few instances of FLAG_ADMIN_UP to FLAG_OPER_UP > > > 5) Add a couple missing spinlocks so that we always call the schedule > > >helper with the spinlock held > > > 6) Make sure that we only clear the BUSY flag once we have done all the > > >other things we are going to do to the mcast entry, and if possible, > > >only call complete after we have released the spinlock > > > 7) Fix the usage of time_before_eq when we should have just used > > >time_before in ipoib_mcast_join_task > > > 8) Create/destroy priv->wq in a slightly different point of > > >ipoib_transport_dev_init/ipoib_transport_dev_cleanup > > > > > > This entire patchset was intended to address the issue of ipoib > > > interfaces being brought up/down in a tight loop, which will hardlock > > > a standard v3.19 kernel. It succeeds at resolving that problem. In > > > order to be sure this patchset does not introduce other problems, > > > and in order to ensure that this rework of the patches into a new > > > set does not break bisectability, this entire patchset has been > > > extensively tested, starting with the first patch and going through > > > the last. > > > > > > I used a 12 machine group plus the subnet manager to test these > > > patches. > > > > > > 1 machine ran ifconfig up/ifconfig down in a tight loop tests > > > 1 machine ran rmmod/insmod ib_ipoib in a loop with a 10 second pause > > > between insmod and rmmod > > > 1 machine ran rmmod/insmod ib_ipoib in a tight loop with only a .1 > > > second pause between insmod and rmmod > > > 9 machines that kept their interfaces up and ran iperf servers, 6 also > > > ran ping6 instances to the addresses of all 12 machines, 3 ran iperf > > > clients that sent data to all 9 iperf servers in an infinite loop > > > 1 subnet manager machine that otherwise did not participate, but > > > during testing was set to restart opensm once every 30 seconds to > > > force net re-register events on all 12 machines in the group > > > > > > In addition to the configuration of various machines above to test > > > data transfers, the IPoIB infrastructure itself contained several > > > elements designed to test specific multicast capabilities. > > > > > > The primary P_Key, the one with the ping6 instances running on it, > > > intentionally had some well known multicast groups not defined in > > > order to intentionally cause failed sendonly multicast joins on > > > the same device that needed to work with IPv6 pings as well as > > > IPv4 multicast. > > > > > > One of the alternate P_Key interfaces was defined with a minimum > > > rate of 56GBit/s, so all machines without 56GBit/s capability > > > were unable to ever join the broadcast group on these P_Keys. > > > This was done to make sure that when the broadcast group is not > > > joined, no other multicast joins, sendonly or otherwise, are ever > > > sent. It also was done to make sure that failed attempts to join > > > the broadcast group honored the backoff delays properly. > > > > > > Note: both machines that were doing the insmod/rmmod loops were > &g
Re: [PATCH 9/9] IB/ipoib: drop mcast_mutex usage
On Mon, 2015-02-23 at 18:56 +0200, Or Gerlitz wrote: > On Sun, Feb 22, 2015 at 2:27 AM, Doug Ledford wrote: > > We needed the mcast_mutex when we had to prevent the join completion > > callback from having the value it stored in mcast->mc overwritten > > downstream patches of this series (7/9 and 8/9) make pretty much heavy > usage of the mcast_mutex (e.g add/delete lines that use it), and patch > 9/9 removes it altogether.. which would be very confusing for > maintaining purposes. Is there a sane way to avoid that?! No. The changes that make dropping the mutex possible are part of patch 7. Patch 7 changes the semantics of the MCAST_FLAG_BUSY usage, and fixes some locking bugs, but that's different than wholesale changing of the locking type. If you want to preserve bisecability and be able to test the semantic changes to the FLAG_BUSY usage separate from the changes to the locking type, then they have to be separated. So, for the sake of good engineering practices and separation of distinctly different types of changes, that locking change should not be folded into patch 7. -- Doug Ledford GPG KeyID: 0E572FDD signature.asc Description: This is a digitally signed message part
Re: [PATCH v4 06/19] IB/core: Add max_mad_size to ib_device_attr
props->max_total_mcast_qp_attach = 0; > props->max_map_per_fmr = 0; > + props->max_mad_size = IB_MGMT_MAD_SIZE; > /* Owned by Userspace >* max_qp_wr, max_sge, max_sge_rd, max_cqe */ > mutex_unlock(&us_ibdev->usdev_lock); > diff --git a/include/rdma/ib_mad.h b/include/rdma/ib_mad.h > index 9c89939..5823016 100644 > --- a/include/rdma/ib_mad.h > +++ b/include/rdma/ib_mad.h > @@ -135,6 +135,7 @@ enum { > IB_MGMT_SA_DATA = 200, > IB_MGMT_DEVICE_HDR = 64, > IB_MGMT_DEVICE_DATA = 192, > + IB_MGMT_MAD_SIZE = IB_MGMT_MAD_HDR + IB_MGMT_MAD_DATA, > }; > > struct ib_mad_hdr { > diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h > index 0116e4b..64d3479 100644 > --- a/include/rdma/ib_verbs.h > +++ b/include/rdma/ib_verbs.h > @@ -210,6 +210,7 @@ struct ib_device_attr { > int sig_prot_cap; > int sig_guard_cap; > struct ib_odp_caps odp_caps; > + u32 max_mad_size; > }; > > enum ib_mtu { -- Doug Ledford GPG KeyID: 0E572FDD signature.asc Description: This is a digitally signed message part
Re: [PATCH v4 07/19] IB/mad: Convert ib_mad_private allocations from kmem_cache to kmalloc
kfree(recv); > } > > qp_info->recv_queue.count = 0; > @@ -3138,45 +3145,25 @@ static struct ib_client mad_client = { > > static int __init ib_mad_init_module(void) > { > - int ret; > - > mad_recvq_size = min(mad_recvq_size, IB_MAD_QP_MAX_SIZE); > mad_recvq_size = max(mad_recvq_size, IB_MAD_QP_MIN_SIZE); > > mad_sendq_size = min(mad_sendq_size, IB_MAD_QP_MAX_SIZE); > mad_sendq_size = max(mad_sendq_size, IB_MAD_QP_MIN_SIZE); > > - ib_mad_cache = kmem_cache_create("ib_mad", > - sizeof(struct ib_mad_private), > - 0, > - SLAB_HWCACHE_ALIGN, > - NULL); > - if (!ib_mad_cache) { > - pr_err("Couldn't create ib_mad cache\n"); > - ret = -ENOMEM; > - goto error1; > - } > - > INIT_LIST_HEAD(&ib_mad_port_list); > > if (ib_register_client(&mad_client)) { > pr_err("Couldn't register ib_mad client\n"); > - ret = -EINVAL; > - goto error2; > + return(-EINVAL); > } > > return 0; > - > -error2: > - kmem_cache_destroy(ib_mad_cache); > -error1: > - return ret; > } > > static void __exit ib_mad_cleanup_module(void) > { > ib_unregister_client(&mad_client); > - kmem_cache_destroy(ib_mad_cache); > } > > module_init(ib_mad_init_module); -- Doug Ledford GPG KeyID: 0E572FDD signature.asc Description: This is a digitally signed message part
Re: [PATCH v4 14/19] IB/core: Add IB_DEVICE_OPA_MAD_SUPPORT device cap flag
eeded in subsequent patches. It would be better to > > see > > the complete picture so it can be reviewed as a whole. > > Is there something in particular you would like to see? There are no other > patches required in the core modules for verbs applications to function. The > MTU patch only improves verbs performance. > > Ira > > > > > -- Hal > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in > > the body > > of a message to majord...@vger.kernel.org More majordomo info at > > http://vger.kernel.org/majordomo-info.html > -- > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Doug Ledford GPG KeyID: 0E572FDD signature.asc Description: This is a digitally signed message part
Re: [PATCH 1/9] IB/ipoib: factor out ah flushing
On Thu, 2015-02-26 at 15:28 +0200, Erez Shitrit wrote: > On 2/22/2015 2:26 AM, Doug Ledford wrote: > > Create a an ipoib_flush_ah and ipoib_stop_ah routines to use at > > appropriate times to flush out all remaining ah entries before we shut > > the device down. > > > > Because neighbors and mcast entries can each have a reference on any > > given ah, we must make sure to free all of those first before our ah > > will actually have a 0 refcount and be able to be reaped. > > > > This factoring is needed in preparation for having per-device work > > queues. The original per-device workqueue code resulted in the following > > error message: > > > > : ib_dealloc_pd failed > > > > That error was tracked down to this issue. With the changes to which > > workqueues were flushed when, there were no flushes of the per device > > workqueue after the last ah's were freed, resulting in an attempt to > > dealloc the pd with outstanding resources still allocated. This code > > puts the explicit flushes in the needed places to avoid that problem. > > > > Signed-off-by: Doug Ledford > > --- > > drivers/infiniband/ulp/ipoib/ipoib_ib.c | 46 > > - > > 1 file changed, 28 insertions(+), 18 deletions(-) > > > > diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c > > b/drivers/infiniband/ulp/ipoib/ipoib_ib.c > > index 72626c34817..cb02466a0eb 100644 > > --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c > > +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c > > @@ -659,6 +659,24 @@ void ipoib_reap_ah(struct work_struct *work) > >round_jiffies_relative(HZ)); > > } > > > > +static void ipoib_flush_ah(struct net_device *dev, int flush) > > +{ > > + struct ipoib_dev_priv *priv = netdev_priv(dev); > > + > > + cancel_delayed_work(&priv->ah_reap_task); > > + if (flush) > > + flush_workqueue(ipoib_workqueue); > > + ipoib_reap_ah(&priv->ah_reap_task.work); > > +} > > + > > +static void ipoib_stop_ah(struct net_device *dev, int flush) > > +{ > > + struct ipoib_dev_priv *priv = netdev_priv(dev); > > + > > + set_bit(IPOIB_STOP_REAPER, &priv->flags); > > + ipoib_flush_ah(dev, flush); > > +} > > + > > static void ipoib_ib_tx_timer_func(unsigned long ctx) > > { > > drain_tx_cq((struct net_device *)ctx); > > @@ -877,24 +895,7 @@ timeout: > > if (ib_modify_qp(priv->qp, &qp_attr, IB_QP_STATE)) > > ipoib_warn(priv, "Failed to modify QP to RESET state\n"); > > > > - /* Wait for all AHs to be reaped */ > > - set_bit(IPOIB_STOP_REAPER, &priv->flags); > > - cancel_delayed_work(&priv->ah_reap_task); > > - if (flush) > > - flush_workqueue(ipoib_workqueue); > > - > > - begin = jiffies; > > - > > - while (!list_empty(&priv->dead_ahs)) { > > - __ipoib_reap_ah(dev); > > - > > - if (time_after(jiffies, begin + HZ)) { > > - ipoib_warn(priv, "timing out; will leak address > > handles\n"); > > - break; > > - } > > - > > - msleep(1); > > - } > > + ipoib_flush_ah(dev, flush); > > > > ib_req_notify_cq(priv->recv_cq, IB_CQ_NEXT_COMP); > > > > @@ -1037,6 +1038,7 @@ static void __ipoib_ib_dev_flush(struct > > ipoib_dev_priv *priv, > > if (level == IPOIB_FLUSH_LIGHT) { > > ipoib_mark_paths_invalid(dev); > > ipoib_mcast_dev_flush(dev); > > + ipoib_flush_ah(dev, 0); > > Why do you need to call the flush function here? To remove all of the ah's that were reduced to a 0 refcount by the previous two functions prior to restarting operations. When we remove an ah, it calls ib_destroy_ah which calls all the way down into the low level driver. This was to make sure that old, stale data was removed all the way down to the card level before we started new queries for paths and ahs. > I can't see the reason to use the flush not from the stop_ah, meaning > without setting the IPOIB_STOP_REAPER, the flush can send twice the same > work. No, it can't. The ah flush routine does not search through ahs to find ones to flush. When you delete neighbors and mcasts, they release their references to ahs. When the refcount goes to 0, the put routine puts the ah on the to-be-deleted ah list. All the flush does is take that list and delete the items. If you run the flush twice,
Re: [PATCH 1/9] IB/ipoib: factor out ah flushing
On Sun, 2015-03-01 at 08:47 +0200, Erez Shitrit wrote: > On 2/26/2015 6:27 PM, Doug Ledford wrote: > > > >>> @@ -1037,6 +1038,7 @@ static void __ipoib_ib_dev_flush(struct > >>> ipoib_dev_priv *priv, > >>> if (level == IPOIB_FLUSH_LIGHT) { > >>> ipoib_mark_paths_invalid(dev); > >>> ipoib_mcast_dev_flush(dev); > >>> + ipoib_flush_ah(dev, 0); > >> Why do you need to call the flush function here? > > To remove all of the ah's that were reduced to a 0 refcount by the > > previous two functions prior to restarting operations. When we remove > > an ah, it calls ib_destroy_ah which calls all the way down into the low > > level driver. This was to make sure that old, stale data was removed > > all the way down to the card level before we started new queries for > > paths and ahs. > > Yes. but it is not needed. That depends on the card. For the modern cards (mlx4, mlx5, qib), it isn't needed but doesn't hurt either. For older cards (in particular, mthca), the driver actually frees up card resources at the time of the call. > The bug happened when the driver was removed (via modprobe -r etc.), and > there were ah's in the dead_ah list, that was fixed by you in the > function ipoib_ib_dev_cleanup, the call that you added here is not > relevant to the bug (and IMHO is not needed at all) I never said that this hunk was part of the original bug I saw before. > So, the the task of cleaning the dead_ah is already there, no need to > recall it, it will be called anyway 1 sec at the most from now. > > You can try that, take of that call, no harm or memory leak will happened. I have no doubt that it will get freed later. As I said, I never considered this particular hunk part of that original bug. But, as I point out above, there is no harm in it for any hardware, and depending on hardware it can help to make sure there isn't a shortage of resources. Given that fact, I see no reason to remove it. > >> I can't see the reason to use the flush not from the stop_ah, meaning > >> without setting the IPOIB_STOP_REAPER, the flush can send twice the same > >> work. > > No, it can't. The ah flush routine does not search through ahs to find > > ones to flush. When you delete neighbors and mcasts, they release their > > references to ahs. When the refcount goes to 0, the put routine puts > > the ah on the to-be-deleted ah list. All the flush does is take that > > list and delete the items. If you run the flush twice, the first run > > deletes all the items on the to-be-deleted list, the second run sees an > > empty list and does nothing. > > > > As for using flush versus stop: the flush function cancels any delayed > > ah_flush work so that it isn't racing with the normally scheduled > > when you call cancel_delayed_work to work that can schedule itself, it > is not help, the work can be at the middle of the run and re-schedule > itself... If it is in the middle of a run and reschedules itself, then it will schedule itself at precisely the same time we would have anyway, and we will get flushed properly, so the net result of this particular race is that we end up doing exactly what we wanted to do anyway. > > > ah_flush, then flushes the workqueue to make sure anything that might > > result in an ah getting freed is done, then flushes, then schedules a > > new delayed flush_ah work 1 second later. So, it does exactly what a > > flush should do: it removes what there is currently to remove, and in > > the case of a periodically scheduled garbage collection, schedules a new > > periodic flush at the maximum interval. > > > > It is not appropriate to call stop_ah at this point because it will > > cancel the delayed work, flush the ahs, then never reschedule the > > garbage collection. If we called stop here, we would have to call start > > later. But that's not really necessary as the flush cancels the > > scheduled work and reschedules it for a second later. > > > >>> } > >>> > >>> if (level >= IPOIB_FLUSH_NORMAL) > >>> @@ -1100,6 +1102,14 @@ void ipoib_ib_dev_cleanup(struct net_device *dev) > >>> ipoib_mcast_stop_thread(dev, 1); > >>> ipoib_mcast_dev_flush(dev); > >>> > >>> + /* > >>> + * All of our ah references aren't free until after > >>> + * ipoib_mcast_dev_flush(), ipoib_flush_paths, and > >>> + * the neighbor garbage collection is stopped and reaped. > >>> + * That
Re: [PATCH 7/9] IB/ipoib: fix MCAST_FLAG_BUSY usage
On Sun, 2015-03-01 at 11:31 +0200, Erez Shitrit wrote: > On 2/22/2015 2:27 AM, Doug Ledford wrote: > > Commit a9c8ba5884 ("IPoIB: Fix usage of uninitialized multicast > > objects") added a new flag MCAST_JOIN_STARTED, but was not very strict > > in how it was used. We didn't always initialize the completion struct > > before we set the flag, and we didn't always call complete on the > > completion struct from all paths that complete it. And when we did > > complete it, sometimes we continued to touch the mcast entry after > > the completion, opening us up to possible use after free issues. > > > > This made it less than totally effective, and certainly made its use > > confusing. And in the flush function we would use the presence of this > > flag to signal that we should wait on the completion struct, but we never > > cleared this flag, ever. > > > > In order to make things clearer and aid in resolving the rtnl deadlock > > bug I've been chasing, I cleaned this up a bit. > > > > 1) Remove the MCAST_JOIN_STARTED flag entirely > > 2) Change MCAST_FLAG_BUSY so it now only means a join is in-flight > > 3) Test mcast->mc directly to see if we have completed > > ib_sa_join_multicast (using IS_ERR_OR_NULL) > > 4) Make sure that before setting MCAST_FLAG_BUSY we always initialize > > the mcast->done completion struct > > 5) Make sure that before calling complete(&mcast->done), we always clear > > the MCAST_FLAG_BUSY bit > > 6) Take the mcast_mutex before we call ib_sa_multicast_join and also > > take the mutex in our join callback. This forces > > ib_sa_multicast_join to return and set mcast->mc before we process > > the callback. This way, our callback can safely clear mcast->mc > > if there is an error on the join and we will do the right thing as > > a result in mcast_dev_flush. > > 7) Because we need the mutex to synchronize mcast->mc, we can no > > longer call mcast_sendonly_join directly from mcast_send and > > instead must add sendonly join processing to the mcast_join_task > > 8) Make MCAST_RUN mean that we have a working mcast subsystem, not that > > we have a running task. We know when we need to reschedule our > > join task thread and don't need a flag to tell us. > > 9) Add a helper for rescheduling the join task thread > > > > A number of different races are resolved with these changes. These > > races existed with the old MCAST_FLAG_BUSY usage, the > > MCAST_JOIN_STARTED flag was an attempt to address them, and while it > > helped, a determined effort could still trip things up. > > > > One race looks something like this: > > > > Thread 1 Thread 2 > > ib_sa_join_multicast (as part of running restart mcast task) > >alloc member > >call callback > > ifconfig ib0 down > > wait_for_completion > > callback call completes > > wait_for_completion in > > mcast_dev_flush completes > >mcast->mc is PTR_ERR_OR_NULL > >so we skip ib_sa_leave_multicast > > return from callback > >return from ib_sa_join_multicast > > set mcast->mc = return from ib_sa_multicast > > > > We now have a permanently unbalanced join/leave issue that trips up the > > refcounting in core/multicast.c > > > > Another like this: > > > > Thread 1 Thread 2 Thread 3 > > ib_sa_multicast_join > > ifconfig ib0 down > > priv->broadcast = NULL > > join_complete > > wait_for_completion > >mcast->mc is not yet set, so don't clear > > return from ib_sa_join_multicast and set mcast->mc > >complete > >return -EAGAIN (making mcast->mc invalid) > > call ib_sa_multicast_leave > > on invalid mcast->mc, hang > > forever > > > > By holding the mutex around ib_sa_multicast_join and taking the mutex > > early in the callback, we force mcast->mc to be valid at the time we > > run the c
Re: [PATCH 8/9] IB/ipoib: deserialize multicast joins
On Sun, 2015-03-01 at 15:58 +0200, Erez Shitrit wrote: > On 2/22/2015 2:27 AM, Doug Ledford wrote: > > Allow the ipoib layer to attempt to join all outstanding multicast > > groups at once. The ib_sa layer will serialize multiple attempts to > > join the same group, but will process attempts to join different groups > > in parallel. Take advantage of that. > > > > In order to make this happen, change the mcast_join_thread to loop > > through all needed joins, sending a join request for each one that we > > still need to join. There are a few special cases we handle though: > > > > 1) Don't attempt to join anything but the broadcast group until the join > > of the broadcast group has succeeded. > > 2) No longer restart the join task at the end of completion handling. > > If we completed successfully, we are done. The join task now needs kicked > > either by mcast_send or mcast_restart_task or mcast_start_thread, but > > should not need started anytime else except when scheduling a backoff > > attempt to rejoin. > > 3) No longer use separate join/completion routines for regular and > > sendonly joins, pass them all through the same routine and just do the > > right thing based on the SENDONLY join flag. > > 4) Only try to join a SENDONLY join twice, then drop the packets and > > quit trying. We leave the mcast group in the list so that if we get a > > new packet, all that we have to do is queue up the packet and restart > > the join task and it will automatically try to join twice and then > > either send or flush the queue again. > > > > Signed-off-by: Doug Ledford > > --- > > drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 250 > > - > > 1 file changed, 82 insertions(+), 168 deletions(-) > > > > diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c > > b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c > > index 277e7ac7c4d..c670d9c2cda 100644 > > --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c > > +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c > > @@ -307,111 +307,6 @@ static int ipoib_mcast_join_finish(struct ipoib_mcast > > *mcast, > > return 0; > > } > > > > -static int > > -ipoib_mcast_sendonly_join_complete(int status, > > - struct ib_sa_multicast *multicast) > > -{ > > - struct ipoib_mcast *mcast = multicast->context; > > - struct net_device *dev = mcast->dev; > > - struct ipoib_dev_priv *priv = netdev_priv(dev); > > - > > - /* > > -* We have to take the mutex to force mcast_sendonly_join to > > -* return from ib_sa_multicast_join and set mcast->mc to a > > -* valid value. Otherwise we were racing with ourselves in > > -* that we might fail here, but get a valid return from > > -* ib_sa_multicast_join after we had cleared mcast->mc here, > > -* resulting in mis-matched joins and leaves and a deadlock > > -*/ > > - mutex_lock(&mcast_mutex); > > - > > - /* We trap for port events ourselves. */ > > - if (status == -ENETRESET) { > > - status = 0; > > - goto out; > > - } > > - > > - if (!status) > > - status = ipoib_mcast_join_finish(mcast, &multicast->rec); > > - > > - if (status) { > > - if (mcast->logcount++ < 20) > > - ipoib_dbg_mcast(netdev_priv(dev), "sendonly multicast " > > - "join failed for %pI6, status %d\n", > > - mcast->mcmember.mgid.raw, status); > > - > > - /* Flush out any queued packets */ > > - netif_tx_lock_bh(dev); > > - while (!skb_queue_empty(&mcast->pkt_queue)) { > > - ++dev->stats.tx_dropped; > > - dev_kfree_skb_any(skb_dequeue(&mcast->pkt_queue)); > > - } > > - netif_tx_unlock_bh(dev); > > - __ipoib_mcast_schedule_join_thread(priv, mcast, 1); > > - } else { > > - mcast->backoff = 1; > > - mcast->delay_until = jiffies; > > - __ipoib_mcast_schedule_join_thread(priv, NULL, 0); > > - } > > -out: > > - clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags); > > - if (status) > > - mcast->mc = NULL; > > - complete(&mcast->done); > > - mutex_unlock(&mcast_mutex); > > - return status; > > -} > > - &g
Re: [PATCH 1/9] IB/ipoib: factor out ah flushing
> On Mar 13, 2015, at 1:39 AM, Or Gerlitz wrote: > > On Tue, Mar 3, 2015 at 11:59 AM, Erez Shitrit > wrote: >> On 3/2/2015 5:09 PM, Doug Ledford wrote: >>> >>> On Sun, 2015-03-01 at 08:47 +0200, Erez Shitrit wrote: >>>> >>>> On 2/26/2015 6:27 PM, Doug Ledford wrote: >>>>>>> >>>>>>> @@ -1037,6 +1038,7 @@ static void __ipoib_ib_dev_flush(struct >>>>>>> ipoib_dev_priv *priv, >>>>>>>if (level == IPOIB_FLUSH_LIGHT) { >>>>>>>ipoib_mark_paths_invalid(dev); >>>>>>>ipoib_mcast_dev_flush(dev); >>>>>>> + ipoib_flush_ah(dev, 0); >>>>>> >>>>>> Why do you need to call the flush function here? >>>>> >>>>> To remove all of the ah's that were reduced to a 0 refcount by the >>>>> previous two functions prior to restarting operations. When we remove >>>>> an ah, it calls ib_destroy_ah which calls all the way down into the low >>>>> level driver. This was to make sure that old, stale data was removed >>>>> all the way down to the card level before we started new queries for >>>>> paths and ahs. >>>> >>>> Yes. but it is not needed. >>> >>> That depends on the card. For the modern cards (mlx4, mlx5, qib), it >>> isn't needed but doesn't hurt either. For older cards (in particular, >>> mthca), the driver actually frees up card resources at the time of the >>> call. >> >> >> Can you please elaborate more here, I took a look in the mthca and didn't >> see that. >> anyway, what i don't understand is why you need to do that now, the ah is >> already in the dead_ah_list so, in at the most 1 sec will be cleared and if >> the driver goes down your other hunk fixed that. > > Doug, ten days and no response from you... lets finalize the review on > this series so we have it safely done for 4.1 -- on top of it Erez > prepares a set of IPoIB fixes from our internal tree and we want that > for 4.1 too. Please address. I didn’t have much to say here. I said that mthca can have card resources freed by this call, which is backed up by this code in mthca_ah.c int mthca_destroy_ah(struct mthca_dev *dev, struct mthca_ah *ah) { switch (ah->type) { case MTHCA_AH_ON_HCA: mthca_free(&dev->av_table.alloc, (ah->avdma - dev->av_table.ddr_av_base) / MTHCA_AV_SIZE); break; I’m not entirely sure how Erez missed that, but it’s there and it’s what gets called when we destroy an ah (depending on the card of course). So, that represents one case where freeing the resources in a non-lazy fashion has a direct benefit. And there is no cited drawback to freeing the resources in a non-lazy fashion on a net event, so I don’t see what there is to discuss further on the issue. — Doug Ledford GPG Key ID: 0E572FDD signature.asc Description: Message signed with OpenPGP using GPGMail
Re: [PATCH 0/9] IB/ipoib: fixup multicast locking issues
> On Mar 13, 2015, at 1:41 AM, Or Gerlitz wrote: > > On Sun, Feb 22, 2015 at 2:26 AM, Doug Ledford wrote: >> This is the re-ordered, squashed version of my 22 patch set that I >> posted on Feb 11. There are a few minor differences between that >> set and this one. They are: > [...] > > Doug, you wrote here a very detailed listing of the changes from > earlier posts and the testing the patches went through, which is > excellent. It would be very good if you can also post few liner > telling the changes done by the series in high level, so we can have > this test as part of a "merge" commit that says in the kernel logs. OK. I would take what I had in the original message and expand upon it then: This entire patchset was intended to address the issue of ipoib interfaces being brought up/down in a tight loop, which will hardlock a standard v3.19 kernel. It succeeds at resolving that problem. In order to accomplish this goal, it reworks how the IPOIB_MCAST_FLAG_BUSY flag is used. Conceptually, that flag used to be set when we started a multicast join, and would stay set once the join was complete. This left no way to tell if the multicast join was complete or still in flight. This allowed race conditions to develop between joining multicast groups and taking an interface down. A previous attempt to resolve these race conditions used the flag IPOIB_MCAST_JOIN_STARTED, but did not succeed at fully resolving the race conditions. This patchset resolves this issue, plus a number of related issues discovered while working on this issue. The primary fix itself is patch 6/9 and a more complete description of the changes to how the IPOIB_MCAST_FLAG_BUSY flag is now used can be found in that commit log. — Doug Ledford GPG Key ID: 0E572FDD signature.asc Description: Message signed with OpenPGP using GPGMail
Re: [PATCH 1/9] IB/ipoib: factor out ah flushing
> On Mar 16, 2015, at 8:24 AM, Erez Shitrit wrote: > > On 3/15/2015 8:42 PM, Doug Ledford wrote: >> >>> Doug, ten days and no response from you... lets finalize the review on >>> this series so we have it safely done for 4.1 -- on top of it Erez >>> prepares a set of IPoIB fixes from our internal tree and we want that >>> for 4.1 too. Please address. >> I didn’t have much to say here. I said that mthca can have card resources >> freed by this call, which is backed up by this code in mthca_ah.c >> >> int mthca_destroy_ah(struct mthca_dev *dev, struct mthca_ah *ah) >> { >> switch (ah->type) { >> case MTHCA_AH_ON_HCA: >> mthca_free(&dev->av_table.alloc, >>(ah->avdma - dev->av_table.ddr_av_base) / >>MTHCA_AV_SIZE); >> break; >> >> >> I’m not entirely sure how Erez missed that, but it’s there and it’s what >> gets called when we destroy an ah (depending on the card of course). So, >> that represents one case where freeing the resources in a non-lazy fashion >> has a direct benefit. And there is no cited drawback to freeing the >> resources in a non-lazy fashion on a net event, so I don’t see what there is >> to discuss further on the issue. > sorry, but i still don't see the connection to the device type. > It will be deleted/freed with the regular flow, like it does in the rest of > the life cycle cases of the ah (in neigh_dtor, path_free, etc.), so why here > it should be directly after the event? Because it’s the right thing to do. The only reason to do lazy deletion is when there is a performance benefit to batching. There is no performance benefit to batching here. And because on certain hardware the action frees resources on the card, which are limited, doing non-lazy deletion can be beneficial. Given that there is no downside to doing the deletions in a non-lazy fashion, and that there can be an upside depending on hardware, there is no reason to stick with the lazy deletions. — Doug Ledford GPG Key ID: 0E572FDD signature.asc Description: Message signed with OpenPGP using GPGMail
Re: [PATCH 1/9] IB/ipoib: factor out ah flushing
> On Mar 16, 2015, at 8:24 AM, Erez Shitrit wrote: > > On 3/15/2015 8:42 PM, Doug Ledford wrote: >> >>> Doug, ten days and no response from you... lets finalize the review on >>> this series so we have it safely done for 4.1 -- on top of it Erez >>> prepares a set of IPoIB fixes from our internal tree and we want that >>> for 4.1 too. Please address. >> I didn’t have much to say here. I said that mthca can have card resources >> freed by this call, which is backed up by this code in mthca_ah.c >> >> int mthca_destroy_ah(struct mthca_dev *dev, struct mthca_ah *ah) >> { >>switch (ah->type) { >>case MTHCA_AH_ON_HCA: >>mthca_free(&dev->av_table.alloc, >> (ah->avdma - dev->av_table.ddr_av_base) / >> MTHCA_AV_SIZE); >>break; >> >> >> I’m not entirely sure how Erez missed that, but it’s there and it’s what >> gets called when we destroy an ah (depending on the card of course). So, >> that represents one case where freeing the resources in a non-lazy fashion >> has a direct benefit. And there is no cited drawback to freeing the >> resources in a non-lazy fashion on a net event, so I don’t see what there is >> to discuss further on the issue. > sorry, but i still don't see the connection to the device type. > It will be deleted/freed with the regular flow, like it does in the rest of > the life cycle cases of the ah (in neigh_dtor, path_free, etc.), so why here > it should be directly after the event? Because it’s the right thing to do. The only reason to do lazy deletion is when there is a performance benefit to batching. There is no performance benefit to batching here. And because on certain hardware the action frees resources on the card, which are limited, doing non-lazy deletion can be beneficial. Given that there is no downside to doing the deletions in a non-lazy fashion, and that there can be an upside depending on hardware, there is no reason to stick with the lazy deletions. — Doug Ledford GPG Key ID: 0E572FDD -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2 RESEND] IB/Verbs: Use helpers to refine the checking on transport and link layer
On Wed, 2015-03-25 at 16:09 +0100, Michael Wang wrote: > My sincerely apologies for the corrupted mails, and thanks for Dan's kindly > remind :-) > > There are too many lengthy code to check the transport type of IB device, > or the link layer type of it's port, this patch set try to use some helper to > refine and save us some code. > > TODO: > Currently we inferred from the transport type and link layer type to > identify > the way of management, it will be better if we can directly get the > indicator > from vendor. > > Sean proposed one suggestion: > https://www.mail-archive.com/linux-rdma@vger.kernel.org/msg23339.html > > It may need a big work to adapt current implementation to utilize > these flags elegantly. > > Also the performance concern on query_port() need to be addressed, may be > some new callback like query_mgmt() could works. > > Michael Wang (2): > [PATCH 1/2] IB/Verbs: Use helpers to check transport and link layer > [PATCH 2/2] IB/Verbs: Use helpers to check IBoE technology > > --- > drivers/infiniband/core/agent.c |2 - > drivers/infiniband/core/cm.c |2 - > drivers/infiniband/core/cma.c | 33 > -- > drivers/infiniband/core/mad.c |6 ++--- > drivers/infiniband/core/multicast.c | 11 +++--- > drivers/infiniband/core/sa_query.c| 14 ++-- > drivers/infiniband/core/ucm.c |3 -- > drivers/infiniband/core/user_mad.c|2 - > drivers/infiniband/core/verbs.c |5 +--- > drivers/infiniband/hw/mlx4/ah.c |2 - > drivers/infiniband/hw/mlx4/cq.c |4 --- > drivers/infiniband/hw/mlx4/mad.c | 14 +++- > drivers/infiniband/hw/mlx4/main.c |8 ++- > drivers/infiniband/hw/mlx4/mlx4_ib.h |2 - > drivers/infiniband/hw/mlx4/qp.c | 21 ++- > drivers/infiniband/hw/mlx4/sysfs.c|6 + > drivers/infiniband/ulp/ipoib/ipoib_main.c |6 ++--- > include/rdma/ib_verbs.h | 30 +++ > net/sunrpc/xprtrdma/svc_rdma_recvfrom.c |3 -- > 19 files changed, 87 insertions(+), 87 deletions(-) > I think this is a step in the right direction. However, as I brought up in a different thread, I think it would be best if we start to clear up some of the funny things in this space, such as calling iWARP link layer InfiniBand. One thing we didn't discuss before is that, even if changing these items around in user space would break user space, changing them around in the kernel won't break anything except maybe Lustre (which we can inform the authors about the change so they can anticipate it and do the right thing in their code) because we can fix up our return values we pass to user space with no impact to them as it's not on a hot path. We can then emulate the old way of setting all these variables to user space while fixing them up in kernel space. So, I would suggest that we fix things up thusly: enum transport { TRANSPORT_IB=1, TRANSPORT_IWARP=2, TRANSPORT_ROCE=4, TRANSPORT_OPA=8, TRANSPORT_USNIC=10, }; #define HAS_SA(ibdev) ((ibdev)->transport & (TRANSPORT_IB|TRANSPORT_OPA)) #define HAS_JUMBO_SA(ibdev) ((ibdev)->transport & TRANSPORT_OPA)) or possibly static bool ib_dev_has_sa(struct ibv_device *ibdev) { return ibdev->transport & (TRANSPORT_IB | TRANSPORT_OPA); } If we do this, then the only thing we have to fix up to preserve ABI with user space is to make sure that any time we export an ibv_device struct and any time we import the same, we convert from our new internal representation to the old representation that user space expects. And we also need to make a few changes in the sysfs code to display the properties as things expect. But, that would allow us to fix up what I see as a problem right now, which is that we hide the information we need to know what sort of device we are working on in two different fields: the transport and the link layer. Instead, just use one field with enough variants that we can store all of the relevant information we need in that one field. This has the benefit that any comparisons that happen in hot paths will now always be a single bitwise comparison and will no longer need to hit two separate variables for two separate compares. -- Doug Ledford GPG KeyID: 0E572FDD signature.asc Description: This is a digitally signed message part
Re: [PATCH 0/2 RESEND] IB/Verbs: Use helpers to refine the checking on transport and link layer
On Thu, 2015-03-26 at 17:04 +0100, Michael Wang wrote: > Hi, Doug > > Thanks for the excellent comments :-) > > On 03/26/2015 03:09 PM, Doug Ledford wrote: > > On Wed, 2015-03-25 at 16:09 +0100, Michael Wang wrote: > >> [snip] > >> > > [snip] > > > > So, I would suggest that we fix things up thusly: > > > > enum transport { > > TRANSPORT_IB=1, > > TRANSPORT_IWARP=2, > > TRANSPORT_ROCE=4, > > TRANSPORT_OPA=8, > > TRANSPORT_USNIC=10, > > }; > > > > #define HAS_SA(ibdev) ((ibdev)->transport & (TRANSPORT_IB|TRANSPORT_OPA)) > > #define HAS_JUMBO_SA(ibdev) ((ibdev)->transport & TRANSPORT_OPA)) > > > > or possibly > > > > static bool ib_dev_has_sa(struct ibv_device *ibdev) > > { > > return ibdev->transport & (TRANSPORT_IB | TRANSPORT_OPA); > > } > > The idea sounds interesting, and here my silly questions come :-P > > So are you suggesting that we add a new bitmask 'transport' into 'struct > ib_device' > in kernel, and setup it at very beginning? > > Few more questions here is: > 1. when to setup? (maybe inside ib_register_device() before doing > client->add() callback?) I don't think "we" can set it up here. The driver's have to set it up. After all, the mlx4 driver will have to decide for itself what the port transport is and tell us, we can't tell it. > 2. how to setup? (still infer from the transport and link layer like we > currently do?) Find each point in each driver where they currently set the link layer and transport fields today, and replace that with setting the new transport bitmask instead. > 3. in case if a device has ports with different link layer type (please > correct > me if this will never happen), then only one bitmask may not be enough to > present the transport of all the ports? (maybe create a bitmask per port?) Correct, a bitmask per port. And we can remove the existing transport and link layer elements of the struct and replace it with just the new transport. Then, whenever we need to copy a struct to user space, we have a helper that looks something like this: static void inline ib_set_user_transport(struct ib_device *ibdev, struct user_ibv_device *uibdev) { switch(ibdev->port[port]->transport) { case TRANSPORT_IB: case TRANSPORT_OPA: uibdev->port[port]->link_layer = INFINIBAND; uibdev->port[port]->transport = INFINIBAND; break; case TRANSPORT_IWARP: uibdev->port[port]->link_layer = INFINIBAND; uibdev->port[port]->transport = IWARP; break; case TRANSPORT_ROCE: uibdev->port[port]->link_layer = ETHERNET; uibdev->port[port]->transport = INFINIBAND; break; case TRANSPORT_USNIC: uibdev->port[port]->link_layer = ETHERNET; uibdev->port[port]->transport = ; break; default: pr_err(ibdev, "unknown transport type %x\n", ibdev->port[port]->transport); } } That preserves the user space ABI and all user programs keep working, while we update to an internal representation that makes more sense for how things have evolved. > Regards, > Michael Wang > > > > > If we do this, then the only thing we have to fix up to preserve ABI > > with user space is to make sure that any time we export an ibv_device > > struct and any time we import the same, we convert from our new internal > > representation to the old representation that user space expects. And > > we also need to make a few changes in the sysfs code to display the > > properties as things expect. But, that would allow us to fix up what I > > see as a problem right now, which is that we hide the information we > > need to know what sort of device we are working on in two different > > fields: the transport and the link layer. Instead, just use one field > > with enough variants that we can store all of the relevant information > > we need in that one field. This has the benefit that any comparisons > > that happen in hot paths will now always be a single bitwise comparison > > and will no longer need to hit two separate variables for two separate > > compares. > > > > > > > -- Doug Ledford GPG KeyID: 0E572FDD signature.asc Description: This is a digitally signed message part
Re: [RFC PATCH 06/11] IB/Verbs: Use management helper has_sa() and cap_sa(), for sa-check
On Fri, 2015-03-27 at 12:47 -0400, ira.weiny wrote: > On Fri, Mar 27, 2015 at 04:46:11PM +0100, Michael Wang wrote: > > > > Introduce helper has_sa() and cap_sa() to help us check if an IB device > > or it's port support Subnet Administrator. > > I think these 2 should be combined. The question is if a port requires the > use > of the SA depending on the network it is connected to. > > Aren't some dual port Mellanox cards capable of doing IB on 1 port and Eth on > the other? Yes. > Do they show up as 2 devices? No. They are two ports of the same device with different link layers. See here: hca_id: mlx4_1 transport: InfiniBand (0) fw_ver: 2.31.5050 node_guid: f452:1403:007b:cba0 sys_image_guid: f452:1403:007b:cba3 vendor_id: 0x02c9 vendor_part_id: 4099 hw_ver: 0x0 board_id: MT_1090120019 phys_port_cnt: 2 port: 1 state: PORT_ACTIVE (4) max_mtu:2048 (4) active_mtu: 2048 (4) sm_lid: 2 port_lid: 2 port_lmc: 0x01 link_layer: InfiniBand port: 2 state: PORT_ACTIVE (4) max_mtu:4096 (5) active_mtu: 4096 (5) sm_lid: 0 port_lid: 0 port_lmc: 0x00 link_layer: Ethernet > Regardless I think we should define the SA access on a per port basis. Yes. > > > > Cc: Jason Gunthorpe > > Cc: Doug Ledford > > Cc: Ira Weiny > > Cc: Sean Hefty > > Signed-off-by: Michael Wang > > --- > > drivers/infiniband/core/sa_query.c | 12 ++-- > > include/rdma/ib_verbs.h| 28 > > 2 files changed, 34 insertions(+), 6 deletions(-) > > > > diff --git a/drivers/infiniband/core/sa_query.c > > b/drivers/infiniband/core/sa_query.c > > index d95d25f..89c27da 100644 > > --- a/drivers/infiniband/core/sa_query.c > > +++ b/drivers/infiniband/core/sa_query.c > > @@ -450,7 +450,7 @@ static void ib_sa_event(struct ib_event_handler > > *handler, struct ib_event *event > > struct ib_sa_port *port = > > &sa_dev->port[event->element.port_num - sa_dev->start_port]; > > > > -if (!rdma_port_ll_is_ib(handler->device, port->port_num)) > > +if (!cap_sa(handler->device, port->port_num)) > > return; > > > > spin_lock_irqsave(&port->ah_lock, flags); > > @@ -1154,7 +1154,7 @@ static void ib_sa_add_one(struct ib_device *device) > > struct ib_sa_device *sa_dev; > > int s, e, i; > > > > -if (!rdma_transport_is_ib(device)) > > +if (!has_sa(device)) > > The logic here should be: > > if (no ports of this device need sa access) > return; > > So why not eliminate this check and allow the cap_sa(s) to handle the support? > > -- Ira > > > return; > > > > if (device->node_type == RDMA_NODE_IB_SWITCH) > > @@ -1175,7 +1175,7 @@ static void ib_sa_add_one(struct ib_device *device) > > > > for (i = 0; i <= e - s; ++i) { > > spin_lock_init(&sa_dev->port[i].ah_lock); > > -if (!rdma_port_ll_is_ib(device, i + 1)) > > +if (!cap_sa(device, i + 1)) > > continue; > > > > sa_dev->port[i].sm_ah= NULL; > > @@ -1205,14 +1205,14 @@ static void ib_sa_add_one(struct ib_device *device) > > goto err; > > > > for (i = 0; i <= e - s; ++i) > > -if (rdma_port_ll_is_ib(device, i + 1)) > > +if (cap_sa(device, i + 1)) > > update_sm_ah(&sa_dev->port[i].update_task); > > > > return; > > > > err: > > while (--i >= 0) > > -if (rdma_port_ll_is_ib(device, i + 1)) > > +if (cap_sa(device, i + 1)) > > ib_unregister_mad_agent(sa_dev->port[i].agent); > > > > kfree(sa_dev); > > @@ -1233,7 +1233,7 @@ static void ib_sa_remove_one(struct ib_device *device) > >
Re: [PATCH 01/11] IB/Verbs: Use helpers to check transport and link layer
On Fri, 2015-03-27 at 16:40 +0100, Michael Wang wrote: > We have so much places to check transport type and link layer type, it's now > make sense to introduce some helpers in order to refine the lengthy code. > > This patch will introduce helpers: > rdma_transport_is_ib() > rdma_transport_is_iwarp() > rdma_port_ll_is_ib() > rdma_port_ll_is_eth() > and use them to save some code for us. If the end result is to do something like I proposed, then why take this intermediate step that just has to be backed out later? In other words, if our end goal is to have rdma_transport_is_ib() rdma_transport_is_iwarp() rdma_transport_is_roce() rdma_transport_is_opa() Then we should skip doing rdma_port_ll_is_*() as the answers to these items would be implied by rdma_transport_is_roce() and such. > Cc: Jason Gunthorpe > Cc: Doug Ledford > Cc: Ira Weiny > Cc: Sean Hefty > Signed-off-by: Michael Wang > --- > drivers/infiniband/core/agent.c | 2 +- > drivers/infiniband/core/cm.c | 2 +- > drivers/infiniband/core/cma.c | 27 --- > drivers/infiniband/core/mad.c | 6 +++--- > drivers/infiniband/core/multicast.c | 11 --- > drivers/infiniband/core/sa_query.c| 14 +++--- > drivers/infiniband/core/ucm.c | 3 +-- > drivers/infiniband/core/user_mad.c| 2 +- > drivers/infiniband/core/verbs.c | 5 ++--- > drivers/infiniband/hw/mlx4/ah.c | 2 +- > drivers/infiniband/hw/mlx4/cq.c | 4 +--- > drivers/infiniband/hw/mlx4/mad.c | 14 -- > drivers/infiniband/hw/mlx4/main.c | 8 +++- > drivers/infiniband/hw/mlx4/mlx4_ib.h | 2 +- > drivers/infiniband/hw/mlx4/qp.c | 21 +++-- > drivers/infiniband/hw/mlx4/sysfs.c| 6 ++ > drivers/infiniband/ulp/ipoib/ipoib_main.c | 6 +++--- > include/rdma/ib_verbs.h | 24 > net/sunrpc/xprtrdma/svc_rdma_recvfrom.c | 3 +-- > 19 files changed, 79 insertions(+), 83 deletions(-) > > diff --git a/drivers/infiniband/core/agent.c b/drivers/infiniband/core/agent.c > index f6d2961..27f1bec 100644 > --- a/drivers/infiniband/core/agent.c > +++ b/drivers/infiniband/core/agent.c > @@ -156,7 +156,7 @@ int ib_agent_port_open(struct ib_device *device, int > port_num) > goto error1; > } > > -if (rdma_port_get_link_layer(device, port_num) == > IB_LINK_LAYER_INFINIBAND) { > +if (rdma_port_ll_is_ib(device, port_num)) { > /* Obtain send only MAD agent for SMI QP */ > port_priv->agent[0] = ib_register_mad_agent(device, port_num, > IB_QPT_SMI, NULL, 0, > diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c > index e28a494..2c72e9e 100644 > --- a/drivers/infiniband/core/cm.c > +++ b/drivers/infiniband/core/cm.c > @@ -3762,7 +3762,7 @@ static void cm_add_one(struct ib_device *ib_device) > int ret; > u8 i; > > -if (rdma_node_get_transport(ib_device->node_type) != RDMA_TRANSPORT_IB) > +if (!rdma_transport_is_ib(ib_device)) > return; > > cm_dev = kzalloc(sizeof(*cm_dev) + sizeof(*port) * > diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c > index d570030..668e955 100644 > --- a/drivers/infiniband/core/cma.c > +++ b/drivers/infiniband/core/cma.c > @@ -375,8 +375,8 @@ static int cma_acquire_dev(struct rdma_id_private > *id_priv, > listen_id_priv->id.port_num) == dev_ll) { > cma_dev = listen_id_priv->cma_dev; > port = listen_id_priv->id.port_num; > -if (rdma_node_get_transport(cma_dev->device->node_type) == > RDMA_TRANSPORT_IB && > -rdma_port_get_link_layer(cma_dev->device, port) == > IB_LINK_LAYER_ETHERNET) > +if (rdma_transport_is_ib(cma_dev->device) && > +rdma_port_ll_is_eth(cma_dev->device, port)) > ret = ib_find_cached_gid(cma_dev->device, &iboe_gid, > &found_port, NULL); > else > @@ -395,8 +395,8 @@ static int cma_acquire_dev(struct rdma_id_private > *id_priv, > listen_id_priv->id.port_num == port) > continue; > if (rdma_port_get_link_layer(cma_dev->device, port) == dev_ll) { > -if (rdma_node_get_transport(cma_dev->device->node_type) == > RDMA_TRANSPORT_IB && > -rdma_port_get_link_layer(cma_dev->device, port) == > IB_LINK_LAYER_ETHERNET) > +if (rdma_transport_is_ib(cma_dev->devic
Re: [RFC PATCH 07/11] IB/Verbs: Use management helper has_mcast() and, cap_mcast() for mcast-check
On Fri, 2015-03-27 at 16:46 +0100, Michael Wang wrote: > Introduce helper has_mcast() and cap_mcast() to help us check if an > IB device or it's port support Multicast. This probably needs reworded or rethought. In truth, *all* rdma devices are multicast capable. *BUT*, IB/OPA devices require multicast registration done the IB way (including for sendonly multicast sends), while Ethernet devices do multicast the Ethernet way. These tests are really just for IB specific multicast registration and deregistration. Call it has_mcast() and cap_mcast() is incorrect. > Cc: Jason Gunthorpe > Cc: Doug Ledford > Cc: Ira Weiny > Cc: Sean Hefty > Signed-off-by: Michael Wang > --- > drivers/infiniband/core/cma.c | 2 +- > drivers/infiniband/core/multicast.c | 8 > include/rdma/ib_verbs.h | 28 > 3 files changed, 33 insertions(+), 5 deletions(-) > > diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c > index 276fb76..cbbc85b 100644 > --- a/drivers/infiniband/core/cma.c > +++ b/drivers/infiniband/core/cma.c > @@ -3398,7 +3398,7 @@ void rdma_leave_multicast(struct rdma_cm_id *id, struct > sockaddr *addr) > ib_detach_mcast(id->qp, > &mc->multicast.ib->rec.mgid, > be16_to_cpu(mc->multicast.ib->rec.mlid)); > -if (rdma_transport_is_ib(id_priv->cma_dev->device)) { > +if (has_mcast(id_priv->cma_dev->device)) { > switch (rdma_port_get_link_layer(id->device, id->port_num)) { > case IB_LINK_LAYER_INFINIBAND: > ib_sa_free_multicast(mc->multicast.ib); > diff --git a/drivers/infiniband/core/multicast.c > b/drivers/infiniband/core/multicast.c > index 17573ff..ffeaf27 100644 > --- a/drivers/infiniband/core/multicast.c > +++ b/drivers/infiniband/core/multicast.c > @@ -780,7 +780,7 @@ static void mcast_event_handler(struct ib_event_handler > *handler, > int index; > > dev = container_of(handler, struct mcast_device, event_handler); > -if (!rdma_port_ll_is_ib(dev->device, event->element.port_num)) > +if (!cap_mcast(dev->device, event->element.port_num)) > return; > > index = event->element.port_num - dev->start_port; > @@ -807,7 +807,7 @@ static void mcast_add_one(struct ib_device *device) > int i; > int count = 0; > > -if (!rdma_transport_is_ib(device)) > +if (!has_mcast(device)) > return; > > dev = kmalloc(sizeof *dev + device->phys_port_cnt * sizeof *port, > @@ -823,7 +823,7 @@ static void mcast_add_one(struct ib_device *device) > } > > for (i = 0; i <= dev->end_port - dev->start_port; i++) { > -if (!rdma_port_ll_is_ib(device, dev->start_port + i)) > +if (!cap_mcast(device, dev->start_port + i)) > continue; > port = &dev->port[i]; > port->dev = dev; > @@ -861,7 +861,7 @@ static void mcast_remove_one(struct ib_device *device) > flush_workqueue(mcast_wq); > > for (i = 0; i <= dev->end_port - dev->start_port; i++) { > -if (rdma_port_ll_is_ib(device, dev->start_port + i)) { > +if (cap_mcast(device, dev->start_port + i)) { > port = &dev->port[i]; > deref_port(port); > wait_for_completion(&port->comp); > diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h > index fa8ffa3..e796104 100644 > --- a/include/rdma/ib_verbs.h > +++ b/include/rdma/ib_verbs.h > @@ -1823,6 +1823,19 @@ static inline int has_sa(struct ib_device *device) > } > > /** > + * has_mcast - Check if a device support Multicast. > + * > + * @device: Device to be checked > + * > + * Return 0 when a device has none port to support > + * Multicast. > + */ > +static inline int has_mcast(struct ib_device *device) > +{ > +return rdma_transport_is_ib(device); > +} > + > +/** > * cap_smi - Check if the port of device has the capability > * Subnet Management Interface. > * > @@ -1852,6 +1865,21 @@ static inline int cap_sa(struct ib_device *device, u8 > port_num) > return rdma_port_ll_is_ib(device, port_num); > } > > +/** > + * cap_mcast - Check if the port of device has the capability > + * Multicast. > + * > + * @device: Device to be checked > + * @port_num: Port number of the device > + * > + * Return 0 when port of the device don't support > + * Multicast. > + */ > +static inline int cap_mcast(struct ib_device *device, u8 port_num) > +{ > +return rdma_port_ll_is_ib(device, port_num); > +} > + > int ib_query_gid(struct ib_device *device, > u8 port_num, int index, union ib_gid *gid); > -- Doug Ledford GPG KeyID: 0E572FDD signature.asc Description: This is a digitally signed message part
Re: [RFC PATCH 08/11] IB/Verbs: Use management helper has_iwarp() for, iwarp-check
On Fri, 2015-03-27 at 16:47 +0100, Michael Wang wrote: > Introduce helper has_iwarp() to help us check if an IB device > support IWARP protocol. This is a needless redirection. Just stick with the original rdma_transport_is_iwarp(). > Cc: Jason Gunthorpe > Cc: Doug Ledford > Cc: Ira Weiny > Cc: Sean Hefty > Signed-off-by: Michael Wang > --- > include/rdma/ib_verbs.h | 13 + > net/sunrpc/xprtrdma/svc_rdma_recvfrom.c | 2 +- > 2 files changed, 14 insertions(+), 1 deletion(-) > > diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h > index e796104..0ef9cd7 100644 > --- a/include/rdma/ib_verbs.h > +++ b/include/rdma/ib_verbs.h > @@ -1836,6 +1836,19 @@ static inline int has_mcast(struct ib_device *device) > } > > /** > + * has_iwarp - Check if a device support IWARP protocol. > + * > + * @device: Device to be checked > + * > + * Return 0 when a device has none port to support > + * IWARP protocol. > + */ > +static inline int has_iwarp(struct ib_device *device) > +{ > +return rdma_transport_is_iwarp(device); > +} > + > +/** > * cap_smi - Check if the port of device has the capability > * Subnet Management Interface. > * > diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c > b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c > index a7b5891..48aeb5e 100644 > --- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c > +++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c > @@ -118,7 +118,7 @@ static void rdma_build_arg_xdr(struct svc_rqst *rqstp, > > static int rdma_read_max_sge(struct svcxprt_rdma *xprt, int sge_count) > { > -if (rdma_transport_is_iwarp(xprt->sc_cm_id->device)) > +if (has_iwarp(xprt->sc_cm_id->device)) > return 1; > else > return min_t(int, sge_count, xprt->sc_max_sge); -- Doug Ledford GPG KeyID: 0E572FDD signature.asc Description: This is a digitally signed message part
Re: [RFC PATCH 06/11] IB/Verbs: Use management helper has_sa() and cap_sa(), for sa-check
On Fri, 2015-03-27 at 16:46 +0100, Michael Wang wrote: > Introduce helper has_sa() and cap_sa() to help us check if an IB device > or it's port support Subnet Administrator. There's no functional reason to have both rdma_transport_is_ib and rdma_port_ll_is_ib, just use one. Then there is also no reason for both has_sa and cap_sa. Just use one. > Cc: Jason Gunthorpe > Cc: Doug Ledford > Cc: Ira Weiny > Cc: Sean Hefty > Signed-off-by: Michael Wang > --- > drivers/infiniband/core/sa_query.c | 12 ++-- > include/rdma/ib_verbs.h| 28 > 2 files changed, 34 insertions(+), 6 deletions(-) > > diff --git a/drivers/infiniband/core/sa_query.c > b/drivers/infiniband/core/sa_query.c > index d95d25f..89c27da 100644 > --- a/drivers/infiniband/core/sa_query.c > +++ b/drivers/infiniband/core/sa_query.c > @@ -450,7 +450,7 @@ static void ib_sa_event(struct ib_event_handler *handler, > struct ib_event *event > struct ib_sa_port *port = > &sa_dev->port[event->element.port_num - sa_dev->start_port]; > > -if (!rdma_port_ll_is_ib(handler->device, port->port_num)) > +if (!cap_sa(handler->device, port->port_num)) > return; > > spin_lock_irqsave(&port->ah_lock, flags); > @@ -1154,7 +1154,7 @@ static void ib_sa_add_one(struct ib_device *device) > struct ib_sa_device *sa_dev; > int s, e, i; > > -if (!rdma_transport_is_ib(device)) > +if (!has_sa(device)) > return; > > if (device->node_type == RDMA_NODE_IB_SWITCH) > @@ -1175,7 +1175,7 @@ static void ib_sa_add_one(struct ib_device *device) > > for (i = 0; i <= e - s; ++i) { > spin_lock_init(&sa_dev->port[i].ah_lock); > -if (!rdma_port_ll_is_ib(device, i + 1)) > +if (!cap_sa(device, i + 1)) > continue; > > sa_dev->port[i].sm_ah= NULL; > @@ -1205,14 +1205,14 @@ static void ib_sa_add_one(struct ib_device *device) > goto err; > > for (i = 0; i <= e - s; ++i) > -if (rdma_port_ll_is_ib(device, i + 1)) > +if (cap_sa(device, i + 1)) > update_sm_ah(&sa_dev->port[i].update_task); > > return; > > err: > while (--i >= 0) > -if (rdma_port_ll_is_ib(device, i + 1)) > +if (cap_sa(device, i + 1)) > ib_unregister_mad_agent(sa_dev->port[i].agent); > > kfree(sa_dev); > @@ -1233,7 +1233,7 @@ static void ib_sa_remove_one(struct ib_device *device) > flush_workqueue(ib_wq); > > for (i = 0; i <= sa_dev->end_port - sa_dev->start_port; ++i) { > -if (rdma_port_ll_is_ib(device, i + 1)) { > +if (cap_sa(device, i + 1)) { > ib_unregister_mad_agent(sa_dev->port[i].agent); > if (sa_dev->port[i].sm_ah) > kref_put(&sa_dev->port[i].sm_ah->ref, free_sm_ah); > diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h > index c0a63f8..fa8ffa3 100644 > --- a/include/rdma/ib_verbs.h > +++ b/include/rdma/ib_verbs.h > @@ -1810,6 +1810,19 @@ static inline int has_cm(struct ib_device *device) > } > > /** > + * has_sa - Check if a device support Subnet Administrator. > + * > + * @device: Device to be checked > + * > + * Return 0 when a device has none port to support > + * Subnet Administrator. > + */ > +static inline int has_sa(struct ib_device *device) > +{ > +return rdma_transport_is_ib(device); > +} > + > +/** > * cap_smi - Check if the port of device has the capability > * Subnet Management Interface. > * > @@ -1824,6 +1837,21 @@ static inline int cap_smi(struct ib_device *device, u8 > port_num) > return rdma_port_ll_is_ib(device, port_num); > } > > +/** > + * cap_sa - Check if the port of device has the capability > + * Subnet Administrator. > + * > + * @device: Device to be checked > + * @port_num: Port number of the device > + * > + * Return 0 when port of the device don't support > + * Subnet Administrator. > + */ > +static inline int cap_sa(struct ib_device *device, u8 port_num) > +{ > +return rdma_port_ll_is_ib(device, port_num); > +} > + > int ib_query_gid(struct ib_device *device, > u8 port_num, int index, union ib_gid *gid); > -- Doug Ledford GPG KeyID: 0E572FDD signature.asc Description: This is a digitally signed message part