from:"Doug Ledford"

ibutils license problems + a few other minor issues

2010-02-11 Thread Doug Ledford

The ibutils tarball as found for download on the openfabrics web site
has some licensing issues that need resolved ASAP.  The tarball in its
current form is probably illegal to distribute until the items are
fixed.  So, here's what popped up in review (done by Jay Fenlason):

The COPYING files sprinkled around in the tarball are not consistent.
Some of them contain a copy of the GPLv2, some of them contain a
statement that the package is licensed under either GPLv2 in the file
COPYING or the OFA BSD license included here in.  Of course, it doesn't
work to reference a COPYING file from within the COPYING file you are
referencing and say it has the information you aren't including here.  I
would suggest that the proper place for the statement of GPLv2 or BSD is
in possibly a LICENSE file with the COPYING file being the GPLv2 as
stated.  It also seems redundant to have multiple COPYING files and
possibly also multiple LICENSE files in a single tarball, so cleanup of
unnecessary files seems in order here.

There are a number of files who have headers that indicate they are
*not* under either GPLv2 or BSD.

ibdm/replace/regex.[ch] are LGPLv2+, so need a LGPLv2+ COPYING somewhere
ibdm/replace/{memset.c,realloc.c,malloc.c} are GPLv2+ versus GPLv2

ibdm/src/fabric_sim.cpp claims to be confidential and proprietary! It
refers to LICENSE.txt for t&c. There is no LICENSE.txt under ibdm/.
There is one under ibmtsim/utils/ though.

ibmgtsim/config/{missing,config.sub,ltmain.sh,depcomp,config.guess} is
GPLv2+
ibmgtsim/config/install-sh is X license

ibmgtsim/utils/{,un}install.sh is GPLv2+

There are also a few other minor issues with the tarball as it builds:

/usr/bin/ibnlparse has no man page

/usr/share/man/man1/ibdm-ibnl-file.1.gz
/usr/share/man/man1/ibdm-topo-file.1.gz
Are in man1, but have no corresponding executables. Perhaps
they should be in a different man section?

-- 
Doug Ledford 
  GPG KeyID: CFBFF194
  http://people.redhat.com/dledford

Infiniband specific RPMs available at
  http://people.redhat.com/dledford/Infiniband



signature.asc
Description: OpenPGP digital signature

Re: [ewg] [PATCH v4] IB Core: RAW ETH support

2010-06-17 Thread Doug Ledford

On 06/16/2010 02:35 PM, Steve Wise wrote:
> Jason Gunthorpe wrote:
>> On Wed, Jun 16, 2010 at 12:12:06PM -0500, Steve Wise wrote:
>>
>>  
>>>>> Granted our dev process may not be documented, but I always assumed
>>>>> the general idea was to get changes accepted upstream, then pull
>>>>> into ofed. OFED is just a mechanism to make top-of-tree linux work
>>>>> on distro kernels. There are some exceptions, but this stuff
>>>>> shouldn't be an exception.
>>>>> 
>>
>>  
>>>> That is what many people wish for, me included, but it is not at all
>>>> what generally happens :(
>>>>
>>>> In my observation the typical flow is:
>>>>  - A patch is written, it may or may not be sent to the list
>>>>  - 'business drivers' get it slammed into OFED right away
>>>>  - A patch is finally sent for proper review
>>>>  - It is not merged, there are comments..
>>>>  - Interest in doing anything is lost because it is already in
>>>>OFED and that is all that matters, right?
>>>>  - People complain.
>>>>
>>>> For instance, the iWarp thingy we were just discussing fits this
>>>> process rather well.
>>>>   
>>
>>  
>>> You're wrong.  I started that iWARP change in 2007 on LKLM.  I
>>> proposed  a few ideas and show the pros/cons of each.  And it was
>>> NAKed 100% by  mister miller.It was then included in OFED as a
>>> last resort only  because I couldn't get any help with trying to add
>>> this upstream in any  form.  I even spent a few weeks developing a
>>> way to administor "iwarp  only" ipaddresses, but Roland didn't like
>>> that scheme for various  reasons.  So please don't mention that
>>> particular patch as a "bad  process" unless you want to argue with me
>>> some more about it.
>>> 
>>
>> Uhm, what you just described does fit my process outline:
>>
>>  #1 - Patch written, sent to LKML. Check.
>>  #3 - Patch sent for proper review - in 2007. Check.
>>  #4 - Not merged. NAK by DM. Check.
>>  #2 - 'business drivers' force it into OFED - 'last resort' ie, iWarp
>>   cards can't be used without some fix. Check.
>>  #5 - Interest is lost. Yep, this was done in 2007, and it was idle
>>   till now. Check.
>>  #6 - People Complain. Hmm. Yep. Check.
>>
>>   
> 
> Note the ordering is different.  IE I tried very hard to get the "right"
> solution designed and agreed-upon upstream.  But failed.  That's my
> bad.I did, however help with the iWARP core code including neighbour
> update net events which did go in upstream before ofed.
> 
>> Don't think I'm being critical toward only you, or singling out that
>> little iWarp patch. But it really isn't special, or different, or an
>> exception. Pick nearly any patch in OFED and someone will rush to its
>> defense with a 'we tried to follow the process and it failed, so we
>> did it anyway' argument.
>>
>> I also didn't say this is the only way that RDMA development goes,
>> lots and lots of stuff goes into mainline first, from everyone.
>>
>> Jason
>>   
> 
> 
> OFED maintainers should be more rigid, perhaps, with requiring that
> changes be accepted upstream first.  One observation is that there is no
> "OFED RDMA maintainer", aka a Roland Dreier, for the OFED code.  So each
> driver maintainer pretty much has free reign to do the right thing or
> the wrong thing...

Yep, no doubt that has an impact on things.  It's for this very reason
that our next operating system is not following OFED but instead is
using upstream as its basis.  That will be true from now on with our
products.


-- 
Doug Ledford 
  GPG KeyID: CFBFF194
  http://people.redhat.com/dledford

Infiniband specific RPMs available at
  http://people.redhat.com/dledford/Infiniband



signature.asc
Description: OpenPGP digital signature

Re: [ewg] [PATCH v4] IB Core: RAW ETH support

2010-06-17 Thread Doug Ledford

On 06/16/2010 01:12 PM, Steve Wise wrote:
> Jason Gunthorpe wrote:
>> On Wed, Jun 16, 2010 at 09:09:59AM -0500, Steve Wise wrote:
>>
>>  
>>> Granted our dev process may not be documented, but I always assumed
>>> the general idea was to get changes accepted upstream, then pull into
>>> ofed. OFED is just a mechanism to make top-of-tree linux work on
>>> distro kernels. There are some exceptions, but this stuff shouldn't
>>> be an exception.
>>> 
>>
>> That is what many people wish for, me included, but it is not at all
>> what generally happens :(
>>
>> In my observation the typical flow is:
>>  - A patch is written, it may or may not be sent to the list
>>  - 'business drivers' get it slammed into OFED right away
>>  - A patch is finally sent for proper review
>>  - It is not merged, there are comments..
>>  - Interest in doing anything is lost because it is already in
>>OFED and that is all that matters, right?
>>  - People complain.
>>
>> For instance, the iWarp thingy we were just discussing fits this
>> process rather well.
>>
>>   
> 
> You're wrong.  I started that iWARP change in 2007 on LKLM.  I proposed
> a few ideas and show the pros/cons of each.  And it was NAKed 100% by
> mister miller.It was then included in OFED as a last resort only
^
Which, of course, is the problem.  Once you have a solution besides "get
it upstream", you throw whatever you feel like into OFED instead of
whatever upstream will accept.  How long has OFED shipped the XRC stuff
now while it *still* isn't upstream?

> because I couldn't get any help with trying to add this upstream in any
> form. 

Again, OFED is part of the reason this failed.  That users had someplace
else to get "working" code besides upstream meant that you didn't have
end users putting pressure on the upstream kernel folks to accept *some*
form of solution.  So, your job was harder because there were no users
present to put pressure on mister miller or others, and then you
perpetuated the issue by caving and going back to OFED "as a last
resort".  It has become a "last resort" so often now that trying to get
things upstream first is just a sort of private joke amongst some people
I think.

> I even spent a few weeks developing a way to administor "iwarp
> only" ipaddresses, but Roland didn't like that scheme for various
> reasons.  So please don't mention that particular patch as a "bad
> process" unless you want to argue with me some more about it.
> 
> Also, the chelsio iWARP driver has 100% been upstream first, then
> ofed.   Some of us are indeed trying to do the right thing.
> 
> 

OFED just needs to go away.  It's been far too abused for far too long
and it's mere existence is hindering upstream development.  I appreciate
that you attempt to do the right thing most of the time, but it really
needs to be all of the time, and you need your users right there beside
you in order to carry the weight you need in order to get solutions
designed and accepted instead of running into the brick wall you ran
into before.

-- 
Doug Ledford 
  GPG KeyID: CFBFF194
  http://people.redhat.com/dledford

Infiniband specific RPMs available at
  http://people.redhat.com/dledford/Infiniband

signature.asc
Description: OpenPGP digital signature

Re: [PATCH 2/2] ib/cm: Bump reference count on cm_id before invoking callback

2011-02-23 Thread Doug Ledford

On Wed, 2011-02-23 at 08:17 -0800, Hefty, Sean wrote:
> When processing a SIDR REQ, the ib_cm allocates a new cm_id.  The refcount
> of the cm_id is initialized to 1.  However, cm_process_work will
> decrement the refcount after invoking all callbacks.  The result is that
> the cm_id will end up with refcount set to 0 by the end of the sidr req
> handler.
> 
> If a user tries to destroy the cm_id, the destruction will proceed, under
> the incorrect assumption that no other threads are referencing the cm_id.
> This can lead to a crash when the cm callback thread tries to access
> the cm_id.
> 
> This problem was noticed as part of a larger investigation with kernel
> crashes in the rdma_cm when running on a real time OS.
> 
> Signed-off-by: Sean Hefty 

Acked-by: Doug Ledford 

> ---
> 
>  drivers/infiniband/core/cm.c |1 +
>  1 files changed, 1 insertions(+), 0 deletions(-)
> 
> diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c
> index 64e0903..1d9616b 100644
> --- a/drivers/infiniband/core/cm.c
> +++ b/drivers/infiniband/core/cm.c
> @@ -2989,6 +2989,7 @@ static int cm_sidr_req_handler(struct cm_work *work)
>   goto out; /* No match. */
>   }
>   atomic_inc(&cur_cm_id_priv->refcount);
> + atomic_inc(&cm_id_priv->refcount);
>   spin_unlock_irq(&cm.lock);
>  
>   cm_id_priv->id.cm_handler = cur_cm_id_priv->id.cm_handler;
> 
> 


-- 
Doug Ledford 
  GPG KeyID: CFBFF194
  http://people.redhat.com/dledford

Infiniband specific RPMs available at
  http://people.redhat.com/dledford/Infiniband



signature.asc
Description: This is a digitally signed message part

Re: [PATCH 1/2] rdma/cm: Fix crash in request handlers

2011-02-23 Thread Doug Ledford

On Wed, 2011-02-23 at 08:11 -0800, Hefty, Sean wrote:
> Doug Ledford and RedHat reported a crash when running the rdma_cm on a real
> time OS.  The crash has the following call trace:
> 
> cm_process_work
>cma_req_handler
>   cma_disable_callback
>   rdma_create_id
>  kzalloc
>  init_completion
>   cma_get_net_info
>   cma_save_net_info
>   cma_any_addr
>  cma_zero_addr
>   rdma_translate_ip
>  rdma_copy_addr
>   cma_acquire_dev
>  rdma_addr_get_sgid
>  ib_find_cached_gid
>  cma_attach_to_dev
>   ucma_event_handler
>  kzalloc
>  ib_copy_ah_attr_to_user
>   cma_comp
>  
> [ preempted ]
>  
> cma_write
> copy_from_user
> ucma_destroy_id
>copy_from_user
>_ucma_find_context
>ucma_put_ctx
>ucma_free_ctx
>   rdma_destroy_id
>  cma_exch
>  cma_cancel_operation
>  rdma_node_get_transport
> 
> rt_mutex_slowunlock
> bad_area_nosemaphore
> oops_enter
> 
> They were able to reproduce the crash multiple times with the following
> details:
> 
> Crash seems to always happen on the:
> mutex_unlock(&conn_id->handler_mutex);
> as conn_id looks to have been freed during this code path.
> 
> An examination of the code shows that a race exists in the request handlers.
> When a new connection request is received, the rdma_cm allocates a new
> connection identifier.  This identifier has a single reference count on it.
> If a user calls rdma_destroy_id() from another thread after receiving a
> callback, rdma_destroy_id will proceed to destroy the id and free
> the associated memory.  However, the request handlers may still be in
> the process of running.  When control returns to the request handlers,
> they can attempt to access the newly created identifiers.
> 
> Fix this by holding a reference on the newly created rdma_cm_id until
> the request handler is through accessing it.
> 
> Signed-off-by: Sean Hefty 

Acked-by: Doug Ledford 

> ---
> This is obviously serious enough to cause a crash, but the bug has also been 
> there
> for a while before anyone detected it.  I have no real opinion on whether this
> targets 2.6.38 (if that's even possible at this point) or 2.6.39.
> 
>  drivers/infiniband/core/cma.c |   15 +++
>  1 files changed, 15 insertions(+), 0 deletions(-)
> 
> diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
> index 6884da2..e450c5a 100644
> --- a/drivers/infiniband/core/cma.c
> +++ b/drivers/infiniband/core/cma.c
> @@ -1210,6 +1210,11 @@ static int cma_req_handler(struct ib_cm_id *cm_id, 
> struct ib_cm_event *ib_event)
>   cm_id->context = conn_id;
>   cm_id->cm_handler = cma_ib_handler;
>  
> + /*
> +  * Protect against the user destroying conn_id from another thread
> +  * until we're done accessing it.
> +  */
> + atomic_inc(&conn_id->refcount);
>   ret = conn_id->id.event_handler(&conn_id->id, &event);
>   if (!ret) {
>   /*
> @@ -1222,8 +1227,10 @@ static int cma_req_handler(struct ib_cm_id *cm_id, 
> struct ib_cm_event *ib_event)
>   ib_send_cm_mra(cm_id, CMA_CM_MRA_SETTING, NULL, 0);
>   mutex_unlock(&lock);
>   mutex_unlock(&conn_id->handler_mutex);
> + cma_deref_id(conn_id);
>   goto out;
>   }
> + cma_deref_id(conn_id);
>  
>   /* Destroy the CM ID by returning a non-zero value. */
>   conn_id->cm_id.ib = NULL;
> @@ -1425,17 +1432,25 @@ static int iw_conn_req_handler(struct iw_cm_id *cm_id,
>   event.param.conn.private_data_len = iw_event->private_data_len;
>   event.param.conn.initiator_depth = attr.max_qp_init_rd_atom;
>   event.param.conn.responder_resources = attr.max_qp_rd_atom;
> +
> + /*
> +  * Protect against the user destroying conn_id from another thread
> +  * until we're done accessing it.
> +  */
> + atomic_inc(&conn_id->refcount);
>   ret = conn_id->id.event_handler(&conn_id->id, &event);
>   if (ret) {
>   /* User wants to destroy the CM ID */
>   conn_id->cm_id.iw = NULL;
>   cma_exch(conn_id, CMA_DESTROYING);
>   mutex_unlock(&conn_id->handler_mutex);
> + cma_deref_id(conn_id);
>   rdma_destroy_id(&conn_id->id);
>   goto out;
>   }
>  
>   mutex_unlock(&conn_id->handler_mutex);
> + cma_deref_id(conn_id);
>  
>  out:
>   if (dev)
> 
> 


-- 
Doug Ledford 
  GPG KeyID: CFBFF194
  http://people.redhat.com/dledford

Infiniband specific RPMs available at
  http://people.redhat.com/dledford/Infiniband



signature.asc
Description: This is a digitally signed message part

Re: [PATCH 1/2] rdma/cm: Fix crash in request handlers

2011-03-15 Thread Doug Ledford

On Mon, 2011-03-14 at 21:22 -0700, Roland Dreier wrote:
> On Mon, Mar 14, 2011 at 7:27 PM, Roland Dreier  wrote:
> > Doesn't that mean unprivileged userspace could trigger a use-after-free
> > in the kernel?  (and it might be malicious code, not buggy userspace)
> 
> >From reading the code a bit, I guess ucma is OK in this area.  I do see
> what seems like an exploitable race in ucma_create_id():
> 
>  - one thread create an id with an invalid userspace pointer
>(so the copy_to_user in ucma_create_id returns -EFAULT
>and calls rdma_destroy_id before idr_remove)
>  - another thread guess the id that is going to be returned and
>call ucma_destroy_id()
> 
> if the second thread hits the window where the cm_id is
> destroyed but the ctx is still in the idr, it can trigger a double free.
> 
> But this is a completely different bug.
> 
>  - R.

Heh, nice.  This would be a good example of why I prefer that rules such
as "cm_destroy_id/rdma_destroy_id/ucma_destroy_id can only be called
once per ID" be coded into the destroy function themselves and enforced.
As long as you are talking about user space guessing ID's, it is
patently impossible to guarantee that we won't call a destroy function
more than once.

-- 
Doug Ledford 
  GPG KeyID: CFBFF194
  http://people.redhat.com/dledford

Infiniband specific RPMs available at
  http://people.redhat.com/dledford/Infiniband



signature.asc
Description: This is a digitally signed message part

[Patch] mlx4: make mcg entry size a module parameter

2011-03-23 Thread Doug Ledford

We ran across a problem at a customer's site where the qp array in the
mcg entry was being filling up and denying further qp attaches.  In
addition, the upcoming SRIOV support is expected to increase use of this
array as well for unicast address steering (or so the comments in a
patch Yevgeny's original patch that statically increased the size of the
mcg entries stated).  But, since increasing the size of this to an
arbitrarily large number is just a waste of memory, and since we don't
know that 0x200 will be large enough for all use cases, make the option
a module parameter instead.  This has been tested at our customer's site
and solves their problem.

commit ff608ce370b49d2e5b614ff91f4e23b5deaac8a4
Author: Doug Ledford 
Date:   Wed Mar 23 12:20:47 2011 -0400

mlx4: make the size of the mcg entry a module parameter

Testing showed that the default size of 0x100 could be overrun.
Bumping to 0x200 would fix the problem, but only until we hit
the cap again, and at the expense of making memory consumption in
all scenarios worse.  So, make the size of the mcg entry a module
parameter and let those people who need to bump the size do so.

    Signed-off-by: Doug Ledford 

diff --git a/drivers/infiniband/hw/mlx4/main.c 
b/drivers/infiniband/hw/mlx4/main.c
index c7a6213..322f0af 100644
--- a/drivers/infiniband/hw/mlx4/main.c
+++ b/drivers/infiniband/hw/mlx4/main.c
@@ -158,7 +158,7 @@ static int mlx4_ib_query_device(struct ib_device *ibdev,
props->masked_atomic_cap   = IB_ATOMIC_HCA;
props->max_pkeys   = dev->dev->caps.pkey_table_len[1];
props->max_mcast_grp   = dev->dev->caps.num_mgms + 
dev->dev->caps.num_amgms;
-   props->max_mcast_qp_attach = dev->dev->caps.num_qp_per_mgm;
+   props->max_mcast_qp_attach = dev->dev->caps.num_qp_per_mcg;
props->max_total_mcast_qp_attach = props->max_mcast_qp_attach *
   props->max_mcast_grp;
props->max_map_per_fmr = (1 << (32 - ilog2(dev->dev->caps.num_mpts))) - 
1;
diff --git a/drivers/net/mlx4/fw.c b/drivers/net/mlx4/fw.c
index 5de1db8..800cb2d 100644
--- a/drivers/net/mlx4/fw.c
+++ b/drivers/net/mlx4/fw.c
@@ -434,8 +434,8 @@ int mlx4_QUERY_DEV_CAP(struct mlx4_dev *dev, struct 
mlx4_dev_cap *dev_cap)
 dev_cap->reserved_mrws, dev_cap->reserved_mtts);
mlx4_dbg(dev, "Max PDs: %d, reserved PDs: %d, reserved UARs: %d\n",
 dev_cap->max_pds, dev_cap->reserved_pds, 
dev_cap->reserved_uars);
-   mlx4_dbg(dev, "Max QP/MCG: %d, reserved MGMs: %d\n",
-dev_cap->max_pds, dev_cap->reserved_mgms);
+   mlx4_dbg(dev, "Max QP/MCG: %d, Max MCGs: %d, reserved MGMs: %d\n",
+dev_cap->max_qp_per_mcg, dev_cap->max_mcgs, 
dev_cap->reserved_mgms);
mlx4_dbg(dev, "Max CQEs: %d, max WQEs: %d, max SRQ WQEs: %d\n",
 dev_cap->max_cq_sz, dev_cap->max_qp_sz, dev_cap->max_srq_sz);
mlx4_dbg(dev, "Local CA ACK delay: %d, max MTU: %d, port width cap: 
%d\n",
diff --git a/drivers/net/mlx4/main.c b/drivers/net/mlx4/main.c
index 2765a3c..5306141 100644
--- a/drivers/net/mlx4/main.c
+++ b/drivers/net/mlx4/main.c
@@ -101,6 +101,10 @@ module_param_named(use_prio, use_prio, bool, 0444);
 MODULE_PARM_DESC(use_prio, "Enable steering by VLAN priority on ETH ports "
  "(0/1, default 0)");
 
+static int log_mcg_size = 8;
+module_param_named(log_mcg_size, log_mcg_size, int, 0444);
+MODULE_PARM_DESC(log_mcg_size, "Log2 size of MCG struct (8-11)");
+
 static int log_mtts_per_seg = ilog2(MLX4_MTT_ENTRY_PER_SEG);
 module_param_named(log_mtts_per_seg, log_mtts_per_seg, int, 0444);
 MODULE_PARM_DESC(log_mtts_per_seg, "Log2 number of MTT entries per segment 
(1-7)");
@@ -203,7 +207,14 @@ static int mlx4_dev_cap(struct mlx4_dev *dev, struct 
mlx4_dev_cap *dev_cap)
dev->caps.reserved_srqs  = dev_cap->reserved_srqs;
dev->caps.max_sq_desc_sz = dev_cap->max_sq_desc_sz;
dev->caps.max_rq_desc_sz = dev_cap->max_rq_desc_sz;
-   dev->caps.num_qp_per_mgm = MLX4_QP_PER_MGM;
+   dev->caps.max_qp_per_mcg = dev_cap->max_qp_per_mcg;
+   dev->caps.max_mcgs   = dev_cap->max_mcgs;
+   i = 0;
+   do {
+   dev->caps.mcg_entry_size = 1 << (log_mcg_size - i++);
+   dev->caps.num_qp_per_mcg = 4 * (dev->caps.mcg_entry_size / 
16 - 2);
+   } while (dev->caps.num_qp_per_mcg > dev->caps.max_qp_per_mcg);
+
/*
 * Subtract 1 from the limit because we need to allocate a
 * spare CQE so the HCA HW can tell the difference between an
@@ -642,7 +653,7 @@ static int mlx4_init_icm(struct mlx4_dev *dev, struct 
mlx4_

Re: [PATCH for-next V2 01/22] IB/core: Reserve bits in enum ib_qp_create_flags for low-level driver use

2012-09-05 Thread Doug Ledford

On 8/3/2012 4:40 AM, Jack Morgenstein wrote:
> Reserve bits 26-31 for internal use by low-level drivers. Two
> such bits are used in the mlx4 driver SRIOV IB implementation.
> 
> These enum additions guarantee that the core layer will never use
> these bits, so that low level drivers may safely make use of them.
> 
> Signed-off-by: Jack Morgenstein 
> ---
>  include/rdma/ib_verbs.h |3 +++
>  1 files changed, 3 insertions(+), 0 deletions(-)
> 
> diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
> index 07996af..46bc045 100644
> --- a/include/rdma/ib_verbs.h
> +++ b/include/rdma/ib_verbs.h
> @@ -614,6 +614,9 @@ enum ib_qp_type {
>  enum ib_qp_create_flags {
>   IB_QP_CREATE_IPOIB_UD_LSO   = 1 << 0,
>   IB_QP_CREATE_BLOCK_MULTICAST_LOOPBACK   = 1 << 1,
> + /* reserve bits 26-31 for low level drivers' internal use */
> + IB_QP_CREATE_RESERVED_START = 1 << 26,
> + IB_QP_CREATE_RESERVED_END   = 1 << 31,
>  };
>  
>  struct ib_qp_init_attr {
> 

Reserving 6 bits for driver use out of 32 seems reasonable.

Acked-by: Doug Ledford 

-- 
Doug Ledford 
  GPG KeyID: 0E572FDD
  http://people.redhat.com/dledford

Infiniband specific RPMs available at
  http://people.redhat.com/dledford/Infiniband



signature.asc
Description: OpenPGP digital signature

Re: [PATCH for-next V2 02/22] IB/core: change pkey table lookups to support full and partial membership for the same pkey

2012-09-11 Thread Doug Ledford

On 8/3/2012 4:40 AM, Jack Morgenstein wrote:
> Enhance the cached and non-cached pkey table lookups to enable limited and 
> full
> members of the same pkey to co-exist in the pkey table.
> 
> This is necessary for SRIOV to allow for a scheme where some guests would 
> have the full
> membership pkey in their virtual pkey table, where other guests on the same 
> hypervisor
> would have the limited one. In that sense, its an extension of the IBTA model 
> for
> non virtualized nodes.

OK, maybe I'm not getting something, but I'm curious why we always pick
the full pkey in preference to the partial pkey.  Shouldn't we pick the
pkey that's appropriate for the vHCA sending the message?

Also, given the rule of least surprise, don't you think it would be best
to rename this function ib_find_cached_full_or_parital_pkey and in your
next patch instead of naming it ib_find_exact_pkey just call that one
ib_find_cached_pkey?

> To accomplish this, we need both the limited and full membership pkeys to be 
> present
> in the master's (hypervisor physical port) pkey table.
> 
> The algorithm for supporting pkey tables which contain both the limited and 
> the full
> membership versions of the same pkey works as follows:
> 
> When scanning the pkey table for a 15 bit pkey:
> 
> A. If there is a full member version of that pkey anywhere
> in the table, return its index (even if a limited-member
> version of the pkey exists earlier in the table).
> 
> B. If the full member version is not in the table,
> but the limited-member version is in the table,
> return the index of the limited pkey.
> 
> Signed-off-by: Liran Liss 
> Signed-off-by: Jack Morgenstein 
> Signed-off-by: Or Gerlitz 
> ---
>  drivers/infiniband/core/cache.c  |   14 +++---
>  drivers/infiniband/core/device.c |   17 +
>  2 files changed, 24 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/infiniband/core/cache.c b/drivers/infiniband/core/cache.c
> index 9353992..0f2f2b7 100644
> --- a/drivers/infiniband/core/cache.c
> +++ b/drivers/infiniband/core/cache.c
> @@ -167,6 +167,7 @@ int ib_find_cached_pkey(struct ib_device *device,
>   unsigned long flags;
>   int i;
>   int ret = -ENOENT;
> + int partial_ix = -1;
>  
>   if (port_num < start_port(device) || port_num > end_port(device))
>   return -EINVAL;
> @@ -179,10 +180,17 @@ int ib_find_cached_pkey(struct ib_device *device,
>  
>   for (i = 0; i < cache->table_len; ++i)
>   if ((cache->table[i] & 0x7fff) == (pkey & 0x7fff)) {
> - *index = i;
> - ret = 0;
> - break;
> + if (cache->table[i] & 0x8000) {
> + *index = i;
> + ret = 0;
> + break;
> + } else
> + partial_ix = i;
>   }
> + if (ret && partial_ix >= 0) {
> + *index = partial_ix;
> + ret = 0;
> + }
>  
>   read_unlock_irqrestore(&device->cache.lock, flags);
>  
> diff --git a/drivers/infiniband/core/device.c 
> b/drivers/infiniband/core/device.c
> index e711de4..a645c68 100644
> --- a/drivers/infiniband/core/device.c
> +++ b/drivers/infiniband/core/device.c
> @@ -707,18 +707,27 @@ int ib_find_pkey(struct ib_device *device,
>  {
>   int ret, i;
>   u16 tmp_pkey;
> + int partial_ix = -1;
>  
>   for (i = 0; i < device->pkey_tbl_len[port_num - start_port(device)]; 
> ++i) {
>   ret = ib_query_pkey(device, port_num, i, &tmp_pkey);
>   if (ret)
>   return ret;
> -
>   if ((pkey & 0x7fff) == (tmp_pkey & 0x7fff)) {
> - *index = i;
> - return 0;
> + /* if there is full-member pkey take it.*/
> + if (tmp_pkey & 0x8000) {
> + *index = i;
> + return 0;
> +     }
> + if (partial_ix < 0)
> + partial_ix = i;
>   }
>   }
> -
> + /*no full-member, if exists take the limited*/
> + if (partial_ix >= 0) {
> + *index = partial_ix;
> + return 0;
> + }
>   return -ENOENT;
>  }
>  EXPORT_SYMBOL(ib_find_pkey);
> 


-- 
Doug Ledford 
  GPG KeyID: 0E572FDD
  http://people.redhat.com/dledford

Infiniband specific RPMs available at
  http://people.redhat.com/dledford/Infiniband



signature.asc
Description: OpenPGP digital signature

Re: [PATCH for-next V2 03/22] IB/core: Add ib_find_exact_cached_pkey() to search for 16-bit pkey match

2012-09-11 Thread Doug Ledford

On 8/3/2012 4:40 AM, Jack Morgenstein wrote:
> When port pkey table potentially contains both full and partial
> membership copies for the same pkey, we need a function to find
> the exact (16-bit) pkey index.

The code on this patch is fine, just see my previous email about the
function naming...

> This is particularly necessary when the master forwards QP1 MADS
> sent by guests.  If the guest has sent the MAD with a limited
> membership pkey, we wish to forward the MAD using the same limited
> membership pkey.  Since master may have both the limited and
> the full member pkeys in its table, we must make sure to retrieve
> the limited membership pkey in this case.
> 
> This requires the 16-bit pkey lookup function (which includes the
> membership bit).
> 
> Signed-off-by: Jack Morgenstein 
> Signed-off-by: Or Gerlitz 
> ---
>  drivers/infiniband/core/cache.c |   32 
>  include/rdma/ib_cache.h |   16 
>  2 files changed, 48 insertions(+), 0 deletions(-)
> 
> diff --git a/drivers/infiniband/core/cache.c b/drivers/infiniband/core/cache.c
> index 0f2f2b7..d8a8c83 100644
> --- a/drivers/infiniband/core/cache.c
> +++ b/drivers/infiniband/core/cache.c
> @@ -198,6 +198,38 @@ int ib_find_cached_pkey(struct ib_device *device,
>  }
>  EXPORT_SYMBOL(ib_find_cached_pkey);
>  
> +int ib_find_exact_cached_pkey(struct ib_device *device,
> +   u8port_num,
> +   u16   pkey,
> +   u16  *index)
> +{
> + struct ib_pkey_cache *cache;
> + unsigned long flags;
> + int i;
> + int ret = -ENOENT;
> +
> + if (port_num < start_port(device) || port_num > end_port(device))
> + return -EINVAL;
> +
> + read_lock_irqsave(&device->cache.lock, flags);
> +
> + cache = device->cache.pkey_cache[port_num - start_port(device)];
> +
> + *index = -1;
> +
> + for (i = 0; i < cache->table_len; ++i)
> + if (cache->table[i] == pkey) {
> + *index = i;
> + ret = 0;
> + break;
> + }
> +
> + read_unlock_irqrestore(&device->cache.lock, flags);
> +
> + return ret;
> +}
> +EXPORT_SYMBOL(ib_find_exact_cached_pkey);
> +
>  int ib_get_cached_lmc(struct ib_device *device,
> u8port_num,
> u8*lmc)
> diff --git a/include/rdma/ib_cache.h b/include/rdma/ib_cache.h
> index 00a2b8e..ad9a3c2 100644
> --- a/include/rdma/ib_cache.h
> +++ b/include/rdma/ib_cache.h
> @@ -101,6 +101,22 @@ int ib_find_cached_pkey(struct ib_device*device,
>   u16 *index);
>  
>  /**
> + * ib_find_exact_cached_pkey - Returns the PKey table index where a specified
> + *   PKey value occurs. Comparison uses the FULL 16 bits (incl membership 
> bit)
> + * @device: The device to query.
> + * @port_num: The port number of the device to search for the PKey.
> + * @pkey: The PKey value to search for.
> + * @index: The index into the cached PKey table where the PKey was found.
> + *
> + * ib_find_exact_cached_pkey() searches the specified PKey table in
> + * the local software cache.
> + */
> +int ib_find_exact_cached_pkey(struct ib_device*device,
> +   u8   port_num,
> +       u16  pkey,
> +   u16 *index);
> +
> +/**
>   * ib_get_cached_lmc - Returns a cached lmc table entry
>   * @device: The device to query.
>   * @port_num: The port number of the device to query.
> 


-- 
Doug Ledford 
  GPG KeyID: 0E572FDD
  http://people.redhat.com/dledford

Infiniband specific RPMs available at
  http://people.redhat.com/dledford/Infiniband



signature.asc
Description: OpenPGP digital signature

Re: [PATCH for-next V2 04/22] IB/mlx4: SRIOV IB context objects and proxy/tunnel sqp support

2012-09-11 Thread Doug Ledford

On 8/3/2012 4:40 AM, Jack Morgenstein wrote:
> 1. Introduce the basic sriov parvirtualization context objects
>for multiplexing and demultiplexing MADs.
> 2. Introduce support for the new proxy and tunnel QP types.
> 
> This patch introduces the objects required by the master
> for managing QP paravirtualization for guests.
> 
> struct mlx4_ib_sriov{} is created by the master only.
> It is a container for the following:
> 1. All the info required by the PPF to multiplex and de-multiplex MADs
>(including those from the PF). (struct mlx4_ib_demux_ctx demux)

OK, so can we have at least a single reference to the various
abbreviations before using them exclusively?  I know PF and PPF may be
common, but it might be nice that they were used once in full form
before abbreviated in commit messages.

> 2. All the info required to manage alias GUIDs (i.e., the GUID at
>index 0 that each guest perceives.  In fact, this is not the
>GUID which is actually at index 0, but is, in fact, the GUID
>which is at index[] in the physical table.

OK, this has been one of the things that has made reviewing this
difficult.  I freely admit that I've steadfastly ignored SRIOV for as
long as I can, so maybe this is just me.  But, in the context of this
driver, how am I supposed to know which code paths will be on the host
and which on the guest?

Also, I note that you do math every time you want to know if you are on
a parent device or a virtual device.  Do you really want to do math all
the time, or would it be better to save off your status on device init
and just refer to that when you would do math in this patch?

-- 
Doug Ledford 
  GPG KeyID: 0E572FDD
  http://people.redhat.com/dledford

Infiniband specific RPMs available at
  http://people.redhat.com/dledford/Infiniband

signature.asc
Description: OpenPGP digital signature

Re: [PATCH for-next V2 03/22] IB/core: Add ib_find_exact_cached_pkey() to search for 16-bit pkey match

2012-09-11 Thread Doug Ledford

On 8/3/2012 4:40 AM, Jack Morgenstein wrote:
> When port pkey table potentially contains both full and partial
> membership copies for the same pkey, we need a function to find
> the exact (16-bit) pkey index.
> 
> This is particularly necessary when the master forwards QP1 MADS
> sent by guests.  If the guest has sent the MAD with a limited
> membership pkey, we wish to forward the MAD using the same limited
> membership pkey.  Since master may have both the limited and
> the full member pkeys in its table, we must make sure to retrieve
> the limited membership pkey in this case.
> 
> This requires the 16-bit pkey lookup function (which includes the
> membership bit).

As a second note, I would like to know why Intel (previously QLogic)
does not use these functions in their driver and what it would take to
get all drivers to use the functions.  Do we need to add more to them?
In my opinion these should be generally useful and used by all drivers.
 Mike?

-- 
Doug Ledford 
  GPG KeyID: 0E572FDD
  http://people.redhat.com/dledford

Infiniband specific RPMs available at
  http://people.redhat.com/dledford/Infiniband



signature.asc
Description: OpenPGP digital signature

Re: [PATCH for-next V2 03/22] IB/core: Add ib_find_exact_cached_pkey() to search for 16-bit pkey match

2012-09-11 Thread Doug Ledford

On 9/11/2012 3:07 PM, Roland Dreier wrote:
> On Tue, Sep 11, 2012 at 10:12 AM, Doug Ledford  wrote:
>> As a second note, I would like to know why Intel (previously QLogic)
>> does not use these functions in their driver and what it would take to
>> get all drivers to use the functions.  Do we need to add more to them?
>> In my opinion these should be generally useful and used by all drivers.
> 
> Use which functions?  The P_Key lookup functions?
> 
> What would a low-level driver use them for?  I thought these are for
> use by upper-level protocols.

Well, at this point, the mlx4 driver uses them, the rdmacm kernel driver
uses them, and both QLogic/Intel drivers have their own internal pkey
table implementation.  So, it isn't so much upper layer as it is drivers.


-- 
Doug Ledford 
  GPG KeyID: 0E572FDD
  http://people.redhat.com/dledford

Infiniband specific RPMs available at
  http://people.redhat.com/dledford/Infiniband



signature.asc
Description: OpenPGP digital signature

Re: [PATCH for-next V2 03/22] IB/core: Add ib_find_exact_cached_pkey() to search for 16-bit pkey match

2012-09-11 Thread Doug Ledford

On 9/11/2012 4:43 PM, Roland Dreier wrote:
> On Tue, Sep 11, 2012 at 1:34 PM, Doug Ledford  wrote:
>> Well, at this point, the mlx4 driver uses them, the rdmacm kernel driver
>> uses them, and both QLogic/Intel drivers have their own internal pkey
>> table implementation.  So, it isn't so much upper layer as it is drivers.
> 
> rdmacm is an upper-level protocol (it's above the midlayer / hardware
> abstraction).

Yeah, I know.  My point wasn't that it was a low level item, just that
it's the only upper layer consumer that I saw.

> mlx4 and mthca look up P_Keys because of internal details of how they
> send MADs, and really they should move to maintaining their own P_Key
> table too.

Why not make the routines useful for all users instead of multiple
implementations of the same thing?


-- 
Doug Ledford 
  GPG KeyID: 0E572FDD
  http://people.redhat.com/dledford

Infiniband specific RPMs available at
  http://people.redhat.com/dledford/Infiniband



signature.asc
Description: OpenPGP digital signature

Re: [PATCH for-next V2 02/22] IB/core: change pkey table lookups to support full and partial membership for the same pkey

2012-09-12 Thread Doug Ledford

On 09/12/2012 03:56 AM, Jack Morgenstein wrote:
> On the Hypervisor, however, we assume that if both versions of the pkey are 
> in its pkey table,
> then for its own infiniband operation (as opposed to performing its pkey 
> virtualizing function),
> it should operate with the highest membership type in its table for a given 
> 15-bit pkey.

That's what I was looking for.  So, how can you know this assumption is
correct?  It seems to me that if someone wanted to restrict membership
of the hypervisor as part of a security lockdown, then give full
membership to a guest because that guest is some high security, single
task guest, then this assumption would break things (the user would be
able to assign the full membership key to the guest OK, but regardless
of how they wanted the hypervisor to be subscribed to that particular
pkey, it would always get the full membership from the guest).

>> Shouldn't we pick the
>> pkey that's appropriate for the vHCA sending the message?
> 
> We do. When QPs on the guests are created, the modify-qp commands are not 
> executed on the guest,
> but rather are passed to the PPF for processing. The PPF replaces the 
> guest-provided virtual pkey-index
> value with the appropriate physical pkey-index value.  See procedure 
> "update_pkey_index" in file
> resource_tracker.c, and all the places it is called (i.e., in the wrapper 
> functions for the various
> modify-qp firmware commands).

That's fine for the guest, but I don't see how this solves the
assumption issue.  My concern is that I could see a valid scenario that
a user might want to implement for security reasons that I think makes
your assumption above incorrect.

-- 
Doug Ledford 
  GPG KeyID: 0E572FDD
  http://people.redhat.com/dledford

signature.asc
Description: OpenPGP digital signature

Problem running rping over Intel adapters

2012-10-03 Thread Doug Ledford

We ran into this problem when testing rping over Intel/QLogic hardware:

[root@rdmaperf3 ~]# rping -s -a 172.31.2.103 -v
wait for CONNECTED state 10
connect error -1
cma event RDMA_CM_EVENT_REJECTED, error 28
[root@rdmaperf3 ~]#


[root@rdmaperf8 ~]# rping -c -a 172.31.2.103 -v -C 5
cma event RDMA_CM_EVENT_CONNECT_ERROR, error -1
wait for CONNECTED state 4
connect error -1
[root@rdmaperf8 ~]#

Turns out this is because of a couple things:

1) rping, on the client side, clears the conn_params for the newly to be
attempted connection, then sets:

conn_param.responder_resources = 1;
conn_param.initiator_depth = 1;
conn_param.retry_count = 10;

On the accept side, rping clears the conn_params and then sets just the
responder_resources and initiator_depth, without even checking the
incoming requested conn_param values from the incoming cm_id.  So, OK,
you can get away with that since this is a simple test program, but
still not "best programming practices".  However, the important part
here is the retry_count of 10.  That won't work on Intel/QLogic hardware.

2) the qib driver enforces a maximum of 7 for retry_count.  I don't see
anything in the spec that specifies a maximum for this entry, and in
particular I know it doesn't call out for 7 to mean infinite retries
like it does for rnr_retry_count.

I don't think the spec really cares how we solve this, and I don't think
there is a hard limit of 7 for the retry_count like the qib driver
enforces.  On the other hand, the spec doesn't call out a limit on the
retry_count but I would assume each driver has the option to implement
their own "reasonable, implementation defined" limit in a case like this.

So, do we make qib more liberal in its acceptance of retry_count or do
we fix rping to use a smaller number?  Matters not to me...

-- 
Doug Ledford 
  GPG KeyID: 0E572FDD
  http://people.redhat.com/dledford

Infiniband specific RPMs available at
  http://people.redhat.com/dledford/Infiniband



signature.asc
Description: OpenPGP digital signature

Re: Problem running rping over Intel adapters

2012-10-03 Thread Doug Ledford

On 10/3/2012 5:50 PM, Hefty, Sean wrote:
>> 1) rping, on the client side, clears the conn_params for the newly to be
>> attempted connection, then sets:
>>
>>  conn_param.responder_resources = 1;
>>  conn_param.initiator_depth = 1;
>>  conn_param.retry_count = 10;
> 
> 
>> 2) the qib driver enforces a maximum of 7 for retry_count.  I don't see
>> anything in the spec that specifies a maximum for this entry, and in
>> particular I know it doesn't call out for 7 to mean infinite retries
>> like it does for rnr_retry_count.
> 
> On IB, retry count is a 3-bit value.  See the CM REQ message.
> 
> The kernel rdma cm should probably check for a valid value and reduce it 
> accordingly.

Then it's also fair to say that right now rping passes a known invalid
value and needs fixed ;-)


-- 
Doug Ledford 
  GPG KeyID: 0E572FDD
  http://people.redhat.com/dledford

Infiniband specific RPMs available at
  http://people.redhat.com/dledford/Infiniband



signature.asc
Description: OpenPGP digital signature

Re: Problem running rping over Intel adapters

2012-10-03 Thread Doug Ledford

On 10/3/2012 6:02 PM, Hefty, Sean wrote:
>>>> 1) rping, on the client side, clears the conn_params for the 
>>>> newly to be attempted connection, then sets:
>>>> 
>>>> conn_param.responder_resources = 1; conn_param.initiator_depth 
>>>> = 1; conn_param.retry_count = 10;
>>> 
>>> 
>>>> 2) the qib driver enforces a maximum of 7 for retry_count.  I 
>>>> don't see anything in the spec that specifies a maximum for 
>>>> this entry, and in particular I know it doesn't call out for 7 
>>>> to mean infinite retries like it does for rnr_retry_count.
>>> 
>>> On IB, retry count is a 3-bit value.  See the CM REQ message.
>>> 
>>> The kernel rdma cm should probably check for a valid value and 
>>> reduce it
>> accordingly.
>> 
>> Then it's also fair to say that right now rping passes a known 
>> invalid value and needs fixed ;-)
> 
> For IB, but I wasn't sure about iWarp.  It doesn't look like it's 
> used there though.

I would hope it's not used on iWARPit's running over it's
own link layer where this is neither defined nor applicable.

> If not, I'll apply a patch to rping, plus submit a fix for the
> kernel to validate the value.




-- 
Doug Ledford 
  GPG KeyID: 0E572FDD
  http://people.redhat.com/dledford

Infiniband specific RPMs available at
  http://people.redhat.com/dledford/Infiniband



signature.asc
Description: OpenPGP digital signature

[Patch libibverbs 0/5] Minor updates

2012-10-14 Thread Doug Ledford

This is a short series of fixes for minor issues.  The most important
being a memcpy buffer overrun on ppc arch for the compatibility wrapper.
The rest are usability fixes for the example programs.

Doug Ledford (5):
  Add an error when the user specifies an invalid port
  Don't allow port == 0 as an option
  Fix the compatibility wrapper on ppc
  Don't print link phys state on iWARP
  Don't try to send UD messages larger than MTU

 examples/devinfo.c |   15 +++
 examples/ud_pingpong.c |   14 ++
 src/compat-1_0.c   |   42 --
 3 files changed, 65 insertions(+), 6 deletions(-)

-- 
1.7.7.6

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Patch libibverbs 1/5] Add an error when the user specifies an invalid port

2012-10-14 Thread Doug Ledford

Signed-off-by: Doug Ledford 
---
 examples/devinfo.c |8 +++-
 1 files changed, 7 insertions(+), 1 deletions(-)

diff --git a/examples/devinfo.c b/examples/devinfo.c
index 7dc0463..b1a3b2e 100644
--- a/examples/devinfo.c
+++ b/examples/devinfo.c
@@ -218,10 +218,16 @@ static int print_hca_cap(struct ibv_device *ib_dev, 
uint8_t ib_port)
goto cleanup;
}
if (ibv_query_device(ctx, &device_attr)) {
-   fprintf(stderr, "Failed to query device props");
+   fprintf(stderr, "Failed to query device props\n");
rc = 2;
goto cleanup;
}
+   if (ib_port && ib_port > device_attr.phys_port_cnt) {
+   fprintf(stderr, "Invalid port requested for device\n");
+   /* rc = 3 is taken by failure to clean up */
+   rc = 4;
+   goto cleanup;
+   }
 
printf("hca_id:\t%s\n", ibv_get_device_name(ib_dev));
printf("\ttransport:\t\t\t%s (%d)\n",
-- 
1.7.7.6

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Patch libibverbs 4/5] Don't print link phys state on iWARP

2012-10-14 Thread Doug Ledford

The physical link state on iWARP transports has no meaning, so
don't print it out at all.

Signed-off-by: Doug Ledford 
---
 examples/devinfo.c |5 +++--
 1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/examples/devinfo.c b/examples/devinfo.c
index d6e9218..ff078e4 100644
--- a/examples/devinfo.c
+++ b/examples/devinfo.c
@@ -327,8 +327,9 @@ static int print_hca_cap(struct ibv_device *ib_dev, uint8_t 
ib_port)
   width_str(port_attr.active_width), 
port_attr.active_width);
printf("\t\t\tactive_speed:\t\t%s (%d)\n",
   speed_str(port_attr.active_speed), 
port_attr.active_speed);
-   printf("\t\t\tphys_state:\t\t%s (%d)\n",
-  port_phy_state_str(port_attr.phys_state), 
port_attr.phys_state);
+   if (ib_dev->transport_type == IBV_TRANSPORT_IB)
+   printf("\t\t\tphys_state:\t\t%s (%d)\n",
+  
port_phy_state_str(port_attr.phys_state), port_attr.phys_state);
 
if (print_all_port_gids(ctx, port, 
port_attr.gid_tbl_len))
goto cleanup;
-- 
1.7.7.6

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Patch libibverbs 3/5] Fix the compatibility wrapper on ppc

2012-10-14 Thread Doug Ledford

The ppc arch packs the work request struct 1.0 in such a way that
a straight memcpy won't work.  Instead, break the copy out into
chunks whenever the sizes don't match for given portions of the
struct.

Found by built in gcc memcpy buffer overflow checks.

Help on the right fix provided by Jakub Jelinek from the gcc
team inside Red Hat.

Signed-off-by: Doug Ledford 
---
 src/compat-1_0.c |   42 --
 1 files changed, 40 insertions(+), 2 deletions(-)

diff --git a/src/compat-1_0.c b/src/compat-1_0.c
index fff6412..36884bb 100644
--- a/src/compat-1_0.c
+++ b/src/compat-1_0.c
@@ -353,8 +353,46 @@ static int post_send_wrapper_1_0(struct ibv_qp_1_0 *qp, 
struct ibv_send_wr_1_0 *
real_wr->wr_id = w->wr_id;
real_wr->next  = NULL;
 
-   memcpy(&real_wr->sg_list, &w->sg_list,
-  sizeof *w - offsetof(struct ibv_send_wr, sg_list));
+#define TEST_SIZE_2_POINT(f1, f2) \
+((offsetof(struct ibv_send_wr, f1) - offsetof(struct ibv_send_wr, f2)) \
+== offsetof(struct ibv_send_wr_1_0, f1) - offsetof(struct ibv_send_wr_1_0, f2))
+#define TEST_SIZE_TO_END(f1) \
+((sizeof(struct ibv_send_wr) - offsetof(struct ibv_send_wr, f1)) == \
+ (sizeof(struct ibv_send_wr_1_0) - offsetof(struct ibv_send_wr_1_0, f1)))
+
+   if (TEST_SIZE_TO_END (sg_list))
+   memcpy(&real_wr->sg_list, &w->sg_list, sizeof *real_wr
+  - offsetof(struct ibv_send_wr, sg_list));
+   else if (TEST_SIZE_2_POINT (imm_data, sg_list) &&
+TEST_SIZE_TO_END (wr)) {
+   /* we have alignment up to wr, but padding between
+* imm_data and wr, and we know wr itself is the
+* same size */
+   memcpy(&real_wr->sg_list, &w->sg_list,
+  offsetof(struct ibv_send_wr, imm_data) -
+  offsetof(struct ibv_send_wr, sg_list) +
+  sizeof real_wr->imm_data);
+   memcpy(&real_wr->wr, &w->wr, sizeof real_wr->wr);
+   } else {
+   real_wr->sg_list = w->sg_list;
+   real_wr->num_sge = w->num_sge;
+   real_wr->opcode = w->opcode;
+   real_wr->send_flags = w->send_flags;
+   real_wr->imm_data = w->imm_data;
+   if (TEST_SIZE_TO_END (wr))
+   memcpy(&real_wr->wr, &w->wr,
+  sizeof real_wr->wr);
+   else {
+   real_wr->wr.atomic.remote_addr = 
+   w->wr.atomic.remote_addr;
+   real_wr->wr.atomic.compare_add = 
+   w->wr.atomic.compare_add;
+   real_wr->wr.atomic.swap = 
+   w->wr.atomic.swap;
+   real_wr->wr.atomic.rkey = 
+   w->wr.atomic.rkey;
+   }
+   }
 
if (is_ud)
real_wr->wr.ud.ah = w->wr.ud.ah->real_ah;
-- 
1.7.7.6

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Patch libibverbs 2/5] Don't allow port == 0 as an option

2012-10-14 Thread Doug Ledford

We special case port == 0 to mean all ports, and it's the default,
so if a user passes in 0, they likely meant 1 instead.  Throw an
error because they probably didn't mean to specify the default
behavior of scan all ports.  Path of least surprise and all that.

Signed-off-by: Doug Ledford 
---
 examples/devinfo.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/examples/devinfo.c b/examples/devinfo.c
index b1a3b2e..d6e9218 100644
--- a/examples/devinfo.c
+++ b/examples/devinfo.c
@@ -385,7 +385,7 @@ int main(int argc, char *argv[])
 
case 'i':
ib_port = strtol(optarg, NULL, 0);
-   if (ib_port < 0) {
+   if (ib_port <= 0) {
usage(argv[0]);
return 1;
}
-- 
1.7.7.6

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Patch libibverbs 5/5] Don't try to send UD messages larger than MTU

2012-10-14 Thread Doug Ledford

The UD protocol doesn't support message sizes larger than the path
MTU.  We don't go so far as to check path MTU, but we do check port
MTU.  This prevents failed runs of the pingpong_ud program with large
MTUs.

Signed-off-by: Doug Ledford 
---
 examples/ud_pingpong.c |   14 ++
 1 files changed, 14 insertions(+), 0 deletions(-)

diff --git a/examples/ud_pingpong.c b/examples/ud_pingpong.c
index 71b152d..9a1ae13 100644
--- a/examples/ud_pingpong.c
+++ b/examples/ud_pingpong.c
@@ -323,6 +323,20 @@ static struct pingpong_context *pp_init_ctx(struct 
ibv_device *ib_dev, int size,
ibv_get_device_name(ib_dev));
return NULL;
}
+   {
+   struct ibv_port_attr port_info = { 0 };
+   int mtu;
+
+   if (ibv_query_port(ctx->context, port, &port_info)) {
+   fprintf(stderr, "Unable to query port info for port 
%d\n", port);
+   return NULL;
+   }
+   mtu = 1 << (port_info.active_mtu + 7);
+   if (size > mtu) {
+   fprintf(stderr, "Requested size larger than port MTU 
(%d)\n", mtu);
+   return NULL;
+   }
+   }
 
if (use_event) {
ctx->channel = ibv_create_comp_channel(ctx->context);
-- 
1.7.7.6

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Patch] ib/mlx4: Fix build error on platforms where UL is not 64bit

2012-10-16 Thread Doug Ledford

Line 110 uses UL as a compiler cast for the 0x constant, but
it's not large enough to hold a 64bit value on 32bit arches.
Use ULL instead.

Signed-off-by: Doug Ledford 
---
 drivers/infiniband/hw/mlx4/alias_GUID.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/infiniband/hw/mlx4/alias_GUID.c 
b/drivers/infiniband/hw/mlx4/alias_GUID.c
index d2fb38d..cb790a8 100644
--- a/drivers/infiniband/hw/mlx4/alias_GUID.c
+++ b/drivers/infiniband/hw/mlx4/alias_GUID.c
@@ -107,7 +107,7 @@ static __be64 get_cached_alias_guid(struct mlx4_ib_dev 
*dev, int port, int index
 {
if (index >= NUM_ALIAS_GUID_PER_PORT) {
pr_err("%s: ERROR: asked for index:%d\n", __func__, index);
-   return  (__force __be64) ((u64) 0xUL);
+   return  (__force __be64) ((u64) 0xULL);
}
return *(__be64 *)&dev->sriov.demux[port - 1].guid_cache[index];
 }
-- 
1.7.7.6

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/4 for opensm] /etc/init.d/opensmd: Improve systemd integration

2012-10-24 Thread Doug Ledford

On 10/24/2012 9:27 AM, Alex Netes wrote:
> Hi Bart,
> 
> On 16:44 Fri 21 Sep , Bart Van Assche wrote:
>> On recent SLES and openSUSE systems it is necessary to read the
>> file /lib/lsb/init-functions instead of /etc/rc.status for proper
>> integration with systemd. Certain implementations of killproc
>> need a pidfile so make sure that one gets created. Also, correct
>> the opensm service dependency list.
>>
>> Signed-off-by: Bart Van Assche 
>> Cc: Doug Ledford 
>> ---
>>  configure.in  |6 ++
>>  scripts/opensm.init.in|   26 ++
>>  scripts/redhat-opensm.init.in |2 ++
>>  3 files changed, 26 insertions(+), 8 deletions(-)
>>
>> diff --git a/configure.in b/configure.in
>> index f660a5b..4515ae2 100644
>> --- a/configure.in
>> +++ b/configure.in
>> @@ -11,6 +11,12 @@ AM_INIT_AUTOMAKE
>>  AC_SUBST(RELEASE, ${RELEASE:-unknown})
>>  AC_SUBST(TARBALL, ${TARBALL:-${PACKAGE}-${VERSION}.tar.gz})
>>  
>> +default_rdma_service=openibd
>> +AC_ARG_WITH([rdma_service],
>> +AC_HELP_STRING([--with-rdma-service=name],
>> +   [name of the RDMA service: "rdma" when using 
>> /etc/init.d/rdma to start RDMA services; "openibd" when using 
>> /etc/init.d/openibd to start RDMA services 
>> [default=${default_rdma_service}]]))
>> +AC_SUBST(RDMA_SERVICE, ${with_rdma_service:-${default_rdma_service}})
>> +
> 
> There is also an option to run opensm with ibsim. In that case neither rdma
> nor openibd should run.

ibsim is unlikely to be used by end users or system administrators in a
normal startup environment.  I doubt it needs to be accounted for in the
init script.  Most users simply don't even know enough about ibsim to be
able to use it at all.

>>  if { rpm -q sles-release || rpm -q openSUSE-release; } >/dev/null 2>&1; then
>> default_start="2 3 5"
>> default_stop="0 1 4 6"
>> diff --git a/scripts/opensm.init.in b/scripts/opensm.init.in
>> index 007dae4..ddd1d6b 100644
>> --- a/scripts/opensm.init.in
>> +++ b/scripts/opensm.init.in
>> @@ -7,8 +7,8 @@
>>  #
>>  ### BEGIN INIT INFO
>>  # Provides: opensm
>> -# Required-Start: $syslog
>> -# Required-Stop:
>> +# Required-Start: $syslog @RDMA_SERVICE@
>> +# Required-Stop: $syslog @RDMA_SERVICE@
>>  # Default-Start: @DEFAULT_START@
>>  # Default-Stop: @DEFAULT_STOP@
>>  # Description:  Manage OpenSM
>> @@ -42,14 +42,22 @@
>>  
>>  prefix=@prefix@
>>  exec_prefix=@exec_prefix@
>> +pidfile=/var/run/opensm.pid
>>  
>>  # Source function library.
>>  if [[ -s /etc/init.d/functions ]]; then
>> +# RHEL / CentOS / SL / Fedora.
>>  . /etc/init.d/functions
>>  rc_status() { :; }
>>  rc_exit() { exit $RETVAL; }
>> -fi
>> -if [[ -s /etc/rc.status ]]; then
>> +elif [[ -s /lib/lsb/init-functions ]]; then
>> +# SLES / openSuSE / Debian.
>> +. /lib/lsb/init-functions
>> +rc_exit() { exit $RETVAL; }
>> +success() { log_success_msg; }
>> +failure() { log_failure_msg; }
>> +elif [[ -s /etc/rc.status ]]; then
>> +# Older SuSE systems.
>>  . /etc/rc.status
>>  failure() { rc_status -v; }
>>  success() { rc_status -v; }
>> @@ -61,10 +69,13 @@ if [[ -s $CONFIG ]]; then
>>  fi
>>  
>>  start () {
>> +if [ -e $pidfile ]; then
> 
> On opensm segfault (happens one in a while :), pidfile won't be removed, so
> you won't be able to start the opensm again. I guess that same thing can
> happen on warm reboot.

The stop action in the script should handle cleanup for you.  If it's
called, and a pid file exists, but the pid is not running, it should
remove the pid file and the subsystem lock file so that a clean start works.

>> +echo Already started
>> +return 1
>> +fi
> 
> -- Alex
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


-- 
Doug Ledford 
  GPG KeyID: 0E572FDD
  http://people.redhat.com/dledford

Infiniband specific RPMs available at
  http://people.redhat.com/dledford/Infiniband



signature.asc
Description: OpenPGP digital signature

Re: [PATCH 3/4 for opensm] /etc/init.d/opensmd: Improve systemd integration

2012-10-24 Thread Doug Ledford

On 10/24/2012 10:45 AM, Bart Van Assche wrote:
> On 10/24/12 16:33, Doug Ledford wrote:
>> On 10/24/2012 9:27 AM, Alex Netes wrote:
>>> On 16:44 Fri 21 Sep , Bart Van Assche wrote:
>>>>   start () {
>>>> +if [ -e $pidfile ]; then
>>>
>>> On opensm segfault (happens one in a while :), pidfile won't be
>>> removed, so
>>> you won't be able to start the opensm again. I guess that same thing can
>>> happen on warm reboot.
>>
>> The stop action in the script should handle cleanup for you.  If it's
>> called, and a pid file exists, but the pid is not running, it should
>> remove the pid file and the subsystem lock file so that a clean start
>> works.
> 
> Hello Doug,
> 
> Do we really need the lock file ? On some Linux systems (Ubuntu) the
> /var/lock/subsys directory does not even exist.
> 
> Bart.
> 

It's an old holdover from LSB compliance.  *I* don't personally care one
way or another, but it's supposed to be there for ancient scripts that
look there to find out if a service is running.  Probably none of those
ancient scripts look for opensm as it wasn't around back when they were
written and the "new and improved" way of checking a service in LSB is
to run the init script with the status command.  But, that's why I have
it there.

-- 
Doug Ledford 
  GPG KeyID: 0E572FDD
  http://people.redhat.com/dledford

Infiniband specific RPMs available at
  http://people.redhat.com/dledford/Infiniband



signature.asc
Description: OpenPGP digital signature

Re: [PATCH] opensm/configure.in: Remove Default-Start from opensmd init script

2013-01-30 Thread Doug Ledford

On 01/30/13 03:59, Bart Van Assche wrote:
> On 01/29/13 18:18, Alex Netes wrote:
>> During opensm RPM packaging, `chkconfig --add opensmd` is called.
>> `chkconfig --add` creates the appropriate entry as specified by the
>> default values in the init script. Having opensmd run by default on boot
>> isn't desired.
>>
>> Signed-off-by: Alex Netes 
>> ---
>>   configure.in | 5 +++--
>>   1 file changed, 3 insertions(+), 2 deletions(-)
>>
>> diff --git a/configure.in b/configure.in
>> index 4515ae2..f798be2 100644
>> --- a/configure.in
>> +++ b/configure.in
>> @@ -18,12 +18,13 @@ AC_ARG_WITH([rdma_service],
>>   AC_SUBST(RDMA_SERVICE, ${with_rdma_service:-${default_rdma_service}})
>>
>>   if { rpm -q sles-release || rpm -q openSUSE-release; } >/dev/null
>> 2>&1; then
>> -   default_start="2 3 5"
>>  default_stop="0 1 4 6"
>>   else
>> -   default_start="2 3 4 5"
>>  default_stop="0 1 6"
>>   fi
>> +
>> +default_start="null"
>> +
>>   AC_SUBST(DEFAULT_START, $default_start)
>>   AC_SUBST(DEFAULT_STOP, $default_stop)
> 
> Sorry but this patch doesn't make sense to me. This patch will prevent
> anyone to enable opensm to run during boot via chkconfig.

It does not.  It means that chkconfig add does not create any start
links, just all kill links.  To actually enabled it, chkconfig --level
2345 opensmd on will still work.  Which was the intent: leave it off by
default, enabled by the admin.

> How about
> replacing the above by the (untested) patch below ?
> 
> iff --git a/opensm.spec.in b/opensm.spec.in
> index 6ae525b..2325b70 100644
> --- a/opensm.spec.in
> +++ b/opensm.spec.in
> @@ -107,13 +107,6 @@ rm -rf $RPM_BUILD_ROOT
> 
>  %post
>  if [ $1 = 1 ]; then
> -if [ -e /sbin/chkconfig ]; then
> -/sbin/chkconfig --add opensmd
> -elif [ -e /usr/sbin/update-rc.d ]; then
> -/usr/sbin/update-rc.d opensmd defaults
> -else
> -/usr/lib/lsb/install_initd /etc/init.d/opensmd
> -fi
>  if type systemctl >/dev/null 2>&1; then
>  systemctl --system daemon-reload
>  fi

This is not the correct fix.  It leaves the init script unadded to the
system, which means chkconfig --list will not show it properly until the
user manually adds it later.


-- 
Doug Ledford 
  GPG KeyID: 0E572FDD
  http://people.redhat.com/dledford

Infiniband specific RPMs available at
  http://people.redhat.com/dledford/Infiniband
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] opensm/configure.in: Remove Default-Start from opensmd init script

2013-01-30 Thread Doug Ledford

On 1/30/2013 11:00 AM, Bart Van Assche wrote:
> On 01/30/13 16:43, Doug Ledford wrote:
>> On 01/30/13 03:59, Bart Van Assche wrote:
>>> On 01/29/13 18:18, Alex Netes wrote:
>>>> During opensm RPM packaging, `chkconfig --add opensmd` is called.
>>>> `chkconfig --add` creates the appropriate entry as specified by the
>>>> default values in the init script. Having opensmd run by default on
>>>> boot
>>>> isn't desired.
>>>>
>>>> Signed-off-by: Alex Netes 
>>>> ---
>>>>configure.in | 5 +++--
>>>>1 file changed, 3 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/configure.in b/configure.in
>>>> index 4515ae2..f798be2 100644
>>>> --- a/configure.in
>>>> +++ b/configure.in
>>>> @@ -18,12 +18,13 @@ AC_ARG_WITH([rdma_service],
>>>>AC_SUBST(RDMA_SERVICE,
>>>> ${with_rdma_service:-${default_rdma_service}})
>>>>
>>>>if { rpm -q sles-release || rpm -q openSUSE-release; } >/dev/null
>>>> 2>&1; then
>>>> -   default_start="2 3 5"
>>>>   default_stop="0 1 4 6"
>>>>else
>>>> -   default_start="2 3 4 5"
>>>>   default_stop="0 1 6"
>>>>fi
>>>> +
>>>> +default_start="null"
>>>> +
>>>>AC_SUBST(DEFAULT_START, $default_start)
>>>>AC_SUBST(DEFAULT_STOP, $default_stop)
>>>
>>> Sorry but this patch doesn't make sense to me. This patch will prevent
>>> anyone to enable opensm to run during boot via chkconfig.
>>
>> It does not.  It means that chkconfig add does not create any start
>> links, just all kill links.  To actually enabled it, chkconfig --level
>> 2345 opensmd on will still work.  Which was the intent: leave it off by
>> default, enabled by the admin.
>>
>>> How about
>>> replacing the above by the (untested) patch below ?
>>>
>>> iff --git a/opensm.spec.in b/opensm.spec.in
>>> index 6ae525b..2325b70 100644
>>> --- a/opensm.spec.in
>>> +++ b/opensm.spec.in
>>> @@ -107,13 +107,6 @@ rm -rf $RPM_BUILD_ROOT
>>>
>>>   %post
>>>   if [ $1 = 1 ]; then
>>> -if [ -e /sbin/chkconfig ]; then
>>> -/sbin/chkconfig --add opensmd
>>> -elif [ -e /usr/sbin/update-rc.d ]; then
>>> -/usr/sbin/update-rc.d opensmd defaults
>>> -else
>>> -/usr/lib/lsb/install_initd /etc/init.d/opensmd
>>> -fi
>>>   if type systemctl >/dev/null 2>&1; then
>>>   systemctl --system daemon-reload
>>>   fi
>>
>> This is not the correct fix.  It leaves the init script unadded to the
>> system, which means chkconfig --list will not show it properly until the
>> user manually adds it later.
> 
> Which convention is followed for other packages ? This is what I found in
> the Fedora 18 iscsi-initiator-utils package
> (http://be.mirror.eurid.eu/fedora/linux/releases/18/Fedora/source/SRPMS/i/iscsi-initiator-utils-6.2.0.872-19.fc18.src.rpm):
> 
> * iscsid.init: Default-Start: 3 4 5
> * iscsi-initiator-utils.spec:

Okay, first off, any package that still uses the SysV initscripts as of
Fedora 18 is not what I would call a package that is keeping up with the
Fedora packaging guidelines or Fedora technologies.  As such, I'm not
really sure you want to use it as an example of a good package.
However, that being said, you will note in this spec file that the iSCSI
initiator package does exactly what you removed, or suggested be
removed, from the opensmd spec file.  It unilaterally adds the
initscript to the system.  The default start/stop settings are
different, but the add action is the same.

All initscripts should be added to the system, regardless of their
default start/stop settings, and the default-start and default-stop
should be used to control *how* they are added by default, and chkconfig
--level .*  [on|off] should be used to control whether or
not they are on or off differently than their default settings.

> %post
> /sbin/ldconfig
> if [ "$1" -eq "1" ]; then
> if [ ! -f %{_sysconfdir}/iscsi/initiatorname.iscsi ]; then
> echo "InitiatorName=`/sbin/iscsi-iname`" >
> %{_sysconfdir}/iscsi/initiatorname.iscsi
> fi
> /sbin/chkconfig --add iscsid
> /sbin/chkconfig --add iscsi
> fi
> 
> Bart.




signature.asc
Description: OpenPGP digital signature

Re: [PATCH] opensm/configure.in: Remove Default-Start from opensmd init script

2013-01-30 Thread Doug Ledford

On 1/30/2013 2:12 PM, Bart Van Assche wrote:
> On 01/30/13 18:48, Doug Ledford wrote:
>> On 1/30/2013 11:00 AM, Bart Van Assche wrote:
>>> Which convention is followed for other packages ? This is what I
>>> found in
>>> the Fedora 18 iscsi-initiator-utils package
>>> (http://be.mirror.eurid.eu/fedora/linux/releases/18/Fedora/source/SRPMS/i/iscsi-initiator-utils-6.2.0.872-19.fc18.src.rpm):
>>>
>>>
>>> * iscsid.init: Default-Start: 3 4 5
>>> * iscsi-initiator-utils.spec:
>>
>> Okay, first off, any package that still uses the SysV initscripts as of
>> Fedora 18 is not what I would call a package that is keeping up with the
>> Fedora packaging guidelines or Fedora technologies.  As such, I'm not
>> really sure you want to use it as an example of a good package.
>> However, that being said, you will note in this spec file that the iSCSI
>> initiator package does exactly what you removed, or suggested be
>> removed, from the opensmd spec file.  It unilaterally adds the
>> initscript to the system.  The default start/stop settings are
>> different, but the add action is the same.
>>
>> All initscripts should be added to the system, regardless of their
>> default start/stop settings, and the default-start and default-stop
>> should be used to control *how* they are added by default, and chkconfig
>> --level .*  [on|off] should be used to control whether or
>> not they are on or off differently than their default settings.
> 
> Thanks for the detailed reply. Regarding the purpose of the patch at the
> start of this thread: do you know whether it is LSB-compliant to use
> "Default-Start: null" or should "Default-Start" be left out entirely in
> order not to create the start links ? See e.g.
> http://refspecs.linuxbase.org/LSB_3.2.0/LSB-Core-generic/LSB-Core-generic/initscrcomconv.html.
> 
> 
> Bart.

I've always used "Default-start:" without any listed levels, but I
haven't really tested it either and the spec is not specific about this
particular possible configuration.




signature.asc
Description: OpenPGP digital signature

Re: [PATCH] opensm/configure.in: Remove Default-Start from opensmd init script

2013-01-31 Thread Doug Ledford

On 01/31/13 02:21, Alex Netes wrote:
> On 14:24 Wed 30 Jan , Doug Ledford wrote:
>> On 1/30/2013 2:12 PM, Bart Van Assche wrote:
>>> On 01/30/13 18:48, Doug Ledford wrote:
>>>> On 1/30/2013 11:00 AM, Bart Van Assche wrote:
>>>>> Which convention is followed for other packages ? This is what I
>>>>> found in
>>>>> the Fedora 18 iscsi-initiator-utils package
>>>>> (http://be.mirror.eurid.eu/fedora/linux/releases/18/Fedora/source/SRPMS/i/iscsi-initiator-utils-6.2.0.872-19.fc18.src.rpm):
>>>>>
>>>>>
>>>>> * iscsid.init: Default-Start: 3 4 5
>>>>> * iscsi-initiator-utils.spec:
>>>>
>>>> Okay, first off, any package that still uses the SysV initscripts as of
>>>> Fedora 18 is not what I would call a package that is keeping up with the
>>>> Fedora packaging guidelines or Fedora technologies.  As such, I'm not
>>>> really sure you want to use it as an example of a good package.
>>>> However, that being said, you will note in this spec file that the iSCSI
>>>> initiator package does exactly what you removed, or suggested be
>>>> removed, from the opensmd spec file.  It unilaterally adds the
>>>> initscript to the system.  The default start/stop settings are
>>>> different, but the add action is the same.
>>>>
>>>> All initscripts should be added to the system, regardless of their
>>>> default start/stop settings, and the default-start and default-stop
>>>> should be used to control *how* they are added by default, and chkconfig
>>>> --level .*  [on|off] should be used to control whether or
>>>> not they are on or off differently than their default settings.
>>>
>>> Thanks for the detailed reply. Regarding the purpose of the patch at the
>>> start of this thread: do you know whether it is LSB-compliant to use
>>> "Default-Start: null" or should "Default-Start" be left out entirely in
>>> order not to create the start links ? See e.g.
>>> http://refspecs.linuxbase.org/LSB_3.2.0/LSB-Core-generic/LSB-Core-generic/initscrcomconv.html.
>>>
>>>
>>> Bart.
>>
>> I've always used "Default-start:" without any listed levels, but I
>> haven't really tested it either and the spec is not specific about this
>> particular possible configuration.
>>
>>
> 
> It's how the script was defined before. The behavior on RHEL and SLES is
> different when not specifying "Default-start:". On RHEL, `chkconfig opensmd 
> on`
> adds the service to the default runlevels: 2 3 4 5. While on SLES it doesn't.

That sounds like a bug in SLES to be honest.  chkconfig opensmd on
without --levels should follow the Default-Start: item (just like
chkconfig add opensmd).


-- 
Doug Ledford 
  GPG KeyID: 0E572FDD
  http://people.redhat.com/dledford

Infiniband specific RPMs available at
  http://people.redhat.com/dledford/Infiniband
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] ipoib: fix hard_header return value

2013-03-26 Thread Doug Ledford

If you have a patched up dhcp server (and dhclient), they will use
AF_PACKET/SOCK_DGRAM pair to send dhcp packets over IPoIB.  This has
worked since forever if you use OFED kernels or one of the distribution
kernels.  However, when testing an upstream kernel, it has been broken
for a very long time (I tested 2.6.34, 2.6.38, 3.0, 3.1, 3.8, HEAD).

It turns out that the hard_header routine in ipoib is not following
the API and is returning 0 even when it pushed data onto the skb.  This
then causes af_packet.c to overwrite the header just pushed with data
from user space.  This header is immediately referenced in the
ipoib_start_xmit routine, so I'm wondering how this ever worked in
distro/ofed kernels that also have this bug, but fixing the bug here
makes things work in upstream kernels.

Signed-off-by: Doug Ledford 
---
 drivers/infiniband/ulp/ipoib/ipoib_main.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c 
b/drivers/infiniband/ulp/ipoib/ipoib_main.c
index 8534afd..31dd2a7 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
@@ -828,7 +828,7 @@ static int ipoib_hard_header(struct sk_buff *skb,
 */
memcpy(cb->hwaddr, daddr, INFINIBAND_ALEN);
 
-   return 0;
+   return sizeof *header;
 }
 
 static void ipoib_set_mcast_list(struct net_device *dev)
-- 
1.8.1.2

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] ipoib: fix hard_header return value

2013-03-26 Thread Doug Ledford

On 03/26/2013 12:46 PM, Roland Dreier wrote:
>> Checkpatch recommends since some time to use sizeof(e) instead of sizeof e,
>> isn't it ?
> 
> I actually prefer "sizeof e" since sizeof is an operator, not a
> function.  "sizeof(e)" looks just as silly as "return(e)" to me.
> 
> I'll apply this patch soon, it's a good catch.
> 

Cool, thanks!

-- 
Doug Ledford 
  GPG KeyID: 0E572FDD
  http://people.redhat.com/dledford




signature.asc
Description: OpenPGP digital signature

Re: [PATCH] ipoib: fix hard_header return value

2013-03-27 Thread Doug Ledford

On 03/26/2013 02:16 PM, Jason Gunthorpe wrote:
> On Tue, Mar 26, 2013 at 09:46:28AM -0700, Roland Dreier wrote:
>>> Checkpatch recommends since some time to use sizeof(e) instead of sizeof e,
>>> isn't it ?
>>
>> I actually prefer "sizeof e" since sizeof is an operator, not a
>> function.  "sizeof(e)" looks just as silly as "return(e)" to me.
> 
> Sizeof is used as an expression, return is not.
> 
> They have different precedence rules:
>  return e + 1; // == return (e + 1)
> vs
>  sizeof e + 1;  // == sizeof(e) + 1
> 
> Or weirder:
>  return (void *)x; // OK
> vs
>  sizeof (void *)x; // <-- compile error
> 
> Coding sizeof as a function call frees the reader from having to
> check/know the obscure precedence rules for sizeof.

Personally, I couldn't care less.  But about 5 lines outside of the
context of this patch in the same function is another use of that same
expression.  The real reason I used the form I did was to match the
style of that other instance.  I prepared a separate patch to convert
both of them to the checkpatch preferred style until I saw Roland say he
didn't want it.  But what I wouldn't have done is submitted a patch that
caused a single function to have two different styles within 8 lines of
each other just because some nanny state nag script tells me too.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/2] Ad IB_MTU_1500|9000 enums.

2013-04-19 Thread Doug Ledford

On 04/19/2013 11:24 AM, Jeff Squyres (jsquyres) wrote:
> On Apr 12, 2013, at 11:40 AM, Jeff Squyres (jsquyres)  
> wrote:
> 
>>> As an aside I like the use of RDMA_MTU_* for these values.  Again to 
>>> distinguish them from the IBTA values.  But I know that is poor form.
>>
>> So what's the right way to move forward on this?  Is it this:
>>
>> enum ib_mtu {
>>   IB_MTU_256  = 1,
>>   IB_MTU_512  = 2,
>>   IB_MTU_1024 = 3,
>>   IB_MTU_2048 = 4,
>>   IB_MTU_4096 = 5,
>>   RDMA_MTU_1500 = 1500,
>>   RDMA_MTU_9000 =9000
>> };
> 
> 
> Bump.
> 

This seems reasonable, but still concerns me a bit.  The original
version was flat out wrong because you can't re-arrange any exposed enum
like this without requiring that all user space apps be recompiled.
This is especially true because ibv_mtu_enum_to_int is an inline, so any
compiled programs won't have been updated by a shared library update,
yet the new enum values can end up showing up, so the apps break when
they start seeing -1 as the return value, or when they interpret 1500 as
2048, etc.

I wonder if the correct way forward isn't to leave these two functions
alone (as inlines you can't change them without a recompile anyway).
However, I would officially change the documentation to specify that
both enum ib_mtu in the queue_pair structs and the result of
ib_mtu_enum_to_int is not exact on non-IB link layers and is deprecated
in favor of a new function: ib_qp_mtu() (or similar) that takes a queue
pair and returns the real mtu for that queue pair based on params and
device the queue pair is on (I could also see it being
ibv_dev_mtu(struct ibv_device *) instead if you want to query the mtu
before you make a queue pair instead).  Then for both IBoE and USNIC
devices, I would just store the actual MTU in the internal ibv device
struct and return that when requested.

This maintains 100% back compatibility with existing apps but allows a
path for people to get better performance/utilization by using the new
function instead of the old one.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/2] Ad IB_MTU_1500|9000 enums.

2013-04-22 Thread Doug Ledford

On 04/20/2013 07:29 PM, Hefty, Sean wrote:
>> This seems reasonable, but still concerns me a bit.  The original 
>> version was flat out wrong because you can't re-arrange any exposed
>> enum like this without requiring that all user space apps be
>> recompiled. This is especially true because ibv_mtu_enum_to_int is
>> an inline
> 
> ib_mtu_enum_to_int() is a kernel function, not user space, so I think
> we're fine here, unless you're concerned about drivers built out of
> tree.

Well, although *I* might have to worry about out of kernel drivers, I
wouldn't suggest such for upstream.  However, for some reason I had it
in my mind when I was reading the patch that it was against libibverbs.
 That's what I get for staying up late and reviewing when I'm tired :-/


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/2] Ad IB_MTU_1500|9000 enums.

2013-04-22 Thread Doug Ledford

On 04/22/2013 02:00 PM, Jeff Squyres (jsquyres) wrote:

> 2. Change all instances of ib_mtu/ibv_mtu to an int.  Code such as 
> "switch(mtu) case IBV_MTU_1024: ..." will need to be updated to 
> "switch(mtu) case 1024: ...".
> 
> PRO: solves the problem for all MTU values
> PRO: eliminates the num-to-int translation functions
> CON: much driver code will need to be updated per above, and also
> update logic checking for out-of-bounds MTU calues
> CON: similarly, userspace apps will need to be updated; it might be
> worthwhile to bump libibverbs to 2.x, and then intentionally change
> the MTU field names in ibv_port_attr and ibv_qp_attr so that apps
> using those fields will fail to compile with libibverbs 2.x
> (and therefore forcibly realize they need to adapt to the new
> int MTU values)

I was actually thinking that an ibverbs API version 2.0 might be an
interesting way to go.  The proliferation of non-IB link layers
providing the verbs API make some of the original assumptions of IB link
layer in the original API obsolete.  But, if we were to do that, I'd
take some time to really think the issue over and try to catch all of
the needed updates in one go.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] ipoib: fix hard_header return value

2013-04-23 Thread Doug Ledford

On 04/01/13 17:25, Or Gerlitz wrote:
> Doug Ledford  wrote:
>> If you have a patched up dhcp server (and dhclient),
> 
> Could you be more specific, I assume you refer to the ISC dhcp bits,
> which version and which patches?

Any version of dhcp server, and the improved-xid and lpf-ib patches
primarily.  We've carried those patches forever, but as far as I know,
they still have not been taken by ISC.  Without them, dhcp server will
only work with a cooked socket interface.  You can't use raw as the
socket type when compiling or else it won't work on IB.  With the
patches, you can enable the raw socket method, and on IB it will fall
back to use PACKET instead.

> AFAIK they don't give you access to
> their source repo but rather only to drops plus possibly patches which
> is a bit more tedious to follow, oh wel...
> 
>> they will use AF_PACKET/SOCK_DGRAM pair to send dhcp packets over IPoIB.
> 
> 
>> This has worked since forever if you use OFED kernels or one of the 
>> distribution
>> kernels.  However, when testing an upstream kernel, it has been broken
>> for a very long time (I tested 2.6.34, 2.6.38, 3.0, 3.1, 3.8, HEAD).
> 
> IMO doesn't seem relevant to the upstream commit message

I disagree.  I don't buy the whole "we are upstream, nothing else
matters or is relevant" philosophy.  The truth of the matter is that
there is essentially a fork between upstream and OFED.  I plan on
spending some time bringing some of the relevant fixes present in OFED
and not upstream back to upstream.  In the context of attempting to
manually merge some of that fork back together, I see no reason
whatsoever to ignore relevant historical information during that process.

>> It turns out that the hard_header routine in ipoib is not following
>> the API and is returning 0 even when it pushed data onto the skb.  This
>> then causes af_packet.c to overwrite the header just pushed with data
>> from user space.  This header is immediately referenced in the
>> ipoib_start_xmit routine
> 
> cool, I assume we want this fix to go for -stable after spending some
> time upstream, e.g probably by the time 3.9 is released and some more
> testing is done on the patch (I'll advocate for that @ MLNX, copied
> some folks now) get that to -stable too.

Yes, it can go to -stable.  But, given that no one has noticed before
now that it doesn't work, I'm guessing not many people are using
straight upstream (which is something that needs fixed IMO).

> Erez, in the code we use internally which is based on upstream 3.7 do
> we have DHCP/IPoIB working without this patch?
> 
>> so I'm wondering how this ever worked in
>> distro/ofed kernels that also have this bug, but fixing the bug here
>> makes things work in upstream kernels.
> 
> same for the last three lines
> 
> 
> Signed-off-by: Doug Ledford 
> ---
>  drivers/infiniband/ulp/ipoib/
> ipoib_main.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c
> b/drivers/infiniband/ulp/ipoib/ipoib_main.c
> index 8534afd..31dd2a7 100644
> --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
> +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
> @@ -828,7 +828,7 @@ static int ipoib_hard_header(struct sk_buff *skb,
>  */
> memcpy(cb->hwaddr, daddr, INFINIBAND_ALEN);
> 
> -   return 0;
> +   return sizeof *header;
>  }
> 
>  static void ipoib_set_mcast_list(struct net_device *dev)
> 


-- 
Doug Ledford 
  GPG KeyID: 0E572FDD
  http://people.redhat.com/dledford

Infiniband specific RPMs available at
  http://people.redhat.com/dledford/Infiniband
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/2] libibverbs: Use autoreconf in autogen.sh

2013-05-01 Thread Doug Ledford

On 04/30/2013 10:29 AM, Jeff Squyres (jsquyres) wrote:
> Bump bump.  :-)

This is fine with me, however, I think you also need to bump the
autotools version to the latest upstream.  The automated checkers in our
build environment is spitting out errors about a number of upstream
packages where the autotools used to configure the package does not
include proper arm support.  The latest autotools bring in all of the
forthcoming arm variants.  So I would like to see both of these things done.

> On Apr 25, 2013, at 11:38 AM, Jeff Squyres (jsquyres)  
> wrote:
> 
>> Bump.
>>
>> On Apr 22, 2013, at 1:41 PM, Jeff Squyres  wrote:
>>
>>> The old sequence of Autotools commands listed in autogen.sh is no
>>> longer correct.  Instead, just use the single "autoreconf" command,
>>> which will invoke all the Right Autotools commands in the correct
>>> order.
>>>
>>> Signed-off-by: Jeff Squyres 
>>> ---
>>> autogen.sh | 6 +-
>>> 1 file changed, 1 insertion(+), 5 deletions(-)
>>>
>>> diff --git a/autogen.sh b/autogen.sh
>>> index fd47839..6c9233e 100755
>>> --- a/autogen.sh
>>> +++ b/autogen.sh
>>> @@ -1,8 +1,4 @@
>>> #! /bin/sh
>>>
>>> set -x
>>> -aclocal -I config
>>> -libtoolize --force --copy
>>> -autoheader
>>> -automake --foreign --add-missing --copy
>>> -autoconf
>>> +autoreconf -ifv -I config
>>> -- 
>>> 1.8.1.1
>>>
>>
>>
>> -- 
>> Jeff Squyres
>> jsquy...@cisco.com
>> For corporate legal information go to: 
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: libibverbs / libmlx4 release

2013-05-20 Thread Doug Ledford

On 05/20/2013 12:49 PM, Roland Dreier wrote:
> On Mon, May 20, 2013 at 5:37 AM, Or Gerlitz  wrote:
>> Following what we discussed last week during the Linux Foundation EU summit, 
>> I think it would be good to follow what you said and have a point release 
>> for libibverbs and libmlx4 before we pull in the verbs extensions framework 
>> and features that use it (XRC, Flow-Steering, etc more fun).
>>
>> I mentioned to you that we have some more libmlx4 patches, but its totally 
>> OK for us to submit them after that release, makes sense?
> 
> That's fine, I'll do the releases this week.

If you haven't already done so, can you please update to the latest
autotools and reroll the autotools files when you do the release?

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: BUG: unable to handle kernel paging request at 0000000000070a78 IPoIB

2013-05-23 Thread Doug Ledford

On 05/23/2013 11:38 AM, Jack Wang wrote:
> Tainted: G   O 3.4.23-pserver-hotfix+ #109 System manufacturer
 ^^^

I would try a newer kernel.  There are a couple known issues fixed since
this kernel (including a memory corrupter that was involved with
neighbor list handling, and some of your traces look vaguely familiar to
that old failuer).



-- 
Doug Ledford 
  GPG KeyID: 0E572FDD
  http://people.redhat.com/dledford




signature.asc
Description: OpenPGP digital signature

Re: BUG: unable to handle kernel paging request at 0000000000070a78 IPoIB

2013-05-23 Thread Doug Ledford

On 05/23/2013 02:53 PM, Jack Wang wrote:
> On 2013年05月23日 19:41, Doug Ledford wrote:
>> On 05/23/2013 11:38 AM, Jack Wang wrote:
>>> Tainted: G   O 3.4.23-pserver-hotfix+ #109 System manufacturer
>>  ^^^
>>
>> I would try a newer kernel.  There are a couple known issues fixed since
>> this kernel (including a memory corrupter that was involved with
>> neighbor list handling, and some of your traces look vaguely familiar to
>> that old failuer).
>>
>>
>>
> 
> Thanks Doug for reply, I tried branch rdma-for-linus, It panic in other
> places.
> 
> Could you point me which commit do you mean exactly?
> 
> Regards,
> Jack
> 

Just try the official v3.9 kernel from Linus and see how it does.  A
'git checkout v3.9' will do the trick.

-- 
Doug Ledford 
  GPG KeyID: 0E572FDD
  http://people.redhat.com/dledford




signature.asc
Description: OpenPGP digital signature

[Patch infiniband-diags:perfquery] Loop through all local HCAs/ports

2011-07-29 Thread Doug Ledford

The -a mode of perfquery is intended to loop through all ports on a 
single HCA and provide aggregated output across all ports.


The -l mode is intended to loop through all ports of a single HCA and 
output non-aggregated data.


Neither mode addresses a machine with more than one HCA.  Furthermore, I 
found both -a and -l failed to loop properly on my Mellanox adapter (it 
would read the first port and error out trying to read the second).


So, I wrote a new switch, -H, that loops through all ports on all HCAs 
in the system.  Because of how it's implemented, it gets around the 
problem that both -a and -l had on my machine when dealing with the 
second Mellanox port.  It, however, does not do aggregated output 
because each HCA/port combination is treated as its own device.


I forgot to update the man page though.  If the current infiniband-diags 
maintainer wants it, I can add that (that's assuming the base patch is 
acceptable).  I think Ira is doing that now, right?


Anyway, attached is the patch.
diff -up infiniband-diags-1.5.8/src/perfquery.c.hcas 
infiniband-diags-1.5.8/src/perfquery.c
--- infiniband-diags-1.5.8/src/perfquery.c.hcas 2011-02-16 05:13:21.0 
-0500
+++ infiniband-diags-1.5.8/src/perfquery.c  2011-07-29 17:17:41.599262511 
-0400
@@ -347,7 +347,7 @@ static void reset_counters(int extended,
 }
 
 static int reset, reset_only, all_ports, loop_ports, port, extended, xmt_sl,
-rcv_sl, xmt_disc, rcv_err, smpl_ctl;
+rcv_sl, xmt_disc, rcv_err, smpl_ctl, all_hcas;
 
 static void common_func(ib_portid_t * portid, int port_num, int mask,
unsigned query, unsigned reset,
@@ -447,12 +447,16 @@ static int process_opt(void *context, in
case 'R':
reset_only++;
break;
+   case 'H':
+   all_hcas++;
+   break;
default:
return -1;
}
return 0;
 }
 
+#define MAX_NAMES 10
 int main(int argc, char **argv)
 {
int mgmt_classes[4] = { IB_SMI_CLASS, IB_SMI_DIRECT_CLASS, IB_SA_CLASS,
@@ -467,6 +471,10 @@ int main(int argc, char **argv)
int start_port = 1;
int enhancedport0;
int i;
+   int names;
+   int cur_name = 0;
+   char name_list[10][UMAD_CA_NAME_LEN];
+   umad_ca_t ca;
 
const struct ibdiag_opt opts[] = {
{"extended", 'x', 0, NULL, "show extended port counters"},
@@ -476,6 +484,7 @@ int main(int argc, char **argv)
{"rcverr", 'E', 0, NULL, "show Rcv Error Details"},
{"smplctl", 'c', 0, NULL, "show samples control"},
{"all_ports", 'a', 0, NULL, "show aggregated counters"},
+   {"all_hcas", 'H', 0, NULL, "iterate through all local HCAs and 
ports"},
{"loop_ports", 'l', 0, NULL, "iterate through each port"},
{"reset_after_read", 'r', 0, NULL, "reset counters after read"},
{"Reset_only", 'R', 0, NULL, "only reset counters"},
@@ -484,10 +493,12 @@ int main(int argc, char **argv)
char usage_args[] = " [ [[port] [reset_mask]]]";
const char *usage_examples[] = {
"\t\t# read local port's performance counters",
+   "-H\t\t# read performance counters on all local HCAs/ports",
"32 1\t\t# read performance counters from lid 32, port 1",
"-x 32 1\t# read extended performance counters from lid 32, 
port 1",
"-a 32\t\t# read performance counters from lid 32, all ports",
"-r 32 1\t# read performance counters and reset",
+   "-r -H\t\t# read and reset counters on all local HCAs/ports",
"-x -r 32 1\t# read extended performance counters and reset",
"-R 0x20 1\t# reset performance counters of port 1 only",
"-x -R 0x20 1\t# reset extended performance counters of port 1 
only",
@@ -503,118 +514,173 @@ int main(int argc, char **argv)
argc -= optind;
argv += optind;
 
+   if (all_hcas && argc > 0)
+   IBERROR("Invalid input: all_hcas and any port/lid/guid are "
+   "not allowed together.");
+   if (all_hcas && (loop_ports || all_ports || (port==ALL_PORTS)))
+   IBERROR("Invalid input: all_hcas already goes over all ports, "
+   "but in a way that's incompatible with the all_ports "
+   "option or variants.");
+
if (argc > 1)
port = strtoul(argv[1], 0, 0);
if (argc > 2)
mask = strtoul(argv[2], 0, 0);
 
-   srcport = mad_rpc_open_port(ibd_ca, ibd_ca_port, mgmt_classes, 4);
-   if (!srcport)
-   IBERROR("Failed to open '%s' port '%d'", ibd_ca, ibd_ca_port);
-
-   if (argc) {
-   if (ib_resolve_portid_str_via(&portid, argv[0], ibd_dest_type,
- ibd_sm_id, srcport) < 0)
-   IBERROR("can't resolve

Re: [Patch infiniband-diags:perfquery] Loop through all local HCAs/ports

2011-07-29 Thread Doug Ledford


On 07/29/2011 05:57 PM, Doug Ledford wrote:
+   if (umad_get_ca(name_list[cur_name], &ca))
+   /* We're done, the next name was
+* empty, just exit gracefully
+*/
+   exit(0);

In retrosight, this may not exit quietly.  I can't tell because the last 
adapter in my system causes the program to exit noisily right now for 
totally unrelated reasons, so if I got that fixed, I may find this isn't 
as quite of an exit as I thought.  This may need to be changed to a 
check for a NULL name instead, on NULL exit quietly, and if non-NULL 
then attempt to get it, and then on failure be noisy.


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Patch infiniband-diags:perfquery] Loop through all local HCAs/ports

2011-07-29 Thread Doug Ledford


On 07/29/2011 06:09 PM, Jason Gunthorpe wrote:

On Fri, Jul 29, 2011 at 05:57:57PM -0400, Doug Ledford wrote:

The -a mode of perfquery is intended to loop through all ports on a
single HCA and provide aggregated output across all ports.

The -l mode is intended to loop through all ports of a single HCA
and output non-aggregated data.


Actually, none of these modes are intended to support HCA scenarios,
they are all only for switches.


Well, they attempt to work on local HCAs if you don't specify a switch 
lid to query.  So, intended or not, they are already being attempted to 
be used in this fashion out in the field.



Not sure what I think of this, is dumping counters on all local HCA
ports really that interesting? Would this be better done by doing
something fancy with nodeGUID so at least all ports on remote HCAs can
be dumped too?


The request came in from one of our partners who wanted it for tracking 
performance stats specifically on the local machine.


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Patch infiniband-diags:perfquery] Loop through all local HCAs/ports

2011-07-30 Thread Doug Ledford

On 7/29/2011 6:26 PM, Jason Gunthorpe wrote:
>>> Not sure what I think of this, is dumping counters on all local HCA
>>> ports really that interesting? Would this be better done by doing
>>> something fancy with nodeGUID so at least all ports on remote HCAs can
>>> be dumped too?
>>
>> The request came in from one of our partners who wanted it for
>> tracking performance stats specifically on the local machine.
> 
> Well, I think it would be best to make this work generally which is
> fairly hard, unfortunately.
>  - If the destination is the local HCA then you have to iterate over
>all local ports with matching node GUIDs by opening devices
>  - If the destination is a remote HCA then you have to query the SA
>for all nodes with a matching GUID and iterate over them. Ira has
>been working on some common code for this..
>  - Maybe you want to cross product and try to query the SA attached to
>all local end ports to try and find all ports. That would actually be
>very useful for many situations I know of...
> 
> Could you make your patch just do #1 and continue to misbehave for
> the other cases?

Well, it sort of already does.  The patch looks bigger than it is
because I had to re-indent in order to put the main functional code
inside of a loop.  For the most part, the loop is unchanged.  Just apply
the patch then look at what we do when all_hcas is non-0 right before
the do { and right before the } while (ibd_ca);  It iterates over all
the hca names that libibumad gives us, and then loops over all the ports
as returned by libibumad's umad_get_ca().


-- 
Doug Ledford 
  GPG KeyID: CFBFF194
  http://people.redhat.com/dledford

Infiniband specific RPMs available at
  http://people.redhat.com/dledford/Infiniband



signature.asc
Description: OpenPGP digital signature

[Patch libibverbs-1.1.5] Fix broken compat assumption on ppc arch

2011-08-02 Thread Doug Ledford

In the compat-1.0 wrapper, the post_send wrapper assumes that the layout 
of the tail end of the ibv_send_wr struct is the same in version 1.0 and 
1.1.  Unfortunately, padding of the struct makes this untrue depending 
on the architecture.  In my case, it was ppc that was failing to compile 
because we were triggering the gcc built in memcpy check.  We were 
scribbling past the end of the structure of the new structure because 
the old structure has alignment padding that the new one doesn't.  There 
are two spots that are padded in the version 1.0 ibv_send_wr struct, 4 
bytes before sg_list, and 4 more bytes before wr.


So this patch conditionalizes whether or not we use memcpy on the basis 
of data that will be reduced to constants at compile time, so even 
though there is an if statement, it gets reduced to just the proper 
method for the given arch at compile time.  And the patch will attempt a 
simple memcpy if possible, or two memcpy's in the case that wr needed 
padding to be aligned (as is the case on ppc), and fall back to field 
based copy just in case this is compiled on some arch with really odd 
field alignment characteristics.


Signed-off-by: Doug Ledford 
diff -up libibverbs-1.1.5/src/compat-1_0.c.memcpy 
libibverbs-1.1.5/src/compat-1_0.c
--- libibverbs-1.1.5/src/compat-1_0.c.memcpy2011-06-28 17:13:45.0 
-0400
+++ libibverbs-1.1.5/src/compat-1_0.c   2011-08-02 12:07:20.114511405 -0400
@@ -350,8 +350,46 @@ static int post_send_wrapper_1_0(struct 
real_wr->wr_id = w->wr_id;
real_wr->next  = NULL;
 
-   memcpy(&real_wr->sg_list, &w->sg_list,
-  sizeof *w - offsetof(struct ibv_send_wr, sg_list));
+#define TEST_SIZE_2_POINT(f1, f2) \
+((offsetof(struct ibv_send_wr, f1) - offsetof(struct ibv_send_wr, f2)) \
+== offsetof(struct ibv_send_wr_1_0, f1) - offsetof(struct ibv_send_wr_1_0, f2))
+#define TEST_SIZE_TO_END(f1) \
+((sizeof(struct ibv_send_wr) - offsetof(struct ibv_send_wr, f1)) == \
+ (sizeof(struct ibv_send_wr_1_0) - offsetof(struct ibv_send_wr_1_0, f1)))
+
+   if (TEST_SIZE_TO_END (sg_list))
+   memcpy(&real_wr->sg_list, &w->sg_list, sizeof *real_wr
+  - offsetof(struct ibv_send_wr, sg_list));
+   else if (TEST_SIZE_2_POINT (imm_data, sg_list) &&
+TEST_SIZE_TO_END (wr)) {
+   /* we have alignment up to wr, but padding between
+* imm_data and wr, and we know wr itself is the
+* same size */
+   memcpy(&real_wr->sg_list, &w->sg_list,
+  offsetof(struct ibv_send_wr, imm_data) -
+  offsetof(struct ibv_send_wr, sg_list) +
+  sizeof real_wr->imm_data);
+   memcpy(&real_wr->wr, &w->wr, sizeof real_wr->wr);
+   } else {
+   real_wr->sg_list = w->sg_list;
+   real_wr->num_sge = w->num_sge;
+   real_wr->opcode = w->opcode;
+   real_wr->send_flags = w->send_flags;
+   real_wr->imm_data = w->imm_data;
+   if (TEST_SIZE_TO_END (wr))
+   memcpy(&real_wr->wr, &w->wr,
+  sizeof real_wr->wr);
+   else {
+   real_wr->wr.atomic.remote_addr = 
+   w->wr.atomic.remote_addr;
+   real_wr->wr.atomic.compare_add = 
+   w->wr.atomic.compare_add;
+   real_wr->wr.atomic.swap = 
+   w->wr.atomic.swap;
+   real_wr->wr.atomic.rkey = 
+   w->wr.atomic.rkey;
+   }
+   }
 
if (is_ud)
real_wr->wr.ud.ah = w->wr.ud.ah->real_ah;

Re: Building 3.1-rc9 in kernel infiniband support with OFED libraries

2011-10-14 Thread Doug Ledford

- Original Message -
> On Thu, Oct 13, 2011 at 10:24 AM, Christoph Lameter 
> wrote:
> >> Clean reviewed patches for this are where?
> >
> > They are in OFED-1.5.3.1 so they were already released.
> 
> OFED is neither clean nor reviewed.  Really.  The stuff in OFED
> always
> needs a bunch
> of review before it is suitable for upstream.  Ignoring ABI stability
> is just one problem
> that OFED code typically has.

Indeed.  Because it is in OFED means it is in testing, nothing more.
A better acronym for OFED is Open Fabrics Experimental Distribution.
These issues are the primary reason we don't follow OFED in RHEL6.

-- 
Doug Ledford 
  GPG KeyID: 0E572FDD
  http://people.redhat.com/dledford

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Building 3.1-rc9 in kernel infiniband support with OFED libraries

2011-10-14 Thread Doug Ledford

- Original Message -
> On Tue, Oct 11, 2011 at 7:39 PM, Jason Gunthorpe
>  wrote:
> > On Tue, Oct 11, 2011 at 09:02:41AM -0500, Christoph Lameter wrote:
> >> Has XRC support not been merged? How can I build the OFED
> >> libraries
> >> against Linux 3.1? I'd really like to get rid of the OFED kernel
> >> tree
> >> nightmare.
> >
> > You have to use upstream libraries with upstream kernels. Be warned
> > that the OFED libraries of the same SONAME are not ABI compatible
> > with
> > upstream libraries.
> 
> Why is the OFED libibverbs library binary incompatible with the
> non-OFED libibverbs library ?

Well, as an example:

Upstream (Roland's) libibverbs (and this may not be entirely clean,
I do have the original RoCE support in here, and plan to switch over
to IBoE when it's available): /usr/include/infiniband/verbs.h

struct ibv_srq {
struct ibv_context *context;
void   *srq_context;
struct ibv_pd  *pd;
uint32_thandle;

pthread_mutex_t mutex;
pthread_cond_t  cond;
uint32_tevents_completed;
};

>From OFED's libibverbs: /usr/include/infiniband/verbs.h

struct ibv_srq {
struct ibv_context *context;
void   *srq_context;
struct ibv_pd  *pd;
uint32_thandle;

uint32_tevents_completed;

uint32_txrc_srq_num;
struct ibv_xrc_domain  *xrc_domain;
struct ibv_cq  *xrc_cq;

pthread_mutex_t mutex;
pthread_cond_t  cond;
};

So, while they added a few new things, they also reordered existing
items.  This is a public structure passed back to the app and which
the app might read any of the items if they are interested.  So, with
a struct change like this you would normally require an update to
the library symbol map and have a compat wrapper so that binaries
compiled with either the old or new struct definition can link
against the library.  However, here is the library symbol maps for
ibv_create_srq in each library source:

Roland's: ibv_create_srq{IBVERBS_1.0}
OFED's: ibv_create_srq{IBVERBS_1.0}

This is despite the fact that whenever an ibverbs interface has
needed an update, Roland has done the right thing and updated
the symbol version to the IBVERBS_1.1 space in the map and
created a back compat wrapper in the 1.0 space.  So it's not even
like the OFED folks didn't have examples of how to do this
properly staring them in the face, they just didn't care to do it
properly.

> Why hasn't XRC support been implemented
> in the OFED libibverbs library such that applications built against
> the upstream libibverbs headers also work with the latest OFED
> version
> of that library ?

Because OFED exists.  That's the *real* reason they don't bother to
write this stuff right in the first place.  OFED is released as one
single unit, and there is no guarantee of API or ABI compatibility
from one release to the next.  And there is no guarantee of API/ABI
compatibility between any given OFED released library and any other
upstream library (aka, no guarantee that the OFED libibverbs will
work with the upstream librdmacm).  If something doesn't work,
then their answer is "Upgrade to the latest OFED on all machines,
and recompile your apps as needed".  If you had your own tree and
didn't have to answer to any upstream review, especially on
controversial new features or difficult designs, would you bother?

-- 
Doug Ledford 
  GPG KeyID: 0E572FDD
  http://people.redhat.com/dledford

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Building 3.1-rc9 in kernel infiniband support with OFED libraries

2011-10-14 Thread Doug Ledford

- Original Message -
> On Wed, Oct 12, 2011 at 9:32 AM, Wendy Cheng
>  wrote:
> > The OFED package itself does include XRC support. The issue here
> > (my
> > guess) is that its build script needs to understand the running
> > system's kernel version to decide what should be pulled (from the
> > source). Linux 3.1 could be too new for OFED build script to make a
> > correct decision. Nevertheless, mix-matching OFED modules/libraries
> > is
> > a *bad* idea.
> 
> No.  The same userspace build should work with all kernel versions.

Wendy's referring to something other than what you are thinking.  The
same libibverbs user space build should work on all kernels going back
a long way, except when you are talking about OFED their libibverbs
is hard coded to assume XRC support and fail if it isn't present, so
an OFED libibverbs won't really work without also the OFED kernel
module stack.  The script Wendy referred to is the script that checks
the running kernel's version in order to determine which backport
patches need applied to the ofa_kernel source tree in order to build
the OFED kernel modules for your running kernel.  Without that
ofa_kernel build, the OFED libibverbs will indeed fail to run on
the running kernel.  And that script hasn't been updated to support
version 3.x kernels last I checked, so she's right, the script itself
doesn't recognize the running kernel version, so ofa_kernel modules
don't get built, so OFED libibverbs won't work anyway.  So, she's
absolutely right, unless you want to start ripping hard coded
assumptions about the existence of XRC support out of things like
OFED's libibverbs, then out of qperf and a number of their other
various packages, then you have to pair the OFED kernel modules and
user space packages, they can not be separated.

-- 
Doug Ledford 
  GPG KeyID: 0E572FDD
  http://people.redhat.com/dledford

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [ewg] [RFC] – Proposal for new process for OFED releases

2011-12-05 Thread Doug Ledford

On 12/01/2011 02:53 PM, Tziporet Koren wrote:
> 
> We propose a new process for the OFED releases starting from next OFED 
> release:
> - OFED content will be the relevant kernel.org modules and user space 
> released packages
> - OFED will offer only backports to the distros  (no fixes)
> - OFED package will be used for easy installation of all packages in a 
> friendly manner

Yay!!!  I'm all in favor.


-- 
Doug Ledford 
  GPG KeyID: 0E572FDD

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Patch] mlx4: fix ppc64 build issue

2012-02-20 Thread Doug Ledford

This may not be an issue with current gcc, I simply haven't tried it
to know.  But with the gcc shipped in rhel6 (gcc 4.4.6), you can't
both define a function as static and also EXPORT_SYMBOL that function
on ppc.  This fixes that in the mlx4 driver.

-- 
Doug Ledford 
  GPG KeyID: 0E572FDD
  http://people.redhat.com/dledford

commit 5e3ddfd1616295b47785ab81e330177cb962070a
Author: Doug Ledford 
Date:   Fri Feb 17 10:35:29 2012 -0500

mlx4: fix build error

powerpc is picky, if you set a function to static and then try to
EXPORT_SYMBOL_GPL that function, you get this error message:

drivers/net/ethernet/mellanox/mlx4/fw.c:687: error:
__ksymtab_mlx4_QUERY_PORT causes a section type conflict

Signed-off-by: Doug Ledford 

diff --git a/drivers/net/ethernet/mellanox/mlx4/fw.c b/drivers/net/ethernet/mellanox/mlx4/fw.c
index 8a21e10..9ea7cab 100644
--- a/drivers/net/ethernet/mellanox/mlx4/fw.c
+++ b/drivers/net/ethernet/mellanox/mlx4/fw.c
@@ -685,7 +685,7 @@ int mlx4_QUERY_PORT_wrapper(struct mlx4_dev *dev, int slave,
 	return err;
 }
 
-static int mlx4_QUERY_PORT(struct mlx4_dev *dev, void *ptr, u8 port)
+int mlx4_QUERY_PORT(struct mlx4_dev *dev, void *ptr, u8 port)
 {
 	struct mlx4_cmd_mailbox *outbox = ptr;
 
diff --git a/drivers/net/ethernet/mellanox/mlx4/mr.c b/drivers/net/ethernet/mellanox/mlx4/mr.c
index 8deeef9..25a80d7 100644
--- a/drivers/net/ethernet/mellanox/mlx4/mr.c
+++ b/drivers/net/ethernet/mellanox/mlx4/mr.c
@@ -304,7 +304,7 @@ static int mlx4_HW2SW_MPT(struct mlx4_dev *dev, struct mlx4_cmd_mailbox *mailbox
 			MLX4_CMD_TIME_CLASS_B, MLX4_CMD_WRAPPED);
 }
 
-static int mlx4_mr_reserve_range(struct mlx4_dev *dev, int cnt, int align,
+int mlx4_mr_reserve_range(struct mlx4_dev *dev, int cnt, int align,
 			  u32 *base_mridx)
 {
 	struct mlx4_priv *priv = mlx4_priv(dev);
@@ -320,14 +320,14 @@ static int mlx4_mr_reserve_range(struct mlx4_dev *dev, int cnt, int align,
 }
 EXPORT_SYMBOL_GPL(mlx4_mr_reserve_range);
 
-static void mlx4_mr_release_range(struct mlx4_dev *dev, u32 base_mridx, int cnt)
+void mlx4_mr_release_range(struct mlx4_dev *dev, u32 base_mridx, int cnt)
 {
 	struct mlx4_priv *priv = mlx4_priv(dev);
 	mlx4_bitmap_free_range(&priv->mr_table.mpt_bitmap, base_mridx, cnt);
 }
 EXPORT_SYMBOL_GPL(mlx4_mr_release_range);
 
-static int mlx4_mr_alloc_reserved(struct mlx4_dev *dev, u32 mridx, u32 pd,
+int mlx4_mr_alloc_reserved(struct mlx4_dev *dev, u32 mridx, u32 pd,
 			   u64 iova, u64 size, u32 access, int npages,
 			   int page_shift, struct mlx4_mr *mr)
 {
@@ -457,7 +457,7 @@ int mlx4_mr_alloc(struct mlx4_dev *dev, u32 pd, u64 iova, u64 size, u32 access,
 }
 EXPORT_SYMBOL_GPL(mlx4_mr_alloc);
 
-static void mlx4_mr_free_reserved(struct mlx4_dev *dev, struct mlx4_mr *mr)
+void mlx4_mr_free_reserved(struct mlx4_dev *dev, struct mlx4_mr *mr)
 {
 	int err;
 
@@ -852,7 +852,7 @@ err_free:
 }
 EXPORT_SYMBOL_GPL(mlx4_fmr_alloc);
 
-static int mlx4_fmr_alloc_reserved(struct mlx4_dev *dev, u32 mridx,
+int mlx4_fmr_alloc_reserved(struct mlx4_dev *dev, u32 mridx,
 			u32 pd, u32 access, int max_pages,
 			int max_maps, u8 page_shift, struct mlx4_fmr *fmr)
 {
@@ -954,7 +954,7 @@ int mlx4_fmr_free(struct mlx4_dev *dev, struct mlx4_fmr *fmr)
 }
 EXPORT_SYMBOL_GPL(mlx4_fmr_free);
 
-static int mlx4_fmr_free_reserved(struct mlx4_dev *dev, struct mlx4_fmr *fmr)
+int mlx4_fmr_free_reserved(struct mlx4_dev *dev, struct mlx4_fmr *fmr)
 {
 	if (fmr->maps)
 		return -EBUSY;

[Patch] iser: free ib connection resources in the proper place

2012-02-25 Thread Doug Ledford

We allocate the login dma buffers in iser_verbs.c as part of
alloc_ib_conn_resources(), however we are freeing them in
iser_initiator.c as part of iser_free_rx_descriptors().  This is
needlessly confusing.  We have an alloc_rx_descriptors() and it doesn't
alloc something that the free_rx_descriptors() frees, and we have an
alloc_ib_conn_resources() that allocs something not freed by
free_ib_conn_resources().  Clean that up.

Signed-off-by: Doug Ledford 
---
 drivers/infiniband/ulp/iser/iser_initiator.c |   12 
 drivers/infiniband/ulp/iser/iser_verbs.c |   12 
 2 files changed, 12 insertions(+), 12 deletions(-)

diff --git a/drivers/infiniband/ulp/iser/iser_initiator.c 
b/drivers/infiniband/ulp/iser/iser_initiator.c
index a607542..622e985 100644
--- a/drivers/infiniband/ulp/iser/iser_initiator.c
+++ b/drivers/infiniband/ulp/iser/iser_initiator.c
@@ -220,18 +220,6 @@ void iser_free_rx_descriptors(struct iser_conn *ib_conn)
struct iser_rx_desc *rx_desc;
struct iser_device *device = ib_conn->device;
 
-   if (ib_conn->login_buf) {
-   if (ib_conn->login_req_dma)
-   ib_dma_unmap_single(device->ib_device,
-   ib_conn->login_req_dma,
-   ISCSI_DEF_MAX_RECV_SEG_LEN, DMA_TO_DEVICE);
-   if (ib_conn->login_resp_dma)
-   ib_dma_unmap_single(device->ib_device,
-   ib_conn->login_resp_dma,
-   ISER_RX_LOGIN_SIZE, DMA_FROM_DEVICE);
-   kfree(ib_conn->login_buf);
-   }
-
if (!ib_conn->rx_descs)
return;
 
diff --git a/drivers/infiniband/ulp/iser/iser_verbs.c 
b/drivers/infiniband/ulp/iser/iser_verbs.c
index e28877c..46ba982 100644
--- a/drivers/infiniband/ulp/iser/iser_verbs.c
+++ b/drivers/infiniband/ulp/iser/iser_verbs.c
@@ -274,6 +274,18 @@ static int iser_free_ib_conn_res(struct iser_conn 
*ib_conn, int can_destroy_id)
ib_conn->cma_id   = NULL;
kfree(ib_conn->page_vec);
 
+   if (ib_conn->login_buf) {
+   if (ib_conn->login_req_dma)
+   ib_dma_unmap_single(device->ib_device,
+   ib_conn->login_req_dma,
+   ISCSI_DEF_MAX_RECV_SEG_LEN, DMA_TO_DEVICE);
+   if (ib_conn->login_resp_dma)
+   ib_dma_unmap_single(device->ib_device,
+   ib_conn->login_resp_dma,
+   ISER_RX_LOGIN_SIZE, DMA_FROM_DEVICE);
+   kfree(ib_conn->login_buf);
+   }
+
return 0;
 }
 
-- 
1.7.6.5

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

ibacm fixes/updates

2012-02-28 Thread Doug Ledford

Hi Sean,

We were asked to integrate this into our product for scalability
purposes.  While working on that, I ran across a number of issues that
needed fixed up (typical for a new package).  So, here's a number of
patches and fix ups for those things.

First, the package is ibacm but the binary is ib_acm.  Trust me, it's
best to pick one of the other and stick with it.  Especially since you
used ibacm for the log and pid files too.  In my build, I went ahead and
changed the binary name to match the package and log and pid file names.
 I patched the Makefile.am and reran the autoconf tools to make this
happen.  I've attached that patch.

In the package spec file don't differentiate between %{ver} and
%{version}, these should always be the same and you should never need to
use anything other than %{version}.  Your spec file had a srv component,
that I suspect was intended to hold the init script, but no init script
was ever written and the section was incomplete (so your spec file
wouldn't even build).  I rewrote the spec file so that the base
component is the service.  In truth, for this you really want it to be a
service and on by default if installed.  I've attached my spec file.
I've also attached the init script I made.

By default ibacm expects to find its configuration files in /etc/ibacm.
 This adds to the proliferation of directories in /etc/ needlessly.  We
already have a number of RDMA related directories to choose from
depending on your install (OFED == /etc/ofed or /etc/openib in the old
days, RHEL5 == /etc/openib from the old days, RHEL6 and Fedora are
/etc/rdma, don't know what SuSE uses).  Would be best if these files
were in the same place as the other RDMA related files.  And to make
that happen easily, it would be best if they picked up on %{sysconfdir}
from the ./configure invocation in order to set the directory.  I used a
static patch for now because I'm behind on my work, so updating the
configure script was more than I had time to do.  I'm not bothering to
include my patch as it needs done the other way so my patch is purely a
stopgap that should not see the time of day ;-)

OK, gotta run.  Would like to see this stuff picked up, especially the
init stuff.  Getting it in your release before we have one init script,
SuSE another, and OFED another will help keep things uniform.

-- 
Doug Ledford 
  GPG KeyID: 0E572FDD
  http://people.redhat.com/dledford

#!/bin/bash
#
# Bring up/down the ibacm daemon
#
# chkconfig: 2345 25 75
# description: Starts/Stops InfiniBand ACM service
#
### BEGIN INIT INFO
# Provides:   ibacm
# Default-Start: 2 3 4 5
# Default-Stop: 0 1 6
# Required-Start: rdma $network
# Required-Stop: rdma $network
# Should-Start: opensm
# Should-Stop: opensm
# Short-Description: Starts and stops the InfiniBand ACM service
# Description: The InfiniBand ACM service provides a user space implementation
#   of someting resembling an ARP cache for InfiniBand SA queries and
#   host route lookups.
### END INIT INFO

pidfile=/var/run/ibacm.pid
subsys=/var/lock/subsys/ibacm
prog=/usr/sbin/ibacm

. /etc/rc.d/init.d/functions

start()
{
echo -n "Starting ibacm daemon:"

daemon $prog
RC=$?
[ $RC -eq 0 ] && touch $subsys
echo
return $RC
}

stop()
{
echo -n "Stopping ibacm daemon:"

killproc -p $pidfile $prog
RC=$?
rm -f $subsys
echo
return $RC
}

status()
{
if [ ! -f $subsys -a ! -f $pidfile ]; then
return 3
fi
if [ -f $pidfile ]; then
checkpid `cat $pidfile`
return $?
fi
if [ -f $subsys ]; then
return 2
fi
}

restart ()
{
stop
start
}

condrestart ()
{
[ -e $subsys ] && restart || return 0
}

usage ()
{
echo
echo "Usage: `basename $0` 
{start|stop|restart|condrestart|try-restart|force-reload|status}"
echo
return 2
}

case $1 in
start|stop|restart|condrestart|try-restart|force-reload)
[ `id -u` != "0" ] && exit 4 ;;
esac

case $1 in
start) start; RC=$? ;;
stop) stop; RC=$? ;;
restart) restart; RC=$? ;;
reload) RC=3 ;;
condrestart) condrestart; RC=$? ;;
try-restart) condrestart; RC=$? ;;
force-reload) condrestart; RC=$? ;;
status) status; RC=$? ;;
*) usage; RC=$? ;;
esac

exit $RC
Name: ibacm
Version: 1.0.5
Release: 1%{?dist}
Summary: InfiniBand Communication Manager Assistant
Group: System Environment/Daemons
License: GPLv2 or BSD
Url: http://www.openfabrics.org/
Source: http://www.openfabrics.org/downloads/rdmacm/%{name}-%{version}.tar.gz
Source1: ibacm.init
Patch0: ibacm-1.0.5-make.patch
Patch1: ibacm-1.0.5-conf.patch
BuildRoot: %{_tmppath}/%{name}-%{version}-%{release}-root-%(%{__id_u} -n)
BuildRequires: libibverbs-devel >= 1.1-1, autoconf, libtool, libibumad-devel
Requires(post): chkconfig
Requires(preun): chkconfig
ExcludeArch: s390 s390x

%descripti

Re: ibacm fixes/updates

2012-02-28 Thread Doug Ledford

On 02/28/2012 04:08 PM, Doug Ledford wrote:
> By default ibacm expects to find its configuration files in /etc/ibacm.
>  This adds to the proliferation of directories in /etc/ needlessly.  We
> already have a number of RDMA related directories to choose from
> depending on your install (OFED == /etc/ofed or /etc/openib in the old
> days, RHEL5 == /etc/openib from the old days, RHEL6 and Fedora are
> /etc/rdma, don't know what SuSE uses).  Would be best if these files
> were in the same place as the other RDMA related files.  And to make
> that happen easily, it would be best if they picked up on %{sysconfdir}
> from the ./configure invocation in order to set the directory.  I used a
> static patch for now because I'm behind on my work, so updating the
> configure script was more than I had time to do.  I'm not bothering to
> include my patch as it needs done the other way so my patch is purely a
> stopgap that should not see the time of day ;-)

Oh, and another issue, the pid and lock files are not in the LSB
standard locations:

Pid file: /var/run/
Lock file: /var/lock/subsys/  (this is created by the init script, and
it is doing the right thing here)

Right now the program creates the pid file in /var/lock (or
/var/run/lock, don't know which and they are linked so it shows up both
places), but not in /var/run which is the standard location for pid
files.  The port file can stay where it is.


-- 
Doug Ledford 
  GPG KeyID: 0E572FDD
  http://people.redhat.com/dledford




signature.asc
Description: OpenPGP digital signature

[Patch opensm] Allow for easily configuring multiple fabrics on one opensm server

2012-02-28 Thread Doug Ledford

There are two things that stand in the way of opensm being run on
redundant fabrics easily:

1) The opensm init script only starts one instance of opensm and opensm
will only work on one fabric per instance
2) Even if you start multiple instances, you have to hand modify config
files for each instance and then when you upgrade the opensm rpm you
either loose your modifications or loose getting new default settings

I worked around both of these issues, I've attached the files I used to
do so.

First, I have an opensm init script that allows starting multiple opensm
instances.  It supports configuring this in one of two ways:

1) Create multiple opensm.conf files, each with a numbered suffix (so
opensm.conf.1, opensm.conf.2, etc.) and it will start one opensm
instance per config file.  This allows an admin to copy the default
config over and edit the things they need, and on rpm upgrade there will
be a new default opensm.conf file so they can diff between their edited
version and the new default and see if there are changes they need to
bring back in.  This also allows for complete flexibility in setting up
the different fabrics, for instance you could use one type of routing on
one and a totally different type on the others.

2) Edit the file /etc/sysconfig/opensm and define more than one GUID in
the GUIDs variable.  This will cause the opensm init script to
automatically start one instance per GUID, passing the GUID in on the
command line.

For the most part, this works well.  However, openmpi in particular
doesn't like you to have physically separate fabrics that have the same
subnet_prefix, and you can't specify a subnet_prefix on the command line
to go along with the GUIDs.  So I wrote a patch for that and made the
init script unilaterally increment the subnet prefix for each different
GUID it's attaching to.

All in all, we use the attached opensm file in /etc/sysconfig as the
standard place you put options belonging to an init script, we have the
opensm init script, the subnet_prefix patch I wrote, and with those
combined things work quite well.

However, I will note that our init script does not (and will not ever)
play the passwordless root ssh stuff that upstream does.  This is
considered a serious security risk on side.  The idea that a customer
(let's say a wall street bank) should set up passwordless root ssh on
their cluster that's a backend to their web farm?  Oh hell no...

I might recommend that it is long since past time for that particular
misfeature of the upstream opensm init script to be done away with.
Personally, I would simply recommend that on failover from a primary to
a backup that it simply scan the fabric and build a "current guid2lid"
map from what it finds, then start updating from there.  Or something
like that.  But passwordless ssh...bleh.

Oh, and while I've got your ear...is there a good reason the opensm libs
have been soname bumping so frequently?  Is it not possible to extend
the APIs without soname bumps quite so often?

-- 
Doug Ledford 
  GPG KeyID: 0E572FDD
  http://people.redhat.com/dledford

#!/bin/bash
#
# Bring up/down opensm
#
# chkconfig: - 15 85
# description: Activates/Deactivates InfiniBand Subnet Manager
# config: /etc/ofed/opensm.conf
#
### BEGIN INIT INFO
# Provides:   opensm
# Default-Stop: 0 1 2 3 4 5 6
# Required-Start: rdma
# Required-Stop: rdma
# Short-Description: Starts/Stops the InfiniBand Subnet Manager
# Description: Starts/Stops the InfiniBand Subnet Manager
### END INIT INFO

. /etc/rc.d/init.d/functions

prog=/usr/sbin/opensm
PID_FILE=/var/run/opensm.pid
[ -f /etc/sysconfig/opensm ] && . /etc/sysconfig/opensm

[ -n "$PRIORITY" ] && prio="-p $PRIORITY"

ACTION=$1

start()
{
local OSM_PID=
if [ -z "$GUIDS" ]; then
CONFIGS=""
CONFIG_CNT=0
for conf in /etc/rdma/opensm.conf.[0-9]*; do
CONFIGS="$CONFIGS $conf"
let CONFIG_CNT++
done
else
GUID_CNT=0
for guid in $GUIDS; do
let GUID_CNT++
done
fi
[ -f /var/lock/subsys/opensm ] && return 0
# Start opensm
echo -n "Starting IB Subnet Manager: "
[ -n "$PRIORITY" ] && echo -n "Priority=$PRIORITY "
[ -n "$GUIDS" ] && echo -n "$GUID_CNT guids "
[ -n "$CONFIGS" ] && echo -n "$CONFIG_CNT instances "
if [ -n "$GUIDS" ]; then
SUBNET_COUNT=0
for guid in $GUIDS; do
SUBNET_PREFIX=`printf "0xfe8000%02d" $SUBNET_COUNT`
$prog -B $prio -g $guid --subnet_prefix $SUBNET_PREFIX >/dev/null 
2>&1
let SUBNET_COUNT++
done
elif [ -n "$CONFIGS" ]; then
for config in $CONFIGS; do
$prog -B $prio -F $config >/dev/null 2&g

Re: [Patch opensm] Allow for easily configuring multiple fabrics on one opensm server

2012-02-29 Thread Doug Ledford

On 02/29/2012 02:22 PM, Ira Weiny wrote:
> Doug,
> 
> First thanks for this.  Some comments below.
> 
> On Wed, 29 Feb 2012 00:01:16 -0500
> Doug Ledford  wrote:
> 
>> There are two things that stand in the way of opensm being run on
>> redundant fabrics easily:
>>
>> 1) The opensm init script only starts one instance of opensm and opensm
>> will only work on one fabric per instance
>> 2) Even if you start multiple instances, you have to hand modify config
>> files for each instance and then when you upgrade the opensm rpm you
>> either loose your modifications or loose getting new default settings
>>
>> I worked around both of these issues, I've attached the files I used to
>> do so.
>>
>> First, I have an opensm init script that allows starting multiple opensm
>> instances.  It supports configuring this in one of two ways:
>>
>> 1) Create multiple opensm.conf files, each with a numbered suffix (so
>> opensm.conf.1, opensm.conf.2, etc.) and it will start one opensm
>> instance per config file.  This allows an admin to copy the default
>> config over and edit the things they need, and on rpm upgrade there will
>> be a new default opensm.conf file so they can diff between their edited
>> version and the new default and see if there are changes they need to
>> bring back in.  This also allows for complete flexibility in setting up
>> the different fabrics, for instance you could use one type of routing on
>> one and a totally different type on the others.
>>
>> 2) Edit the file /etc/sysconfig/opensm and define more than one GUID in
>> the GUIDs variable.  This will cause the opensm init script to
>> automatically start one instance per GUID, passing the GUID in on the
>> command line.
> 
> I know you are going for ease of use here, which is good, however, I worry 
> about this file becoming a redefinition of opensm.conf.

Hehehe, I don't think you'll ever have to worry about that.  You have
looked at opensm.conf in recent times I take it?  Replacing that with
command line options in a shell startup script isn't reasonable.

However, if you are going to run a redundant fabric setup, then the two
things you *know* you will have to set are the guid and subnet_prefix
(assuming you want to use openmpi).  If you are going to run
master/slave setup, then the one thing you *know* you will have to set
is the priority.  Supporting setting those items in an init script is
reasonable.  Beyond that, I would agree, you should just edit the config
files.


-- 
Doug Ledford 
  GPG KeyID: 0E572FDD
  http://people.redhat.com/dledford




signature.asc
Description: OpenPGP digital signature

Re: [Patch opensm] Allow for easily configuring multiple fabrics on one opensm server

2012-03-01 Thread Doug Ledford

On 02/29/2012 09:15 PM, Brian Ginsbach wrote:
> On Wed, Feb 29, 2012 at 02:47:00PM -0500, Doug Ledford wrote:
>> On 02/29/2012 02:22 PM, Ira Weiny wrote:
>>> Doug,
>>>
>>> First thanks for this.  Some comments below.
>>>
>>> On Wed, 29 Feb 2012 00:01:16 -0500
>>> Doug Ledford  wrote:
>>>
>>>> There are two things that stand in the way of opensm being run on
>>>> redundant fabrics easily:
>>>>
>>>> 1) The opensm init script only starts one instance of opensm and opensm
>>>> will only work on one fabric per instance
>>>> 2) Even if you start multiple instances, you have to hand modify config
>>>> files for each instance and then when you upgrade the opensm rpm you
>>>> either loose your modifications or loose getting new default settings
>>>>
>>>> I worked around both of these issues, I've attached the files I used to
>>>> do so.
>>>>
>>>> First, I have an opensm init script that allows starting multiple opensm
>>>> instances.  It supports configuring this in one of two ways:
>>>>
>>>> 1) Create multiple opensm.conf files, each with a numbered suffix (so
>>>> opensm.conf.1, opensm.conf.2, etc.) and it will start one opensm
>>>> instance per config file.  This allows an admin to copy the default
>>>> config over and edit the things they need, and on rpm upgrade there will
>>>> be a new default opensm.conf file so they can diff between their edited
>>>> version and the new default and see if there are changes they need to
>>>> bring back in.  This also allows for complete flexibility in setting up
>>>> the different fabrics, for instance you could use one type of routing on
>>>> one and a totally different type on the others.
>>>>
>>>> 2) Edit the file /etc/sysconfig/opensm and define more than one GUID in
>>>> the GUIDs variable.  This will cause the opensm init script to
>>>> automatically start one instance per GUID, passing the GUID in on the
>>>> command line.
>>>
>>> I know you are going for ease of use here, which is good, however, I worry 
>>> about this file becoming a redefinition of opensm.conf.
>>
>> Hehehe, I don't think you'll ever have to worry about that.  You have
>> looked at opensm.conf in recent times I take it?  Replacing that with
>> command line options in a shell startup script isn't reasonable.
>>
>> However, if you are going to run a redundant fabric setup, then the two
>> things you *know* you will have to set are the guid and subnet_prefix
>> (assuming you want to use openmpi).  If you are going to run
> 
> Assuming you are doing this for openmpi.  The subnet_prefix should
> not be needed if the separate subnets are for disjoint networks
> (mpi and storage) or multiple storage networks.

True enough, but that's why I said openmpi.  It is, after all, a primary
IB fabric user.

>> master/slave setup, then the one thing you *know* you will have to set
>> is the priority.  Supporting setting those items in an init script is
>> reasonable.  Beyond that, I would agree, you should just edit the config
>> files.
>>
> 
> Not everything can be done in the config files.  I'm not sure that
> it is a good idea to have every opensm instance using the same
> temporary and cache directories (OSM_TMP_DIR and OSM_CACHE_DIR
> environment variables).  Seems like these fall into the *know* you
> will have to set category.

Unless opensm is smart enough to allow more than one instance to open
the same log file and interleave their log messages successfully.
Temporary files or cache files could do something like use a pid suffix
if need be.  But yes, I see your point.  Opensm has lots of junk it
likes to put on the drive :-/

> You'd also want to make sure that other potentially very useful
> things are configured in the config files (e.g. log_file and
> log_prefix).  Aren't these also things you *know* you will have to
> set.

I would say we are simply getting to the point where we *know* we need
opensm to handle more than one fabric from a single instance ;-)

-- 
Doug Ledford 
  GPG KeyID: 0E572FDD
  http://people.redhat.com/dledford




signature.asc
Description: OpenPGP digital signature

Re: [Patch opensm] Allow for easily configuring multiple fabrics on one opensm server

2012-03-02 Thread Doug Ledford

On 3/2/2012 5:30 AM, Alex Netes wrote:
> On 11:22 Wed 29 Feb , Ira Weiny wrote:
>> Doug,
>>
>> First thanks for this.  Some comments below.
>>
>> On Wed, 29 Feb 2012 00:01:16 -0500
>> Doug Ledford  wrote:
>>
>>> There are two things that stand in the way of opensm being run on
>>> redundant fabrics easily:
>>>
>>> 1) The opensm init script only starts one instance of opensm and opensm
>>> will only work on one fabric per instance
>>> 2) Even if you start multiple instances, you have to hand modify config
>>> files for each instance and then when you upgrade the opensm rpm you
>>> either loose your modifications or loose getting new default settings
>>>
>>> I worked around both of these issues, I've attached the files I used to
>>> do so.
>>>
>>> First, I have an opensm init script that allows starting multiple opensm
>>> instances.  It supports configuring this in one of two ways:
>>>
>>> 1) Create multiple opensm.conf files, each with a numbered suffix (so
>>> opensm.conf.1, opensm.conf.2, etc.) and it will start one opensm
>>> instance per config file.  This allows an admin to copy the default
>>> config over and edit the things they need, and on rpm upgrade there will
>>> be a new default opensm.conf file so they can diff between their edited
>>> version and the new default and see if there are changes they need to
>>> bring back in.  This also allows for complete flexibility in setting up
>>> the different fabrics, for instance you could use one type of routing on
>>> one and a totally different type on the others.
>>>
>>> 2) Edit the file /etc/sysconfig/opensm and define more than one GUID in
>>> the GUIDs variable.  This will cause the opensm init script to
>>> automatically start one instance per GUID, passing the GUID in on the
>>> command line.
>>
>> I know you are going for ease of use here, which is good, however, I worry 
>> about this file becoming a redefinition of opensm.conf.
>>
>>>
>>> For the most part, this works well.  However, openmpi in particular
>>> doesn't like you to have physically separate fabrics that have the same
>>> subnet_prefix, and you can't specify a subnet_prefix on the command line
>>> to go along with the GUIDs.  So I wrote a patch for that and made the
>>> init script unilaterally increment the subnet prefix for each different
>>> GUID it's attaching to.
>>
>> If you only allow option 1 above this takes care of itself by making the 
>> admin configure his subnet prefixes in each config file as appropriate.  The 
>> only down side is the loss of new configuration options as you upgrade.  
>> However, that is probably better taken care of by a default config file with 
>> each package.  I mentioned this to Sasha years back and was denied since 
>> "you can always generate a new one with '-c'".  :-(
>>
>> Alex would a default config file be acceptable?  It would mean more work on 
>> your part.
>>
> 
> What the default opensm.conf would be used for? Just as a reference to the
> default values?

No, he's referring to having a default config file that is parsed, then
an override config file that is parsed where you only put options you
want to update in the override config file.  That way you could have,
for instance, a default opensm.conf in the normal location and totally
unedited so that it gets updated with each update of the opensm rpm,
then you could create an opensm.conf.1 that is empty except for just a
guid setting, a subnet_prefix setting, maybe a cache dir setting, etc.
In that way, if say the default routing engine gets a new option in the
future, your override config file won't already be populated with the
old stuff.  It's a means of inheritance that is functionally identical
to specifying all this stuff on the command line, but doesn't require a
huge command line or a complex init script.


-- 
Doug Ledford 
  GPG KeyID: 0E572FDD
  http://people.redhat.com/dledford

Infiniband specific RPMs available at
  http://people.redhat.com/dledford/Infiniband



signature.asc
Description: OpenPGP digital signature

Re: [Patch opensm] Allow for easily configuring multiple fabrics on one opensm server

2012-03-02 Thread Doug Ledford

On 3/2/2012 10:31 AM, Doug Ledford wrote:
> On 3/2/2012 5:30 AM, Alex Netes wrote:
>> What the default opensm.conf would be used for? Just as a reference to the
>> default values?
> 
> No, he's referring to having a default config file that is parsed, then
> an override config file that is parsed where you only put options you
> want to update in the override config file.  That way you could have,
> for instance, a default opensm.conf in the normal location and totally
> unedited so that it gets updated with each update of the opensm rpm,
> then you could create an opensm.conf.1 that is empty except for just a
> guid setting, a subnet_prefix setting, maybe a cache dir setting, etc.
> In that way, if say the default routing engine gets a new option in the
> future, your override config file won't already be populated with the
> old stuff.  It's a means of inheritance that is functionally identical
> to specifying all this stuff on the command line, but doesn't require a
> huge command line or a complex init script.

And for what it's worth, it could be as simply done as the attached
(untested, but compiled) patch.

-- 
Doug Ledford 
  GPG KeyID: 0E572FDD
  http://people.redhat.com/dledford

Infiniband specific RPMs available at
  http://people.redhat.com/dledford/Infiniband
diff -up opensm-3.3.13/opensm/main.c.config opensm-3.3.13/opensm/main.c
--- opensm-3.3.13/opensm/main.c.config  2012-03-02 10:35:26.783996345 -0500
+++ opensm-3.3.13/opensm/main.c 2012-03-02 10:46:33.471939369 -0500
@@ -131,6 +131,13 @@ static void show_usage(void)
   "  The name of the OpenSM config file. When not 
specified\n"
   "  " OSM_DEFAULT_CONFIG_FILE
   " will be used (if exists).\n\n");
+   printf("--extra-config, -E \n"
+  "  The name of an OpenSM config file used to over ride\n"
+  "  the entries in the primary config file.  This is\n"
+  "  useful when you have more than one opensm instance\n"
+  "  to manage and you want them all to have a central,\n"
+  "  shared set of options and you want a second, 
smaller\n"
+  "  config file to hold their fabric specific 
options.\n\n");
printf("--create-config, -c \n"
   "  OpenSM will dump its configuration to the specified 
file and exit.\n"
   "  This is a way to generate OpenSM configuration file 
template.\n\n");
@@ -569,10 +576,10 @@ int main(int argc, char *argv[])
boolean_t run_once_flag = FALSE;
int32_t vendor_debug = 0;
int next_option;
-   char *conf_template = NULL, *config_file = NULL;
+   char *conf_template, *config_file, *extra_config_file;
uint32_t val;
const char *const short_option =
-   
"F:c:i:w:O:f:ed:D:g:l:L:s:t:a:u:m:X:R:zM:U:S:P:Y:ANBIQvVhoryxp:n:q:k:C:G:H:";
+   
"F:E:c:i:w:O:f:ed:D:g:l:L:s:t:a:u:m:X:R:zM:U:S:P:Y:ANBIQvVhoryxp:n:q:k:C:G:H:";
 
/*
   In the array below, the 2nd parameter specifies the number
@@ -584,6 +591,7 @@ int main(int argc, char *argv[])
const struct option long_option[] = {
{"version", 0, NULL, 12},
{"config", 1, NULL, 'F'},
+   {"extra-config", 1, NULL, 'E'},
{"create-config", 1, NULL, 'c'},
{"debug", 1, NULL, 'd'},
{"guid", 1, NULL, 'g'},
@@ -647,6 +655,7 @@ int main(int argc, char *argv[])
{"torus_config", 1, NULL, 10},
{NULL, 0, NULL, 0}  /* Required at the end of the array */
};
+   conf_template = config_file = extra_config_file = NULL;
 
/* force stdout to be line-buffered */
setvbuf(stdout, NULL, _IOLBF, BUFSIZ);
@@ -672,6 +681,11 @@ int main(int argc, char *argv[])
config_file = optarg;
printf("Config file is `%s`:\n", config_file);
break;
+   case 'E':
+   extra_config_file = optarg;
+   printf("Extra Config file is `%s`:\n",
+  extra_config_file);
+   break;
default:
break;
}
@@ -687,6 +701,11 @@ int main(int argc, char *argv[])
if (osm_subn_parse_conf_file(config_file, &opt) < 0)
printf("\nFail to parse config file \'%s\'\n", config_file);
 
+   if (extra_config_file)
+   if (osm_subn_parse_conf_file(extra_config_file, &opt) < 0)
+   printf("\nFailed to parse extra config file `%s`\n",
+  extra_config_file);
+
printf("Command Line Arguments:\n");
do {
next_option = getopt_long_only(argc, argv, short_option,


signature.asc
Description: OpenPGP digital signature

Re: [Patch opensm] Allow for easily configuring multiple fabrics on one opensm server

2012-03-05 Thread Doug Ledford

- Original Message -
> On 3/1/2012 8:31 AM, Doug Ledford wrote:
> > I would say we are simply getting to the point where we *know* we
> > need
> > opensm to handle more than one fabric from a single instance ;-)
> 
> Why does a single OpenSM need to handle multiple subnets/fabrics ?
> What's the issue with running multiple OpenSMs with each handling
> it's
> own subnet/fabric ?

Because it wasn't built with the idea in mind of sharing logs or cache
directories, etc.  Not to mention that the proliferation of files when
you start multiple instances of opensm is pretty insane (have you
counted how many different files opensm can be configured to require
now a days...).

-- 
Doug Ledford 
  GPG KeyID: 0E572FDD
  http://people.redhat.com/dledford

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Patch opensm] Allow for easily configuring multiple fabrics on one opensm server

2012-03-05 Thread Doug Ledford

- Original Message -
> On 3/5/2012 10:28 AM, Doug Ledford wrote:
> > - Original Message -
> >> On 3/1/2012 8:31 AM, Doug Ledford wrote:
> >>> I would say we are simply getting to the point where we *know* we
> >>> need
> >>> opensm to handle more than one fabric from a single instance ;-)
> >>
> >> Why does a single OpenSM need to handle multiple subnets/fabrics ?
> >> What's the issue with running multiple OpenSMs with each handling
> >> it's
> >> own subnet/fabric ?
> > 
> > Because it wasn't built with the idea in mind of sharing logs or
> > cache
> > directories, etc.
> 
> Yes, each instance has it's own configuration and output files.
> 
> > Not to mention that the proliferation of files when
> > you start multiple instances of opensm is pretty insane (have you
> > counted how many different files opensm can be configured to
> > require
> > now a days...).
> 
> I'm not disagreeing with the fact that there are numerous config
> files
> (the main config file is already large enough without the myriad of
> features with their own separate config files) but different subnets
> have totally different configuration requirements. It's not just the
> subnet prefix, SM priority, and the few other config items that you
> identified.

In some cases, and in other cases the different subnets are perfect mirrors
of each other (redundant, identical fabrics).

> So what would make this better in your mind ?

If I got to wave a magic wand, I would say it's time that opensm
management started getting simpler, not more complex.  With all of
the various options for routing engines and QoS and partitions, it's
gotten to where you need the equivalent of the old Cisco CNA in order
to configure opensm ;-)


-- 
Doug Ledford 
  GPG KeyID: 0E572FDD
  http://people.redhat.com/dledford

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] Proposal to change Node Description naming scheme for HCA's

2012-04-01 Thread Doug Ledford

I'm fine with this change.  If people agree, I'll "make it so" in our products.

Sent from my ASUS Eee Pad

Ira Weiny  wrote:

First, a question: what package installs the openibd script in OFED?  For the
life of me I can't find this script in 1.5.4.1 or 3.2 ...  :-/  [*]

Right now the "standard" for node description is, AFAIK, " HCA-",
where num is simply a counter for the HCA's as they are found in
/sys/class/infiniband.

The problem is resolving this "random" HCA number to an actual HCA on the
host.  I thought about including the node description in ibstat but that seems
a bit short sighted.  I think the better solution would be to append the hca
name (ie mlx4_X, qibX, etc) to the hostname for the Node Description.

Hacking the RHEL start up script is really easy to do this and results in nice
names on the fabric which are easily resolved by the infiniband-diags,
ibverbs, and perftest utilities.

bash-4.1# ibhosts
Ca  : 0x0002c90300325280 ports 1 "ending mlx4_2"
Ca  : 0x00117579da38 ports 1 "happy qib0"
Ca  : 0x0002c90300108f2e ports 1 "ending mlx4_1"
Ca  : 0x00117577d90e ports 2 "ending qib0"
Ca  : 0x0002c903004bebda ports 2 "happy mlx4_0"
bash-4.1# hostname
happy
bash-4.1# ibstat mlx4_0
CA 'mlx4_0'
CA type: MT26428
Number of ports: 2
Firmware version: 2.8.600
...
bash-4.1# ibv_rc_pingpong -d mlx4_0
  local address:  LID 0x0008, QPN 0x16004a, PSN 0x2e8316
...
bash-4.1# rdma_bw -d mlx4_0
6089: | port=18515 | ib_port=1 | size=65536 | tx_depth=100 | sl=0 | iters=1000 
| duplex=0 | cma=0 |
...

I realize this is really a distro thing but it would be nice if we could agree
to change the current "standard".

I can send a patch for RHEL and OFED (if I someone can point me to the openibd
script or srpm).

Thoughts?
Ira


[*] Last I knew openibd does the same as RHEL's rdma start up script in this 
regard.

-- 
Ira Weiny
Member of Technical Staff
Lawrence Livermore National Lab
925-423-8008
wei...@llnl.gov
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: ibacm fixes/updates

2012-04-05 Thread Doug Ledford

On 04/02/2012 04:49 PM, Hefty, Sean wrote:
>> By default ibacm expects to find its configuration files in /etc/ibacm.
>>  This adds to the proliferation of directories in /etc/ needlessly.  We
>> already have a number of RDMA related directories to choose from
>> depending on your install (OFED == /etc/ofed or /etc/openib in the old
>> days, RHEL5 == /etc/openib from the old days, RHEL6 and Fedora are
>> /etc/rdma, don't know what SuSE uses).  Would be best if these files
>> were in the same place as the other RDMA related files.  And to make
>> that happen easily, it would be best if they picked up on %{sysconfdir}
>> from the ./configure invocation in order to set the directory.  I used a
>> static patch for now because I'm behind on my work, so updating the
>> configure script was more than I had time to do.  I'm not bothering to
>> include my patch as it needs done the other way so my patch is purely a
>> stopgap that should not see the time of day ;-)
> 
> I've moved the config files to /etc/rdma.  I'm still trying to figure out how 
> to make them use %{sysconfdir}, so I can avoid hard coding any paths.
> 
> Note that the config files are NOT needed by default.  If the option config 
> file (acm_opts.cfg) is not found, the ibacm service will simply use the 
> default settings.  The default values are the same as defined in the sample 
> option file.
> 
> The acm_addr.cfg file included with the package is a sample file only.  The 
> address file should be generated by running the ib_acme program on the same 
> node.  Copying the acm_addr.cfg file included with the package will not work, 
> since the addresses must be unique per node.  If no address config file is 
> found, the ibacm service will automatically call ib_acme to generate one.
> 
> So... I think you want to remove the following install lines from your 
> makefile.am patch.
> 
> +install-exec-hook:
> + if ! test -d $(DESTDIR)$(sysconfdir); then \
> + mkdir -p $(DESTDIR)$(sysconfdir); \
> + fi; \
> + install -m 644 acm_opts.cfg $(DESTDIR)$(sysconfdir)/acm_opts.cfg; \
> + install -m 644 acm_addr.cfg $(DESTDIR)$(sysconfdir)/acm_addr.cfg;

OK, will do.


-- 
Doug Ledford 
  GPG KeyID: 0E572FDD
  http://people.redhat.com/dledford




signature.asc
Description: OpenPGP digital signature

Re: [PATCH] ibacm: Fixes to ACM package to support distros

2012-04-05 Thread Doug Ledford

On 04/03/2012 03:10 PM, Hefty, Sean wrote:
> Set of changes to fixup the ibacm package for inclusion into RedHat 6.
> Changes are based on feedback from Doug Ledford .
> These are primarily changes to the build files, along with name changes
> to the man pages and sample configuration files.
> 
> Rename the ib_acm service to match the package name, ibacm.
> 
> Rename the ibacm configuration files to use the prefix 'ibacm' instead
> of 'acm'.  The new sample files are 'ibacm_addr.cfg' and 'ibacm_opts.cfg'.
> 
> Move location of ACM lock and configuration files and ibacm.pid
> files.  They are currently in non-standard locations.
> 
> Modify ibacm and ib_acme to use $sysconfdir and $bindir configure 
> values.  The ibacm_addr.cfg and ibacm_opt.cfg files will now be 
> read/written to $sysconfdir/rdma by default.  And ibacm will execute
> $bindir/ib_acme if it needs to create the ibacm_addr.cfg file.  Without
> $bindir, the ibacm service can fail to launch ib_acme when started
> from an init script.
> 
> Add init script as part of install.  The init script is installed
> into $sysconfdir/init.d.
> 
> Fixup man pages based on changes.
> 
> Signed-off-by: Doug Ledford 
> Signed-off-by: Sean Hefty 
> ---
>  Makefile.am|   30 +
>  acm_addr.cfg   |   24 --
>  acm_opts.cfg   |  130 
> 
>  ibacm.init |  102 
>  ibacm.spec.in  |   68 ++---
>  ibacm_addr.cfg |   24 ++
>  ibacm_opts.cfg |  130 
> 
>  linux/osd.h|   12 -
>  man/ib_acm.1   |  130 
> 
>  man/ib_acm.7   |   33 --
>  man/ib_acme.1  |   14 +++---
>  man/ibacm.1|  130 
> 
>  man/ibacm.7|   31 +
>  src/acm.c  |   17 ---
>  src/acme.c |   16 +++
>  15 files changed, 523 insertions(+), 368 deletions(-)
>  delete mode 100644 acm_addr.cfg
>  delete mode 100644 acm_opts.cfg
>  create mode 100644 ibacm.init
>  create mode 100644 ibacm_addr.cfg
>  create mode 100644 ibacm_opts.cfg
>  delete mode 100644 man/ib_acm.1
>  delete mode 100644 man/ib_acm.7
>  create mode 100644 man/ibacm.1
>  create mode 100644 man/ibacm.7

Thanks for this Sean.  A few last tweaks inline below:

> diff --git a/Makefile.am b/Makefile.am
> index 503ad72..58e3415 100644
> --- a/Makefile.am
> +++ b/Makefile.am
> @@ -1,12 +1,12 @@
>  INCLUDES = -I$(srcdir)/include -I$(srcdir)/linux
>  
> -AM_CFLAGS = -g -Wall -D_GNU_SOURCE
> +AM_CFLAGS = -g -Wall -D_GNU_SOURCE -DSYSCONFDIR=\"${sysconfdir}\" 
> -DBINDIR=\"$(bindir)\"

OK, so you're making is so that the --sysconfdir option to configure
works now, and works via always passing sysconfdir on the build command
line (versus the other option which is to put it into config.h).  Cool.
This is a fine solution IMO.


> -EXTRA_DIST = src/acm_mad.h src/libacm.h \
> -  linux/osd.h linux/dlist.h ibacm.spec.in $(man_MANS) acm_opts.cfg \
> -  acm_addr.cfg
> +EXTRA_DIST = src/acm_mad.h src/libacm.h ibacm.init \
> +  linux/osd.h linux/dlist.h ibacm.spec.in $(man_MANS) ibacm_opts.cfg 
> \
> +  ibacm_addr.cfg
> +
> +install-exec-hook:
> + if ! test -d $(DESTDIR)$(sysconfdir); then \
> + mkdir -p $(DESTDIR)$(sysconfdir); \
> + fi; \
> + if ! test -d $(DESTDIR)$(sysconfdir)/rdma; then \
> + mkdir -p $(DESTDIR)$(sysconfdir)/rdma; \
> + fi; \
> + if ! test -d $(DESTDIR)$(sysconfdir)/init.d; then \
> + mkdir -p $(DESTDIR)$(sysconfdir)/init.d; \
> + fi; \
> + install -m 755 ibacm.init $(DESTDIR)$(sysconfdir)/init.d/ibacmd;

I'm curious why you didn't just do install -D -m 755 ibacm.init
$(DESTDIR)$(sysconfdir)/init.d/ibacmd instead of all the individual
calls to mkdir.

Also, a minor nit, but along the lines of the whole naming thing, I
would name the init script ibacm since the program is ibacm.  Usually
when the program name is used to name the init script, they match
exactly (when possible, but it's a minor issue so I wouldn't change it
once the package has been in use for a while, but that's not the case
here, it's a new package, so we can do that little cleanup with no cost
up until the package gets in wide spread use).


> +PATH=$PATH:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin

I'm not very keen on including /usr/local in the search path on system
binaries.  It can be done, but we would patch it out (and I imagine SuSE
wo

Re: [RFC] Proposal to change Node Description naming scheme for HCA's

2012-04-05 Thread Doug Ledford

On 04/04/2012 01:12 PM, Jason Gunthorpe wrote:
> On Tue, Apr 03, 2012 at 07:14:42PM -0700, Ira Weiny wrote:
>> On Mon, 2 Apr 2012 12:45:45 -0600
>> Jason Gunthorpe  wrote:
>>
>>> On Mon, Apr 02, 2012 at 03:27:35PM +, Heinz, Michael William wrote:
>>>
>>>> Any ideas on how we could solve the hostname problem while we're
>>>> changing the description?
>>>
>>> The node description needs to be set from the DCHP notifier script
>>> chain (eg /etc/network/if-up.d/ on Debian) and also from a udev rule
>>> triggered on device insertion.
>>
>> Jason, I'm confused.  Do you mean DHCP?
> 
> Yes..
> 
>> It seems are you indicating you would like to see if[up/down] bring
>> ports up/down like they do for IP?
> 
> No..
> 
>> On my ubuntu box (the closest thing I have to Debian) it looks like
>> ifupdown owns /etc/network/if-up.d and that is not specific to DHCP.
>> I don't think DHCP should be required for IB.
> 
> if-up.d/ is a '.d' directory, the idea is individual packages drop
> their script files in the directory and the system runs them at
> defined times. No package owns the directory.
> 
> The purpose of placing a hook here is this call path is used when the
> hostname changes via DHCP so you can have a chance to reset the node
> description.
> 
>> Using udev to set this __seems__ like a better idea although I have
>> not prototyped it.
> 
> The purpose of the udev hook path is to set the node description on
> initial device insertion, which may be before, or after the DHCP
> process completes, in such cases.

Well, a udev rule is guaranteed to be before dhcp completes *on that
device*.  It might be that dhcp has completed on another device and that
the other device is actually where the hostname is pulled, but the udev
rule will always be before this given device in the rule has completed dhcp.

> It may also be before or after the openibd script..

Non hot-plug events are going to end up always being before the openibd
script.  This is merely a side effect of the fact that udev start is
called before the normal init script processing is started (this all
changes somewhat in the systemd world though).

> Having init.d
> scripts depend on the ordering of hardware discovery and module
> loading is considered sketchy these days with all the parallel boot
> fancyness and what not.

I would agree with that.  However...

> IMHO we should have a udev rule that also loads the higher level IB
> modules when any RDMA capable device is discovered, including the mlx4
> IB layer, uverbs, umad, etc. This way systems that have RDMA will load
> the right modules and systems that don't, won't. Fully supporting hot
> plug, of course.

One of the niceties of the rdma/openibd init script is the ability to
completely reload the stack.  That goes away if we switch to a udev load
of the stack (well, unless you now create a script in /sbin and have the
udev rule call that script, but the script should no longer be in the
initrddir if it's not an init script).

> This would broadly eliminate the openibd script, integrate more
> correctly with the modern distro world, be better prepared for systemd
> and just be an overall better example for distros to follow :)

This is mostly true, but you do have to sacrifice the one item I listed
above.

>>> It should probably not be set from the openibd script..
>>
>> I agree there might be better ways but I am not sure I follow your
>> proposal.  Furthermore, I don't know that a start up script of some
>> sort is all that evil.
>>
>> Finally, I think Michael brings up a good point about which package
>> should own any such scripts.
> 
> udev is like if-up.d/, there is a rules directory packages can drop
> hook scripts into that run at the appropriate time.

Correct.  On Red Hat Enterprise Linux I could have the rdma package drop
in an /sbin/rdma script that would bring the stack up (and possibly
reload it, but I'm a bit sketchy on that idea given that this would no
longer be an init script but something else), have it drop a small
script in /etc/dhcp/dhclient.d/ to set the node description after dhcp
completes (the script in /sbin would also have to set the node
description from whatever information is available on load just in case
the machine doesn't even use dhcp though), have it drop a rules file in
/etc/udev.d/rules.d for bringing up the stack on device discovery (this
one is a bit tricky though, you basically have to match against all
possible RDMA devices and bring up the stack on the presence of any one
of them, and your script that you call to bring the stack up needs to be
safe against parallel invocations), and then also

Re: [PATCH] ibacm: Fixes to ACM package to support distros

2012-04-05 Thread Doug Ledford

On 04/05/2012 03:10 PM, Hefty, Sean wrote:
> I've incorporated most of the changes, but see below.
> 
>>> -EXTRA_DIST = src/acm_mad.h src/libacm.h \ - linux/osd.h
>>> linux/dlist.h ibacm.spec.in $(man_MANS) acm_opts.cfg \ -
>>> acm_addr.cfg +EXTRA_DIST = src/acm_mad.h src/libacm.h ibacm.init
>>> \ +  linux/osd.h linux/dlist.h ibacm.spec.in $(man_MANS)
>>> ibacm_opts.cfg
>> \
>>> +ibacm_addr.cfg + +install-exec-hook: + if ! test -d
>>> $(DESTDIR)$(sysconfdir); then \ +   mkdir -p
>>> $(DESTDIR)$(sysconfdir); \ +fi; \ + if ! test -d
>>> $(DESTDIR)$(sysconfdir)/rdma; then \ +  mkdir -p
>>> $(DESTDIR)$(sysconfdir)/rdma; \ +   fi; \ + if ! test -d
>>> $(DESTDIR)$(sysconfdir)/init.d; then \ +mkdir -p
>>> $(DESTDIR)$(sysconfdir)/init.d; \ + fi; \ + install -m 755
>>> ibacm.init $(DESTDIR)$(sysconfdir)/init.d/ibacmd;
>> 
>> I'm curious why you didn't just do install -D -m 755 ibacm.init 
>> $(DESTDIR)$(sysconfdir)/init.d/ibacmd instead of all the
>> individual calls to mkdir.
> 
> ignorance on the existence of the '-D' option  :)
> 
>>> +PATH=$PATH:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
>>
>>
>>> 
I'm not very keen on including /usr/local in the search path on system
>> binaries.  It can be done, but we would patch it out (and I imagine
>> SuSE would too).  It's a support issue.  Having /usr/local override
>> the system installed binaries means that you can end up with
>> strange things happening, and support scratching their head and
>> going "what the hell is causing that" and in the end it's because a
>> different binary than the one RPM installed is actually running.
> 
> I create a .tar.gz package using 'make dist', copy it to another
> system, then install it using 'configure && make install'.  When I do
> that, sysconfdir defaults to /usr/local/etc, sbindir /usr/local/sbin,
> and bindir to /usr/local/bin.  I added /usr/local to PATH, so that
> the init script would work.  Otherwise I get a 'command not found'
> error when running the script.  On a side note, my ibacm
> configuration files end up in /usr/local/etc/rdma.

configure --prefix=/usr --sysconfdir=/etc && make && make install

> To say that I'm not sure what autotools and the init scripts are
> doing is an understatement...  The init script finds ibacm in
> /usr/sbin,

Which means some time it got installed there, either by an rpm or
something else.

> but not /usr/local/sbin where it ended up being installed.
> I'm not sure how to fix that.

Do an actual rpm build and install from the rpm.  Or use the command I
listed above to get the configure directories correct (configure
defaults to /usr/local).  But it's generally best to have your files
tracked by rpm.  It makes upgrades and uninstalls work much more
smoothly.  And the rpmbuild program and the macros in the spec file
processing get all your directories correct for you too.  Assuming you
have an rpm based system that you are running from of course.  In that
case, it's fairly simple to do something like the following:

make dist
rpmbuild --rebuild 
rpm -Uvh ~/rpmbuild/RPMS//

You might need this in your ~/.rpmmacros file:

%_topdir %(echo $HOME)/rpmbuild


-- 
Doug Ledford 
  GPG KeyID: 0E572FDD
  http://people.redhat.com/dledford




signature.asc
Description: OpenPGP digital signature

Re: [PATCH] ibacm: Fixes to ACM package to support distros

2012-04-17 Thread Doug Ledford

On 04/05/2012 04:40 PM, Jason Gunthorpe wrote:
> On Thu, Apr 05, 2012 at 07:10:25PM +, Hefty, Sean wrote:
>>>> +PATH=$PATH:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
>>>
>>> I'm not very keen on including /usr/local in the search path on system
>  
>> To say that I'm not sure what autotools and the init scripts are
>> doing is an understatement...  The init script finds ibacm in
>> /usr/sbin, but not /usr/local/sbin where it ended up being
>> installed.  I'm not sure how to fix that.
> 
> In this instance you need to have configure variable substitute your
> init script as well. You should not manipulate PATH from an init.d
> script, and you should use an absolute path to the daemon to avoid
> contamination from the invoking user's environment.
> 
> So you want:
> 
> %prefix$/bin/IBACM_BIN -foo
> 
> (use start-stop-daemon if appropriate for RHEL, I'm not sure what
> their conventions are...)
> 
> There might be a better choice than %prefix% for this, check the
> config.sub output for a better substitution.
> 
> Also, these days it would be very forward-thinking of you to
> contemplate how this will work in a systemd world. Make sure it can be
> demand started, have a daemon that has a running mode that is systemd
> compatible, etc. I think Doug was saying that the latest FC has IB and
> systemd together?

The latest FC has IB and systemd, but IB still uses SysV init scripts
and systemd does the emulation to start IB.  It's on my list of things
ToDo to get it converted to native systemd.  My seemingly never ending
ToDo list :-/


-- 
Doug Ledford 
  GPG KeyID: 0E572FDD
  http://people.redhat.com/dledford




signature.asc
Description: OpenPGP digital signature

Re: [PATCH v2] ibacm: Fixes to ACM package to support distros

2012-04-17 Thread Doug Ledford

On 04/06/2012 01:12 PM, Hefty, Sean wrote:
> ibacm: Fixes to ACM package to support distros
> 
> From: Sean Hefty 
> 
> Set of changes to fixup the ibacm package for inclusion into RedHat 6.
> Changes are based on feedback from Doug Ledford .
> These are primarily changes to the build files, along with name changes
> to the man pages and sample configuration files.
> 
> Rename the ib_acm service to match the package name, ibacm.
> 
> Rename the ibacm configuration files to use the prefix 'ibacm' instead
> of 'acm'.  The new sample files are 'ibacm_addr.cfg' and 'ibacm_opts.cfg'.
> 
> Move location of ACM lock and configuration files and ibacm.pid
> files.  They are currently in non-standard locations.
> 
> Modify ibacm and ib_acme to use $sysconfdir, $bindir, and rdmadir
> configure values.  The ibacm_addr.cfg and ibacm_opt.cfg files will now be
> read/written to $sysconfdir/$rdmadir by default, with rdmadir defaulting
> to 'rdma' if not specified..  And ibacm will execute
> $bindir/ib_acme if it needs to create the ibacm_addr.cfg file.  Without
> $bindir, the ibacm service can fail to launch ib_acme when started
> from an init script.
> 
> Add init script as part of install.  The init script is installed
> into $sysconfdir/init.d.  The init script is processed by configure,
> so that it executes the correct ibacm service that was installed.
> 
> Fixup man pages based on changes.
> 
> Signed-off-by: Doug Ledford 
> Signed-off-by: Sean Hefty 
> ---
> Changes from v1:
> 
> Removed mkdir calls from Makefile and replaced them with -D install option.
> 
> Removed setting PATH in init script.  Init script is now processed by 
> configure
> to handle the case where the daemon is not installed in /usr/sbin.
> Renamed init script to match package / daemon name.
> 
> User can now specify the directory for configuration files by setting
> rdmadir= when running configure.  By default rdmadir=rdma.
> rdmadir is passed to the compiler as a define.
> 
> Fixed other minor issues, like name changes in man pages.

This looks good enough to me Sean.  Ack to the version 2 patch set.


-- 
Doug Ledford 
  GPG KeyID: 0E572FDD
  http://people.redhat.com/dledford




signature.asc
Description: OpenPGP digital signature

[PATCH 0/9] IB/ipoib: fixup multicast locking issues

2015-02-21 Thread Doug Ledford

This is the re-ordered, squashed version of my 22 patch set that I
posted on Feb 11.  There are a few minor differences between that
set and this one.  They are:

1) Rename __ipoib_mcast_continue_join_thread to
   __ipoib_mcast_schedule_join_thread
2) Make __ipoib_mcast_schedule_join_thread cancel any delayed work to
   avoid us accidentally trying to queue the single work struct instance
   twice (which doesn't work)
3) Slight alter layout of __ipoib_mcast_schedule_join_thread.  Logic
   is the same modulo #2, but indenting is reduced and readability
   increased
4) Switch a few instances of FLAG_ADMIN_UP to FLAG_OPER_UP
5) Add a couple missing spinlocks so that we always call the schedule
   helper with the spinlock held
6) Make sure that we only clear the BUSY flag once we have done all the
   other things we are going to do to the mcast entry, and if possible,
   only call complete after we have released the spinlock
7) Fix the usage of time_before_eq when we should have just used
   time_before in ipoib_mcast_join_task
8) Create/destroy priv->wq in a slightly different point of
   ipoib_transport_dev_init/ipoib_transport_dev_cleanup

This entire patchset was intended to address the issue of ipoib
interfaces being brought up/down in a tight loop, which will hardlock
a standard v3.19 kernel.  It succeeds at resolving that problem.  In
order to be sure this patchset does not introduce other problems,
and in order to ensure that this rework of the patches into a new
set does not break bisectability, this entire patchset has been
extensively tested, starting with the first patch and going through
the last.

I used a 12 machine group plus the subnet manager to test these
patches.

1 machine ran ifconfig up/ifconfig down in a tight loop tests
1 machine ran rmmod/insmod ib_ipoib in a loop with a 10 second pause
  between insmod and rmmod
1 machine ran rmmod/insmod ib_ipoib in a tight loop with only a .1
  second pause between insmod and rmmod
9 machines that kept their interfaces up and ran iperf servers, 6 also
  ran ping6 instances to the addresses of all 12 machines, 3 ran iperf
  clients that sent data to all 9 iperf servers in an infinite loop
1 subnet manager machine that otherwise did not participate, but
  during testing was set to restart opensm once every 30 seconds to
  force net re-register events on all 12 machines in the group

In addition to the configuration of various machines above to test
data transfers, the IPoIB infrastructure itself contained several
elements designed to test specific multicast capabilities.

The primary P_Key, the one with the ping6 instances running on it,
intentionally had some well known multicast groups not defined in
order to intentionally cause failed sendonly multicast joins on
the same device that needed to work with IPv6 pings as well as
IPv4 multicast.

One of the alternate P_Key interfaces was defined with a minimum
rate of 56GBit/s, so all machines without 56GBit/s capability
were unable to ever join the broadcast group on these P_Keys.
This was done to make sure that when the broadcast group is not
joined, no other multicast joins, sendonly or otherwise, are ever
sent.  It also was done to make sure that failed attempts to join
the broadcast group honored the backoff delays properly.

Note: both machines that were doing the insmod/rmmod loops were
changed to not have any P_Key interfaces defined other than the
default P_Key interface.  It is known that repeated insmod/rmmod
of the ib_ipoib interface is fragile and easily breaks in the
presence of child interfaces.  It was not my intent to address
that particular problem with this patch set and so to avoid false
issues, children interfaces were removed from the mix on these
machines.

A wide array of hardware was also tested with this 12 machine group,
covering mthca, mlx4, mlx5, and qib hardware.

Patches 1 through 6 were tested without the ifconfig/rmmod/opensm
loops as those particular problems were not expected to be addressed
until patch 7.  Pathes 7 through 9 were tested with all tests.

The final, complete patch set was left running with the various
tests until it had completed 257 opensm restarts, 12052
ifconfig up/ifconfig down loops, 765 10 second insmod/rmmod loops,
and 1971 .1 second insmod/rmmod loops.  The only observed problem
was that the fast insmod/rmmod loop eventually locked up the
network stack on the machine.  It was stuck on a rtnl_lock deadlock,
but not one related to the multicast code (and therefore outside
the scope of these patches to address).  There are several bits of
additional locking to be fixed in the overall ipoib code in relation
to insmod/rmmod races and this patch set does not attempt to address
those.  It merely attempts not to introduce any new issues while
resolving the mcast locking issues related to bringing the interface
up and down.  I feel confident that it does that.

Doug Ledford (9):
  IB/ipoib: factor out ah flushing
  IB/ipoib: change init

[PATCH 5/9] IB/ipoib: Use dedicated workqueues per interface

2015-02-21 Thread Doug Ledford

During my recent work on the rtnl lock deadlock in the IPoIB driver, I
saw that even once I fixed the apparent races for a single device, as
soon as that device had any children, new races popped up.  It turns
out that this is because no matter how well we protect against races
on a single device, the fact that all devices use the same workqueue,
and flush_workqueue() flushes *everything* from that workqueue means
that we would also have to prevent all races between different devices
(for instance, ipoib_mcast_restart_task on interface ib0 can race with
ipoib_mcast_flush_dev on interface ib0.8002, resulting in a deadlock on
the rtnl_lock).

There are several possible solutions to this problem:

Make carrier_on_task and mcast_restart_task try to take the rtnl for
some set period of time and if they fail, then bail.  This runs the
real risk of dropping work on the floor, which can end up being its
own separate kind of deadlock.

Set some global flag in the driver that says some device is in the
middle of going down, letting all tasks know to bail.  Again, this can
drop work on the floor.

Or the method this patch attempts to use, which is when we bring an
interface up, create a workqueue specifically for that interface, so
that when we take it back down, we are flushing only those tasks
associated with our interface.  In addition, keep the global
workqueue, but now limit it to only flush tasks.  In this way, the
flush tasks can always flush the device specific work queues without
having deadlock issues.

Signed-off-by: Doug Ledford 
---
 drivers/infiniband/ulp/ipoib/ipoib.h   |  1 +
 drivers/infiniband/ulp/ipoib/ipoib_cm.c| 18 +++
 drivers/infiniband/ulp/ipoib/ipoib_ib.c|  6 ++---
 drivers/infiniband/ulp/ipoib/ipoib_main.c  | 28 +--
 drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 20 -
 drivers/infiniband/ulp/ipoib/ipoib_verbs.c | 31 +++---
 6 files changed, 66 insertions(+), 38 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h 
b/drivers/infiniband/ulp/ipoib/ipoib.h
index d7562beb542..e940cd9f847 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib.h
+++ b/drivers/infiniband/ulp/ipoib/ipoib.h
@@ -317,6 +317,7 @@ struct ipoib_dev_priv {
struct list_head multicast_list;
struct rb_root multicast_tree;
 
+   struct workqueue_struct *wq;
struct delayed_work mcast_task;
struct work_struct carrier_on_task;
struct work_struct flush_light;
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c 
b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
index 933efcea0d0..56959adb6c7 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
@@ -474,7 +474,7 @@ static int ipoib_cm_req_handler(struct ib_cm_id *cm_id, 
struct ib_cm_event *even
}
 
spin_lock_irq(&priv->lock);
-   queue_delayed_work(ipoib_workqueue,
+   queue_delayed_work(priv->wq,
   &priv->cm.stale_task, IPOIB_CM_RX_DELAY);
/* Add this entry to passive ids list head, but do not re-add it
 * if IB_EVENT_QP_LAST_WQE_REACHED has moved it to flush list. */
@@ -576,7 +576,7 @@ void ipoib_cm_handle_rx_wc(struct net_device *dev, struct 
ib_wc *wc)
spin_lock_irqsave(&priv->lock, flags);
list_splice_init(&priv->cm.rx_drain_list, 
&priv->cm.rx_reap_list);
ipoib_cm_start_rx_drain(priv);
-   queue_work(ipoib_workqueue, &priv->cm.rx_reap_task);
+   queue_work(priv->wq, &priv->cm.rx_reap_task);
spin_unlock_irqrestore(&priv->lock, flags);
} else
ipoib_warn(priv, "cm recv completion event with wrid %d 
(> %d)\n",
@@ -603,7 +603,7 @@ void ipoib_cm_handle_rx_wc(struct net_device *dev, struct 
ib_wc *wc)
spin_lock_irqsave(&priv->lock, flags);
list_move(&p->list, &priv->cm.rx_reap_list);
spin_unlock_irqrestore(&priv->lock, flags);
-   queue_work(ipoib_workqueue, 
&priv->cm.rx_reap_task);
+   queue_work(priv->wq, &priv->cm.rx_reap_task);
}
return;
}
@@ -827,7 +827,7 @@ void ipoib_cm_handle_tx_wc(struct net_device *dev, struct 
ib_wc *wc)
 
if (test_and_clear_bit(IPOIB_FLAG_INITIALIZED, &tx->flags)) {
list_move(&tx->list, &priv->cm.reap_list);
-   queue_work(ipoib_workqueue, &priv->cm.reap_task);
+   queue_work(priv->wq, &priv->cm.reap_task);
}
 
clear_bit(IPOIB_FLAG_OPER_UP, &tx

[PATCH 4/9] IB/ipoib: Make the carrier_on_task race aware

2015-02-21 Thread Doug Ledford

We blindly assume that we can just take the rtnl lock and that will
prevent races with downing this interface.  Unfortunately, that's not
the case.  In ipoib_mcast_stop_thread() we will call flush_workqueue()
in an attempt to clear out all remaining instances of ipoib_join_task.
But, since this task is put on the same workqueue as the join task,
the flush_workqueue waits on this thread too.  But this thread is
deadlocked on the rtnl lock.  The better thing here is to use trylock
and loop on that until we either get the lock or we see that
FLAG_OPER_UP has been cleared, in which case we don't need to do
anything anyway and we just return.

While investigating which flag should be used, FLAG_ADMIN_UP or
FLAG_OPER_UP, it was determined that FLAG_OPER_UP was the more
appropriate flag to use.  However, there was a mix of these two flags in
use in the existing code.  So while we check for that flag here as part
of this race fix, also cleanup the two places that had used the less
appropriate flag for their tests.

Signed-off-by: Doug Ledford 
---
 drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 25 +
 1 file changed, 17 insertions(+), 8 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 
b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
index eee66d13e5b..c63a598d0b4 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
@@ -353,18 +353,27 @@ void ipoib_mcast_carrier_on_task(struct work_struct *work)
   carrier_on_task);
struct ib_port_attr attr;
 
-   /*
-* Take rtnl_lock to avoid racing with ipoib_stop() and
-* turning the carrier back on while a device is being
-* removed.
-*/
if (ib_query_port(priv->ca, priv->port, &attr) ||
attr.state != IB_PORT_ACTIVE) {
ipoib_dbg(priv, "Keeping carrier off until IB port is 
active\n");
return;
}
 
-   rtnl_lock();
+   /*
+* Take rtnl_lock to avoid racing with ipoib_stop() and
+* turning the carrier back on while a device is being
+* removed.  However, ipoib_stop() will attempt to flush
+* the workqueue while holding the rtnl lock, so loop
+* on trylock until either we get the lock or we see
+* FLAG_OPER_UP go away as that signals that we are bailing
+* and can safely ignore the carrier on work.
+*/
+   while (!rtnl_trylock()) {
+   if (!test_bit(IPOIB_FLAG_OPER_UP, &priv->flags))
+   return;
+   else
+   msleep(20);
+   }
if (!ipoib_cm_admin_enabled(priv->dev))
dev_set_mtu(priv->dev, min(priv->mcast_mtu, priv->admin_mtu));
netif_carrier_on(priv->dev);
@@ -535,7 +544,7 @@ void ipoib_mcast_join_task(struct work_struct *work)
if (!priv->broadcast) {
struct ipoib_mcast *broadcast;
 
-   if (!test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags))
+   if (!test_bit(IPOIB_FLAG_OPER_UP, &priv->flags))
return;
 
broadcast = ipoib_mcast_alloc(dev, 1);
@@ -882,7 +891,7 @@ void ipoib_mcast_restart_task(struct work_struct *work)
ipoib_mcast_free(mcast);
}
 
-   if (test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags))
+   if (test_bit(IPOIB_FLAG_OPER_UP, &priv->flags))
ipoib_mcast_start_thread(dev);
 }
 
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 2/9] IB/ipoib: change init sequence ordering

2015-02-21 Thread Doug Ledford

In preparation for using per device work queues, we need to move the
start of the neighbor thread task to after ipoib_ib_dev_init and move
the destruction of the neighbor task to before ipoib_ib_dev_cleanup.
Otherwise we will end up freeing our workqueue with work possibly
still on it.

Signed-off-by: Doug Ledford 
---
 drivers/infiniband/ulp/ipoib/ipoib_main.c | 24 +---
 1 file changed, 17 insertions(+), 7 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c 
b/drivers/infiniband/ulp/ipoib/ipoib_main.c
index 58b5aa3b6f2..002ff0da9fa 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
@@ -1262,15 +1262,13 @@ int ipoib_dev_init(struct net_device *dev, struct 
ib_device *ca, int port)
 {
struct ipoib_dev_priv *priv = netdev_priv(dev);
 
-   if (ipoib_neigh_hash_init(priv) < 0)
-   goto out;
/* Allocate RX/TX "rings" to hold queued skbs */
priv->rx_ring = kzalloc(ipoib_recvq_size * sizeof *priv->rx_ring,
GFP_KERNEL);
if (!priv->rx_ring) {
printk(KERN_WARNING "%s: failed to allocate RX ring (%d 
entries)\n",
   ca->name, ipoib_recvq_size);
-   goto out_neigh_hash_cleanup;
+   goto out;
}
 
priv->tx_ring = vzalloc(ipoib_sendq_size * sizeof *priv->tx_ring);
@@ -1285,16 +1283,24 @@ int ipoib_dev_init(struct net_device *dev, struct 
ib_device *ca, int port)
if (ipoib_ib_dev_init(dev, ca, port))
goto out_tx_ring_cleanup;
 
+   /*
+* Must be after ipoib_ib_dev_init so we can allocate a per
+* device wq there and use it here
+*/
+   if (ipoib_neigh_hash_init(priv) < 0)
+   goto out_dev_uninit;
+
return 0;
 
+out_dev_uninit:
+   ipoib_ib_dev_cleanup(dev);
+
 out_tx_ring_cleanup:
vfree(priv->tx_ring);
 
 out_rx_ring_cleanup:
kfree(priv->rx_ring);
 
-out_neigh_hash_cleanup:
-   ipoib_neigh_hash_uninit(dev);
 out:
return -ENOMEM;
 }
@@ -1317,6 +1323,12 @@ void ipoib_dev_cleanup(struct net_device *dev)
}
unregister_netdevice_many(&head);
 
+   /*
+* Must be before ipoib_ib_dev_cleanup or we delete an in use
+* work queue
+*/
+   ipoib_neigh_hash_uninit(dev);
+
ipoib_ib_dev_cleanup(dev);
 
kfree(priv->rx_ring);
@@ -1324,8 +1336,6 @@ void ipoib_dev_cleanup(struct net_device *dev)
 
priv->rx_ring = NULL;
priv->tx_ring = NULL;
-
-   ipoib_neigh_hash_uninit(dev);
 }
 
 static const struct header_ops ipoib_header_ops = {
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 3/9] IB/ipoib: Consolidate rtnl_lock tasks in workqueue

2015-02-21 Thread Doug Ledford

The ipoib_mcast_flush_dev routine is called with the rtnl_lock held and
needs to keep it held.  It also needs to call flush_workqueue() to flush
out any outstanding work.  In the past, we've had to try and make sure
that we didn't flush out any outstanding join completions because they
also wanted to grab rtnl_lock() and that would deadlock.  It turns out
that the only thing in the join completion handler that needs this lock
can be safely moved to our carrier_on_task, thereby reducing the
potential for the join completion code and the flush code to deadlock
against each other.

Signed-off-by: Doug Ledford 
---
 drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 8 ++--
 1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 
b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
index ffb83b5f7e8..eee66d13e5b 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
@@ -190,12 +190,6 @@ static int ipoib_mcast_join_finish(struct ipoib_mcast 
*mcast,
spin_unlock_irq(&priv->lock);
priv->tx_wr.wr.ud.remote_qkey = priv->qkey;
set_qkey = 1;
-
-   if (!ipoib_cm_admin_enabled(dev)) {
-   rtnl_lock();
-   dev_set_mtu(dev, min(priv->mcast_mtu, priv->admin_mtu));
-   rtnl_unlock();
-   }
}
 
if (!test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags)) {
@@ -371,6 +365,8 @@ void ipoib_mcast_carrier_on_task(struct work_struct *work)
}
 
rtnl_lock();
+   if (!ipoib_cm_admin_enabled(priv->dev))
+   dev_set_mtu(priv->dev, min(priv->mcast_mtu, priv->admin_mtu));
netif_carrier_on(priv->dev);
rtnl_unlock();
 }
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/9] IB/ipoib: factor out ah flushing

2015-02-21 Thread Doug Ledford

Create a an ipoib_flush_ah and ipoib_stop_ah routines to use at
appropriate times to flush out all remaining ah entries before we shut
the device down.

Because neighbors and mcast entries can each have a reference on any
given ah, we must make sure to free all of those first before our ah
will actually have a 0 refcount and be able to be reaped.

This factoring is needed in preparation for having per-device work
queues.  The original per-device workqueue code resulted in the following
error message:

: ib_dealloc_pd failed

That error was tracked down to this issue.  With the changes to which
workqueues were flushed when, there were no flushes of the per device
workqueue after the last ah's were freed, resulting in an attempt to
dealloc the pd with outstanding resources still allocated.  This code
puts the explicit flushes in the needed places to avoid that problem.

Signed-off-by: Doug Ledford 
---
 drivers/infiniband/ulp/ipoib/ipoib_ib.c | 46 -
 1 file changed, 28 insertions(+), 18 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c 
b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
index 72626c34817..cb02466a0eb 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
@@ -659,6 +659,24 @@ void ipoib_reap_ah(struct work_struct *work)
   round_jiffies_relative(HZ));
 }
 
+static void ipoib_flush_ah(struct net_device *dev, int flush)
+{
+   struct ipoib_dev_priv *priv = netdev_priv(dev);
+
+   cancel_delayed_work(&priv->ah_reap_task);
+   if (flush)
+   flush_workqueue(ipoib_workqueue);
+   ipoib_reap_ah(&priv->ah_reap_task.work);
+}
+
+static void ipoib_stop_ah(struct net_device *dev, int flush)
+{
+   struct ipoib_dev_priv *priv = netdev_priv(dev);
+
+   set_bit(IPOIB_STOP_REAPER, &priv->flags);
+   ipoib_flush_ah(dev, flush);
+}
+
 static void ipoib_ib_tx_timer_func(unsigned long ctx)
 {
drain_tx_cq((struct net_device *)ctx);
@@ -877,24 +895,7 @@ timeout:
if (ib_modify_qp(priv->qp, &qp_attr, IB_QP_STATE))
ipoib_warn(priv, "Failed to modify QP to RESET state\n");
 
-   /* Wait for all AHs to be reaped */
-   set_bit(IPOIB_STOP_REAPER, &priv->flags);
-   cancel_delayed_work(&priv->ah_reap_task);
-   if (flush)
-   flush_workqueue(ipoib_workqueue);
-
-   begin = jiffies;
-
-   while (!list_empty(&priv->dead_ahs)) {
-   __ipoib_reap_ah(dev);
-
-   if (time_after(jiffies, begin + HZ)) {
-   ipoib_warn(priv, "timing out; will leak address 
handles\n");
-   break;
-   }
-
-   msleep(1);
-   }
+   ipoib_flush_ah(dev, flush);
 
ib_req_notify_cq(priv->recv_cq, IB_CQ_NEXT_COMP);
 
@@ -1037,6 +1038,7 @@ static void __ipoib_ib_dev_flush(struct ipoib_dev_priv 
*priv,
if (level == IPOIB_FLUSH_LIGHT) {
ipoib_mark_paths_invalid(dev);
ipoib_mcast_dev_flush(dev);
+   ipoib_flush_ah(dev, 0);
}
 
if (level >= IPOIB_FLUSH_NORMAL)
@@ -1100,6 +1102,14 @@ void ipoib_ib_dev_cleanup(struct net_device *dev)
ipoib_mcast_stop_thread(dev, 1);
ipoib_mcast_dev_flush(dev);
 
+   /*
+* All of our ah references aren't free until after
+* ipoib_mcast_dev_flush(), ipoib_flush_paths, and
+* the neighbor garbage collection is stopped and reaped.
+* That should all be done now, so make a final ah flush.
+*/
+   ipoib_stop_ah(dev, 1);
+
ipoib_transport_dev_cleanup(dev);
 }
 
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 6/9] IB/ipoib: No longer use flush as a parameter

2015-02-21 Thread Doug Ledford

Various places in the IPoIB code had a deadlock related to flushing
the ipoib workqueue.  Now that we have per device workqueues and a
specific flush workqueue, there is no longer a deadlock issue with
flushing the device specific workqueues and we can do so unilaterally.

Signed-off-by: Doug Ledford 
---
 drivers/infiniband/ulp/ipoib/ipoib.h   |  8 +++---
 drivers/infiniband/ulp/ipoib/ipoib_ib.c| 35 +-
 drivers/infiniband/ulp/ipoib/ipoib_main.c  |  8 +++---
 drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 18 ++---
 4 files changed, 39 insertions(+), 30 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h 
b/drivers/infiniband/ulp/ipoib/ipoib.h
index e940cd9f847..9ef432ae72e 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib.h
+++ b/drivers/infiniband/ulp/ipoib/ipoib.h
@@ -478,10 +478,10 @@ void ipoib_ib_dev_flush_heavy(struct work_struct *work);
 void ipoib_pkey_event(struct work_struct *work);
 void ipoib_ib_dev_cleanup(struct net_device *dev);
 
-int ipoib_ib_dev_open(struct net_device *dev, int flush);
+int ipoib_ib_dev_open(struct net_device *dev);
 int ipoib_ib_dev_up(struct net_device *dev);
-int ipoib_ib_dev_down(struct net_device *dev, int flush);
-int ipoib_ib_dev_stop(struct net_device *dev, int flush);
+int ipoib_ib_dev_down(struct net_device *dev);
+int ipoib_ib_dev_stop(struct net_device *dev);
 void ipoib_pkey_dev_check_presence(struct net_device *dev);
 
 int ipoib_dev_init(struct net_device *dev, struct ib_device *ca, int port);
@@ -493,7 +493,7 @@ void ipoib_mcast_send(struct net_device *dev, u8 *daddr, 
struct sk_buff *skb);
 
 void ipoib_mcast_restart_task(struct work_struct *work);
 int ipoib_mcast_start_thread(struct net_device *dev);
-int ipoib_mcast_stop_thread(struct net_device *dev, int flush);
+int ipoib_mcast_stop_thread(struct net_device *dev);
 
 void ipoib_mcast_dev_down(struct net_device *dev);
 void ipoib_mcast_dev_flush(struct net_device *dev);
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c 
b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
index 2a56b7a11a9..e144d07d53c 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
@@ -659,22 +659,21 @@ void ipoib_reap_ah(struct work_struct *work)
   round_jiffies_relative(HZ));
 }
 
-static void ipoib_flush_ah(struct net_device *dev, int flush)
+static void ipoib_flush_ah(struct net_device *dev)
 {
struct ipoib_dev_priv *priv = netdev_priv(dev);
 
cancel_delayed_work(&priv->ah_reap_task);
-   if (flush)
-   flush_workqueue(priv->wq);
+   flush_workqueue(priv->wq);
ipoib_reap_ah(&priv->ah_reap_task.work);
 }
 
-static void ipoib_stop_ah(struct net_device *dev, int flush)
+static void ipoib_stop_ah(struct net_device *dev)
 {
struct ipoib_dev_priv *priv = netdev_priv(dev);
 
set_bit(IPOIB_STOP_REAPER, &priv->flags);
-   ipoib_flush_ah(dev, flush);
+   ipoib_flush_ah(dev);
 }
 
 static void ipoib_ib_tx_timer_func(unsigned long ctx)
@@ -682,7 +681,7 @@ static void ipoib_ib_tx_timer_func(unsigned long ctx)
drain_tx_cq((struct net_device *)ctx);
 }
 
-int ipoib_ib_dev_open(struct net_device *dev, int flush)
+int ipoib_ib_dev_open(struct net_device *dev)
 {
struct ipoib_dev_priv *priv = netdev_priv(dev);
int ret;
@@ -724,7 +723,7 @@ int ipoib_ib_dev_open(struct net_device *dev, int flush)
 dev_stop:
if (!test_and_set_bit(IPOIB_FLAG_INITIALIZED, &priv->flags))
napi_enable(&priv->napi);
-   ipoib_ib_dev_stop(dev, flush);
+   ipoib_ib_dev_stop(dev);
return -1;
 }
 
@@ -756,7 +755,7 @@ int ipoib_ib_dev_up(struct net_device *dev)
return ipoib_mcast_start_thread(dev);
 }
 
-int ipoib_ib_dev_down(struct net_device *dev, int flush)
+int ipoib_ib_dev_down(struct net_device *dev)
 {
struct ipoib_dev_priv *priv = netdev_priv(dev);
 
@@ -765,7 +764,7 @@ int ipoib_ib_dev_down(struct net_device *dev, int flush)
clear_bit(IPOIB_FLAG_OPER_UP, &priv->flags);
netif_carrier_off(dev);
 
-   ipoib_mcast_stop_thread(dev, flush);
+   ipoib_mcast_stop_thread(dev);
ipoib_mcast_dev_flush(dev);
 
ipoib_flush_paths(dev);
@@ -825,7 +824,7 @@ void ipoib_drain_cq(struct net_device *dev)
local_bh_enable();
 }
 
-int ipoib_ib_dev_stop(struct net_device *dev, int flush)
+int ipoib_ib_dev_stop(struct net_device *dev)
 {
struct ipoib_dev_priv *priv = netdev_priv(dev);
struct ib_qp_attr qp_attr;
@@ -895,7 +894,7 @@ timeout:
if (ib_modify_qp(priv->qp, &qp_attr, IB_QP_STATE))
ipoib_warn(priv, "Failed to modify QP to RESET state\n");
 
-   ipoib_flush_ah(dev, flush);
+   ipoib_flush_ah(dev);
 
ib_req_notify_cq(priv->recv_cq, IB_CQ_NEXT_COMP);
 
@@ -919,7 +918,7 @@ int ipoib_ib_dev_init(struct net_device *dev, struct 
ib_device *

[PATCH 7/9] IB/ipoib: fix MCAST_FLAG_BUSY usage

2015-02-21 Thread Doug Ledford

Commit a9c8ba5884 ("IPoIB: Fix usage of uninitialized multicast
objects") added a new flag MCAST_JOIN_STARTED, but was not very strict
in how it was used.  We didn't always initialize the completion struct
before we set the flag, and we didn't always call complete on the
completion struct from all paths that complete it.  And when we did
complete it, sometimes we continued to touch the mcast entry after
the completion, opening us up to possible use after free issues.

This made it less than totally effective, and certainly made its use
confusing.  And in the flush function we would use the presence of this
flag to signal that we should wait on the completion struct, but we never
cleared this flag, ever.

In order to make things clearer and aid in resolving the rtnl deadlock
bug I've been chasing, I cleaned this up a bit.

 1) Remove the MCAST_JOIN_STARTED flag entirely
 2) Change MCAST_FLAG_BUSY so it now only means a join is in-flight
 3) Test mcast->mc directly to see if we have completed
ib_sa_join_multicast (using IS_ERR_OR_NULL)
 4) Make sure that before setting MCAST_FLAG_BUSY we always initialize
the mcast->done completion struct
 5) Make sure that before calling complete(&mcast->done), we always clear
the MCAST_FLAG_BUSY bit
 6) Take the mcast_mutex before we call ib_sa_multicast_join and also
take the mutex in our join callback.  This forces
ib_sa_multicast_join to return and set mcast->mc before we process
the callback.  This way, our callback can safely clear mcast->mc
if there is an error on the join and we will do the right thing as
a result in mcast_dev_flush.
 7) Because we need the mutex to synchronize mcast->mc, we can no
longer call mcast_sendonly_join directly from mcast_send and
instead must add sendonly join processing to the mcast_join_task
 8) Make MCAST_RUN mean that we have a working mcast subsystem, not that
we have a running task.  We know when we need to reschedule our
join task thread and don't need a flag to tell us.
 9) Add a helper for rescheduling the join task thread

A number of different races are resolved with these changes.  These
races existed with the old MCAST_FLAG_BUSY usage, the
MCAST_JOIN_STARTED flag was an attempt to address them, and while it
helped, a determined effort could still trip things up.

One race looks something like this:

Thread 1 Thread 2
ib_sa_join_multicast (as part of running restart mcast task)
  alloc member
  call callback
 ifconfig ib0 down
 wait_for_completion
callback call completes
 wait_for_completion in
 mcast_dev_flush completes
   mcast->mc is PTR_ERR_OR_NULL
   so we skip ib_sa_leave_multicast
return from callback
  return from ib_sa_join_multicast
set mcast->mc = return from ib_sa_multicast

We now have a permanently unbalanced join/leave issue that trips up the
refcounting in core/multicast.c

Another like this:

Thread 1   Thread 2 Thread 3
ib_sa_multicast_join
ifconfig ib0 down
priv->broadcast = NULL
   join_complete
wait_for_completion
   mcast->mc is not yet set, so don't clear
return from ib_sa_join_multicast and set mcast->mc
   complete
   return -EAGAIN (making mcast->mc invalid)
call ib_sa_multicast_leave
on invalid mcast->mc, hang
forever

By holding the mutex around ib_sa_multicast_join and taking the mutex
early in the callback, we force mcast->mc to be valid at the time we
run the callback.  This allows us to clear mcast->mc if there is an
error and the join is going to fail.  We do this before we complete
the mcast.  In this way, mcast_dev_flush always sees consistent state
in regards to mcast->mc membership at the time that the
wait_for_completion() returns.

Signed-off-by: Doug Ledford 
---
 drivers/infiniband/ulp/ipoib/ipoib.h   |  11 +-
 drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 355 -
 2 files changed, 238 insertions(+), 128 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h 
b/drivers/infiniband/ulp/ipoib/ipoib.h
index 9ef432ae72e..c79dcd5ee8a 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib.h
+++ b/drivers/infiniband/ulp/ipoib/ipoib.h
@@ -98,9 +98,15 @@ enum {
 
IPOIB_MCAST_FLAG_FOUND= 0,  /* used in set_multicast_list */
IPOIB_MCAST_FLAG_SENDONLY = 1,
-   IP

[PATCH 8/9] IB/ipoib: deserialize multicast joins

2015-02-21 Thread Doug Ledford

Allow the ipoib layer to attempt to join all outstanding multicast
groups at once.  The ib_sa layer will serialize multiple attempts to
join the same group, but will process attempts to join different groups
in parallel.  Take advantage of that.

In order to make this happen, change the mcast_join_thread to loop
through all needed joins, sending a join request for each one that we
still need to join.  There are a few special cases we handle though:

1) Don't attempt to join anything but the broadcast group until the join
of the broadcast group has succeeded.
2) No longer restart the join task at the end of completion handling.
If we completed successfully, we are done.  The join task now needs kicked
either by mcast_send or mcast_restart_task or mcast_start_thread, but
should not need started anytime else except when scheduling a backoff
attempt to rejoin.
3) No longer use separate join/completion routines for regular and
sendonly joins, pass them all through the same routine and just do the
right thing based on the SENDONLY join flag.
4) Only try to join a SENDONLY join twice, then drop the packets and
quit trying.  We leave the mcast group in the list so that if we get a
new packet, all that we have to do is queue up the packet and restart
the join task and it will automatically try to join twice and then
either send or flush the queue again.

Signed-off-by: Doug Ledford 
---
 drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 250 -
 1 file changed, 82 insertions(+), 168 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 
b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
index 277e7ac7c4d..c670d9c2cda 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
@@ -307,111 +307,6 @@ static int ipoib_mcast_join_finish(struct ipoib_mcast 
*mcast,
return 0;
 }
 
-static int
-ipoib_mcast_sendonly_join_complete(int status,
-  struct ib_sa_multicast *multicast)
-{
-   struct ipoib_mcast *mcast = multicast->context;
-   struct net_device *dev = mcast->dev;
-   struct ipoib_dev_priv *priv = netdev_priv(dev);
-
-   /*
-* We have to take the mutex to force mcast_sendonly_join to
-* return from ib_sa_multicast_join and set mcast->mc to a
-* valid value.  Otherwise we were racing with ourselves in
-* that we might fail here, but get a valid return from
-* ib_sa_multicast_join after we had cleared mcast->mc here,
-* resulting in mis-matched joins and leaves and a deadlock
-*/
-   mutex_lock(&mcast_mutex);
-
-   /* We trap for port events ourselves. */
-   if (status == -ENETRESET) {
-   status = 0;
-   goto out;
-   }
-
-   if (!status)
-   status = ipoib_mcast_join_finish(mcast, &multicast->rec);
-
-   if (status) {
-   if (mcast->logcount++ < 20)
-   ipoib_dbg_mcast(netdev_priv(dev), "sendonly multicast "
-   "join failed for %pI6, status %d\n",
-   mcast->mcmember.mgid.raw, status);
-
-   /* Flush out any queued packets */
-   netif_tx_lock_bh(dev);
-   while (!skb_queue_empty(&mcast->pkt_queue)) {
-   ++dev->stats.tx_dropped;
-   dev_kfree_skb_any(skb_dequeue(&mcast->pkt_queue));
-   }
-   netif_tx_unlock_bh(dev);
-   __ipoib_mcast_schedule_join_thread(priv, mcast, 1);
-   } else {
-   mcast->backoff = 1;
-   mcast->delay_until = jiffies;
-   __ipoib_mcast_schedule_join_thread(priv, NULL, 0);
-   }
-out:
-   clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
-   if (status)
-   mcast->mc = NULL;
-   complete(&mcast->done);
-   mutex_unlock(&mcast_mutex);
-   return status;
-}
-
-static int ipoib_mcast_sendonly_join(struct ipoib_mcast *mcast)
-{
-   struct net_device *dev = mcast->dev;
-   struct ipoib_dev_priv *priv = netdev_priv(dev);
-   struct ib_sa_mcmember_rec rec = {
-#if 0  /* Some SMs don't support send-only yet */
-   .join_state = 4
-#else
-   .join_state = 1
-#endif
-   };
-   int ret = 0;
-
-   if (!test_bit(IPOIB_FLAG_OPER_UP, &priv->flags)) {
-   ipoib_dbg_mcast(priv, "device shutting down, no sendonly "
-   "multicast joins\n");
-   clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
-   complete(&mcast->done);
-   return -ENODEV;
-   }
-
-   rec.mgid = mcast->mcmember.mgid;
-   rec.port_gid = priv->local_

[PATCH 9/9] IB/ipoib: drop mcast_mutex usage

2015-02-21 Thread Doug Ledford

We needed the mcast_mutex when we had to prevent the join completion
callback from having the value it stored in mcast->mc overwritten
by a delayed return from ib_sa_join_multicast.  By storing the return
of ib_sa_join_multicast in an intermediate variable, we prevent a
delayed return from ib_sa_join_multicast overwriting the valid
contents of mcast->mc, and we no longer need a mutex to force the
join callback to run after the return of ib_sa_join_multicast.  This
allows us to do away with the mutex entirely and protect our critical
sections with a just a spinlock instead.  This is highly desirable
as there were some places where we couldn't use a mutex because the
code was not allowed to sleep, and so we were currently using a mix
of mutex and spinlock to protect what we needed to protect.  Now we
only have a spin lock and the locking complexity is greatly reduced.

Signed-off-by: Doug Ledford 
---
 drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 70 --
 1 file changed, 32 insertions(+), 38 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 
b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
index c670d9c2cda..3203ebe9b10 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
@@ -55,8 +55,6 @@ MODULE_PARM_DESC(mcast_debug_level,
 "Enable multicast debug tracing if > 0");
 #endif
 
-static DEFINE_MUTEX(mcast_mutex);
-
 struct ipoib_mcast_iter {
struct net_device *dev;
union ib_gid   mgid;
@@ -67,7 +65,7 @@ struct ipoib_mcast_iter {
 };
 
 /*
- * This should be called with the mcast_mutex held
+ * This should be called with the priv->lock held
  */
 static void __ipoib_mcast_schedule_join_thread(struct ipoib_dev_priv *priv,
   struct ipoib_mcast *mcast,
@@ -352,16 +350,6 @@ static int ipoib_mcast_join_complete(int status,
"sendonly " : "",
mcast->mcmember.mgid.raw, status);
 
-   /*
-* We have to take the mutex to force mcast_join to
-* return from ib_sa_multicast_join and set mcast->mc to a
-* valid value.  Otherwise we were racing with ourselves in
-* that we might fail here, but get a valid return from
-* ib_sa_multicast_join after we had cleared mcast->mc here,
-* resulting in mis-matched joins and leaves and a deadlock
-*/
-   mutex_lock(&mcast_mutex);
-
/* We trap for port events ourselves. */
if (status == -ENETRESET) {
status = 0;
@@ -383,8 +371,10 @@ static int ipoib_mcast_join_complete(int status,
 * send out all of the non-broadcast joins
 */
if (mcast == priv->broadcast) {
+   spin_lock_irq(&priv->lock);
queue_work(priv->wq, &priv->carrier_on_task);
__ipoib_mcast_schedule_join_thread(priv, NULL, 0);
+   goto out_locked;
}
} else {
if (mcast->logcount++ < 20) {
@@ -417,16 +407,28 @@ static int ipoib_mcast_join_complete(int status,

dev_kfree_skb_any(skb_dequeue(&mcast->pkt_queue));
}
netif_tx_unlock_bh(dev);
-   } else
+   } else {
+   spin_lock_irq(&priv->lock);
/* Requeue this join task with a backoff delay */
__ipoib_mcast_schedule_join_thread(priv, mcast, 1);
+   goto out_locked;
+   }
}
 out:
-   clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
+   spin_lock_irq(&priv->lock);
+out_locked:
+   /*
+* Make sure to set mcast->mc before we clear the busy flag to avoid
+* racing with code that checks for BUSY before checking mcast->mc
+*/
if (status)
mcast->mc = NULL;
+   else
+   mcast->mc = multicast;
+   clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
+   spin_unlock_irq(&priv->lock);
complete(&mcast->done);
-   mutex_unlock(&mcast_mutex);
+
return status;
 }
 
@@ -434,6 +436,7 @@ static void ipoib_mcast_join(struct net_device *dev, struct 
ipoib_mcast *mcast,
 int create)
 {
struct ipoib_dev_priv *priv = netdev_priv(dev);
+   struct ib_sa_multicast *multicast;
struct ib_sa_mcmember_rec rec = {
.join_state = 1
};
@@ -475,18 +478,19 @@ static void ipoib_mcast_join(struct net_device *dev, 
struct ipoib_mcast *mcast,
rec.hop_limit = priv->broadcast->mcmember.hop_limit;
}
 
-   mutex_lock(&mcast_mutex);

Re: [PATCH 0/9] IB/ipoib: fixup multicast locking issues

2015-02-22 Thread Doug Ledford

On Sun, 2015-02-22 at 23:34 +0200, Or Gerlitz wrote:
> On Sun, Feb 22, 2015 at 2:26 AM, Doug Ledford  wrote:
> > This is the re-ordered, squashed version of my 22 patch set that I
> > posted on Feb 11.  There are a few minor differences between that
> > set and this one.
> 
> Hi Doug,
> 
> I took quick look on your git repo @
> git://github.com/dledford/linux.git and it seems not to contain this
> series, can you please get that there and tell what branch to pull?

It's there now, branch for-3.20-squashed.

> Or.
> 
> > They are:
> > 1) Rename __ipoib_mcast_continue_join_thread to
> >__ipoib_mcast_schedule_join_thread
> > 2) Make __ipoib_mcast_schedule_join_thread cancel any delayed work to
> >avoid us accidentally trying to queue the single work struct instance
> >twice (which doesn't work)
> > 3) Slight alter layout of __ipoib_mcast_schedule_join_thread.  Logic
> >is the same modulo #2, but indenting is reduced and readability
> >increased
> > 4) Switch a few instances of FLAG_ADMIN_UP to FLAG_OPER_UP
> > 5) Add a couple missing spinlocks so that we always call the schedule
> >helper with the spinlock held
> > 6) Make sure that we only clear the BUSY flag once we have done all the
> >other things we are going to do to the mcast entry, and if possible,
> >only call complete after we have released the spinlock
> > 7) Fix the usage of time_before_eq when we should have just used
> >time_before in ipoib_mcast_join_task
> > 8) Create/destroy priv->wq in a slightly different point of
> >ipoib_transport_dev_init/ipoib_transport_dev_cleanup
> >
> > This entire patchset was intended to address the issue of ipoib
> > interfaces being brought up/down in a tight loop, which will hardlock
> > a standard v3.19 kernel.  It succeeds at resolving that problem.  In
> > order to be sure this patchset does not introduce other problems,
> > and in order to ensure that this rework of the patches into a new
> > set does not break bisectability, this entire patchset has been
> > extensively tested, starting with the first patch and going through
> > the last.
> >
> > I used a 12 machine group plus the subnet manager to test these
> > patches.
> >
> > 1 machine ran ifconfig up/ifconfig down in a tight loop tests
> > 1 machine ran rmmod/insmod ib_ipoib in a loop with a 10 second pause
> >   between insmod and rmmod
> > 1 machine ran rmmod/insmod ib_ipoib in a tight loop with only a .1
> >   second pause between insmod and rmmod
> > 9 machines that kept their interfaces up and ran iperf servers, 6 also
> >   ran ping6 instances to the addresses of all 12 machines, 3 ran iperf
> >   clients that sent data to all 9 iperf servers in an infinite loop
> > 1 subnet manager machine that otherwise did not participate, but
> >   during testing was set to restart opensm once every 30 seconds to
> >   force net re-register events on all 12 machines in the group
> >
> > In addition to the configuration of various machines above to test
> > data transfers, the IPoIB infrastructure itself contained several
> > elements designed to test specific multicast capabilities.
> >
> > The primary P_Key, the one with the ping6 instances running on it,
> > intentionally had some well known multicast groups not defined in
> > order to intentionally cause failed sendonly multicast joins on
> > the same device that needed to work with IPv6 pings as well as
> > IPv4 multicast.
> >
> > One of the alternate P_Key interfaces was defined with a minimum
> > rate of 56GBit/s, so all machines without 56GBit/s capability
> > were unable to ever join the broadcast group on these P_Keys.
> > This was done to make sure that when the broadcast group is not
> > joined, no other multicast joins, sendonly or otherwise, are ever
> > sent.  It also was done to make sure that failed attempts to join
> > the broadcast group honored the backoff delays properly.
> >
> > Note: both machines that were doing the insmod/rmmod loops were
> > changed to not have any P_Key interfaces defined other than the
> > default P_Key interface.  It is known that repeated insmod/rmmod
> > of the ib_ipoib interface is fragile and easily breaks in the
> > presence of child interfaces.  It was not my intent to address
> > that particular problem with this patch set and so to avoid false
> > issues, children interfaces were removed from the mix on these
> > machines.
> >
> > A wide array of hardware was also tested with this 12 machine group,
> > covering mthca, mlx4, mlx5,

Re: [PATCH 0/9] IB/ipoib: fixup multicast locking issues

2015-02-22 Thread Doug Ledford

On Sun, 2015-02-22 at 16:56 -0500, Doug Ledford wrote:
> On Sun, 2015-02-22 at 23:34 +0200, Or Gerlitz wrote:
> > On Sun, Feb 22, 2015 at 2:26 AM, Doug Ledford  wrote:
> > > This is the re-ordered, squashed version of my 22 patch set that I
> > > posted on Feb 11.  There are a few minor differences between that
> > > set and this one.
> > 
> > Hi Doug,
> > 
> > I took quick look on your git repo @
> > git://github.com/dledford/linux.git and it seems not to contain this
> > series, can you please get that there and tell what branch to pull?
> 
> It's there now, branch for-3.20-squashed.

Also, git diff for-3.20..for-3.20-squashed will show the exact
differences between this 9 patch set and the previous 22 patch set.

> > Or.
> > 
> > > They are:
> > > 1) Rename __ipoib_mcast_continue_join_thread to
> > >__ipoib_mcast_schedule_join_thread
> > > 2) Make __ipoib_mcast_schedule_join_thread cancel any delayed work to
> > >avoid us accidentally trying to queue the single work struct instance
> > >twice (which doesn't work)
> > > 3) Slight alter layout of __ipoib_mcast_schedule_join_thread.  Logic
> > >is the same modulo #2, but indenting is reduced and readability
> > >increased
> > > 4) Switch a few instances of FLAG_ADMIN_UP to FLAG_OPER_UP
> > > 5) Add a couple missing spinlocks so that we always call the schedule
> > >helper with the spinlock held
> > > 6) Make sure that we only clear the BUSY flag once we have done all the
> > >other things we are going to do to the mcast entry, and if possible,
> > >only call complete after we have released the spinlock
> > > 7) Fix the usage of time_before_eq when we should have just used
> > >time_before in ipoib_mcast_join_task
> > > 8) Create/destroy priv->wq in a slightly different point of
> > >ipoib_transport_dev_init/ipoib_transport_dev_cleanup
> > >
> > > This entire patchset was intended to address the issue of ipoib
> > > interfaces being brought up/down in a tight loop, which will hardlock
> > > a standard v3.19 kernel.  It succeeds at resolving that problem.  In
> > > order to be sure this patchset does not introduce other problems,
> > > and in order to ensure that this rework of the patches into a new
> > > set does not break bisectability, this entire patchset has been
> > > extensively tested, starting with the first patch and going through
> > > the last.
> > >
> > > I used a 12 machine group plus the subnet manager to test these
> > > patches.
> > >
> > > 1 machine ran ifconfig up/ifconfig down in a tight loop tests
> > > 1 machine ran rmmod/insmod ib_ipoib in a loop with a 10 second pause
> > >   between insmod and rmmod
> > > 1 machine ran rmmod/insmod ib_ipoib in a tight loop with only a .1
> > >   second pause between insmod and rmmod
> > > 9 machines that kept their interfaces up and ran iperf servers, 6 also
> > >   ran ping6 instances to the addresses of all 12 machines, 3 ran iperf
> > >   clients that sent data to all 9 iperf servers in an infinite loop
> > > 1 subnet manager machine that otherwise did not participate, but
> > >   during testing was set to restart opensm once every 30 seconds to
> > >   force net re-register events on all 12 machines in the group
> > >
> > > In addition to the configuration of various machines above to test
> > > data transfers, the IPoIB infrastructure itself contained several
> > > elements designed to test specific multicast capabilities.
> > >
> > > The primary P_Key, the one with the ping6 instances running on it,
> > > intentionally had some well known multicast groups not defined in
> > > order to intentionally cause failed sendonly multicast joins on
> > > the same device that needed to work with IPv6 pings as well as
> > > IPv4 multicast.
> > >
> > > One of the alternate P_Key interfaces was defined with a minimum
> > > rate of 56GBit/s, so all machines without 56GBit/s capability
> > > were unable to ever join the broadcast group on these P_Keys.
> > > This was done to make sure that when the broadcast group is not
> > > joined, no other multicast joins, sendonly or otherwise, are ever
> > > sent.  It also was done to make sure that failed attempts to join
> > > the broadcast group honored the backoff delays properly.
> > >
> > > Note: both machines that were doing the insmod/rmmod loops were
> &g

Re: [PATCH 9/9] IB/ipoib: drop mcast_mutex usage

2015-02-23 Thread Doug Ledford

On Mon, 2015-02-23 at 18:56 +0200, Or Gerlitz wrote:
> On Sun, Feb 22, 2015 at 2:27 AM, Doug Ledford  wrote:
> > We needed the mcast_mutex when we had to prevent the join completion
> > callback from having the value it stored in mcast->mc overwritten
> 
> downstream patches of this series (7/9 and 8/9) make pretty much heavy
> usage of the mcast_mutex (e.g add/delete lines that use it), and patch
> 9/9 removes it altogether.. which would be very confusing for
> maintaining purposes. Is there a sane way to avoid that?!

No.  The changes that make dropping the mutex possible are part of patch
7.  Patch 7 changes the semantics of the MCAST_FLAG_BUSY usage, and
fixes some locking bugs, but that's different than wholesale changing of
the locking type.  If you want to preserve bisecability and be able to
test the semantic changes to the FLAG_BUSY usage separate from the
changes to the locking type, then they have to be separated.  So, for
the sake of good engineering practices and separation of distinctly
different types of changes, that locking change should not be folded
into patch 7.

-- 
Doug Ledford 
  GPG KeyID: 0E572FDD

signature.asc
Description: This is a digitally signed message part

Re: [PATCH v4 06/19] IB/core: Add max_mad_size to ib_device_attr

2015-02-24 Thread Doug Ledford

  props->max_total_mcast_qp_attach = 0;
>   props->max_map_per_fmr = 0;
> + props->max_mad_size = IB_MGMT_MAD_SIZE;
>   /* Owned by Userspace
>* max_qp_wr, max_sge, max_sge_rd, max_cqe */
>   mutex_unlock(&us_ibdev->usdev_lock);
> diff --git a/include/rdma/ib_mad.h b/include/rdma/ib_mad.h
> index 9c89939..5823016 100644
> --- a/include/rdma/ib_mad.h
> +++ b/include/rdma/ib_mad.h
> @@ -135,6 +135,7 @@ enum {
>   IB_MGMT_SA_DATA = 200,
>   IB_MGMT_DEVICE_HDR = 64,
>   IB_MGMT_DEVICE_DATA = 192,
> + IB_MGMT_MAD_SIZE = IB_MGMT_MAD_HDR + IB_MGMT_MAD_DATA,
>  };
>  
>  struct ib_mad_hdr {
> diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
> index 0116e4b..64d3479 100644
> --- a/include/rdma/ib_verbs.h
> +++ b/include/rdma/ib_verbs.h
> @@ -210,6 +210,7 @@ struct ib_device_attr {
>   int sig_prot_cap;
>   int sig_guard_cap;
>   struct ib_odp_caps  odp_caps;
> + u32 max_mad_size;
>  };
>  
>  enum ib_mtu {


-- 
Doug Ledford 
  GPG KeyID: 0E572FDD




signature.asc
Description: This is a digitally signed message part

Re: [PATCH v4 07/19] IB/mad: Convert ib_mad_private allocations from kmem_cache to kmalloc

2015-02-24 Thread Doug Ledford

  kfree(recv);
>   }
>  
>   qp_info->recv_queue.count = 0;
> @@ -3138,45 +3145,25 @@ static struct ib_client mad_client = {
>  
>  static int __init ib_mad_init_module(void)
>  {
> - int ret;
> -
>   mad_recvq_size = min(mad_recvq_size, IB_MAD_QP_MAX_SIZE);
>   mad_recvq_size = max(mad_recvq_size, IB_MAD_QP_MIN_SIZE);
>  
>   mad_sendq_size = min(mad_sendq_size, IB_MAD_QP_MAX_SIZE);
>   mad_sendq_size = max(mad_sendq_size, IB_MAD_QP_MIN_SIZE);
>  
> - ib_mad_cache = kmem_cache_create("ib_mad",
> -      sizeof(struct ib_mad_private),
> -  0,
> -  SLAB_HWCACHE_ALIGN,
> -  NULL);
> - if (!ib_mad_cache) {
> - pr_err("Couldn't create ib_mad cache\n");
> - ret = -ENOMEM;
> - goto error1;
> - }
> -
>   INIT_LIST_HEAD(&ib_mad_port_list);
>  
>   if (ib_register_client(&mad_client)) {
>   pr_err("Couldn't register ib_mad client\n");
> - ret = -EINVAL;
> - goto error2;
> + return(-EINVAL);
>   }
>  
>   return 0;
> -
> -error2:
> - kmem_cache_destroy(ib_mad_cache);
> -error1:
> - return ret;
>  }
>  
>  static void __exit ib_mad_cleanup_module(void)
>  {
>   ib_unregister_client(&mad_client);
> - kmem_cache_destroy(ib_mad_cache);
>  }
>  
>  module_init(ib_mad_init_module);


-- 
Doug Ledford 
  GPG KeyID: 0E572FDD




signature.asc
Description: This is a digitally signed message part

Re: [PATCH v4 14/19] IB/core: Add IB_DEVICE_OPA_MAD_SUPPORT device cap flag

2015-02-25 Thread Doug Ledford

eeded in subsequent patches. It would be better to 
> > see
> > the complete picture so it can be reviewed as a whole.
> 
> Is there something in particular you would like to see?  There are no other 
> patches required in the core modules for verbs applications to function.  The 
> MTU patch only improves verbs performance.
> 
> Ira 
> 
> > 
> > -- Hal
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in 
> > the body
> > of a message to majord...@vger.kernel.org More majordomo info at
> > http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


-- 
Doug Ledford 
  GPG KeyID: 0E572FDD




signature.asc
Description: This is a digitally signed message part

Re: [PATCH 1/9] IB/ipoib: factor out ah flushing

2015-02-26 Thread Doug Ledford

On Thu, 2015-02-26 at 15:28 +0200, Erez Shitrit wrote:
> On 2/22/2015 2:26 AM, Doug Ledford wrote:
> > Create a an ipoib_flush_ah and ipoib_stop_ah routines to use at
> > appropriate times to flush out all remaining ah entries before we shut
> > the device down.
> >
> > Because neighbors and mcast entries can each have a reference on any
> > given ah, we must make sure to free all of those first before our ah
> > will actually have a 0 refcount and be able to be reaped.
> >
> > This factoring is needed in preparation for having per-device work
> > queues.  The original per-device workqueue code resulted in the following
> > error message:
> >
> > : ib_dealloc_pd failed
> >
> > That error was tracked down to this issue.  With the changes to which
> > workqueues were flushed when, there were no flushes of the per device
> > workqueue after the last ah's were freed, resulting in an attempt to
> > dealloc the pd with outstanding resources still allocated.  This code
> > puts the explicit flushes in the needed places to avoid that problem.
> >
> > Signed-off-by: Doug Ledford 
> > ---
> >   drivers/infiniband/ulp/ipoib/ipoib_ib.c | 46 
> > -
> >   1 file changed, 28 insertions(+), 18 deletions(-)
> >
> > diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c 
> > b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
> > index 72626c34817..cb02466a0eb 100644
> > --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c
> > +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
> > @@ -659,6 +659,24 @@ void ipoib_reap_ah(struct work_struct *work)
> >round_jiffies_relative(HZ));
> >   }
> >   
> > +static void ipoib_flush_ah(struct net_device *dev, int flush)
> > +{
> > +   struct ipoib_dev_priv *priv = netdev_priv(dev);
> > +
> > +   cancel_delayed_work(&priv->ah_reap_task);
> > +   if (flush)
> > +   flush_workqueue(ipoib_workqueue);
> > +   ipoib_reap_ah(&priv->ah_reap_task.work);
> > +}
> > +
> > +static void ipoib_stop_ah(struct net_device *dev, int flush)
> > +{
> > +   struct ipoib_dev_priv *priv = netdev_priv(dev);
> > +
> > +   set_bit(IPOIB_STOP_REAPER, &priv->flags);
> > +   ipoib_flush_ah(dev, flush);
> > +}
> > +
> >   static void ipoib_ib_tx_timer_func(unsigned long ctx)
> >   {
> > drain_tx_cq((struct net_device *)ctx);
> > @@ -877,24 +895,7 @@ timeout:
> > if (ib_modify_qp(priv->qp, &qp_attr, IB_QP_STATE))
> > ipoib_warn(priv, "Failed to modify QP to RESET state\n");
> >   
> > -   /* Wait for all AHs to be reaped */
> > -   set_bit(IPOIB_STOP_REAPER, &priv->flags);
> > -   cancel_delayed_work(&priv->ah_reap_task);
> > -   if (flush)
> > -   flush_workqueue(ipoib_workqueue);
> > -
> > -   begin = jiffies;
> > -
> > -   while (!list_empty(&priv->dead_ahs)) {
> > -   __ipoib_reap_ah(dev);
> > -
> > -   if (time_after(jiffies, begin + HZ)) {
> > -   ipoib_warn(priv, "timing out; will leak address 
> > handles\n");
> > -   break;
> > -   }
> > -
> > -   msleep(1);
> > -   }
> > +   ipoib_flush_ah(dev, flush);
> >   
> > ib_req_notify_cq(priv->recv_cq, IB_CQ_NEXT_COMP);
> >   
> > @@ -1037,6 +1038,7 @@ static void __ipoib_ib_dev_flush(struct 
> > ipoib_dev_priv *priv,
> > if (level == IPOIB_FLUSH_LIGHT) {
> > ipoib_mark_paths_invalid(dev);
> > ipoib_mcast_dev_flush(dev);
> > +   ipoib_flush_ah(dev, 0);
> 
> Why do you need to call the flush function here?

To remove all of the ah's that were reduced to a 0 refcount by the
previous two functions prior to restarting operations.  When we remove
an ah, it calls ib_destroy_ah which calls all the way down into the low
level driver.  This was to make sure that old, stale data was removed
all the way down to the card level before we started new queries for
paths and ahs.

> I can't see the reason to use the flush not from the stop_ah, meaning 
> without setting the IPOIB_STOP_REAPER, the flush can send twice the same 
> work.

No, it can't.  The ah flush routine does not search through ahs to find
ones to flush.  When you delete neighbors and mcasts, they release their
references to ahs.  When the refcount goes to 0, the put routine puts
the ah on the to-be-deleted ah list.  All the flush does is take that
list and delete the items.  If you run the flush twice,

Re: [PATCH 1/9] IB/ipoib: factor out ah flushing

2015-03-02 Thread Doug Ledford

On Sun, 2015-03-01 at 08:47 +0200, Erez Shitrit wrote:
> On 2/26/2015 6:27 PM, Doug Ledford wrote:
> >
> >>> @@ -1037,6 +1038,7 @@ static void __ipoib_ib_dev_flush(struct 
> >>> ipoib_dev_priv *priv,
> >>>   if (level == IPOIB_FLUSH_LIGHT) {
> >>>   ipoib_mark_paths_invalid(dev);
> >>>   ipoib_mcast_dev_flush(dev);
> >>> + ipoib_flush_ah(dev, 0);
> >> Why do you need to call the flush function here?
> > To remove all of the ah's that were reduced to a 0 refcount by the
> > previous two functions prior to restarting operations.  When we remove
> > an ah, it calls ib_destroy_ah which calls all the way down into the low
> > level driver.  This was to make sure that old, stale data was removed
> > all the way down to the card level before we started new queries for
> > paths and ahs.
> 
> Yes. but it is not needed.

That depends on the card.  For the modern cards (mlx4, mlx5, qib), it
isn't needed but doesn't hurt either.  For older cards (in particular,
mthca), the driver actually frees up card resources at the time of the
call.

> The bug happened when the driver was removed (via modprobe -r etc.), and 
> there were ah's in the dead_ah list, that was fixed by you in the 
> function ipoib_ib_dev_cleanup, the call that you added here is not 
> relevant to the bug (and IMHO is not needed at all)

I never said that this hunk was part of the original bug I saw before.

> So, the the task of cleaning the dead_ah is already there, no need to 
> recall it, it will be called anyway 1 sec at the most from now.
> 
> You can try that, take of that call, no harm or memory leak will happened.

I have no doubt that it will get freed later.  As I said, I never
considered this particular hunk part of that original bug.  But, as I
point out above, there is no harm in it for any hardware, and depending
on hardware it can help to make sure there isn't a shortage of
resources.  Given that fact, I see no reason to remove it.

> >> I can't see the reason to use the flush not from the stop_ah, meaning
> >> without setting the IPOIB_STOP_REAPER, the flush can send twice the same
> >> work.
> > No, it can't.  The ah flush routine does not search through ahs to find
> > ones to flush.  When you delete neighbors and mcasts, they release their
> > references to ahs.  When the refcount goes to 0, the put routine puts
> > the ah on the to-be-deleted ah list.  All the flush does is take that
> > list and delete the items.  If you run the flush twice, the first run
> > deletes all the items on the to-be-deleted list, the second run sees an
> > empty list and does nothing.
> >
> > As for using flush versus stop: the flush function cancels any delayed
> > ah_flush work so that it isn't racing with the normally scheduled
> 
> when you call cancel_delayed_work to work that can schedule itself, it 
> is not help, the work can be at the middle of the run and re-schedule 
> itself...

If it is in the middle of a run and reschedules itself, then it will
schedule itself at precisely the same time we would have anyway, and we
will get flushed properly, so the net result of this particular race is
that we end up doing exactly what we wanted to do anyway.

> 
> > ah_flush, then flushes the workqueue to make sure anything that might
> > result in an ah getting freed is done, then flushes, then schedules a
> > new delayed flush_ah work 1 second later.  So, it does exactly what a
> > flush should do: it removes what there is currently to remove, and in
> > the case of a periodically scheduled garbage collection, schedules a new
> > periodic flush at the maximum interval.
> >
> > It is not appropriate to call stop_ah at this point because it will
> > cancel the delayed work, flush the ahs, then never reschedule the
> > garbage collection.  If we called stop here, we would have to call start
> > later.  But that's not really necessary as the flush cancels the
> > scheduled work and reschedules it for a second later.
> >
> >>>   }
> >>>
> >>>   if (level >= IPOIB_FLUSH_NORMAL)
> >>> @@ -1100,6 +1102,14 @@ void ipoib_ib_dev_cleanup(struct net_device *dev)
> >>>   ipoib_mcast_stop_thread(dev, 1);
> >>>   ipoib_mcast_dev_flush(dev);
> >>>
> >>> + /*
> >>> +  * All of our ah references aren't free until after
> >>> +  * ipoib_mcast_dev_flush(), ipoib_flush_paths, and
> >>> +  * the neighbor garbage collection is stopped and reaped.
> >>> +  * That

Re: [PATCH 7/9] IB/ipoib: fix MCAST_FLAG_BUSY usage

2015-03-02 Thread Doug Ledford

On Sun, 2015-03-01 at 11:31 +0200, Erez Shitrit wrote:
> On 2/22/2015 2:27 AM, Doug Ledford wrote:
> > Commit a9c8ba5884 ("IPoIB: Fix usage of uninitialized multicast
> > objects") added a new flag MCAST_JOIN_STARTED, but was not very strict
> > in how it was used.  We didn't always initialize the completion struct
> > before we set the flag, and we didn't always call complete on the
> > completion struct from all paths that complete it.  And when we did
> > complete it, sometimes we continued to touch the mcast entry after
> > the completion, opening us up to possible use after free issues.
> >
> > This made it less than totally effective, and certainly made its use
> > confusing.  And in the flush function we would use the presence of this
> > flag to signal that we should wait on the completion struct, but we never
> > cleared this flag, ever.
> >
> > In order to make things clearer and aid in resolving the rtnl deadlock
> > bug I've been chasing, I cleaned this up a bit.
> >
> >   1) Remove the MCAST_JOIN_STARTED flag entirely
> >   2) Change MCAST_FLAG_BUSY so it now only means a join is in-flight
> >   3) Test mcast->mc directly to see if we have completed
> >  ib_sa_join_multicast (using IS_ERR_OR_NULL)
> >   4) Make sure that before setting MCAST_FLAG_BUSY we always initialize
> >  the mcast->done completion struct
> >   5) Make sure that before calling complete(&mcast->done), we always clear
> >  the MCAST_FLAG_BUSY bit
> >   6) Take the mcast_mutex before we call ib_sa_multicast_join and also
> >  take the mutex in our join callback.  This forces
> >  ib_sa_multicast_join to return and set mcast->mc before we process
> >  the callback.  This way, our callback can safely clear mcast->mc
> >  if there is an error on the join and we will do the right thing as
> >  a result in mcast_dev_flush.
> >   7) Because we need the mutex to synchronize mcast->mc, we can no
> >  longer call mcast_sendonly_join directly from mcast_send and
> >  instead must add sendonly join processing to the mcast_join_task
> >   8) Make MCAST_RUN mean that we have a working mcast subsystem, not that
> >  we have a running task.  We know when we need to reschedule our
> >  join task thread and don't need a flag to tell us.
> >   9) Add a helper for rescheduling the join task thread
> >
> > A number of different races are resolved with these changes.  These
> > races existed with the old MCAST_FLAG_BUSY usage, the
> > MCAST_JOIN_STARTED flag was an attempt to address them, and while it
> > helped, a determined effort could still trip things up.
> >
> > One race looks something like this:
> >
> > Thread 1 Thread 2
> > ib_sa_join_multicast (as part of running restart mcast task)
> >alloc member
> >call callback
> >   ifconfig ib0 down
> >  wait_for_completion
> >  callback call completes
> >   wait_for_completion in
> >  mcast_dev_flush completes
> >mcast->mc is PTR_ERR_OR_NULL
> >so we skip ib_sa_leave_multicast
> >  return from callback
> >return from ib_sa_join_multicast
> > set mcast->mc = return from ib_sa_multicast
> >
> > We now have a permanently unbalanced join/leave issue that trips up the
> > refcounting in core/multicast.c
> >
> > Another like this:
> >
> > Thread 1   Thread 2 Thread 3
> > ib_sa_multicast_join
> >  ifconfig ib0 down
> > priv->broadcast = NULL
> > join_complete
> > wait_for_completion
> >mcast->mc is not yet set, so don't clear
> > return from ib_sa_join_multicast and set mcast->mc
> >complete
> >return -EAGAIN (making mcast->mc invalid)
> > call ib_sa_multicast_leave
> > on invalid mcast->mc, hang
> > forever
> >
> > By holding the mutex around ib_sa_multicast_join and taking the mutex
> > early in the callback, we force mcast->mc to be valid at the time we
> > run the c

Re: [PATCH 8/9] IB/ipoib: deserialize multicast joins

2015-03-02 Thread Doug Ledford

On Sun, 2015-03-01 at 15:58 +0200, Erez Shitrit wrote:
> On 2/22/2015 2:27 AM, Doug Ledford wrote:
> > Allow the ipoib layer to attempt to join all outstanding multicast
> > groups at once.  The ib_sa layer will serialize multiple attempts to
> > join the same group, but will process attempts to join different groups
> > in parallel.  Take advantage of that.
> >
> > In order to make this happen, change the mcast_join_thread to loop
> > through all needed joins, sending a join request for each one that we
> > still need to join.  There are a few special cases we handle though:
> >
> > 1) Don't attempt to join anything but the broadcast group until the join
> > of the broadcast group has succeeded.
> > 2) No longer restart the join task at the end of completion handling.
> > If we completed successfully, we are done.  The join task now needs kicked
> > either by mcast_send or mcast_restart_task or mcast_start_thread, but
> > should not need started anytime else except when scheduling a backoff
> > attempt to rejoin.
> > 3) No longer use separate join/completion routines for regular and
> > sendonly joins, pass them all through the same routine and just do the
> > right thing based on the SENDONLY join flag.
> > 4) Only try to join a SENDONLY join twice, then drop the packets and
> > quit trying.  We leave the mcast group in the list so that if we get a
> > new packet, all that we have to do is queue up the packet and restart
> > the join task and it will automatically try to join twice and then
> > either send or flush the queue again.
> >
> > Signed-off-by: Doug Ledford 
> > ---
> >   drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 250 
> > -
> >   1 file changed, 82 insertions(+), 168 deletions(-)
> >
> > diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 
> > b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
> > index 277e7ac7c4d..c670d9c2cda 100644
> > --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
> > +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
> > @@ -307,111 +307,6 @@ static int ipoib_mcast_join_finish(struct ipoib_mcast 
> > *mcast,
> > return 0;
> >   }
> >   
> > -static int
> > -ipoib_mcast_sendonly_join_complete(int status,
> > -  struct ib_sa_multicast *multicast)
> > -{
> > -   struct ipoib_mcast *mcast = multicast->context;
> > -   struct net_device *dev = mcast->dev;
> > -   struct ipoib_dev_priv *priv = netdev_priv(dev);
> > -
> > -   /*
> > -* We have to take the mutex to force mcast_sendonly_join to
> > -* return from ib_sa_multicast_join and set mcast->mc to a
> > -* valid value.  Otherwise we were racing with ourselves in
> > -* that we might fail here, but get a valid return from
> > -* ib_sa_multicast_join after we had cleared mcast->mc here,
> > -* resulting in mis-matched joins and leaves and a deadlock
> > -*/
> > -   mutex_lock(&mcast_mutex);
> > -
> > -   /* We trap for port events ourselves. */
> > -   if (status == -ENETRESET) {
> > -   status = 0;
> > -   goto out;
> > -   }
> > -
> > -   if (!status)
> > -   status = ipoib_mcast_join_finish(mcast, &multicast->rec);
> > -
> > -   if (status) {
> > -   if (mcast->logcount++ < 20)
> > -   ipoib_dbg_mcast(netdev_priv(dev), "sendonly multicast "
> > -   "join failed for %pI6, status %d\n",
> > -   mcast->mcmember.mgid.raw, status);
> > -
> > -   /* Flush out any queued packets */
> > -   netif_tx_lock_bh(dev);
> > -   while (!skb_queue_empty(&mcast->pkt_queue)) {
> > -   ++dev->stats.tx_dropped;
> > -   dev_kfree_skb_any(skb_dequeue(&mcast->pkt_queue));
> > -   }
> > -   netif_tx_unlock_bh(dev);
> > -   __ipoib_mcast_schedule_join_thread(priv, mcast, 1);
> > -   } else {
> > -   mcast->backoff = 1;
> > -   mcast->delay_until = jiffies;
> > -   __ipoib_mcast_schedule_join_thread(priv, NULL, 0);
> > -   }
> > -out:
> > -   clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
> > -   if (status)
> > -   mcast->mc = NULL;
> > -   complete(&mcast->done);
> > -   mutex_unlock(&mcast_mutex);
> > -   return status;
> > -}
> > -
&g

Re: [PATCH 1/9] IB/ipoib: factor out ah flushing

2015-03-15 Thread Doug Ledford


> On Mar 13, 2015, at 1:39 AM, Or Gerlitz  wrote:
> 
> On Tue, Mar 3, 2015 at 11:59 AM, Erez Shitrit  
> wrote:
>> On 3/2/2015 5:09 PM, Doug Ledford wrote:
>>> 
>>> On Sun, 2015-03-01 at 08:47 +0200, Erez Shitrit wrote:
>>>> 
>>>> On 2/26/2015 6:27 PM, Doug Ledford wrote:
>>>>>>> 
>>>>>>> @@ -1037,6 +1038,7 @@ static void __ipoib_ib_dev_flush(struct
>>>>>>> ipoib_dev_priv *priv,
>>>>>>>if (level == IPOIB_FLUSH_LIGHT) {
>>>>>>>ipoib_mark_paths_invalid(dev);
>>>>>>>ipoib_mcast_dev_flush(dev);
>>>>>>> +   ipoib_flush_ah(dev, 0);
>>>>>> 
>>>>>> Why do you need to call the flush function here?
>>>>> 
>>>>> To remove all of the ah's that were reduced to a 0 refcount by the
>>>>> previous two functions prior to restarting operations.  When we remove
>>>>> an ah, it calls ib_destroy_ah which calls all the way down into the low
>>>>> level driver.  This was to make sure that old, stale data was removed
>>>>> all the way down to the card level before we started new queries for
>>>>> paths and ahs.
>>>> 
>>>> Yes. but it is not needed.
>>> 
>>> That depends on the card.  For the modern cards (mlx4, mlx5, qib), it
>>> isn't needed but doesn't hurt either.  For older cards (in particular,
>>> mthca), the driver actually frees up card resources at the time of the
>>> call.
>> 
>> 
>> Can you please elaborate more here, I took a look in the mthca and didn't
>> see that.
>> anyway, what i don't understand is why you need to do that now, the ah is
>> already in the dead_ah_list so, in at the most 1 sec will be cleared and if
>> the driver goes down your other hunk fixed that.
> 
> Doug, ten days and no response from you... lets finalize the review on
> this series so we have it safely done for 4.1 -- on top of it Erez
> prepares a set of IPoIB fixes from our internal tree and we want that
> for 4.1 too. Please address.

I didn’t have much to say here.  I said that mthca can have card resources 
freed by this call, which is backed up by this code in mthca_ah.c

int mthca_destroy_ah(struct mthca_dev *dev, struct mthca_ah *ah)
{
switch (ah->type) {
case MTHCA_AH_ON_HCA:
mthca_free(&dev->av_table.alloc,
   (ah->avdma - dev->av_table.ddr_av_base) /
   MTHCA_AV_SIZE);
break;


I’m not entirely sure how Erez missed that, but it’s there and it’s what gets 
called when we destroy an ah (depending on the card of course).  So, that 
represents one case where freeing the resources in a non-lazy fashion has a 
direct benefit.  And there is no cited drawback to freeing the resources in a 
non-lazy fashion on a net event, so I don’t see what there is to discuss 
further on the issue.

—
Doug Ledford 
GPG Key ID: 0E572FDD







signature.asc
Description: Message signed with OpenPGP using GPGMail

Re: [PATCH 0/9] IB/ipoib: fixup multicast locking issues

2015-03-15 Thread Doug Ledford

> On Mar 13, 2015, at 1:41 AM, Or Gerlitz  wrote:
> 
> On Sun, Feb 22, 2015 at 2:26 AM, Doug Ledford  wrote:
>> This is the re-ordered, squashed version of my 22 patch set that I
>> posted on Feb 11.  There are a few minor differences between that
>> set and this one.  They are:
> [...]
> 
> Doug, you wrote here a very detailed listing of the changes from
> earlier posts and the testing the patches went through, which is
> excellent. It would be very good if you can also post few liner
> telling the changes done by the series in high level, so we can have
> this test as part of a "merge" commit that says in the kernel logs.

OK.  I would take what I had in the original message and expand upon it then:

This entire patchset was intended to address the issue of ipoib
interfaces being brought up/down in a tight loop, which will hardlock
a standard v3.19 kernel.  It succeeds at resolving that problem.

In order to accomplish this goal, it reworks how the IPOIB_MCAST_FLAG_BUSY flag 
is used.  Conceptually, that flag used to be set when we started a multicast 
join, and would stay set once the join was complete.  This left no way to tell 
if the multicast join was complete or still in flight.  This allowed race 
conditions to develop between joining multicast groups and taking an interface 
down.  A previous attempt to resolve these race conditions used the flag 
IPOIB_MCAST_JOIN_STARTED, but did not succeed at fully resolving the race 
conditions.  This patchset resolves this issue, plus a number of related issues 
discovered while working on this issue.  The primary fix itself is patch 6/9 
and a more complete description of the changes to how the IPOIB_MCAST_FLAG_BUSY 
flag is now used can be found in that commit log.

—
Doug Ledford 
GPG Key ID: 0E572FDD

signature.asc
Description: Message signed with OpenPGP using GPGMail

Re: [PATCH 1/9] IB/ipoib: factor out ah flushing

2015-03-16 Thread Doug Ledford


> On Mar 16, 2015, at 8:24 AM, Erez Shitrit  wrote:
> 
> On 3/15/2015 8:42 PM, Doug Ledford wrote:
>> 
>>> Doug, ten days and no response from you... lets finalize the review on
>>> this series so we have it safely done for 4.1 -- on top of it Erez
>>> prepares a set of IPoIB fixes from our internal tree and we want that
>>> for 4.1 too. Please address.
>> I didn’t have much to say here.  I said that mthca can have card resources 
>> freed by this call, which is backed up by this code in mthca_ah.c
>> 
>> int mthca_destroy_ah(struct mthca_dev *dev, struct mthca_ah *ah)
>> {
>> switch (ah->type) {
>> case MTHCA_AH_ON_HCA:
>> mthca_free(&dev->av_table.alloc,
>>(ah->avdma - dev->av_table.ddr_av_base) /
>>MTHCA_AV_SIZE);
>> break;
>> 
>> 
>> I’m not entirely sure how Erez missed that, but it’s there and it’s what 
>> gets called when we destroy an ah (depending on the card of course).  So, 
>> that represents one case where freeing the resources in a non-lazy fashion 
>> has a direct benefit.  And there is no cited drawback to freeing the 
>> resources in a non-lazy fashion on a net event, so I don’t see what there is 
>> to discuss further on the issue.
> sorry, but i still don't see the connection to the device type.
> It will be deleted/freed with the regular flow, like it does in the rest of 
> the life cycle cases of the ah (in neigh_dtor, path_free, etc.), so why here 
> it should be directly after the event?

Because it’s the right thing to do.  The only reason to do lazy deletion is 
when there is a performance benefit to batching.  There is no performance 
benefit to batching here.  And because on certain hardware the action frees 
resources on the card, which are limited, doing non-lazy deletion can be 
beneficial.  Given that there is no downside to doing the deletions in a 
non-lazy fashion, and that there can be an upside depending on hardware, there 
is no reason to stick with the lazy deletions.

—
Doug Ledford 
GPG Key ID: 0E572FDD







signature.asc
Description: Message signed with OpenPGP using GPGMail

Re: [PATCH 1/9] IB/ipoib: factor out ah flushing

2015-03-16 Thread Doug Ledford


> On Mar 16, 2015, at 8:24 AM, Erez Shitrit  wrote:
> 
> On 3/15/2015 8:42 PM, Doug Ledford wrote:
>> 
>>> Doug, ten days and no response from you... lets finalize the review on
>>> this series so we have it safely done for 4.1 -- on top of it Erez
>>> prepares a set of IPoIB fixes from our internal tree and we want that
>>> for 4.1 too. Please address.
>> I didn’t have much to say here.  I said that mthca can have card resources 
>> freed by this call, which is backed up by this code in mthca_ah.c
>> 
>> int mthca_destroy_ah(struct mthca_dev *dev, struct mthca_ah *ah)
>> {
>>switch (ah->type) {
>>case MTHCA_AH_ON_HCA:
>>mthca_free(&dev->av_table.alloc,
>>   (ah->avdma - dev->av_table.ddr_av_base) /
>>   MTHCA_AV_SIZE);
>>break;
>> 
>> 
>> I’m not entirely sure how Erez missed that, but it’s there and it’s what 
>> gets called when we destroy an ah (depending on the card of course).  So, 
>> that represents one case where freeing the resources in a non-lazy fashion 
>> has a direct benefit.  And there is no cited drawback to freeing the 
>> resources in a non-lazy fashion on a net event, so I don’t see what there is 
>> to discuss further on the issue.
> sorry, but i still don't see the connection to the device type.
> It will be deleted/freed with the regular flow, like it does in the rest of 
> the life cycle cases of the ah (in neigh_dtor, path_free, etc.), so why here 
> it should be directly after the event?

Because it’s the right thing to do.  The only reason to do lazy deletion is 
when there is a performance benefit to batching.  There is no performance 
benefit to batching here.  And because on certain hardware the action frees 
resources on the card, which are limited, doing non-lazy deletion can be 
beneficial.  Given that there is no downside to doing the deletions in a 
non-lazy fashion, and that there can be an upside depending on hardware, there 
is no reason to stick with the lazy deletions.

—
Doug Ledford 
GPG Key ID: 0E572FDD





--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/2 RESEND] IB/Verbs: Use helpers to refine the checking on transport and link layer

2015-03-26 Thread Doug Ledford

On Wed, 2015-03-25 at 16:09 +0100, Michael Wang wrote:
> My sincerely apologies for the corrupted mails, and thanks for Dan's kindly
> remind :-)
> 
> There are too many lengthy code to check the transport type of IB device,
> or the link layer type of it's port, this patch set try to use some helper to
> refine and save us some code.
> 
> TODO:
> Currently we inferred from the transport type and link layer type to 
> identify
> the way of management, it will be better if we can directly get the 
> indicator
> from vendor.
> 
> Sean proposed one suggestion:
> https://www.mail-archive.com/linux-rdma@vger.kernel.org/msg23339.html
> 
> It may need a big work to adapt current implementation to utilize
> these flags elegantly.
> 
> Also the performance concern on query_port() need to be addressed, may be
> some new callback like query_mgmt() could works.
> 
> Michael Wang (2):
> [PATCH 1/2] IB/Verbs: Use helpers to check transport and link layer
> [PATCH 2/2] IB/Verbs: Use helpers to check IBoE technology
> 
> ---
>  drivers/infiniband/core/agent.c   |2 -
>  drivers/infiniband/core/cm.c  |2 -
>  drivers/infiniband/core/cma.c |   33 
> --
>  drivers/infiniband/core/mad.c |6 ++---
>  drivers/infiniband/core/multicast.c   |   11 +++---
>  drivers/infiniband/core/sa_query.c|   14 ++--
>  drivers/infiniband/core/ucm.c |3 --
>  drivers/infiniband/core/user_mad.c|2 -
>  drivers/infiniband/core/verbs.c   |5 +---
>  drivers/infiniband/hw/mlx4/ah.c   |2 -
>  drivers/infiniband/hw/mlx4/cq.c   |4 ---
>  drivers/infiniband/hw/mlx4/mad.c  |   14 +++-
>  drivers/infiniband/hw/mlx4/main.c |8 ++-
>  drivers/infiniband/hw/mlx4/mlx4_ib.h  |2 -
>  drivers/infiniband/hw/mlx4/qp.c   |   21 ++-
>  drivers/infiniband/hw/mlx4/sysfs.c|6 +
>  drivers/infiniband/ulp/ipoib/ipoib_main.c |6 ++---
>  include/rdma/ib_verbs.h   |   30 +++
>  net/sunrpc/xprtrdma/svc_rdma_recvfrom.c   |3 --
>  19 files changed, 87 insertions(+), 87 deletions(-)
> 

I think this is a step in the right direction.  However, as I brought up
in a different thread, I think it would be best if we start to clear up
some of the funny things in this space, such as calling iWARP link layer
InfiniBand.

One thing we didn't discuss before is that, even if changing these items
around in user space would break user space, changing them around in the
kernel won't break anything except maybe Lustre (which we can inform the
authors about the change so they can anticipate it and do the right
thing in their code) because we can fix up our return values we pass to
user space with no impact to them as it's not on a hot path.  We can
then emulate the old way of setting all these variables to user space
while fixing them up in kernel space.

So, I would suggest that we fix things up thusly:

enum transport {
TRANSPORT_IB=1,
TRANSPORT_IWARP=2,
TRANSPORT_ROCE=4,
TRANSPORT_OPA=8,
TRANSPORT_USNIC=10,
};

#define HAS_SA(ibdev) ((ibdev)->transport & (TRANSPORT_IB|TRANSPORT_OPA))
#define HAS_JUMBO_SA(ibdev) ((ibdev)->transport & TRANSPORT_OPA))

or possibly

static bool ib_dev_has_sa(struct ibv_device *ibdev)
{
return ibdev->transport & (TRANSPORT_IB | TRANSPORT_OPA);
}

If we do this, then the only thing we have to fix up to preserve ABI
with user space is to make sure that any time we export an ibv_device
struct and any time we import the same, we convert from our new internal
representation to the old representation that user space expects.  And
we also need to make a few changes in the sysfs code to display the
properties as things expect.  But, that would allow us to fix up what I
see as a problem right now, which is that we hide the information we
need to know what sort of device we are working on in two different
fields: the transport and the link layer.  Instead, just use one field
with enough variants that we can store all of the relevant information
we need in that one field.  This has the benefit that any comparisons
that happen in hot paths will now always be a single bitwise comparison
and will no longer need to hit two separate variables for two separate
compares.

-- 
Doug Ledford 
  GPG KeyID: 0E572FDD

signature.asc
Description: This is a digitally signed message part

Re: [PATCH 0/2 RESEND] IB/Verbs: Use helpers to refine the checking on transport and link layer

2015-03-26 Thread Doug Ledford

On Thu, 2015-03-26 at 17:04 +0100, Michael Wang wrote:
> Hi, Doug
> 
> Thanks for the excellent comments :-)
> 
> On 03/26/2015 03:09 PM, Doug Ledford wrote:
> > On Wed, 2015-03-25 at 16:09 +0100, Michael Wang wrote:
> >> [snip]
> >>
> > [snip]
> >
> > So, I would suggest that we fix things up thusly:
> >
> > enum transport {
> > TRANSPORT_IB=1,
> > TRANSPORT_IWARP=2,
> > TRANSPORT_ROCE=4,
> > TRANSPORT_OPA=8,
> > TRANSPORT_USNIC=10,
> > };
> >
> > #define HAS_SA(ibdev) ((ibdev)->transport & (TRANSPORT_IB|TRANSPORT_OPA))
> > #define HAS_JUMBO_SA(ibdev) ((ibdev)->transport & TRANSPORT_OPA))
> >
> > or possibly
> >
> > static bool ib_dev_has_sa(struct ibv_device *ibdev)
> > {
> > return ibdev->transport & (TRANSPORT_IB | TRANSPORT_OPA);
> > }
> 
> The idea sounds interesting, and here my silly questions come :-P
> 
> So are you suggesting that we add a new bitmask 'transport' into 'struct 
> ib_device'
> in kernel, and setup it at very beginning?
> 
> Few more questions here is:
> 1. when to setup? (maybe inside ib_register_device() before doing 
> client->add() callback?)

I don't think "we" can set it up here.  The driver's have to set it up.
After all, the mlx4 driver will have to decide for itself what the port
transport is and tell us, we can't tell it.

> 2. how to setup? (still infer from the transport and link layer like we 
> currently do?)

Find each point in each driver where they currently set the link layer
and transport fields today, and replace that with setting the new
transport bitmask instead.

> 3. in case if a device has ports with different link layer type (please 
> correct
> me if this will never happen), then only one bitmask may not be enough to
> present the transport of all the ports? (maybe create a bitmask per port?)

Correct, a bitmask per port.  And we can remove the existing transport
and link layer elements of the struct and replace it with just the new
transport.  Then, whenever we need to copy a struct to user space, we
have a helper that looks something like this:

static void inline ib_set_user_transport(struct ib_device *ibdev,
 struct user_ibv_device *uibdev)
{
switch(ibdev->port[port]->transport) {
case TRANSPORT_IB:
case TRANSPORT_OPA:
uibdev->port[port]->link_layer = INFINIBAND;
uibdev->port[port]->transport = INFINIBAND;
break;
case TRANSPORT_IWARP:
uibdev->port[port]->link_layer = INFINIBAND;
uibdev->port[port]->transport = IWARP;
break;
case TRANSPORT_ROCE:
uibdev->port[port]->link_layer = ETHERNET;
uibdev->port[port]->transport = INFINIBAND;
break;
case TRANSPORT_USNIC:
uibdev->port[port]->link_layer = ETHERNET;
uibdev->port[port]->transport = ;
break;
default:
pr_err(ibdev, "unknown transport type %x\n",
   ibdev->port[port]->transport);
}
}

That preserves the user space ABI and all user programs keep working,
while we update to an internal representation that makes more sense for
how things have evolved.

> Regards,
> Michael Wang
> 
> >
> > If we do this, then the only thing we have to fix up to preserve ABI
> > with user space is to make sure that any time we export an ibv_device
> > struct and any time we import the same, we convert from our new internal
> > representation to the old representation that user space expects.  And
> > we also need to make a few changes in the sysfs code to display the
> > properties as things expect.  But, that would allow us to fix up what I
> > see as a problem right now, which is that we hide the information we
> > need to know what sort of device we are working on in two different
> > fields: the transport and the link layer.  Instead, just use one field
> > with enough variants that we can store all of the relevant information
> > we need in that one field.  This has the benefit that any comparisons
> > that happen in hot paths will now always be a single bitwise comparison
> > and will no longer need to hit two separate variables for two separate
> > compares.
> >
> >
> >
> 


-- 
Doug Ledford 
  GPG KeyID: 0E572FDD




signature.asc
Description: This is a digitally signed message part

Re: [RFC PATCH 06/11] IB/Verbs: Use management helper has_sa() and cap_sa(), for sa-check

2015-03-27 Thread Doug Ledford

On Fri, 2015-03-27 at 12:47 -0400, ira.weiny wrote:
> On Fri, Mar 27, 2015 at 04:46:11PM +0100, Michael Wang wrote:
> > 
> > Introduce helper has_sa() and cap_sa() to help us check if an IB device
> > or it's port support Subnet Administrator.
> 
> I think these 2 should be combined.  The question is if a port requires the 
> use
> of the SA depending on the network it is connected to.
> 
> Aren't some dual port Mellanox cards capable of doing IB on 1 port and Eth on
> the other?

Yes.

>   Do they show up as 2 devices?

No.  They are two ports of the same device with different link layers.
See here:

hca_id: mlx4_1
transport:  InfiniBand (0)
fw_ver: 2.31.5050
node_guid:  f452:1403:007b:cba0
sys_image_guid: f452:1403:007b:cba3
vendor_id:  0x02c9
vendor_part_id: 4099
hw_ver: 0x0
board_id:   MT_1090120019
phys_port_cnt:  2
port:   1
state:  PORT_ACTIVE (4)
max_mtu:2048 (4)
active_mtu: 2048 (4)
sm_lid: 2
port_lid:   2
port_lmc:   0x01
link_layer: InfiniBand

port:   2
state:  PORT_ACTIVE (4)
max_mtu:4096 (5)
active_mtu: 4096 (5)
sm_lid: 0
port_lid:   0
port_lmc:   0x00
link_layer: Ethernet



> Regardless I think we should define the SA access on a per port basis.

Yes.

> > 
> > Cc: Jason Gunthorpe 
> > Cc: Doug Ledford 
> > Cc: Ira Weiny 
> > Cc: Sean Hefty 
> > Signed-off-by: Michael Wang 
> > ---
> >  drivers/infiniband/core/sa_query.c | 12 ++--
> >  include/rdma/ib_verbs.h| 28 
> >  2 files changed, 34 insertions(+), 6 deletions(-)
> > 
> > diff --git a/drivers/infiniband/core/sa_query.c 
> > b/drivers/infiniband/core/sa_query.c
> > index d95d25f..89c27da 100644
> > --- a/drivers/infiniband/core/sa_query.c
> > +++ b/drivers/infiniband/core/sa_query.c
> > @@ -450,7 +450,7 @@ static void ib_sa_event(struct ib_event_handler 
> > *handler, struct ib_event *event
> >  struct ib_sa_port *port =
> >  &sa_dev->port[event->element.port_num - sa_dev->start_port];
> >  
> > -if (!rdma_port_ll_is_ib(handler->device, port->port_num))
> > +if (!cap_sa(handler->device, port->port_num))
> >  return;
> >  
> >  spin_lock_irqsave(&port->ah_lock, flags);
> > @@ -1154,7 +1154,7 @@ static void ib_sa_add_one(struct ib_device *device)
> >  struct ib_sa_device *sa_dev;
> >  int s, e, i;
> >  
> > -if (!rdma_transport_is_ib(device))
> > +if (!has_sa(device))
> 
> The logic here should be:
> 
> if (no ports of this device need sa access)
>   return;
> 
> So why not eliminate this check and allow the cap_sa(s) to handle the support?
> 
> -- Ira
> 
> >  return;
> >  
> >  if (device->node_type == RDMA_NODE_IB_SWITCH)
> > @@ -1175,7 +1175,7 @@ static void ib_sa_add_one(struct ib_device *device)
> >  
> >  for (i = 0; i <= e - s; ++i) {
> >  spin_lock_init(&sa_dev->port[i].ah_lock);
> > -if (!rdma_port_ll_is_ib(device, i + 1))
> > +if (!cap_sa(device, i + 1))
> >  continue;
> >  
> >  sa_dev->port[i].sm_ah= NULL;
> > @@ -1205,14 +1205,14 @@ static void ib_sa_add_one(struct ib_device *device)
> >  goto err;
> >  
> >  for (i = 0; i <= e - s; ++i)
> > -if (rdma_port_ll_is_ib(device, i + 1))
> > +if (cap_sa(device, i + 1))
> >  update_sm_ah(&sa_dev->port[i].update_task);
> >  
> >  return;
> >  
> >  err:
> >  while (--i >= 0)
> > -if (rdma_port_ll_is_ib(device, i + 1))
> > +if (cap_sa(device, i + 1))
> >  ib_unregister_mad_agent(sa_dev->port[i].agent);
> >  
> >  kfree(sa_dev);
> > @@ -1233,7 +1233,7 @@ static void ib_sa_remove_one(struct ib_device *device)
> >

Re: [PATCH 01/11] IB/Verbs: Use helpers to check transport and link layer

2015-03-30 Thread Doug Ledford

On Fri, 2015-03-27 at 16:40 +0100, Michael Wang wrote:
> We have so much places to check transport type and link layer type, it's now
> make sense to introduce some helpers in order to refine the lengthy code.
> 
> This patch will introduce helpers:
> rdma_transport_is_ib()
> rdma_transport_is_iwarp()
> rdma_port_ll_is_ib()
> rdma_port_ll_is_eth()
> and use them to save some code for us.

If the end result is to do something like I proposed, then why take this
intermediate step that just has to be backed out later?

In other words, if our end goal is to have

rdma_transport_is_ib()
rdma_transport_is_iwarp()
rdma_transport_is_roce()
rdma_transport_is_opa()

Then we should skip doing rdma_port_ll_is_*() as the answers to these
items would be implied by rdma_transport_is_roce() and such.

> Cc: Jason Gunthorpe 
> Cc: Doug Ledford 
> Cc: Ira Weiny 
> Cc: Sean Hefty 
> Signed-off-by: Michael Wang 
> ---
>  drivers/infiniband/core/agent.c   |  2 +-
>  drivers/infiniband/core/cm.c  |  2 +-
>  drivers/infiniband/core/cma.c | 27 ---
>  drivers/infiniband/core/mad.c |  6 +++---
>  drivers/infiniband/core/multicast.c   | 11 ---
>  drivers/infiniband/core/sa_query.c| 14 +++---
>  drivers/infiniband/core/ucm.c |  3 +--
>  drivers/infiniband/core/user_mad.c|  2 +-
>  drivers/infiniband/core/verbs.c   |  5 ++---
>  drivers/infiniband/hw/mlx4/ah.c   |  2 +-
>  drivers/infiniband/hw/mlx4/cq.c   |  4 +---
>  drivers/infiniband/hw/mlx4/mad.c  | 14 --
>  drivers/infiniband/hw/mlx4/main.c |  8 +++-
>  drivers/infiniband/hw/mlx4/mlx4_ib.h  |  2 +-
>  drivers/infiniband/hw/mlx4/qp.c   | 21 +++--
>  drivers/infiniband/hw/mlx4/sysfs.c|  6 ++
>  drivers/infiniband/ulp/ipoib/ipoib_main.c |  6 +++---
>  include/rdma/ib_verbs.h   | 24 
>  net/sunrpc/xprtrdma/svc_rdma_recvfrom.c   |  3 +--
>  19 files changed, 79 insertions(+), 83 deletions(-)
> 
> diff --git a/drivers/infiniband/core/agent.c b/drivers/infiniband/core/agent.c
> index f6d2961..27f1bec 100644
> --- a/drivers/infiniband/core/agent.c
> +++ b/drivers/infiniband/core/agent.c
> @@ -156,7 +156,7 @@ int ib_agent_port_open(struct ib_device *device, int 
> port_num)
>  goto error1;
>  }
>  
> -if (rdma_port_get_link_layer(device, port_num) == 
> IB_LINK_LAYER_INFINIBAND) {
> +if (rdma_port_ll_is_ib(device, port_num)) {
>  /* Obtain send only MAD agent for SMI QP */
>  port_priv->agent[0] = ib_register_mad_agent(device, port_num,
>  IB_QPT_SMI, NULL, 0,
> diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c
> index e28a494..2c72e9e 100644
> --- a/drivers/infiniband/core/cm.c
> +++ b/drivers/infiniband/core/cm.c
> @@ -3762,7 +3762,7 @@ static void cm_add_one(struct ib_device *ib_device)
>  int ret;
>  u8 i;
>  
> -if (rdma_node_get_transport(ib_device->node_type) != RDMA_TRANSPORT_IB)
> +if (!rdma_transport_is_ib(ib_device))
>  return;
>  
>  cm_dev = kzalloc(sizeof(*cm_dev) + sizeof(*port) *
> diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
> index d570030..668e955 100644
> --- a/drivers/infiniband/core/cma.c
> +++ b/drivers/infiniband/core/cma.c
> @@ -375,8 +375,8 @@ static int cma_acquire_dev(struct rdma_id_private 
> *id_priv,
>   listen_id_priv->id.port_num) == dev_ll) {
>  cma_dev = listen_id_priv->cma_dev;
>  port = listen_id_priv->id.port_num;
> -if (rdma_node_get_transport(cma_dev->device->node_type) == 
> RDMA_TRANSPORT_IB &&
> -rdma_port_get_link_layer(cma_dev->device, port) == 
> IB_LINK_LAYER_ETHERNET)
> +if (rdma_transport_is_ib(cma_dev->device) &&
> +rdma_port_ll_is_eth(cma_dev->device, port))
>  ret = ib_find_cached_gid(cma_dev->device, &iboe_gid,
>   &found_port, NULL);
>  else
> @@ -395,8 +395,8 @@ static int cma_acquire_dev(struct rdma_id_private 
> *id_priv,
>  listen_id_priv->id.port_num == port)
>  continue;
>  if (rdma_port_get_link_layer(cma_dev->device, port) == dev_ll) {
> -if (rdma_node_get_transport(cma_dev->device->node_type) == 
> RDMA_TRANSPORT_IB &&
> -rdma_port_get_link_layer(cma_dev->device, port) == 
> IB_LINK_LAYER_ETHERNET)
> +if (rdma_transport_is_ib(cma_dev->devic

Re: [RFC PATCH 07/11] IB/Verbs: Use management helper has_mcast() and, cap_mcast() for mcast-check

2015-03-30 Thread Doug Ledford

On Fri, 2015-03-27 at 16:46 +0100, Michael Wang wrote:
> Introduce helper has_mcast() and cap_mcast() to help us check if an
> IB device or it's port support Multicast.

This probably needs reworded or rethought.  In truth, *all* rdma devices
are multicast capable.  *BUT*, IB/OPA devices require multicast
registration done the IB way (including for sendonly multicast sends),
while Ethernet devices do multicast the Ethernet way.  These tests are
really just for IB specific multicast registration and deregistration.
Call it has_mcast() and cap_mcast() is incorrect.

> Cc: Jason Gunthorpe 
> Cc: Doug Ledford 
> Cc: Ira Weiny 
> Cc: Sean Hefty 
> Signed-off-by: Michael Wang 
> ---
>  drivers/infiniband/core/cma.c   |  2 +-
>  drivers/infiniband/core/multicast.c |  8 
>  include/rdma/ib_verbs.h | 28 
>  3 files changed, 33 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
> index 276fb76..cbbc85b 100644
> --- a/drivers/infiniband/core/cma.c
> +++ b/drivers/infiniband/core/cma.c
> @@ -3398,7 +3398,7 @@ void rdma_leave_multicast(struct rdma_cm_id *id, struct 
> sockaddr *addr)
>  ib_detach_mcast(id->qp,
>  &mc->multicast.ib->rec.mgid,
>  be16_to_cpu(mc->multicast.ib->rec.mlid));
> -if (rdma_transport_is_ib(id_priv->cma_dev->device)) {
> +if (has_mcast(id_priv->cma_dev->device)) {
>  switch (rdma_port_get_link_layer(id->device, id->port_num)) {
>  case IB_LINK_LAYER_INFINIBAND:
>  ib_sa_free_multicast(mc->multicast.ib);
> diff --git a/drivers/infiniband/core/multicast.c 
> b/drivers/infiniband/core/multicast.c
> index 17573ff..ffeaf27 100644
> --- a/drivers/infiniband/core/multicast.c
> +++ b/drivers/infiniband/core/multicast.c
> @@ -780,7 +780,7 @@ static void mcast_event_handler(struct ib_event_handler 
> *handler,
>  int index;
>  
>  dev = container_of(handler, struct mcast_device, event_handler);
> -if (!rdma_port_ll_is_ib(dev->device, event->element.port_num))
> +if (!cap_mcast(dev->device, event->element.port_num))
>  return;
>  
>  index = event->element.port_num - dev->start_port;
> @@ -807,7 +807,7 @@ static void mcast_add_one(struct ib_device *device)
>  int i;
>  int count = 0;
>  
> -if (!rdma_transport_is_ib(device))
> +if (!has_mcast(device))
>  return;
>  
>  dev = kmalloc(sizeof *dev + device->phys_port_cnt * sizeof *port,
> @@ -823,7 +823,7 @@ static void mcast_add_one(struct ib_device *device)
>  }
>  
>  for (i = 0; i <= dev->end_port - dev->start_port; i++) {
> -if (!rdma_port_ll_is_ib(device, dev->start_port + i))
> +if (!cap_mcast(device, dev->start_port + i))
>  continue;
>  port = &dev->port[i];
>  port->dev = dev;
> @@ -861,7 +861,7 @@ static void mcast_remove_one(struct ib_device *device)
>  flush_workqueue(mcast_wq);
>  
>  for (i = 0; i <= dev->end_port - dev->start_port; i++) {
> -if (rdma_port_ll_is_ib(device, dev->start_port + i)) {
> +if (cap_mcast(device, dev->start_port + i)) {
>  port = &dev->port[i];
>  deref_port(port);
>  wait_for_completion(&port->comp);
> diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
> index fa8ffa3..e796104 100644
> --- a/include/rdma/ib_verbs.h
> +++ b/include/rdma/ib_verbs.h
> @@ -1823,6 +1823,19 @@ static inline int has_sa(struct ib_device *device)
>  }
>  
>  /**
> + * has_mcast - Check if a device support Multicast.
> + *
> + * @device: Device to be checked
> + *
> + * Return 0 when a device has none port to support
> + * Multicast.
> + */
> +static inline int has_mcast(struct ib_device *device)
> +{
> +return rdma_transport_is_ib(device);
> +}
> +
> +/**
>   * cap_smi - Check if the port of device has the capability
>   * Subnet Management Interface.
>   *
> @@ -1852,6 +1865,21 @@ static inline int cap_sa(struct ib_device *device, u8 
> port_num)
>  return rdma_port_ll_is_ib(device, port_num);
>  }
>  
> +/**
> + * cap_mcast - Check if the port of device has the capability
> + * Multicast.
> + *
> + * @device: Device to be checked
> + * @port_num: Port number of the device
> + *
> + * Return 0 when port of the device don't support
> + * Multicast.
> + */
> +static inline int cap_mcast(struct ib_device *device, u8 port_num)
> +{
> +return rdma_port_ll_is_ib(device, port_num);
> +}
> +
>  int ib_query_gid(struct ib_device *device,
>   u8 port_num, int index, union ib_gid *gid);
>  


-- 
Doug Ledford 
  GPG KeyID: 0E572FDD




signature.asc
Description: This is a digitally signed message part

Re: [RFC PATCH 08/11] IB/Verbs: Use management helper has_iwarp() for, iwarp-check

2015-03-30 Thread Doug Ledford

On Fri, 2015-03-27 at 16:47 +0100, Michael Wang wrote:
> Introduce helper has_iwarp() to help us check if an IB device
> support IWARP protocol.

This is a needless redirection.  Just stick with the original
rdma_transport_is_iwarp().

> Cc: Jason Gunthorpe 
> Cc: Doug Ledford 
> Cc: Ira Weiny 
> Cc: Sean Hefty 
> Signed-off-by: Michael Wang 
> ---
>  include/rdma/ib_verbs.h | 13 +
>  net/sunrpc/xprtrdma/svc_rdma_recvfrom.c |  2 +-
>  2 files changed, 14 insertions(+), 1 deletion(-)
> 
> diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
> index e796104..0ef9cd7 100644
> --- a/include/rdma/ib_verbs.h
> +++ b/include/rdma/ib_verbs.h
> @@ -1836,6 +1836,19 @@ static inline int has_mcast(struct ib_device *device)
>  }
>  
>  /**
> + * has_iwarp - Check if a device support IWARP protocol.
> + *
> + * @device: Device to be checked
> + *
> + * Return 0 when a device has none port to support
> + * IWARP protocol.
> + */
> +static inline int has_iwarp(struct ib_device *device)
> +{
> +return rdma_transport_is_iwarp(device);
> +}
> +
> +/**
>   * cap_smi - Check if the port of device has the capability
>   * Subnet Management Interface.
>   *
> diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c 
> b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
> index a7b5891..48aeb5e 100644
> --- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
> +++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
> @@ -118,7 +118,7 @@ static void rdma_build_arg_xdr(struct svc_rqst *rqstp,
>  
>  static int rdma_read_max_sge(struct svcxprt_rdma *xprt, int sge_count)
>  {
> -if (rdma_transport_is_iwarp(xprt->sc_cm_id->device))
> +if (has_iwarp(xprt->sc_cm_id->device))
>  return 1;
>  else
>  return min_t(int, sge_count, xprt->sc_max_sge);


-- 
Doug Ledford 
  GPG KeyID: 0E572FDD




signature.asc
Description: This is a digitally signed message part

Re: [RFC PATCH 06/11] IB/Verbs: Use management helper has_sa() and cap_sa(), for sa-check

2015-03-30 Thread Doug Ledford

On Fri, 2015-03-27 at 16:46 +0100, Michael Wang wrote:
> Introduce helper has_sa() and cap_sa() to help us check if an IB device
> or it's port support Subnet Administrator.

There's no functional reason to have both rdma_transport_is_ib and
rdma_port_ll_is_ib, just use one.  Then there is also no reason for both
has_sa and cap_sa.  Just use one.

> Cc: Jason Gunthorpe 
> Cc: Doug Ledford 
> Cc: Ira Weiny 
> Cc: Sean Hefty 
> Signed-off-by: Michael Wang 
> ---
>  drivers/infiniband/core/sa_query.c | 12 ++--
>  include/rdma/ib_verbs.h| 28 
>  2 files changed, 34 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/infiniband/core/sa_query.c 
> b/drivers/infiniband/core/sa_query.c
> index d95d25f..89c27da 100644
> --- a/drivers/infiniband/core/sa_query.c
> +++ b/drivers/infiniband/core/sa_query.c
> @@ -450,7 +450,7 @@ static void ib_sa_event(struct ib_event_handler *handler, 
> struct ib_event *event
>  struct ib_sa_port *port =
>  &sa_dev->port[event->element.port_num - sa_dev->start_port];
>  
> -if (!rdma_port_ll_is_ib(handler->device, port->port_num))
> +if (!cap_sa(handler->device, port->port_num))
>  return;
>  
>  spin_lock_irqsave(&port->ah_lock, flags);
> @@ -1154,7 +1154,7 @@ static void ib_sa_add_one(struct ib_device *device)
>  struct ib_sa_device *sa_dev;
>  int s, e, i;
>  
> -if (!rdma_transport_is_ib(device))
> +if (!has_sa(device))
>  return;
>  
>  if (device->node_type == RDMA_NODE_IB_SWITCH)
> @@ -1175,7 +1175,7 @@ static void ib_sa_add_one(struct ib_device *device)
>  
>  for (i = 0; i <= e - s; ++i) {
>  spin_lock_init(&sa_dev->port[i].ah_lock);
> -if (!rdma_port_ll_is_ib(device, i + 1))
> +if (!cap_sa(device, i + 1))
>  continue;
>  
>  sa_dev->port[i].sm_ah= NULL;
> @@ -1205,14 +1205,14 @@ static void ib_sa_add_one(struct ib_device *device)
>  goto err;
>  
>  for (i = 0; i <= e - s; ++i)
> -if (rdma_port_ll_is_ib(device, i + 1))
> +if (cap_sa(device, i + 1))
>  update_sm_ah(&sa_dev->port[i].update_task);
>  
>  return;
>  
>  err:
>  while (--i >= 0)
> -if (rdma_port_ll_is_ib(device, i + 1))
> +if (cap_sa(device, i + 1))
>  ib_unregister_mad_agent(sa_dev->port[i].agent);
>  
>  kfree(sa_dev);
> @@ -1233,7 +1233,7 @@ static void ib_sa_remove_one(struct ib_device *device)
>  flush_workqueue(ib_wq);
>  
>  for (i = 0; i <= sa_dev->end_port - sa_dev->start_port; ++i) {
> -if (rdma_port_ll_is_ib(device, i + 1)) {
> +if (cap_sa(device, i + 1)) {
>  ib_unregister_mad_agent(sa_dev->port[i].agent);
>  if (sa_dev->port[i].sm_ah)
>  kref_put(&sa_dev->port[i].sm_ah->ref, free_sm_ah);
> diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
> index c0a63f8..fa8ffa3 100644
> --- a/include/rdma/ib_verbs.h
> +++ b/include/rdma/ib_verbs.h
> @@ -1810,6 +1810,19 @@ static inline int has_cm(struct ib_device *device)
>  }
>  
>  /**
> + * has_sa - Check if a device support Subnet Administrator.
> + *
> + * @device: Device to be checked
> + *
> + * Return 0 when a device has none port to support
> + * Subnet Administrator.
> + */
> +static inline int has_sa(struct ib_device *device)
> +{
> +return rdma_transport_is_ib(device);
> +}
> +
> +/**
>   * cap_smi - Check if the port of device has the capability
>   * Subnet Management Interface.
>   *
> @@ -1824,6 +1837,21 @@ static inline int cap_smi(struct ib_device *device, u8 
> port_num)
>  return rdma_port_ll_is_ib(device, port_num);
>  }
>  
> +/**
> + * cap_sa - Check if the port of device has the capability
> + * Subnet Administrator.
> + *
> + * @device: Device to be checked
> + * @port_num: Port number of the device
> + *
> + * Return 0 when port of the device don't support
> + * Subnet Administrator.
> + */
> +static inline int cap_sa(struct ib_device *device, u8 port_num)
> +{
> +return rdma_port_ll_is_ib(device, port_num);
> +}
> +
>  int ib_query_gid(struct ib_device *device,
>   u8 port_num, int index, union ib_gid *gid);
>  


-- 
Doug Ledford 
  GPG KeyID: 0E572FDD




signature.asc
Description: This is a digitally signed message part

1 2 3 4 5 6 7 8 >

1 - 100 of 778 matches

Mail list logo