RE: [PATCHv3 TRIVIAL] IB/core: Documentation fix to ib_mad_snoop_handler in the MAD header file

2016-01-05 Thread Hefty, Sean
> In ib_mad.h, ib_mad_snoop_handler uses send_buf rather than send_wr

The MAD snooping should be removed from the mad stack.  There are no in tree 
users and the only attempt at adding one was rejected.


RE: [PATCHv3 TRIVIAL] IB/core: Documentation fix to ib_mad_snoop_handler in the MAD header file

2016-01-05 Thread Hefty, Sean
> > There are no in tree users and the only attempt at adding one was
> rejected.
> 
> There are no in tree users of this but there is your madeye tool (which
> is out of tree). This is still a useful debug tool for MADs and there
> are people who still use that.

It's an out of tree tool.  Maintain the snooping hooks into the mad layer out 
of tree as well.


RE: [PATCH 22/37] IB/rdmavt: Add queue pair data structure to rdmavt

2015-12-10 Thread Hefty, Sean
> >> +struct rvt_rwqe {
> >> +  u64 wr_id;
> >> +  u8 num_sge;
> >> +  struct ib_sge sg_list[0];
> >> +};
> >> +
> >> +/*
> >> + * This structure is used to contain the head pointer, tail pointer,
> >> + * and receive work queue entries as a single memory allocation so
> >> + * it can be mmap'ed into user space.
> >> + * Note that the wq array elements are variable size so you can't
> >> + * just index into the array to get the N'th element;
> >> + * use get_rwqe_ptr() instead.
> >
> >Can you add/use an entry_size field?
> 
> I think we could work something like that, however what we have in
> qib/hfi1
> also works.  Any reason we really should change this?

I did not check to see what the drivers do.  Using entry_size is 
straightforward, may provide the best performance, and can be done in common 
code, versus adding callbacks to all users. 
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 03/37] IB/rdmavt: Add protection domain to rdmavt.

2015-12-10 Thread Hefty, Sean
> > How about I reword the comment to say that while we could continue
> > allocating PDs being only constrained by system resources, the spec says
> > there is a limit so we enforce that?
> 
> I'm ok with that, Sean?

That's fine
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH V2 1/1] [PATCH 1/1] Ibacm: default pkey for partitioned fabrics

2015-12-10 Thread Hefty, Sean
Thanks - applied
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 1/1] Ibacm: default pkey for partitioned fabrics

2015-12-09 Thread Hefty, Sean
> > I mean that the user needs to investigate why the fabric is not working
> out of box.
> 
> My point is that an educated admin should _know_ to configure in these
> cases and that debug is only when things are broken not by default in
> these more complex cases. This means the limitations of the out of box
> approach needs to be explained in the docs.

When IP addresses are used, the corresponding pkey is used.  The issue this 
patch is addressing is the mapping of 'hostnames' to pkeys, as show in this 
ibacm_addr.cfg example:

#Name  device port pkey
cst-lin0   mlx4_0 1default
cst-lin0-1 mlx4_0 1default
cst-lin0-2 mlx4_0 2default

Currently, 'default' is hard-coded to a pkey of 0x.  The intent is to 
define a better default value.  Kaike has suggested this be the new default:

1. Find the first non-management full-member pkey;
2. If it fails, find pkey 0x;
3. If pkey 0x is not available, use the first pkey.

Is there better alternative for what default should be?  Jason was suggesting 
use pkey[0], which seems less robust in theory, but is simple and may cover the 
vast majority of real use cases.

The fewer cases where manual configuration is necessary, the fewer emails I 
receive, and the better off I am.  :)

- Sean


RE: [PATCH 1/1] Ibacm: default pkey for partitioned fabrics

2015-12-09 Thread Hefty, Sean
> pkey[0] at least has the logic that the admin will configure things so
> that the default ipoib device reaches the broadest audiance makes the
> most sense to me. That is what most sites I've seen want to do.

Kaike, will pkey[0] work in the configurations that you're targeting with this 
change?

This seems like a very simple solution that's better than what we have now. 

> I'm not quite sure what the acm algorithm is, but can't it just figure
> out the pkey from the IP routing? Ie if you have an IP address to
> resolve a few netlink queries will tell you what pkey to use, and that
> is where the acm multicast should go?

When an IP address is used, it uses the correct pkey (based on routing data).  
I mentioned this in a separate email, but this addresses the case when 
hostnames are used and ipoib may not be involved.

- Sean
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 1/1] Ibacm: default pkey for partitioned fabrics

2015-12-09 Thread Hefty, Sean
> >> Example: Compute nodes are assigned pkeys 0x8000 and 0x7fff.  A node
> >> running the job scheduler has pkeys 0x and 0x8000 (maybe it's
> >> also the backup SA).  Ibacm would need to select pkey 0x8000 for
> >> communication.
> >
> > I've also seen the reverse, eg 0x is used for default ipoib
> > communication and 0x8001 is assigned to only some nodes as a child
> > vlan.
> 
> That's what I use internally in our test lab.
> 
> > Choosing 0x8001 in that case won't work either.
> 
> Nope.  The suggestion here would break our setup.  Well, *would* being
> the operative word where what I mean is would if we used ibacm.  But
> that's a story for another email...

In this case, Kaike, please change the default to just be pkey[0].

- Sean
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 1/1] Ibacm: default pkey for partitioned fabrics

2015-12-09 Thread Hefty, Sean
> > 1. Find the first non-management full-member pkey;

I.e. a pkey with the high-order bit set that is not 0x

> > 2. If it fails, find pkey 0x;
> 
> Order of 1 and 2 depends on use models for full default partition and
> other partitions. Reversing 1 and 2 (full default partition first) would
> handle the most common use models and handles Jason's case.
> 
> The only common case that I'm aware of where that might fall down is in
> the virtualized case. I'm not sure what policy is best there and would
> need to think about that scenario some more (and there is more
> fundamental issue with ACM in those environments).

Offline I had asked about reversing 1 and 2.  The reasoning given was that 
ibacm could be running on a 'management' node, but needed to communicate with 
compute nodes.
 
IOW, the occurrence of a 'non-management, full member' pkey in the pkey table 
strongly indicates that a node is being used in a secure environment, and ibacm 
should prefer using that pkey over 0x.

Example: Compute nodes are assigned pkeys 0x8000 and 0x7fff.  A node running 
the job scheduler has pkeys 0x and 0x8000 (maybe it's also the backup SA).  
Ibacm would need to select pkey 0x8000 for communication.

This seems like a reasonable argument to me.

- Sean


RE: [PATCH 03/37] IB/rdmavt: Add protection domain to rdmavt.

2015-12-08 Thread Hefty, Sean
> > > +struct ib_pd *rvt_alloc_pd(struct ib_device *ibdev,
> > > +struct ib_ucontext *context,
> > > +struct ib_udata *udata)
> > > +{
> > > + struct rvt_dev_info *dev = ib_to_rvt(ibdev);
> > > + struct rvt_pd *pd;
> > > + struct ib_pd *ret;
> > > +
> > > + pd = kmalloc(sizeof(*pd), GFP_KERNEL);
> > > + if (!pd) {
> > > + ret = ERR_PTR(-ENOMEM);
> > > + goto bail;
> > > + }
> > > + /*
> > > +  * This is actually totally arbitrary.  Some correctness tests
> > > +  * assume there's a maximum number of PDs that can be allocated.
> > > +  * We don't actually have this limit, but we fail the test if
> > > +  * we allow allocations of more than we report for this value.
> > > +  */
> >
> > Why not trap this in user space, rather than forcing the kernel to
> support some test program?
> >
> 
> What do you mean "trap this in user space"?  This code is not supporting
> any
> particular test program.
> 
> If users try and allocate more protection domains then are advertised then
> an
> error should be returned.
> 
> Perhaps the comment should be removed if it is confusing?

There is no theoretical limit on the number of PDs, or likely most other 
resources.  So why define and enforce an arbitrary limit?  The justification 
given was that some test program would fail.  If there's a concern about that 
test program passing, then enforce the limit in user space.

I didn't find the comment confusing.  Just the choice to let a test app drive 
the design.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 1/1] Ibacm: default pkey for partitioned fabrics

2015-12-08 Thread Hefty, Sean
> > +* Determine the default pkey for parsing address file as well.
> > +*   order of preference: first full-member non-management pkey,
> > +*   0x, first pkey.
> > +*/
> 
> This really should just be the 0 index pkey, which exactly matches how
> IPoIB determines the default pkey, which is what matters when talking
> rdmacm..

Ibacm currently hard-codes the 'default' pkey to 0x for ibacm <-> ibacm 
communication.  If there's no disagreement to switching to pkey[0], I'm fine 
with that.  I did not realize that ipoib uses this same default.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 01/37] IB/rdmavt: Create module framework and handle driver registration

2015-12-08 Thread Hefty, Sean
> >> +#define RDMAVT_DRIVER_VERSION "0.1"
> >Do we really need driver version?
> 
> Based on these patches thus far. Probably not. We do still have a bit more
> development to go here, I added it more as a just in case but am not
> steadfast in keeping it.
> 
> What's the general rule here? Should we not assign a version if we aren't
> explicitly doing anything with it yet, but may at some point?

A driver version doesn't make sense.  Use the kernel version. 
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 03/37] IB/rdmavt: Add protection domain to rdmavt.

2015-12-07 Thread Hefty, Sean
> +struct ib_pd *rvt_alloc_pd(struct ib_device *ibdev,
> +struct ib_ucontext *context,
> +struct ib_udata *udata)
> +{
> + struct rvt_dev_info *dev = ib_to_rvt(ibdev);
> + struct rvt_pd *pd;
> + struct ib_pd *ret;
> +
> + pd = kmalloc(sizeof(*pd), GFP_KERNEL);
> + if (!pd) {
> + ret = ERR_PTR(-ENOMEM);
> + goto bail;
> + }
> + /*
> +  * This is actually totally arbitrary.  Some correctness tests
> +  * assume there's a maximum number of PDs that can be allocated.
> +  * We don't actually have this limit, but we fail the test if
> +  * we allow allocations of more than we report for this value.
> +  */

Why not trap this in user space, rather than forcing the kernel to support some 
test program?

N�r��yb�X��ǧv�^�)޺{.n�+{��ٚ�{ay�ʇڙ�,j��f���h���z��w���
���j:+v���w�j�mzZ+�ݢj"��!�i

RE: [PATCH 00/37] Add rdma verbs transport library

2015-12-07 Thread Hefty, Sean
> The following series implements rdmavt. This is the rdma verbs transport
> software library

This is going to be an annoying request, but I would rather see this named the 
IB transport library, unless this framework is usable over iWarp as well.
N�r��yb�X��ǧv�^�)޺{.n�+{��ٚ�{ay�ʇڙ�,j��f���h���z��w���
���j:+v���w�j�mzZ+�ݢj"��!�i

RE: [PATCH 06/37] IB/rdmavt: Add query and modify device stubs

2015-12-07 Thread Hefty, Sean
> +static int rvt_query_device(struct ib_device *ibdev,
> + struct ib_device_attr *props,
> + struct ib_udata *uhw)
> +{
> + /*
> +  * Return rvt_dev_info.props contents
> +  */
> + return -EINVAL;

ENOSYS on all of the function templates.  This and subsequent patches.

N�r��yb�X��ǧv�^�)޺{.n�+{��ٚ�{ay�ʇڙ�,j��f���h���z��w���
���j:+v���w�j�mzZ+�ݢj"��!�i

RE: [PATCH 01/37] IB/rdmavt: Create module framework and handle driver registration

2015-12-07 Thread Hefty, Sean
> @@ -0,0 +1,89 @@
> +/*
> + *
> + * This file is provided under a dual BSD/GPLv2 license.  When using or
> + * redistributing this file, you may do so under either license.
> + *
> + * GPL LICENSE SUMMARY
> + *
> + * Copyright(c) 2015 Intel Corporation.

I'm guessing that the GPL license text does not have an Intel copyright.

> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of version 2 of the GNU General Public License as
> + * published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it will be useful, but
> + * WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> + * General Public License for more details.
> + *
> + * BSD LICENSE
> + *
> + * Copyright(c) 2015 Intel Corporation.

Or the BSD license either.

Please move the copyright notices to the top of the comment block in this and 
other files.

> + *
> + * Redistribution and use in source and binary forms, with or without
> + * modification, are permitted provided that the following conditions
> + * are met:
> + *
> + *  - Redistributions of source code must retain the above copyright
> + *notice, this list of conditions and the following disclaimer.
> + *  - Redistributions in binary form must reproduce the above copyright
> + *notice, this list of conditions and the following disclaimer in
> + *the documentation and/or other materials provided with the
> + *distribution.
> + *  - Neither the name of Intel Corporation nor the names of its
> + *contributors may be used to endorse or promote products derived
> + *from this software without specific prior written permission.
> + *
> + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> + * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
> + * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
> + * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
> + * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
> + * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
> + * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
> + * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
> + * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
> + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
> + * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
> + *
> + */
N�r��yb�X��ǧv�^�)޺{.n�+{��ٚ�{ay�ʇڙ�,j��f���h���z��w���
���j:+v���w�j�mzZ+�ݢj"��!�i

RE: [PATCH 05/37] IB/rdmavt: Macroize override checks during driver registration

2015-12-07 Thread Hefty, Sean
> +/*
> + * Check driver override. If driver passes a value use it, otherwise we
> use our
> + * own value.
> + */
> +#define CDR(rdi, x) \
> + rdi->ibdev.x = rdi->ibdev.x ? : rvt_ ##x

This is an extremely obscure name.

No one will be able to look at this:

> + CDR(rdi, alloc_pd);
> + CDR(rdi, dealloc_pd);

And have a clue what is happening without searching for the macro.


RE: [PATCH 21/37] IB/rdmavt: Move MR datastructures into rvt

2015-12-07 Thread Hefty, Sean
>  include/rdma/rdma_vt.h |   53
> 
>  1 files changed, 53 insertions(+), 0 deletions(-)
> 
> diff --git a/include/rdma/rdma_vt.h b/include/rdma/rdma_vt.h
> index 5112dd7..39a0737 100644
> --- a/include/rdma/rdma_vt.h
> +++ b/include/rdma/rdma_vt.h
> @@ -59,6 +59,56 @@
>  #include "ib_verbs.h"
> 
>  /*
> + * For Memory Regions. This stuff should probably be moved into
> rdmavt/mr.h once
> + * drivers no longer need access to the MR directly.
> + */
> +
> +/*
> + * A segment is a linear region of low physical memory.
> + * Used by the verbs layer.
> + */
> +struct rvt_seg {
> + void *vaddr;
> + size_t length;
> +};
> +
> +/* The number of rvt_segs that fit in a page. */
> +#define RVT_SEGSZ (PAGE_SIZE / sizeof(struct rvt_seg))
> +
> +struct rvt_segarray {
> + struct rvt_seg segs[RVT_SEGSZ];
> +};
> +
> +struct rvt_mregion {
> + struct ib_pd *pd;   /* shares refcnt of ibmr.pd */
> + u64 user_base;  /* User's address for this region */
> + u64 iova;   /* IB start address of this region */
> + size_t length;
> + u32 lkey;
> + u32 offset; /* offset (bytes) to start of region */
> + int access_flags;
> + u32 max_segs;   /* number of rvt_segs in all the arrays */
> + u32 mapsz;  /* size of the map array */
> + u8  page_shift; /* 0 - non unform/non powerof2 sizes */
> + u8  lkey_published; /* in global table */

Without looking ahead in the patch series, won't the access_flags indicate this?

> + struct completion comp; /* complete when refcount goes to zero */
> + atomic_t refcount;
> + struct rvt_segarray *map[0];/* the segments */
> +};
> +
> +#define RVT_MAX_LKEY_TABLE_BITS 23
> +
> +struct rvt_lkey_table {
> + spinlock_t lock; /* protect changes in this struct */
> + u32 next;   /* next unused index (speeds search) */
> + u32 gen;/* generation count */
> + u32 max;/* size of the table */
> + struct rvt_mregion __rcu **table;
> +};
> +
> +/* End Memmory Region */
> +
> +/*
>   * Things that are driver specific, module parameters in hfi1 and qib
>   */
>  struct rvt_driver_params {
> @@ -125,6 +175,9 @@ struct rvt_dev_info {
>   /* Driver specific properties */
>   struct rvt_driver_params dparms;
> 
> + struct rvt_mregion __rcu *dma_mr;
> + struct rvt_lkey_table lk_table;

Go crazy here and add the 'ey' into the field name.



RE: [PATCH 22/37] IB/rdmavt: Add queue pair data structure to rdmavt

2015-12-07 Thread Hefty, Sean
>  drivers/infiniband/sw/rdmavt/qp.h |5 -
>  include/rdma/rdma_vt.h|  233
> +
>  2 files changed, 233 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/infiniband/sw/rdmavt/qp.h
> b/drivers/infiniband/sw/rdmavt/qp.h
> index 4e4709f..c80d326 100644
> --- a/drivers/infiniband/sw/rdmavt/qp.h
> +++ b/drivers/infiniband/sw/rdmavt/qp.h
> @@ -53,11 +53,6 @@
> 
>  #include 
> 
> -struct rvt_qp {
> - struct ib_qp *ibqp;
> - /* Other stuff */
> -};
> -
>  struct ib_qp *rvt_create_qp(struct ib_pd *ibpd,
>   struct ib_qp_init_attr *init_attr,
>   struct ib_udata *udata);
> diff --git a/include/rdma/rdma_vt.h b/include/rdma/rdma_vt.h
> index 39a0737..8d3a41a 100644
> --- a/include/rdma/rdma_vt.h
> +++ b/include/rdma/rdma_vt.h
> @@ -109,6 +109,239 @@ struct rvt_lkey_table {
>  /* End Memmory Region */
> 
>  /*
> + * Things needed for the Queue Pair definition. Like the MR stuff above
> the
> + * following should probably get moved to qp.h once drivers stop trying
> to make
> + * and manipulate thier own QPs. For the few instnaces where a driver may
> need
> + * to look into a queue pair there should be a pointer to a driver
> priavte data
> + * structure that they can look at.
> + */
> +
> +/*
> + * These keep track of the copy progress within a memory region.
> + * Used by the verbs layer.
> + */
> +struct rvt_sge {
> + struct rvt_mregion *mr;
> + void *vaddr;/* kernel virtual address of segment */
> + u32 sge_length; /* length of the SGE */
> + u32 length; /* remaining length of the segment */
> + u16 m;  /* current index: mr->map[m] */

Rename to cur_map?

> + u16 n;  /* current index: mr->map[m]->segs[n] */

cur_seg?

> +};
> +
> +/*
> + * Send work request queue entry.
> + * The size of the sg_list is determined when the QP is created and
> stored
> + * in qp->s_max_sge.
> + */
> +struct rvt_swqe {
> + union {
> + struct ib_send_wr wr;   /* don't use wr.sg_list */
> + struct ib_ud_wr ud_wr;
> + struct ib_reg_wr reg_wr;
> + struct ib_rdma_wr rdma_wr;
> + struct ib_atomic_wr atomic_wr;
> + };
> + u32 psn;/* first packet sequence number */
> + u32 lpsn;   /* last packet sequence number */
> + u32 ssn;/* send sequence number */
> + u32 length; /* total length of data in sg_list */
> + struct rvt_sge sg_list[0];
> +};
> +
> +/*
> + * Receive work request queue entry.
> + * The size of the sg_list is determined when the QP (or SRQ) is created
> + * and stored in qp->r_rq.max_sge (or srq->rq.max_sge).
> + */
> +struct rvt_rwqe {
> + u64 wr_id;
> + u8 num_sge;
> + struct ib_sge sg_list[0];
> +};
> +
> +/*
> + * This structure is used to contain the head pointer, tail pointer,
> + * and receive work queue entries as a single memory allocation so
> + * it can be mmap'ed into user space.
> + * Note that the wq array elements are variable size so you can't
> + * just index into the array to get the N'th element;
> + * use get_rwqe_ptr() instead.

Can you add/use an entry_size field?


> + */
> +struct rvt_rwq {
> + u32 head;   /* new work requests posted to the head */
> + u32 tail;   /* receives pull requests from here. */
> + struct rvt_rwqe wq[0];
> +};
> +
> +struct rvt_rq {
> + struct rvt_rwq *wq;
> + u32 size;   /* size of RWQE array */
> + u8 max_sge;
> + /* protect changes in this struct */
> + spinlock_t lock cacheline_aligned_in_smp;
> +};
N�r��yb�X��ǧv�^�)޺{.n�+{��ٚ�{ay�ʇڙ�,j��f���h���z��w���
���j:+v���w�j�mzZ+�ݢj"��!�i

RE: [PATCH] IB/mad: In validate_mad, validate CM method and attribute

2015-11-12 Thread Hefty, Sean
> > +   /* CM attributes other than ClassPortInfo only use Send method
> */
> > +   if (mad_hdr->mgmt_class == IB_MGMT_CLASS_CM) {
> > +   if (mad_hdr->attr_id != IB_MGMT_CLASSPORTINFO_ATTR_ID) {
> > +   if (mad_hdr->method != IB_MGMT_METHOD_SEND)
> > +   goto out;
> > +   } else if (mad_hdr->method != IB_MGMT_METHOD_GET_RESP)
> > +   goto out;
> > +   }
> 
> Doesn't this invalidate a CM Get(ClassPortInfo) mad?

I believe this does.  I think you could remove the else if clause and let the 
received MAD get passed to the CM.  It would be dropped there as unsupported.  
The net result is likely the same.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH] IB/mad: In validate_mad, validate CM method and attribute

2015-11-12 Thread Hefty, Sean
> IIRC responding to Get(CPI) is mandatory?

Maybe the drivers are responding?  I don't believe that the CM does.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH] IB/mad: In validate_mad, validate CM method and attribute

2015-11-12 Thread Hefty, Sean
> Receipt of CM MAD with response method for other than ClassPortInfo
> attribute is invalid.
> 
> CM attributes other than ClassPortInfo use send method only
> and GetResp is valid for ClassPortInfo attribute.
> Note also that the CM ClassPortInfo is not currently supported.
> 
> The SRP initiator does not maintain a timeout policy for CM connect
> requests relies on the CM layer to do that. The result was that
> the SRP initiator hung as the connect request never completed.
> 
> A new SRP target has been observed to respond to Send CM REQ
> with GetResp of CM REQ with bad status. This is non conformant
> with IBA spec but exposes a vulnerability in the current MAD/CM
> code which will respond to the incoming GetResp of CM REQ as if
> it was a valid incoming Send of CM REQ rather than tossing
> this on the floor. It also causes the MAD layer not to
> retransmit the original REQ even though it has not received a REP.
> 
> Reviewed-by: Sagi Grimberg 
> Signed-off-by: Hal Rosenstock 

Reviewed-by: Sean Hefty 


RE: [PATCH libibcm] cmpost.c: Handle ibv_get_device_list returning no IB devices in init()

2015-11-05 Thread Hefty, Sean
Merged - thanks.

This is the first patch against the libibcm in over 4 years.  Is there a reason 
why this is being used instead of the librdmacm?  I ask because I assumed that 
the libibcm was basically deprecated.  The last release was over 6 years ago.

- Sean


RE: [PATCH rdma-RC] IB/cm: Fix rb-tree duplicate free and use-after-free

2015-10-26 Thread Hefty, Sean
> > Sean, I need to close on this patch.  What is your position after
> > Matan's explanation?
> >
> 
> Absent an objection from Sean, I've pulled this in.  A use after free
> bug is a pretty serious issue, and you've listed an error flow that
> triggers it.  The only thing bugging me is that this code is 10+ years
> old and this didn't show up until now, which makes me think that some
> recent change is the cause of this.  I've made note of that fact in my
> tag commit and I think this warrants further examination in the next
> kernel cycle.  But since we are so close to out of time on 4.3, I deemed
> it better to fix the use after free issue, even if it isn't necessarily
> the perfect fix, than leave that hanging about.

I was out last week.  I think one of the reasons that this bug hasn't shown up 
is that very few apps use UD QPs, and those that do likely exchange QP 
information using some other out of band mechanism.

- Sean
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH rdma-RC] IB/cm: Fix sleeping while atomic when creating AH from WC

2015-10-20 Thread Hefty, Sean
> Today, cm assumes paths are reversible primary_path->reversible = 1.

I can't quickly find a link, but I believe all CM MADs are reversible, per the 
spec.


RE: [PATCH rdma-RC] IB/cm: Fix sleeping while atomic when creating AH from WC

2015-10-15 Thread Hefty, Sean
> >> ib_create_ah_from_wc needs to resolve the DMAC in order to create the
> >> AH (this may result sending an ARP and waiting for response).
> >> CM uses this function (which is now sleepable).
> >
> > This is a significant change to the CM.  The CM calls are invoked
> assuming that they return relatively quickly.  They're invoked from
> callbacks and internally.  Having the calls now wait for an ARP response
> requires that this be re-architected, so the calling thread doesn't go out
> to lunch for several seconds.
> 
> Agree - this is a significant change, but it was done a long time ago
> (at v4.3 if I recall). When we need to send a message we need to

We're at 4.3-rc5?

> figure out the destination MAC. Even the passive side needs to do that
> as some vendors don't report the source MAC of the packet in their wc.
> Even if they did, since IP based addressing is rout-able by its
> nature, it should follow the networking stack rules. Some crazy
> configurations could force sending responses to packets that came from
> router1 to router2 - so we have no choice than resolving the DMAC at
> every side.

Ib_create_ah_from_wc is broken.   It is now an asynchronous operation, only the 
call itself was left as synchronous.  We can't block kernel threads for a 
minute, or however long ARP takes to resolve.  The call itself must change to 
be async, and all users of it updated to allocate some request, queue it, and 
handle all race conditions that result -- such as state changes or destruction 
of the work that caused the request to be initiated.


RE: [PATCH rdma-RC] IB/cm: Fix sleeping while atomic when creating AH from WC

2015-10-13 Thread Hefty, Sean
> On Mon, Oct 12, 2015 at 7:42 PM, Hefty, Sean <sean.he...@intel.com> wrote:
> >> When IP based addressing was introduced, ib_create_ah_from_wc was
> >> changed in order to support a suitable AH. Since this AH should
> >> now contains the DMAC (which isn't a simple derivative of the GID).
> >> In order to find the DMAC, an ARP should sometime be sent. This ARP
> >> is a sleeping context.
> >
> > Wait - are you saying that the CM may now be waiting for an ARP response
> before it can send a message?
> >
> 
> ib_create_ah_from_wc needs to resolve the DMAC in order to create the
> AH (this may result sending an ARP and waiting for response).
> CM uses this function (which is now sleepable).

This is a significant change to the CM.  The CM calls are invoked assuming that 
they return relatively quickly.  They're invoked from callbacks and internally. 
 Having the calls now wait for an ARP response requires that this be 
re-architected, so the calling thread doesn't go out to lunch for several 
seconds.

- Sean


RE: [PATCH rdma-RC] IB/cm: Fix sleeping while atomic when creating AH from WC

2015-10-12 Thread Hefty, Sean
> When IP based addressing was introduced, ib_create_ah_from_wc was
> changed in order to support a suitable AH. Since this AH should
> now contains the DMAC (which isn't a simple derivative of the GID).
> In order to find the DMAC, an ARP should sometime be sent. This ARP
> is a sleeping context.

Wait - are you saying that the CM may now be waiting for an ARP response before 
it can send a message?

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH rdma-RC] IB/cm: Fix rb-tree duplicate free and use-after-free

2015-10-12 Thread Hefty, Sean
> ib_send_cm_sidr_rep could sometimes erase the node from the sidr
> (depending on errors in the process). Since ib_send_cm_sidr_rep is
> called both from cm_sidr_req_handler and cm_destroy_id, cm_id_priv

This should clarify that it is the app calling from the callback, and not a 
direct call from the cm_sidr_req_handler.

> could be either erased from the rb_tree twice or not erased at all.

In an error case, I can see why it would be left in the rbtree, but I don't see 
how it can be removed twice.


> Fixing that by making sure it's erased only once before freeing
> cm_id_priv.
> 
> Fixes: a977049dacde ('[PATCH] IB: Add the kernel CM implementation')
> Signed-off-by: Doron Tsur 
> Signed-off-by: Matan Barak 
> ---
> 
> Hi Doug,
> This patch fixes a bug in the CM. In some flow, rb-tree could be
> freed twice or used after it was freed. This bug was picked by
> our regression tests and this fix was verified.
> 
> Thanks,
> Doron and Matan
> 
>  drivers/infiniband/core/cm.c | 10 +-
>  1 file changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c
> index f5cf1c4..56ff0f3 100644
> --- a/drivers/infiniband/core/cm.c
> +++ b/drivers/infiniband/core/cm.c
> @@ -844,6 +844,11 @@ retest:
>   case IB_CM_SIDR_REQ_RCVD:
>   spin_unlock_irq(_id_priv->lock);
>   cm_reject_sidr_req(cm_id_priv, IB_SIDR_REJECT);
> + spin_lock_irq();
> + if (!RB_EMPTY_NODE(_id_priv->sidr_id_node))
> + rb_erase(_id_priv->sidr_id_node,
> +  _sidr_table);
> + spin_unlock_irq();

We should be able to use a return value from cm_reject_sidr_req() -- passed 
through from ib_send_cm_sidr_rep() to determine if the id was removed from the 
tree.

>   break;
>   case IB_CM_REQ_SENT:
>   case IB_CM_MRA_REQ_RCVD:
> @@ -3210,7 +3215,10 @@ int ib_send_cm_sidr_rep(struct ib_cm_id *cm_id,
>   spin_unlock_irqrestore(_id_priv->lock, flags);
> 
>   spin_lock_irqsave(, flags);
> - rb_erase(_id_priv->sidr_id_node, _sidr_table);
> + if (!RB_EMPTY_NODE(_id_priv->sidr_id_node)) {
> + rb_erase(_id_priv->sidr_id_node, _sidr_table);
> + RB_CLEAR_NODE(_id_priv->sidr_id_node);
> + }
>   spin_unlock_irqrestore(, flags);

Something is very wrong in this function if the id is not in the tree at this 
point.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH librdmacm] examples: Use gai_strerror rather than perror for [rdma_]getaddrinfo failures

2015-10-06 Thread Hefty, Sean
Thanks - merged all 3 
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: libmlx4 and libmlx5 git trees? Who is handling those?

2015-09-28 Thread Hefty, Sean
> Anyway, I had intended, and I've started on, making a change in how
> libibverbs is done anyway.  The idea that a new release is just a script
> that throws a version on and we go is naive.  I *will* be doing
> pre-release rc tarballs and there will be testing and there will be a
> release process that helps to make sure we don't see things like what we
> saw with librdmacm (sorry to pick on you Sean): 1.0.17/1.0.17.1,
> 1.0.18/1.0.18.1, and 1.0.19/1.0.19.1.

I don’t feel picked on.  :)

The issue is that basically no one tests the releases of the packages.  They 
only test the OFED bundle, and OFED doesn't want to pick up anything that's not 
already a release. 
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: libmlx4 and libmlx5 git trees? Who is handling those?

2015-09-28 Thread Hefty, Sean
> >The issue is that basically no one tests the releases of the packages.
> 
> We assume that the maintainer of the package tests it before they release
> it :)

No one has access to that many pieces of hardware and system configurations.  
The "community" can't dump the responsibility for testing onto a single person.

OFA has always had the problem of testing long after the fact -- months, if not 
years, after changes are merged upstream (kernel or user space).

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH] infiniband:core:Fix error handling in the function cm_lap_handler

2015-09-18 Thread Hefty, Sean
> diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c
> index ea4db9c..8fab3a7 100644
> --- a/drivers/infiniband/core/cm.c
> +++ b/drivers/infiniband/core/cm.c
> @@ -2812,7 +2812,12 @@ static int cm_lap_handler(struct cm_work *work)
>   cm_init_av_for_response(work->port, work->mad_recv_wc->wc,
>   work->mad_recv_wc->recv_buf.grh,
>   _id_priv->av);
> - cm_init_av_by_path(param->alternate_path, _id_priv->alt_av);
> + ret = cm_init_av_by_path(param->alternate_path, _id_priv-
> >alt_av);
> + if (ret) {
> + spin_unlock_irq(_id_priv->lock);
> + cm_deref_id(cm_id_priv);
> + }

How can this be right?  On a failure, we release the spinlock and the reference 
on the cm_id, then continue on with the processing?

> +
>   ret = atomic_inc_and_test(_id_priv->work_count);
>   if (!ret)
>   list_add_tail(>list, _id_priv->work_list);
> --
> 2.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 0/7] devcg: device cgroup extension for rdma resource

2015-09-11 Thread Hefty, Sean
> > Trying to limit the number of QPs that an app can allocate,
> > therefore, just limits how much of the address space an app can use.
> > There's no clear link between QP limits and HW resource limits,
> > unless you assume a very specific underlying implementation.
> 
> Isn't that the point though? We have several vendors with hardware
> that does impose hard limits on specific resources. There is no way to
> avoid that, and ultimately, those exact HW resources need to be
> limited.

My point is that limiting the number of QPs that an app can allocate doesn't 
necessarily mean anything.  Is allocating 1000 QPs with 1 entry each better or 
worse than 1 QP with 10,000 entries?  Who knows?

> If we want to talk about abstraction, then I'd suggest something very
> general and simple - two limits:
>  '% of the RDMA hardware resource pool' (per device or per ep?)
>  'bytes of kernel memory for RDMA structures' (all devices)

Yes - this makes more sense to me.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 0/7] devcg: device cgroup extension for rdma resource

2015-09-11 Thread Hefty, Sean
> So, the existence of resource limitations is fine.  That's what we
> deal with all the time.  The problem usually with this sort of
> interfaces which expose implementation details to users directly is
> that it severely limits engineering manuevering space.  You usually
> want your users to express their intentions and a mechanism to
> arbitrate resources to satisfy those intentions (and in a way more
> graceful than "we can't, maybe try later?"); otherwise, implementing
> any sort of high level resource distribution scheme becomes painful
> and usually the only thing possible is preventing runaway disasters -
> you don't wanna pin unused resource permanently if there actually is
> contention around it, so usually all you can do with hard limits is
> overcommiting limits so that it at least prevents disasters.

I agree with Tejun that this proposal is at the wrong level of abstraction.

If you look at just trying to limit QPs, it's not clear what that attempts to 
accomplish.  Conceptually, a QP is little more than an addressable endpoint.  
It may or may not map to HW resources (for Intel NICs it does not).  Even when 
HW resources do back the QP, the hardware is limited by how many QPs can 
realistically be active at any one time, based on how much caching is available 
in the NIC.

Trying to limit the number of QPs that an app can allocate, therefore, just 
limits how much of the address space an app can use.  There's no clear link 
between QP limits and HW resource limits, unless you assume a very specific 
underlying implementation.

- Sean
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: RDMA/CM and multiple QPs

2015-09-10 Thread Hefty, Sean
> right now RDMA/CM works on a QP basis, but seems very awakward if you
> want multiple QPs as part of a single logical device, which will be
> useful for a lot of modern protocols.  For example we will need to check
> in the CM handler that we're not getting a different ib_device if we
> want to apply the device limit in any sort of global scope, and it's
> generally very hard to get a struct ib_device that can be used as
> a driver model parent.
> 
> Is there any interest in trying to add an API to the CM to do a single
> address resolution and allocate multiple QPs with these checks in
> place?

IMO, you want a completely different level of abstraction.  One not based on a 
specific hardware implementation.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 0/7] devcg: device cgroup extension for rdma resource

2015-09-10 Thread Hefty, Sean
> > In past there has been similar comment to have dedicated cgroup
> > controller for RDMA instead of merging with device cgroup.
> > I am ok with both the approach, however I prefer to utilize device
> > controller instead of spinning of new controller for new devices
> > category.
> > I anticipate more such need would arise and for new device category,
> > it might not be worth to have new cgroup controller.
> > RapidIO though very less popular and upcoming PCIe are on horizon to
> > offer similar benefits as that of RDMA and in future having one
> > controller for each of them again would not be right approach.
> >
> > I certainly seek your and others inputs in this email thread here
> whether
> > (a) to continue to extend device cgroup (which support character,
> > block devices white list) and now RDMA devices
> > or
> > (b) to spin of new controller, if so what are the compelling reasons
> > that it can provide compare to extension.
> 
> I'm doubtful that these things are gonna be mainstream w/o building up
> higher level abstractions on top and if we ever get there we won't be
> talking about MR or CQ or whatever.  Also, whatever next-gen is
> unlikely to have enough commonalities when the proposed resource knobs
> are this low level, so let's please keep it separate, so that if/when
> this goes out of fashion for one reason or another, the controller can
> silently wither away too.

As an attempt to abstract the hardware resources only, what these devices are 
exposing to apps can be viewed as command queues (RDMA QPs and SRQs), 
notification queues (RDMA CQs and EQs), and space in the device cache and 
allocated memory (RDMA MRs and AHs, maybe PDs).

If one wanted a higher level of abstraction, associations exist between these 
resources.  For example, command queues feed into notification queues.  Address 
handles are required resources to use an unconnected queue pair.

- Sean
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH] infiniband:core:Fix assumation that the function ib_send_cm_drep never fails in rdma_disconnect

2015-09-08 Thread Hefty, Sean
> This fixes the incorrect assumation that the function ib_send_cm_drep
> always runs successfully and never fails in the function rdma_disconnect
> by making this call be assigned the variable used to return to callers
> of rdma_disconnect in order to make sure that a error code is returned
> if a failure occurs with this call in order for the caller to handle.

It's not assuming that it doesn't fail.  It's saying that it doesn't care that 
it does.  The call to send a DREP is optimistic and correct in most 
circumstances.  The app is done at this point, and the connection has been 
disconnected.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH] infiniband:core:Fix mutex locking to lock before unlocking in the function cma_ib_mc_handler

2015-09-08 Thread Hefty, Sean
> This fixes mutex locking to lock before unlocking in the function
> cma_ib_mc_handler for the mutex handler_mutex as part of the pointer
> id_priv before unlocking it after the critical region for event handler
> work and execution in order to have actual proper concurrent execution
> protection around this critical region in the function cma_ib_mc_handler.

The call cma_disable_callback() acquires the mutex.

> ---
>  drivers/infiniband/core/cma.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
> index 143ded2..7b89195 100644
> --- a/drivers/infiniband/core/cma.c
> +++ b/drivers/infiniband/core/cma.c
> @@ -3158,6 +3158,7 @@ static int cma_ib_mc_handler(int status, struct
> ib_sa_multicast *multicast)
>   } else
>   event.event = RDMA_CM_EVENT_MULTICAST_ERROR;
> 
> + mutex_lock(_priv->handler_mutex);

This mutex is already held at this point.

>   ret = id_priv->id.event_handler(_priv->id, );
>   if (ret) {
>   cma_exch(id_priv, RDMA_CM_DESTROYING);
> --
> 2.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH] infiniband:Fix error handling in the function cm_init_av_by_path

2015-09-08 Thread Hefty, Sean
> This fixes error handling in the function cm_init_av_by_path
> to properly check if the internal call to the function
> ib_init_ah_from_path has failed by returning a error code and
> if so return immediately to the caller with this error code in
> order to signal/allow the caller to handle the error in its own
> intended error paths.
> 
> Signed-off-by: Nicholas Krause 

Acked-by: Sean Hefty 

> ---
>  drivers/infiniband/core/cm.c | 6 --
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c
> index 3a972eb..cb021fd 100644
> --- a/drivers/infiniband/core/cm.c
> +++ b/drivers/infiniband/core/cm.c
> @@ -379,8 +379,10 @@ static int cm_init_av_by_path(struct ib_sa_path_rec
> *path, struct cm_av *av)
>   return ret;
> 
>   av->port = port;
> - ib_init_ah_from_path(cm_dev->ib_device, port->port_num, path,
> -  >ah_attr);
> + ret = ib_init_ah_from_path(cm_dev->ib_device, port->port_num, path,
> +>ah_attr);
> + if (ret)
> + return ret;
>   av->timeout = path->packet_life_time + 1;
>   memcpy(av->smac, path->smac, sizeof(av->smac));
> 
> --
> 2.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH] infiniband:core:Fix error handling in the function join_handler

2015-09-08 Thread Hefty, Sean
> diff --git a/drivers/infiniband/core/multicast.c
> b/drivers/infiniband/core/multicast.c
> index 2cb865c..9284337 100644
> --- a/drivers/infiniband/core/multicast.c
> +++ b/drivers/infiniband/core/multicast.c
> @@ -526,8 +526,9 @@ static void join_handler(int status, struct
> ib_sa_mcmember_rec *rec,
>   process_join_error(group, status);
>   else {
>   int mgids_changed, is_mgid0;
> - ib_find_pkey(group->port->dev->device, group->port->port_num,
> -  be16_to_cpu(rec->pkey), _index);
> + if (ib_find_pkey(group->port->dev->device, group->port-
> >port_num,
> +  be16_to_cpu(rec->pkey), _index))
> + return;

We can't just abort here.  Apps are waiting on a callback to their join 
operation, and the SA believes that this port is part of the multicast group.

If we don't get a valid pkey, the group will be left set to 
'MCAST_INVALID_PKEY_INDEX'. 

- Sean

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH v6 0/4] Add network namespace support in the RDMA-CM

2015-08-27 Thread Hefty, Sean
This series looks reasonable to me
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 1/3] IB/uverbs: reject invalid or unknown opcodes

2015-08-19 Thread Hefty, Sean
 AFAIK, this path is rarely (never?) actually used. I think all the
 drivers we have can post directly from userspace.

I didn't think the ipath or qib drivers post from userspace.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH] IB/ucma: Fix theoretical user triggered use-after-free

2015-08-06 Thread Hefty, Sean
 Something like this:
 
 CPU A CPU B
   
 ucma_destroy_id()
  wait_for_completion()
   .. anything
 ucma_put_ctx()
   complete()
  .. continues ...
   ucma_leave_multicast()
mutex_lock(mut)
  atomic_inc(ctx-ref)
mutex_unlock(mut)
  ucma_free_ctx()
   ucma_cleanup_multicast()
mutex_lock(mut)
  kfree(mc)
rdma_leave_multicast(mc-ctx-cm_id,..
 
 Fix it by latching the ref at 0. Once it goes to 0 mc and ctx cannot
 leave the mutex(mut) protection.
 
 The other atomic_inc in ucma_get_ctx is OK because mutex(mut) protects
 it from racing with ucma_destroy_id.
 
 Signed-off-by: Jason Gunthorpe jguntho...@obsidianresearch.com

Acked-by: Sean Hefty sean.he...@intel.com

 ---
  drivers/infiniband/core/ucma.c | 6 +++---
  1 file changed, 3 insertions(+), 3 deletions(-)
 
 diff --git a/drivers/infiniband/core/ucma.c
 b/drivers/infiniband/core/ucma.c
 index 29b21213ea75..acac9eafdbf6 100644
 --- a/drivers/infiniband/core/ucma.c
 +++ b/drivers/infiniband/core/ucma.c
 @@ -1321,10 +1321,10 @@ static ssize_t ucma_leave_multicast(struct
 ucma_file *file,
   mc = ERR_PTR(-ENOENT);
   else if (mc-ctx-file != file)
   mc = ERR_PTR(-EINVAL);
 - else {
 + else if (!atomic_inc_not_zero(mc-ctx-ref))
 + mc = ERR_PTR(-ENXIO);
 + else
   idr_remove(multicast_idr, mc-id);
 - atomic_inc(mc-ctx-ref);
 - }
   mutex_unlock(mut);
 
   if (IS_ERR(mc)) {

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH for-next 1/2] IB/core: Add support for RX/TX checksum offload capabilities report

2015-08-05 Thread Hefty, Sean
 +enum ib_csum_cap_flags {
 + IB_CSUM_RX_TCP_UDP   = 1  0,
 + IB_CSUM_RX_IP_HDR= 1  1,
 + IB_CSUM_TX_TCP_UDP   = 1  2,
 + IB_CSUM_TX_IP_HDR= 1  3
 +};

TPC and UDP should be separate flags.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH for-next 1/2] IB/core: Add support for RX/TX checksum offload capabilities report

2015-08-05 Thread Hefty, Sean
   +enum ib_csum_cap_flags {
   + IB_CSUM_RX_TCP_UDP   = 1  0,
   + IB_CSUM_RX_IP_HDR= 1  1,
   + IB_CSUM_TX_TCP_UDP   = 1  2,
   + IB_CSUM_TX_IP_HDR= 1  3
   +};
 
  TPC and UDP should be separate flags.
 
 Can you explain why?

For the same reason that you didn't include *all* L4 headers.

 What we are advertising here is offloads for L3 and L4 checksums, why
 should it be per protocol?

Because UDP and TCP have different headers, and it's entirely possible for a 
NIC (e.g. usnic) to support offloading one but not the other.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [RFC] split struct ib_send_wr

2015-08-04 Thread Hefty, Sean
 please take a look at my RFC patch here:
 
   http://git.infradead.org/users/hch/scsi.git/commitdiff/751774250b71d
 a83a26ba8584cff70f5e7bb7b1e
 
 the commit contains my explanation, but apparently the patch is too
 large for the list limit and didn't make it through.

This looks like a reasonable start.  It may help with feedback if you could 
just post the changes to ib_verbs.h.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH v2 05/13] IB/cm: Share listening CM IDs

2015-07-27 Thread Hefty, Sean
 +/**
 + * Create a new listening ib_cm_id and listen on the given service ID.
 + *
 + * If there's an existing ID listening on that same device and service
 ID,
 + * return it.
 + *
 + * @device: Device associated with the cm_id.  All related communication
 will
 + * be associated with the specified device.
 + * @cm_handler: Callback invoked to notify the user of CM events.
 + * @service_id: Service identifier matched against incoming connection
 + *   and service ID resolution requests.  The service ID should be
 specified
 + *   network-byte order.  If set to IB_CM_ASSIGN_SERVICE_ID, the CM will
 + *   assign a service ID to the caller.
 + * @service_mask: Mask applied to service ID used to listen across a
 + *   range of service IDs.  If set to 0, the service ID is matched
 + *   exactly.  This parameter is ignored if %service_id is set to
 + *   IB_CM_ASSIGN_SERVICE_ID.
 + *
 + * Callers should call ib_destroy_cm_id when done with the listener ID.
 + */
 +struct ib_cm_id *ib_cm_insert_listen(struct ib_device *device,
 +  ib_cm_handler cm_handler,
 +  __be64 service_id, __be64 service_mask)
 +{
 + struct cm_id_private *cm_id_priv;
 + struct ib_cm_id *cm_id;
 + unsigned long flags;
 + int err = 0;
 +
 + /* Create an ID in advance, since the creation may sleep */
 + cm_id = ib_create_cm_id(device, cm_handler, NULL);
 + if (IS_ERR(cm_id))
 + return cm_id;
 +
 + spin_lock_irqsave(cm.lock, flags);
 +
 + if (service_id == IB_CM_ASSIGN_SERVICE_ID)
 + goto new_id;
 +
 + /* Find an existing ID */
 + cm_id_priv = cm_find_listen(device, service_id, NULL);
 + if (cm_id_priv) {

The service_mask is being ignored through this code path.


 + if (cm_id-cm_handler != cm_handler || cm_id-context) {
 + /* Sharing an ib_cm_id with different handlers is not
 +  * supported */
 + spin_unlock_irqrestore(cm.lock, flags);
 + return ERR_PTR(-EINVAL);
 + }
 + atomic_inc(cm_id_priv-refcount);
 + ++cm_id_priv-listen_sharecount;
 + spin_unlock_irqrestore(cm.lock, flags);
 +
 + ib_destroy_cm_id(cm_id);
 + cm_id = cm_id_priv-id;
 + return cm_id;
 + }
 +
 +new_id:
 + /* Use newly created ID */
 + err = __ib_cm_listen(cm_id, service_id, service_mask, NULL, false);
 +
 + spin_unlock_irqrestore(cm.lock, flags);
 +
 + if (err) {
 + ib_destroy_cm_id(cm_id);
 + return ERR_PTR(err);
 + }
 + return cm_id;
 +}
 +EXPORT_SYMBOL(ib_cm_insert_listen);
 +
  static __be64 cm_form_tid(struct cm_id_private *cm_id_priv,
 enum cm_msg_sequence msg_seq)
  {

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH v2 07/13] IB/cma: Helper functions to access port space IDRs

2015-07-27 Thread Hefty, Sean
 Add helper functions to access the IDRs by port-space and port number.
 
 Pass around the port-space enum in cma.c instead of using pointers to
 port-space IDRs.

Why?

  drivers/infiniband/core/cma.c | 81 --
 -
  1 file changed, 60 insertions(+), 21 deletions(-)

This change adds code, and AFAICT makes it less efficient.  What purpose does 
it serve?
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH v2 09/13] IB/cma: Add net_dev and private data checks to RDMA CM

2015-07-27 Thread Hefty, Sean
 @@ -1040,6 +1040,181 @@ static int cma_save_net_info(struct sockaddr
 *src_addr,
   return cma_save_ip_info(src_addr, dst_addr, ib_event, service_id);
  }
 
 +struct cma_req_info {
 + struct ib_device *device;
 + int port;
 + const union ib_gid *local_gid;

The use of a pointer to some gid inside some other structure looks 
questionable.  There's no reference count held on another structure.  We would 
be better off copying the GID or setting an index into the device GID table.

 + __be64 service_id;
 + u16 pkey;
 +};

Please relocate to the top of the source file to group it with other structure 
definitions.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH V3 1/5] RDMA/core: Transport-independent access flags

2015-07-23 Thread Hefty, Sean
 The lkey is possibly useful if someone wants to do single op transfers
 larger than the S/G limit of the SQE. I haven't noticed any ULPs doing
 that..

That changes how the buffer is identified, which gets back to my question of 
are we identifying local buffers by address or through some sort of 
iova/tag/descriptor/mr/whatever.  Do we have this right in the API?
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH V3 1/5] RDMA/core: Transport-independent access flags

2015-07-23 Thread Hefty, Sean
  This may be a nit, but IMO, the use of the term 'rkey' versus 'stag'
 matters.  They convey different ways of finding a data
 buffer.  For
  example, do you locate a buffer using the stag, then verify that the
 offset + length fits into the target buffer?
 
 Yes.  HW always uses the stag to locate a record that contains the stag
 state (valid or invalid), the access flags, the 8b key, the
 va_base, length, PBL describing the host pages, etc.  HW validates all
 that before using the buffer.  NOTE: An stag of 0 is the
 special local-dma-lkey which HW treats differently: If the stag is 0, then
 the address in the SGE is the bus/dma address itself and
 no lookup of a MR/PBL/etc is needed.   Stag 0 can ONLY be used by kernel
 users and MUST never be accepted/used from an ingress
 packet and MUST never be emitted on the wire in a READ or WRITE.
 
  Or do you locate the buffer
  by address, then verify that the key matches?
 
 
 This is never done.

These were more rhetorical questions to highlight that stag is a better choice 
for the name than rkey.  Lkey seems better than stag, though I'd rather see 
lkey removed completely.  I don't see where usnic, ipath, qib, or opa 
technically need an lkey at all.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: rdma_getaddrinfo and GUID

2015-07-23 Thread Hefty, Sean
 I'm use
 cat /sys/class/infiniband/mlx4_0/ports/1/gids/0
 
 fe80::::0002:c903:00ef:6601
 
 In node I'm use gid provided below service null, hints contains af_ib and
 rai_family|rai_numerichost

Ah, yes, the rdma_cm supports GIDs.  It does not support GUIDs.  Though it's 
usually trivial to convert a GUID into a GID.


RE: rdma_getaddrinfo and GUID

2015-07-23 Thread Hefty, Sean
  Hello, does it possible to use rdma_getaddrinfo and specify in node
 port
  GUID?
 
  The current code does not support this.
 
 
 Why rdma_getaddrinfo does not return error? Why it returns success and
 result contains port guid?

I don't think I'm following your questions on how rdma_getaddrinfo is being 
called.  Are you referring to a GUID (64-bit value) or a GID (128-bit value)?  
How are you calling rdma_getaddrinfo with this value?
N�r��yb�X��ǧv�^�)޺{.n�+{��ٚ�{ay�ʇڙ�,j��f���h���z��w���
���j:+v���w�j�mzZ+�ݢj��!�i

RE: [PATCH V3 1/5] RDMA/core: Transport-independent access flags

2015-07-23 Thread Hefty, Sean
 There is confusion about lkeys and rkeys with regard to iWARP.  In the
 iWARP verbs, there is no distinction between an lkey and
 rkey: they are the same key, called a Steering Tag or STAG.  When you
 create a MR, the lkey == rkey == STAG for iwarp transports.
 Somewhat related, but really a different issue, is that SGEs that are the
 target of a read need REMOTE_WRITE access flags on their
 STAG for iWARP.
 
 Clear as mud? :)

This may be a nit, but IMO, the use of the term 'rkey' versus 'stag' matters.  
They convey different ways of finding a data buffer.  For example, do you 
locate a buffer using the stag, then verify that the offset + length fits into 
the target buffer?  Or do you locate the buffer by address, then verify that 
the key matches?

Consider if we allow an app to specify the rkey/stag, or reference the buffer 
using an offset, rather than a virtual address.

This seems to be part of the difference between an lkey and an rkey. 
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: rdma_getaddrinfo and GUID

2015-07-22 Thread Hefty, Sean
 Hello, does it possible to use rdma_getaddrinfo and specify in node port
 GUID?

The current code does not support this.


RE: [PATCH WIP 01/43] IB: Modify ib_create_mr API

2015-07-22 Thread Hefty, Sean
 +enum ib_mr_type {
 + IB_MR_TYPE_FAST_REG,
 + IB_MR_TYPE_SIGNATURE,

If we're going to go through the trouble of changing everything, I vote for 
dropping the word 'fast'. It's a marketing term.  It's goofy.  And the IB spec 
is goofy for using it.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH V3 1/5] RDMA/core: Transport-independent access flags

2015-07-08 Thread Hefty, Sean
 I am still not clear if all of us agree that we need it.
 Sean and Steve had some disclaimers...

A single entry point doesn't help a whole lot if the app must deal with 
different behavior based on how the API is used.  We have a single entry point 
for post_send, for example, and that makes things worse.  IMO, we need fewer 
registration *methods* not fewer calls.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH v7 1/4] IB/netlink: Add defines for local service requests through netlink

2015-06-30 Thread Hefty, Sean
  include/uapi/rdma/rdma_netlink.h |   87
 ++
  1 files changed, 87 insertions(+), 0 deletions(-)
 
 diff --git a/include/uapi/rdma/rdma_netlink.h
 b/include/uapi/rdma/rdma_netlink.h
 index 6e4bb42..d2c50e9 100644
 --- a/include/uapi/rdma/rdma_netlink.h
 +++ b/include/uapi/rdma/rdma_netlink.h
 @@ -7,12 +7,14 @@ enum {
   RDMA_NL_RDMA_CM = 1,
   RDMA_NL_NES,
   RDMA_NL_C4IW,
 + RDMA_NL_SA,
   RDMA_NL_NUM_CLIENTS
  };
 
  enum {
   RDMA_NL_GROUP_CM = 1,
   RDMA_NL_GROUP_IWPM,
 + RDMA_NL_GROUP_LS,
   RDMA_NL_NUM_GROUPS
  };
 
 @@ -128,5 +130,90 @@ enum {
   IWPM_NLA_ERR_MAX
  };
 
 +/* Local service group opcodes */
 +enum {
 + RDMA_NL_LS_OP_RESOLVE = 0,
 + RDMA_NL_LS_OP_SET_TIMEOUT,
 + RDMA_NL_LS_NUM_OPS
 +};
 +
 +/* Local service netlink message flags */
 +#define RDMA_NL_LS_F_OK  0x0100  /* Success response */
 +#define RDMA_NL_LS_F_ERR 0x0200  /* Failed response */

Is there a reason for these odd values?

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH V2 4/5] RDMA/iser: support iWARP devices

2015-06-30 Thread Hefty, Sean
  I prefer to decouple the iSER changes with this core work.
  Jason/Sean... thoughts?  I could do the iSER w/o patch 3, and the
  follow up with a series that includes our final solution on
  transport independent memory registration and change all the TI
  kernel users (iser and nfsrdma) along with it.
 
 There has a been a big push lately to drop the strange transport
 specific stuff - if you are comitted to seeing the access flag clean
 up series through then I don't see a problem with using whatever order
 you like.

I agree.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH V2 3/5] RDMA/core: transport-independent access flags

2015-06-30 Thread Hefty, Sean
 I suggest to start consolidating to ib_create_mr() that receives an
 extensible ib_mr_init_attr and additional attributes can be mr_roles
 and mr_attrs.

I think this makes sense, but does it really help?  If the end result is that 
the app and providers basically end up switching on mr_attr::type, we end up 
reducing the number of APIs, but the code complexity remains the same.


RE: [PATCH RFC] RDMA/core: add rdma_get_dma_mr()

2015-06-29 Thread Hefty, Sean
+enum rdma_mr_roles {
+   RDMA_MRR_RECV   = 1,
+   RDMA_MRR_SEND   = (11),
+   RDMA_MRR_READ_SOURCE= (12),
+   RDMA_MRR_READ_SINK  = (13),
  
   Maybe it's just me, but it took me a second to figure out which was
 the
   source and which was the sink in RDMA reads. Do you think calling them
   initiator and responder/target would be better?
 
  Not to me.  For an RDMA operation, the initiator is the app that
  issues the read request WR.  That app doesn't create what I call the
  READ_SOURCE MR.  Its peer application does.  So calling READ_SOURCE
  something like READ_INITIATOR doesn't make sense to me.  That's why
  I thought SOURCE and SINK were clearer.  Perhaps not...
 
 I would call SOURCE the RESPONDER..
 
 Initiator/Poster and Responder is closest to the language used in
 other places in the API for READ.

I agree with Steve.  I think of this in terms of copy and like source/dest 
myself, similar to source/sink.  A memory buffer is not a responder or 
initiator.

In hindsight, we probably never should have exposed initiator_depth and 
responder_resources.  Those values have been an endless source of confusion and 
bugs.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH RFC] RDMA/core: add rdma_get_dma_mr()

2015-06-28 Thread Hefty, Sean
 I wanted to point out that with this scheme the ULP may sometimes get an
 MR with a wider set of permissions that it asked for, and I'm not sure
 that's always safe. Perhaps the ULP wants to guarantee that the MR
 doesn't allow certain kinds of accesses and doesn't expect the verbs
 layer to change that.

The providers can return the actual permissions that were granted.  If the app 
cares, it can check.
N�r��yb�X��ǧv�^�)޺{.n�+{��ٚ�{ay�ʇڙ�,j��f���h���z��w���
���j:+v���w�j�mzZ+�ݢj��!�i

RE: [PATCH RFC 2/2] RDMA/isert: Support iWARP transport

2015-06-25 Thread Hefty, Sean
  +* IWARP transports need REMOTE_WRITE for MRs used as the target of
  +* an RDMA_READ.  Since the DMA MR is used for all ports, then if
  +* any port is running IWARP, add REMOTE_WRITE.
  +*/
  +   if (any_port_is_iwarp(device))
 
 It would be nice to have a new-style cap test for this instead of open
 coding iwarp. Similar to rdma_cap_read_multi_sge

This should be pushed down into the devices, or at least within verbs, rather 
than having ULPs needing to know this oddness.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH RFC 2/2] RDMA/isert: Support iWARP transport

2015-06-25 Thread Hefty, Sean
 How would you envision doing this?  At the time a MR is registered the
 device driver doesn't know if the application will be doing
 RDMA reads or not on that MR.

I was thinking of checking for REMOTE_READ, but that doesn't work on the 
initiator side.

I guess you could a READ_DEST(SOURCE? TARGET?) flag to indicate that the MR 
will be used on the initiating side of the RDMA read operation.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH RFC 2/2] RDMA/isert: Support iWARP transport

2015-06-25 Thread Hefty, Sean
 What about moving to something more specific? Encode the allowed verbs
 in the access flag?

This makes more sense to me.  Something like: SEND, RECV, INIT READ, INIT 
WRITE, READ TARGET, WRITE TARGET, etc.  We're close to this, but it's not 
clear, for example, what flags are needed for a receive buffer.  None?  
LOCAL_WRITE?
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH RFC] RDMA/core: add rdma_get_dma_mr()

2015-06-25 Thread Hefty, Sean
 +enum rdma_mr_roles {

I would drop naming the enum - it shouldn't be used, as the values are bit 
flags.

 + RDMA_MRR_RECV   = 1,
 + RDMA_MRR_SEND   = (11),
 + RDMA_MRR_READ_SOURCE= (12),
 + RDMA_MRR_READ_SINK  = (13),
 + RDMA_MRR_WRITE_SOURCE   = (14),
 + RDMA_MRR_WRITE_SINK = (15),
 + RDMA_MRR_ATOMIC = (16),
 + RDMA_MRR_MW_BIND= (17),
 + RDMA_MRR_ZERO_BASED = (18),

There's 'something' different about this role that cause me hesitation.  Maybe 
that it's dependent on other roles being set to be useful?  I'm not sure.

Maybe we need both roles and registration flags, with this declared as a flag?

 + RDMA_MRR_ACCESS_ON_DEMAND   = (19),

This one is even more different, as it doesn't impact how the MR interacts with 
the interfaces, or change how the application uses the MR.  This is really a 
hint to the provider regarding the selection of different implementation flows.


RE: [PATCHv2 RFC] IB: Add rdma_cap_ib_switch helper and use where appropriate

2015-06-22 Thread Hefty, Sean
 Note that there still remain a couple of node type checks in the kernel
 that we may want to remove. There's an IB CA check in cma.c:rdma_notify

This should check for rdma_cap_ib_cm()

 as well as in rds/ib.c:rds_ib_add_one and rds_ib_laddr_check and

Not sure what the underlying reason is for these checks.

 an RNIC check in rds/iw.c:rds_iw_add_one and rds_iw_laddr_check

rdma_cap_iw_cm()

The intent is to clarify why the checks that exist are made and replace them 
with a check that clearly conveys what the restriction is.  So, yes, the use of 
node_type should be replaced.

- Sean
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in


RE: [PATCHv2 RFC] IB: Add rdma_cap_ib_switch helper and use where appropriate

2015-06-22 Thread Hefty, Sean
 rds_ib_add_one says:
 /* Only handle IB (no iWARP) devices */
 but not sure comment is 100% accurate as it checks node type != IB CA.
 
 rds_ib_laddr_check says:
 /* rdma_bind_addr will only succeed for IB  iWARP devices */
 /* due to this, we will claim to support iWARP devices unless we check
 node_type. */
 It's similar to above in that it checks node type != IB CA and does not
 seem 100% accurate.

I noticed these, but I don't know why RDS has this requirement.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in


RE: [PATCH RFC] IB: Add rdma_cap_ib_switch helper and use where appropriate

2015-06-18 Thread Hefty, Sean
 In an effort to reform the RDMA core and ULPs to minimize use of
 node_type in struct ib_device, an additional bit is added to
 struct ib_device for is_switch (IB switch). This is needed
 to be initialized by any IB switch device driver. This is a
 NEW requirement on such device drivers which are all
 out of tree.

I agree with Jason, only I would add we need to decide if we want to continue 
supporting the out of tree devices.  We have for so long that it seems 
reasonable to do so.  At the same time, I think we'd reject these changes now 
without a lower level user of them.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCHv2 RFC] IB: Add rdma_cap_ib_switch helper and use where appropriate

2015-06-18 Thread Hefty, Sean
 In an effort to reform the RDMA core and ULPs to minimize use of
 node_type in struct ib_device, an additional bit is added to
 struct ib_device for is_switch (IB switch). This is needed
 to be initialized by any IB switch device driver. This is a
 NEW requirement on such device drivers which are all
 out of tree.
 
 In addition, an ib_switch helper was added to ib_verbs.h
 based on the is_switch device bit rather than node_type
 (although those should be consistent).
 
 The RDMA core (MAD, SMI, agent, sa_query, multicast, sysfs)
 as well as (IPoIB and SRP) ULPs are updated where
 appropriate to use this new helper. In some cases,
 the helper is now used under the covers of using
 rdma_[start end]_port rather than the open coding
 previously used.
 
 Signed-off-by: Hal Rosenstock h...@mellanox.com

Assuming that the final version matches this one:

Reviewed-by: Sean Hefty sean.he...@intel.com
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH v2 01/49] IB/core: Add OPA Port header definitions

2015-06-17 Thread Hefty, Sean
   include/rdma/opa_port_info.h  |  433
 
  This matches the current code structure, but is this the best location
 for this file?
 
  Do you have a suggestion?
 
 Couldn't it be a header in the hfi1 driver directory ? Isn't it OPA
 specific ?

I don't have a specific suggestion.  If the contents are specific to this 
driver, I would move them into the driver directory.  For values that are OPA, 
but not driver, specific, include/rdma is probably the best option for now.  If 
the content is mixed (I didn't look closely), I would split the file.


RE: [PATCH 06/41] IB/hfi1: add char device instantiation code

2015-06-17 Thread Hefty, Sean
 PSM2 (and PSM) uses this to dialog with the driver for hardware specific
 setup.
 
 I'm suspected at some point in the past, this was ioctl based and changed
 due to bias against ioctl.

Could this be distinguished based on a common command header?
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 00/41] Add OPA gen1 driver

2015-06-17 Thread Hefty, Sean
 Ummm.. Could we get some more descriptions as to what this code is for?
 
 Do we have a new OmniPath protocol here as well or is it IB? Which
 standards are followed?
 
 I think the APIs that the driver uses need to be documented somewhere in
 particular if new sysfs entries etc are created.
 
 Are there any user space tools to control/exercise the driver and the
 protocol stack?

Christoph's questions are fair.  There should at least be a short write-up that 
describes how this driver fits into the RDMA stack.  (Maybe there are links 
with additional details that could be provided?)  For example, we know from 
Ira's patches and comments that OPA uses IB like management and supports the 
same verbs as the qib driver.  It's hard to piece this together from separate 
email threads or by looking at 56,000 lines of code.  I'm less concerned about 
the PSM support myself, as that's self-contained to this driver and doesn't 
impact the rest of the stack.  But at least clarify how it relates to other 
kernel modules (ib_cm, rdma_cm, ipoib, etc.), if at all.

- Sean

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 04/11] IB/cm: Expose DGID in SIDR request events

2015-06-16 Thread Hefty, Sean
 The idea is to allow SIDR request to be sorted by the GID, when we will
 have alias GIDs for IPoIB.

Please limit this series, or at least the early patches in this series, to 
simply moving the de-mux out of the ib_cm and into the rdma_cm.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 14/14] IB/mad: Add final OPA MAD processing

2015-06-16 Thread Hefty, Sean
 You're right that apps can be coded to other CA types, like RNICs and
 USNICs.  However, those are all very different from an IB_CA due to
 limited queue pair types or limited primitives.  If OPA had that same
 limitation then I would agree it needs a different node type.
 
 So this will be my litmus test.  Currently, an app that supports all of
 the RDMA types looks like this:
 
 if (node_type == RNIC)
   do iwarpy stuff
 else if (node_type == USNIC)
   do USNIC stuff
 else if (node_type == IB_CA)
   do IB verbs stuff
   if (link_layer == Ethernet)
   do RoCE addressing/management
   else
   do IB addressing/management

The node type values were originally defined to align with the IB management 
NodeInfo structure.  AFAICT, there was no intent to associate those values with 
specific functionality or addressing or verbs support or anything else, really, 
outside of what IB management needed.

iWarp added a new node type, so that the IB management code could ignore those 
devices.  RoCE basically broke this association by forcing additional checks in 
the local management code to also check against the link layer.  The recent mad 
capability bits are a superior solution, making the node type obsolete. 

At this point, the node type essentially indicates if we start counting ports 
at a numeric value 0 or 1.  The NodeType that an OPA channel adapter will 
report in a NodeInfo structure will be 1, the same value as if it were an IB 
channel adapter.

In the end, this argument matters one iota.  The kernel code barely relies on 
the node type, and a user space verbs provider can report whatever value it 
wants.

- Sean


RE: [PATCH acm] acme.c: Fix IPv6 address handling in inet_any_pton

2015-06-16 Thread Hefty, Sean
Thanks - applied manually

  acm/src/acme.c |2 +-

Path?

 diff --git a/acm/src/acme.c b/acm/src/acme.c
 index 9bf7557..d54c8b9 100644
 --- a/acm/src/acme.c
 +++ b/acm/src/acme.c
 @@ -787,7 +787,7 @@ static int inet_any_pton(char *addr, struct sockaddr

This was off by a couple hundred lines.  Can you check your branch?

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 06/41] IB/hfi1: add char device instantiation code

2015-06-15 Thread Hefty, Sean
  +int __init dev_init(void)
  +{
  +   int ret;
  +
  +   ret = alloc_chrdev_region(hfi1_dev, 0, HFI1_NMINORS,
 DRIVER_NAME);
  +   if (ret  0) {
  +   pr_err(Could not allocate chrdev region (err %d)\n, -
 ret);
  +   goto done;
  +   }
  +
  +   class = class_create(THIS_MODULE, class_name());
  +   if (IS_ERR(class)) {
  +   ret = PTR_ERR(class);
  +   pr_err(Could not create device class (err %d)\n, -
 ret);
  +   unregister_chrdev_region(hfi1_dev, HFI1_NMINORS);
  +   }
  +
  +done:
  +   return ret;
  +}
 
 so what's the role of the char-device? 

Someone from Mike's team can chime in, but I believe this supports non-verbs 
interfaces (i.e. PSM), similar to what's done for qib.


RE: [PATCH 06/41] IB/hfi1: add char device instantiation code

2015-06-15 Thread Hefty, Sean
 Was Al Viro's comment on qib addressed here? That would be a
 showstopper for me..

Do you have a link to this comment?  I'm missing a bunch of messages from this 
thread and can't find anything from Al in the logs.

- Sean
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH v2 01/49] IB/core: Add OPA Port header definitions

2015-06-15 Thread Hefty, Sean
  include/rdma/opa_port_info.h  |  433

This matches the current code structure, but is this the best location for this 
file?


RE: [PATCH 03/11] IB/cm: Expose service ID in request events

2015-06-15 Thread Hefty, Sean
 Expose the service ID on an incoming CM or SIDR request to the event
 handler. This will allow the RDMA CM module to de-multiplex connection
 requests based on the information encoded in the service ID.
 
 Signed-off-by: Haggai Eran hagg...@mellanox.com

Acked-by: Sean Hefty sean.he...@intel.com
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 04/11] IB/cm: Expose DGID in SIDR request events

2015-06-15 Thread Hefty, Sean
  drivers/infiniband/core/cm.c | 7 +++
  include/rdma/ib_cm.h | 2 ++
  2 files changed, 9 insertions(+)
 
 diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c
 index c5f5f89e274a..46f99ec4080a 100644
 --- a/drivers/infiniband/core/cm.c
 +++ b/drivers/infiniband/core/cm.c
 @@ -2983,6 +2983,13 @@ static void cm_format_sidr_req_event(struct cm_work
 *work,
   param-pkey = __be16_to_cpu(sidr_req_msg-pkey);
   param-listen_id = listen_id;
   param-service_id = sidr_req_msg-service_id;
 + if (work-mad_recv_wc-wc-wc_flags  IB_WC_GRH) {
 + param-grh = 1;
 + memcpy(param-dgid, work-mad_recv_wc-recv_buf.grh-dgid,
 +sizeof(param-dgid));
 + } else {
 + param-grh = 0;

What is the use case here?  Are you trying to sort by device?  How does the GID 
of the GMP relate to the listen?

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 05/11] IB/cm: Share listening CM IDs

2015-06-15 Thread Hefty, Sean
 @@ -722,6 +725,7 @@ struct ib_cm_id *ib_create_cm_id(struct ib_device
 *device,
   INIT_LIST_HEAD(cm_id_priv-work_list);
   atomic_set(cm_id_priv-work_count, -1);
   atomic_set(cm_id_priv-refcount, 1);
 + cm_id_priv-listen_sharecount = 1;

This is setting the listen count before we know whether the cm_id will actually 
be used to listen.


   return cm_id_priv-id;
 
  error:
 @@ -847,11 +851,21 @@ retest:
   spin_lock_irq(cm_id_priv-lock);
   switch (cm_id-state) {
   case IB_CM_LISTEN:
 - cm_id-state = IB_CM_IDLE;
   spin_unlock_irq(cm_id_priv-lock);
 +
   spin_lock_irq(cm.lock);
 + if (--cm_id_priv-listen_sharecount  0) {
 + /* The id is still shared. */
 + atomic_dec(cm_id_priv-refcount);
 + spin_unlock_irq(cm.lock);
 + return;
 + }
   rb_erase(cm_id_priv-service_node, cm.listen_service_table);
   spin_unlock_irq(cm.lock);
 +
 + spin_lock_irq(cm_id_priv-lock);
 + cm_id-state = IB_CM_IDLE;
 + spin_unlock_irq(cm_id_priv-lock);

Why is the state being changed?  The cm_id is about to be freed anyway.


   break;
   case IB_CM_SIDR_REQ_SENT:
   cm_id-state = IB_CM_IDLE;
 @@ -929,11 +943,32 @@ void ib_destroy_cm_id(struct ib_cm_id *cm_id)
  }
  EXPORT_SYMBOL(ib_destroy_cm_id);
 
 -int ib_cm_listen(struct ib_cm_id *cm_id, __be64 service_id, __be64
 service_mask,
 -  struct ib_cm_compare_data *compare_data)
 +/**
 + * __ib_cm_listen - Initiates listening on the specified service ID for
 + *   connection and service ID resolution requests.
 + * @cm_id: Connection identifier associated with the listen request.
 + * @service_id: Service identifier matched against incoming connection
 + *   and service ID resolution requests.  The service ID should be
 specified
 + *   network-byte order.  If set to IB_CM_ASSIGN_SERVICE_ID, the CM will
 + *   assign a service ID to the caller.
 + * @service_mask: Mask applied to service ID used to listen across a
 + *   range of service IDs.  If set to 0, the service ID is matched
 + *   exactly.  This parameter is ignored if %service_id is set to
 + *   IB_CM_ASSIGN_SERVICE_ID.
 + * @compare_data: This parameter is optional.  It specifies data that
 must
 + *   appear in the private data of a connection request for the specified
 + *   listen request.
 + * @lock: If set, lock the cm.lock spin-lock when adding the id to the
 + *   listener tree. When false, the caller must already hold the spin-
 lock,
 + *   and compare_data must be NULL.
 + */
 +static int __ib_cm_listen(struct ib_cm_id *cm_id, __be64 service_id,
 +   __be64 service_mask,
 +   struct ib_cm_compare_data *compare_data,
 +   bool lock)
  {
   struct cm_id_private *cm_id_priv, *cur_cm_id_priv;
 - unsigned long flags;
 + unsigned long flags = 0;
   int ret = 0;
 
   service_mask = service_mask ? service_mask : ~cpu_to_be64(0);
 @@ -959,7 +994,8 @@ int ib_cm_listen(struct ib_cm_id *cm_id, __be64
 service_id, __be64 service_mask,
 
   cm_id-state = IB_CM_LISTEN;
 
 - spin_lock_irqsave(cm.lock, flags);
 + if (lock)
 + spin_lock_irqsave(cm.lock, flags);

I'm not a fan of this sort of locking structure.  Why not just move the locking 
into the outside calls completely?  I.e. move to ib_cm_listen() instead of 
passing in true.


   if (service_id == IB_CM_ASSIGN_SERVICE_ID) {
   cm_id-service_id = cpu_to_be64(cm.listen_service_id++);
   cm_id-service_mask = ~cpu_to_be64(0);
 @@ -968,7 +1004,8 @@ int ib_cm_listen(struct ib_cm_id *cm_id, __be64
 service_id, __be64 service_mask,
   cm_id-service_mask = service_mask;
   }
   cur_cm_id_priv = cm_insert_listen(cm_id_priv);
 - spin_unlock_irqrestore(cm.lock, flags);
 + if (lock)
 + spin_unlock_irqrestore(cm.lock, flags);
 
   if (cur_cm_id_priv) {
   cm_id-state = IB_CM_IDLE;
 @@ -978,8 +1015,86 @@ int ib_cm_listen(struct ib_cm_id *cm_id, __be64
 service_id, __be64 service_mask,
   }
   return ret;
  }
 +
 +int ib_cm_listen(struct ib_cm_id *cm_id, __be64 service_id, __be64
 service_mask,
 +  struct ib_cm_compare_data *compare_data)
 +{
 + return __ib_cm_listen(cm_id, service_id, service_mask, compare_data,
 +   true);
 +}
  EXPORT_SYMBOL(ib_cm_listen);
 
 +/**
 + * Create a new listening ib_cm_id and listen on the given service ID.
 + *
 + * If there's an existing ID listening on that same device and service
 ID,
 + * return it.
 + *
 + * @device: Device associated with the cm_id.  All related communication
 will
 + * be associated with the specified device.
 + * @cm_handler: Callback invoked to notify the user of CM events.
 + * @service_id: Service identifier matched against 

RE: [PATCH v2 25/49] IB/hfi1: add local mad header

2015-06-15 Thread Hefty, Sean
  drivers/infiniband/hw/hfi1/mad.h |  477
 ++
  1 file changed, 477 insertions(+)
  create mode 100644 drivers/infiniband/hw/hfi1/mad.h
 
 diff --git a/drivers/infiniband/hw/hfi1/mad.h
 b/drivers/infiniband/hw/hfi1/mad.h
 new file mode 100644
 index 000..af61249

Snip

 +#include rdma/ib_pma.h
 +#define USE_PI_LED_ENABLE1 /* use led enabled bit in struct
 +* opa_port_states, if available */
 +#include rdma/opa_smi.h
 +#include rdma/opa_port_info.h
 +#ifndef PI_LED_ENABLE_SUP
 +#define PI_LED_ENABLE_SUP 0
 +#endif
 +#include opa_compat.h
 +
 +#define IB_SMP_UNSUP_VERSIONcpu_to_be16(0x0004)
 +#define IB_SMP_UNSUP_METHOD cpu_to_be16(0x0008)
 +#define IB_SMP_UNSUP_METH_ATTR  cpu_to_be16(0x000C)
 +#define IB_SMP_INVALID_FIELDcpu_to_be16(0x001C)
 +
 +struct ib_node_info {
 + u8 base_version;
 + u8 class_version;
 + u8 node_type;
 + u8 num_ports;
 + __be64 sys_guid;
 + __be64 node_guid;
 + __be64 port_guid;
 + __be16 partition_cap;
 + __be16 device_id;
 + __be32 revision;
 + u8 local_port_num;
 + u8 vendor_id[3];
 +} __packed;
 +
 +struct ib_mad_notice_attr {
 + u8 generic_type;
 + u8 prod_type_msb;
 + __be16 prod_type_lsb;
 + __be16 trap_num;
 + __be16 issuer_lid;
 + __be16 toggle_count;
 +
 + union {
 + struct {
 + u8  details[54];
 + } raw_data;
 +
 + struct {
 + __be16  reserved;
 + __be16  lid;/* where violation happened */
 + u8  port_num;   /* where violation happened */
 + } __packed ntc_129_131;
 +
 + struct {
 + __be16  reserved;
 + __be16  lid;/* LID where change occurred */
 + u8  reserved2;
 + u8  local_changes;  /* low bit - local changes */
 + __be32  new_cap_mask;   /* new capability mask */
 + u8  reserved3;
 + u8  change_flags;   /* low 3 bits only */
 + } __packed ntc_144;
 +
 + struct {
 + __be16  reserved;
 + __be16  lid;/* lid where sys guid changed */
 + __be16  reserved2;
 + __be64  new_sys_guid;
 + } __packed ntc_145;
 +
 + struct {
 + __be16  reserved;
 + __be16  lid;
 + __be16  dr_slid;
 + u8  method;
 + u8  reserved2;
 + __be16  attr_id;
 + __be32  attr_mod;
 + __be64  mkey;
 + u8  reserved3;
 + u8  dr_trunc_hop;
 + u8  dr_rtn_path[30];
 + } __packed ntc_256;
 +
 + struct {
 + __be16  reserved;
 + __be16  lid1;
 + __be16  lid2;
 + __be32  key;
 + __be32  sl_qp1; /* SL: high 4 bits */
 + __be32  qp2;/* high 8 bits reserved */
 + union ib_gidgid1;
 + union ib_gidgid2;
 + } __packed ntc_257_258;
 +
 + } details;
 +};
 +
 +/*
 + * Generic trap/notice types
 + */
 +#define IB_NOTICE_TYPE_FATAL 0x80
 +#define IB_NOTICE_TYPE_URGENT0x81
 +#define IB_NOTICE_TYPE_SECURITY  0x82
 +#define IB_NOTICE_TYPE_SM0x83
 +#define IB_NOTICE_TYPE_INFO  0x84

Snip the end...


These are generic values and data structures.  Why are they isolated into a 
driver specific folder, versus being in a common include location?


RE: [PATCH for-next V2 0/9] Add completion timestamping support

2015-06-11 Thread Hefty, Sean
 Intel is supporting multicast in hardware. Its just a bad implementation
 (broadcast and filtering MC groups in the HCA or what was that?) and there
 is no plan to fix the issues despite the problem being known for quite
 some time. Also does this mean that libfabric only to supports the
 features needed by Intel?

Libfabric supports whatever features apps require and the participating vendors 
want to provide.  However, I, personally, have a limited amount of time to my 
day and will focus my effort on either what my management requires of me or 
areas that are most interesting.  Libfabric is specifically designed to be 
vendor, transport, and implementation neutral.

 I would be interested to see some measurements. AFAICT the Intel solutions
 are based on historically inferior IB technology from Qlogic which has
 never been able in my lab tests to compete latency wise with other
 vendors. I have heard these latency claims repeatedly from Qlogic
 personnel over the years.

You are referring to hardware latency.  Libfabric is software.  No amount of 
software is going to overcome hardware limitations.  The entire reason 
multicast support was removed from libfabric 1.0 was that the proposed API 
would have introduced latency by adding a branch into the code path.

 This is a well designed solution and its easy to use.

I fundamentally disagree with the practice of ad-hoc API design.  I stated this 
on the mail list probably 3 years ago.  I see nothing wrong with allowing and 
encouraging vendor specific extensions.

 It would help libfabric if you would work with other vendors and
 industries to include support for their needs. MPI is not the only
 applications that are running on the fabrics. I understand that is
 historically the only area in which Qlogic hardware was able to compete
 but I think you need to move beyond that. APIs should be as general as
 possible abstracting hardware as much as possible. A viable libfabric
 needs to be easy to use, low overhead as well as covering the requirements
 of multiple vendors and use cases.

Libfabric included requirements from multiple users and applications - MPI, 
SHMEM, PGAS, DBMS, and sockets all provided input.  It chose to target MPI as 
an initial priority, but it is not limited to MPI at all.  It also works with 
other vendors, including vendors that do not support the verbs interfaces -- 
Cisco, Cray, Intel PSM, plus others.  I, personally, ensured that libfabric 
would layer well over verbs based hardware.  That doesn't mean that I'm 
obligated to provide optimized providers over everyone's hardware.

The goal was not to spend 3 years working on a new API, but to get something 
usable within a short timeframe that could be extended.  OFIWG could have taken 
a different approach, but this was what the community (not Intel) selected.

As a company, Intel has many products.  A competitor in one area of the company 
may be a partner in another.  Xeon is by far the most important to this 
discussion.  It's why Intel dedicated developers to enabling high performance 
networking in Linux for over 10 years -- even before Intel had any products in 
those spaces.  And it's why Intel continues to fund development.  Sure, Intel 
now has IB and Omni-Path Architecture products, but they also have iWarp and 
Ethernet.  Intel MPI runs over a bunch of different fabrics.  Libfabric doesn't 
just need to work well over Intel NICs, it needs to work well over Intel 
platforms.

Returning to this thread, if I had to add time stamps to libfabric, I would 
still add 2 time stamps into a new completion structure.  Those time stamps 
would be selected using a method similar to what Doug stated in an earlier 
email.  The app would use an enum to select what the time stamps would capture. 
 However, I would lean more to having those values specified as part of the 
endpoint attributes, rather than the CQ.

- Sean

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 14/14] IB/mad: Add final OPA MAD processing

2015-06-11 Thread Hefty, Sean
  cap_is_switch_smi would be a nice refinement to let us drop nodetype.
 
 Exactly, we need a bit added to the immutable data bits, and a new cap_
 helper, and then nodetype is ready to be retired.  Add a bit, drop a
 u8 ;-)

I agree that the node type enum isn't particularly useful and should be 
retired.  In fact, I don't see where RDMA_NODE_IB_SWITCH is used by any 
upstream device.  So I don't think there's any obligation to keep it.  But even 
if we do, I'm not sure this is the correct approach.  I don't know this for a 
fact, but it seems more likely that someone would embed Linux on an IB switch 
than they would plug an IB switch into a Linux based system.  The code is 
designed around the latter.  Making this a system wide setting might simplify 
the code and optimize the code paths.

- Sean
N�r��yb�X��ǧv�^�)޺{.n�+{��ٚ�{ay�ʇڙ�,j��f���h���z��w���
���j:+v���w�j�mzZ+�ݢj��!�i

RE: [PATCH 14/14] IB/mad: Add final OPA MAD processing

2015-06-11 Thread Hefty, Sean
  I agree that the node type enum isn't particularly useful and should be
 retired.
 
 Are you referring to kernel space or user space or both ?

Short term, kernel space.  User space needs to keep something around for 
backwards compatibility.

But the in tree code will never expose this value up.

  But even if we do, I'm not sure this is the correct approach.  I don't
 know this for a fact,
  but it seems more likely that someone would embed Linux on an IB switch
 than they would plug an IB switch
  into a Linux based system.  The code is designed around the latter.
 Making this a system wide setting might simplify the code and optimize the
 code paths.
 
 I think we need to discuss how user space would be addressed.

This is an issue with out of tree drivers.  We're having to guess what things 
might be doing.  Are all devices being exposed up as a 'switch', or is there 
ever a case where there's a 'switch' device and an HCA device being reported 
together, or (highly unlikely) a switch device and an RNIC?

If the real use case is to embed Linux on a switch, then we could look at 
making that a system wide setting, rather than per device.  This could clean up 
the kernel without impacting the uABI.


RE: [PATCH] IB/core: Don't warn on no SA support in event handler

2015-06-10 Thread Hefty, Sean
  Registering an event handler is done for a device. This device may have
  one RoCE port (no SA cap) and one InfiniBand port (has SA cap).
  Therefore, warning from the event handler about a specific port that
  doesn't have SA cap is correct but pollutes the kernel log without a
  need.
 
 Maybe we should think about register event handler per port :-P

I agree, though I think that can be added separately from this change.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH for-next V2 0/9] Add completion timestamping support

2015-06-10 Thread Hefty, Sean
 There are multiple problems with libfrabric related to the use cases in my
 area. Most of all the lack of multicast support. Then there is the build
 up of software bloat on top. The interest here is in low latency
 operations. Redenzvous and other new features are really not wanted if
 they increase the latency.

Multicast is only supported by one vendor that has taken a hostile position 
against libfabric.  Support for multicast will eventually be there, but it's 
definitely not a priority for me.  As an open source project, anyone is welcome 
to propose patches.

For native providers, libfabric will reduce latency.  That's a provider 
implementation issue, and native providers will be available soon.  The OFIWG 
selected to have a working set of interfaces that applications can begin using 
immediately, versus waiting until there were a large set of native providers.

This is an entirely unrelated topic for this thread.  Jason asked what I would 
do with libfabric.  I answered.  Your comments do nothing to change my mind on 
that answer.

 This is a way it was implemented

IMO, this is exactly the problem.  The entire design is being driving by the 
implementation.  That produces an unmaintainable API and fractures the software 
ecosystem, which is exactly where we are today.



--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: rdmacm issue

2015-06-10 Thread Hefty, Sean
 RDMA_CM_EVENT_UNREACHABLE is indicated when there are timeouts in
 underlying CM protocol exchange. I suspect that the server is really
 busy and doesn't respond to the low level CM MADs in a timely manner.
 RDMA CM (and other kernel ULPs like IPoIB and SRP use hard coded local
 and remote response timeouts of 20 which is ~4.3 sec. This was discussed
 back in 2006 in
 http://comments.gmane.org/gmane.linux.drivers.openib/27664. In this
 scenario, the response took more than 30 seconds.  More recently, there
 was proposal to base RDMA CM response timeout on subnet timeout
 (http://permalink.gmane.org/gmane.linux.drivers.rdma/19969).

Hal's assessment seems likely.  Error code -110 is ETIMEDOUT.  However, the IB 
CM timeout when used through the RDMA CM should be much larger, as it makes use 
of the CM MRA protocol.  Unless a lot of MADs are being lost, or I'm not 
remembering the RDMA CM code correctly, there's still an issue here that I'm 
not understanding.

- Sean


RE: [PATCH v4 1/4] IB/netlink: Add defines for local service requests through netlink

2015-06-10 Thread Hefty, Sean
  +/* Local Service Reversible attribute */ struct
  +rdma_nla_ls_reversible {
  + __u32   reversible;
  +};
 
  Isn't __u8 sufficient for reversible ?
  Certainly enough. However, reversible is __u32 in struct
 ib_user_path_rec and int in struct ib_sa_path_rec.
 
 OK; I hadn't double checked there. So it's inherited from those
 reversible definitions.

I don't think we need to adhere to the sizes defined in other structures.  I 
agree with Hal's original comment.  A __u8 here seems sufficient, unless we 
want to introduce some bit fields.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH v4 0/4] Sending kernel pathrecord query to user cache server

2015-06-10 Thread Hefty, Sean
 This series does not attempt to optimize the kernel needing to know that a
 PR
 has been updated.  There are existing mechanisms for that.

Does this exist in the kernel?

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH v4 0/4] Sending kernel pathrecord query to user cache server

2015-06-10 Thread Hefty, Sean
 Not in the patches themselves but in the general issue when a PR changes.
 
 Do you think this needs addressing or are things fine as they are now ?

IMO, I think it needs addressing in terms of can the proposed netlink 
architecture and design accommodate this sort of request in the future?  We 
shouldn't design in a limitation up front.  I don't see anything in the current 
approach that would cause an issue.  There would likely be a need for new 
messages and attributes.

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH v4 0/4] Sending kernel pathrecord query to user cache server

2015-06-10 Thread Hefty, Sean
 Not directly.  IPoIB treats it that way.  I guess to be safe.
 
 Officially one should register for UnPath/RePath traps.  But no one has
 ever implemented that.
 
 To be clear I am agreeing with Hal that having some sort of update
 signal would be nice.  But I don't think that must be done before this
 series goes in.
 
 I think a PR update signal should be an extension to this interface.

I agree.  I just wanted to make sure that there wasn't some feature regarding 
PRs, such as unpath, that a kernel client would lose (i.e. it is currently 
implemented) by changing how the PRs are retrieved.  Basically nothing breaks 
with this change.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH v4 0/4] Sending kernel pathrecord query to user cache server

2015-06-10 Thread Hefty, Sean
   This series does not attempt to optimize the kernel needing to know
   that a PR has been updated.  There are existing mechanisms for that.
 
  Does this exist in the kernel?
 
 At least some support, yes.  For example client reregister marks all IPoIB
 paths as invalid.

Reregister indicates that all PRs are now invalid?
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH for-next V5 00/12] Move RoCE GID management to IB/Core

2015-06-10 Thread Hefty, Sean
 Sean, this change is needed b/c two drivers have (mlx4 and ocrda) and
 more two to come soon (mlx5 and soft-Roce) would have the very same
 logic of constructing the port GID table according to netdev events and
 such, no point in repeating this logic/code over and over.
 
 Matan explained why we don't have 2 x Y deletions and 1 x Y insertions.

It more than doubles the amount of code.  That's not a cleanup.  It introduces 
a bunch of new functionality.  Jason has asked repeatedly to remove the RoCEv2 
code, and that has been ignored repeatedly.  As far as I'm concerned, this 
patch is not worth my time, and I will no longer even bother following this 
series.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH v4 0/4] Sending kernel pathrecord query to user cache server

2015-06-10 Thread Hefty, Sean
 While this appears to address the current upstream use model for ACM
 with it's multicast overlay backend where PRs are static, it does not
 appear to address PR changes.

Although this ties into ibacm, from the viewpoint of the kernel, there's no 
requirement on the user space implementation.

 Should aging/revalidation of PRs be supported ? If so, would this make
 PRs similar at high level to IP neighbor cache in kernel ?

This is requesting a new feature not supported by anything in the kernel today, 
and would seem to fall well outside the scope of the suggested changes.  Is 
there a specific issue in the patches that you are seeing related to this?

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 10/14] IB/mad: Add support for additional MAD info to/from drivers

2015-06-08 Thread Hefty, Sean
 @@ -1693,8 +1693,11 @@ struct ib_device {
 u8 port_num,
 const struct ib_wc *in_wc,
 const struct ib_grh *in_grh,
 -   const struct ib_mad *in_mad,
 -   struct ib_mad *out_mad);
 +   const struct ib_mad_hdr 
 *in_mad,
 +   size_t in_mad_size,
 +   struct ib_mad_hdr *out_mad,
 +   size_t *out_mad_size,
 +   u16 *out_mad_pkey_index);

I don't have an alternate suggestion at the moment, but this call is getting to 
be a little much.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH] IB/core: Get rid of redundant verb ib_destroy_mr

2015-06-08 Thread Hefty, Sean
 --- a/drivers/infiniband/hw/mlx5/mr.c
 +++ b/drivers/infiniband/hw/mlx5/mr.c
 @@ -1174,6 +1174,18 @@ static int clean_mr(struct mlx5_ib_mr *mr)
   int umred = mr-umred;
   int err;
 
 + if (mr-sig) {
 + if (mlx5_core_destroy_psv(dev-mdev,
 +   mr-sig-psv_memory.psv_idx))
 + mlx5_ib_warn(dev, failed to destroy mem psv %d\n,
 +  mr-sig-psv_memory.psv_idx);
 + if (mlx5_core_destroy_psv(dev-mdev,
 +   mr-sig-psv_wire.psv_idx))
 + mlx5_ib_warn(dev, failed to destroy wire psv %d\n,
 +  mr-sig-psv_wire.psv_idx);
 + kfree(mr-sig);
 + }
 +
   if (!umred) {
   err = destroy_mkey(dev, mr);
   if (err) {

This patch doesn't introduce this problem, but the err check suggests that 
deregistration can fail for some reason.  If so, should mr-sig be cleared 
above, so that a second call to dereg doesn't attempt to free the memory twice?

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH for-next V5 00/12] Move RoCE GID management to IB/Core

2015-06-08 Thread Hefty, Sean
 Previously, every vendor implemented its net device notifiers in its own
 driver. This introduces a huge code duplication as figuring


  28 files changed, 2253 insertions(+), 860 deletions(-)

How does adding 1400 lines of code help reduce code duplication?

Can you please explain and justify why this change is actually needed?
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


  1   2   3   4   5   6   7   8   9   10   >