RE: [PATCHv3 TRIVIAL] IB/core: Documentation fix to ib_mad_snoop_handler in the MAD header file
> In ib_mad.h, ib_mad_snoop_handler uses send_buf rather than send_wr The MAD snooping should be removed from the mad stack. There are no in tree users and the only attempt at adding one was rejected.
RE: [PATCHv3 TRIVIAL] IB/core: Documentation fix to ib_mad_snoop_handler in the MAD header file
> > There are no in tree users and the only attempt at adding one was > rejected. > > There are no in tree users of this but there is your madeye tool (which > is out of tree). This is still a useful debug tool for MADs and there > are people who still use that. It's an out of tree tool. Maintain the snooping hooks into the mad layer out of tree as well.
RE: [PATCH 22/37] IB/rdmavt: Add queue pair data structure to rdmavt
> >> +struct rvt_rwqe { > >> + u64 wr_id; > >> + u8 num_sge; > >> + struct ib_sge sg_list[0]; > >> +}; > >> + > >> +/* > >> + * This structure is used to contain the head pointer, tail pointer, > >> + * and receive work queue entries as a single memory allocation so > >> + * it can be mmap'ed into user space. > >> + * Note that the wq array elements are variable size so you can't > >> + * just index into the array to get the N'th element; > >> + * use get_rwqe_ptr() instead. > > > >Can you add/use an entry_size field? > > I think we could work something like that, however what we have in > qib/hfi1 > also works. Any reason we really should change this? I did not check to see what the drivers do. Using entry_size is straightforward, may provide the best performance, and can be done in common code, versus adding callbacks to all users. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 03/37] IB/rdmavt: Add protection domain to rdmavt.
> > How about I reword the comment to say that while we could continue > > allocating PDs being only constrained by system resources, the spec says > > there is a limit so we enforce that? > > I'm ok with that, Sean? That's fine -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH V2 1/1] [PATCH 1/1] Ibacm: default pkey for partitioned fabrics
Thanks - applied -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 1/1] Ibacm: default pkey for partitioned fabrics
> > I mean that the user needs to investigate why the fabric is not working > out of box. > > My point is that an educated admin should _know_ to configure in these > cases and that debug is only when things are broken not by default in > these more complex cases. This means the limitations of the out of box > approach needs to be explained in the docs. When IP addresses are used, the corresponding pkey is used. The issue this patch is addressing is the mapping of 'hostnames' to pkeys, as show in this ibacm_addr.cfg example: #Name device port pkey cst-lin0 mlx4_0 1default cst-lin0-1 mlx4_0 1default cst-lin0-2 mlx4_0 2default Currently, 'default' is hard-coded to a pkey of 0x. The intent is to define a better default value. Kaike has suggested this be the new default: 1. Find the first non-management full-member pkey; 2. If it fails, find pkey 0x; 3. If pkey 0x is not available, use the first pkey. Is there better alternative for what default should be? Jason was suggesting use pkey[0], which seems less robust in theory, but is simple and may cover the vast majority of real use cases. The fewer cases where manual configuration is necessary, the fewer emails I receive, and the better off I am. :) - Sean
RE: [PATCH 1/1] Ibacm: default pkey for partitioned fabrics
> pkey[0] at least has the logic that the admin will configure things so > that the default ipoib device reaches the broadest audiance makes the > most sense to me. That is what most sites I've seen want to do. Kaike, will pkey[0] work in the configurations that you're targeting with this change? This seems like a very simple solution that's better than what we have now. > I'm not quite sure what the acm algorithm is, but can't it just figure > out the pkey from the IP routing? Ie if you have an IP address to > resolve a few netlink queries will tell you what pkey to use, and that > is where the acm multicast should go? When an IP address is used, it uses the correct pkey (based on routing data). I mentioned this in a separate email, but this addresses the case when hostnames are used and ipoib may not be involved. - Sean -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 1/1] Ibacm: default pkey for partitioned fabrics
> >> Example: Compute nodes are assigned pkeys 0x8000 and 0x7fff. A node > >> running the job scheduler has pkeys 0x and 0x8000 (maybe it's > >> also the backup SA). Ibacm would need to select pkey 0x8000 for > >> communication. > > > > I've also seen the reverse, eg 0x is used for default ipoib > > communication and 0x8001 is assigned to only some nodes as a child > > vlan. > > That's what I use internally in our test lab. > > > Choosing 0x8001 in that case won't work either. > > Nope. The suggestion here would break our setup. Well, *would* being > the operative word where what I mean is would if we used ibacm. But > that's a story for another email... In this case, Kaike, please change the default to just be pkey[0]. - Sean -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 1/1] Ibacm: default pkey for partitioned fabrics
> > 1. Find the first non-management full-member pkey; I.e. a pkey with the high-order bit set that is not 0x > > 2. If it fails, find pkey 0x; > > Order of 1 and 2 depends on use models for full default partition and > other partitions. Reversing 1 and 2 (full default partition first) would > handle the most common use models and handles Jason's case. > > The only common case that I'm aware of where that might fall down is in > the virtualized case. I'm not sure what policy is best there and would > need to think about that scenario some more (and there is more > fundamental issue with ACM in those environments). Offline I had asked about reversing 1 and 2. The reasoning given was that ibacm could be running on a 'management' node, but needed to communicate with compute nodes. IOW, the occurrence of a 'non-management, full member' pkey in the pkey table strongly indicates that a node is being used in a secure environment, and ibacm should prefer using that pkey over 0x. Example: Compute nodes are assigned pkeys 0x8000 and 0x7fff. A node running the job scheduler has pkeys 0x and 0x8000 (maybe it's also the backup SA). Ibacm would need to select pkey 0x8000 for communication. This seems like a reasonable argument to me. - Sean
RE: [PATCH 03/37] IB/rdmavt: Add protection domain to rdmavt.
> > > +struct ib_pd *rvt_alloc_pd(struct ib_device *ibdev, > > > +struct ib_ucontext *context, > > > +struct ib_udata *udata) > > > +{ > > > + struct rvt_dev_info *dev = ib_to_rvt(ibdev); > > > + struct rvt_pd *pd; > > > + struct ib_pd *ret; > > > + > > > + pd = kmalloc(sizeof(*pd), GFP_KERNEL); > > > + if (!pd) { > > > + ret = ERR_PTR(-ENOMEM); > > > + goto bail; > > > + } > > > + /* > > > + * This is actually totally arbitrary. Some correctness tests > > > + * assume there's a maximum number of PDs that can be allocated. > > > + * We don't actually have this limit, but we fail the test if > > > + * we allow allocations of more than we report for this value. > > > + */ > > > > Why not trap this in user space, rather than forcing the kernel to > support some test program? > > > > What do you mean "trap this in user space"? This code is not supporting > any > particular test program. > > If users try and allocate more protection domains then are advertised then > an > error should be returned. > > Perhaps the comment should be removed if it is confusing? There is no theoretical limit on the number of PDs, or likely most other resources. So why define and enforce an arbitrary limit? The justification given was that some test program would fail. If there's a concern about that test program passing, then enforce the limit in user space. I didn't find the comment confusing. Just the choice to let a test app drive the design. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 1/1] Ibacm: default pkey for partitioned fabrics
> > +* Determine the default pkey for parsing address file as well. > > +* order of preference: first full-member non-management pkey, > > +* 0x, first pkey. > > +*/ > > This really should just be the 0 index pkey, which exactly matches how > IPoIB determines the default pkey, which is what matters when talking > rdmacm.. Ibacm currently hard-codes the 'default' pkey to 0x for ibacm <-> ibacm communication. If there's no disagreement to switching to pkey[0], I'm fine with that. I did not realize that ipoib uses this same default. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 01/37] IB/rdmavt: Create module framework and handle driver registration
> >> +#define RDMAVT_DRIVER_VERSION "0.1" > >Do we really need driver version? > > Based on these patches thus far. Probably not. We do still have a bit more > development to go here, I added it more as a just in case but am not > steadfast in keeping it. > > What's the general rule here? Should we not assign a version if we aren't > explicitly doing anything with it yet, but may at some point? A driver version doesn't make sense. Use the kernel version. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 03/37] IB/rdmavt: Add protection domain to rdmavt.
> +struct ib_pd *rvt_alloc_pd(struct ib_device *ibdev, > +struct ib_ucontext *context, > +struct ib_udata *udata) > +{ > + struct rvt_dev_info *dev = ib_to_rvt(ibdev); > + struct rvt_pd *pd; > + struct ib_pd *ret; > + > + pd = kmalloc(sizeof(*pd), GFP_KERNEL); > + if (!pd) { > + ret = ERR_PTR(-ENOMEM); > + goto bail; > + } > + /* > + * This is actually totally arbitrary. Some correctness tests > + * assume there's a maximum number of PDs that can be allocated. > + * We don't actually have this limit, but we fail the test if > + * we allow allocations of more than we report for this value. > + */ Why not trap this in user space, rather than forcing the kernel to support some test program? N�r��yb�X��ǧv�^�){.n�+{��ٚ�{ay�ʇڙ�,j��f���h���z��w��� ���j:+v���w�j�mzZ+�ݢj"��!�i
RE: [PATCH 00/37] Add rdma verbs transport library
> The following series implements rdmavt. This is the rdma verbs transport > software library This is going to be an annoying request, but I would rather see this named the IB transport library, unless this framework is usable over iWarp as well. N�r��yb�X��ǧv�^�){.n�+{��ٚ�{ay�ʇڙ�,j��f���h���z��w��� ���j:+v���w�j�mzZ+�ݢj"��!�i
RE: [PATCH 06/37] IB/rdmavt: Add query and modify device stubs
> +static int rvt_query_device(struct ib_device *ibdev, > + struct ib_device_attr *props, > + struct ib_udata *uhw) > +{ > + /* > + * Return rvt_dev_info.props contents > + */ > + return -EINVAL; ENOSYS on all of the function templates. This and subsequent patches. N�r��yb�X��ǧv�^�){.n�+{��ٚ�{ay�ʇڙ�,j��f���h���z��w��� ���j:+v���w�j�mzZ+�ݢj"��!�i
RE: [PATCH 01/37] IB/rdmavt: Create module framework and handle driver registration
> @@ -0,0 +1,89 @@ > +/* > + * > + * This file is provided under a dual BSD/GPLv2 license. When using or > + * redistributing this file, you may do so under either license. > + * > + * GPL LICENSE SUMMARY > + * > + * Copyright(c) 2015 Intel Corporation. I'm guessing that the GPL license text does not have an Intel copyright. > + * > + * This program is free software; you can redistribute it and/or modify > + * it under the terms of version 2 of the GNU General Public License as > + * published by the Free Software Foundation. > + * > + * This program is distributed in the hope that it will be useful, but > + * WITHOUT ANY WARRANTY; without even the implied warranty of > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > + * General Public License for more details. > + * > + * BSD LICENSE > + * > + * Copyright(c) 2015 Intel Corporation. Or the BSD license either. Please move the copyright notices to the top of the comment block in this and other files. > + * > + * Redistribution and use in source and binary forms, with or without > + * modification, are permitted provided that the following conditions > + * are met: > + * > + * - Redistributions of source code must retain the above copyright > + *notice, this list of conditions and the following disclaimer. > + * - Redistributions in binary form must reproduce the above copyright > + *notice, this list of conditions and the following disclaimer in > + *the documentation and/or other materials provided with the > + *distribution. > + * - Neither the name of Intel Corporation nor the names of its > + *contributors may be used to endorse or promote products derived > + *from this software without specific prior written permission. > + * > + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS > + * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT > + * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR > + * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT > + * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, > + * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT > + * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, > + * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY > + * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT > + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE > + * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. > + * > + */ N�r��yb�X��ǧv�^�){.n�+{��ٚ�{ay�ʇڙ�,j��f���h���z��w��� ���j:+v���w�j�mzZ+�ݢj"��!�i
RE: [PATCH 05/37] IB/rdmavt: Macroize override checks during driver registration
> +/* > + * Check driver override. If driver passes a value use it, otherwise we > use our > + * own value. > + */ > +#define CDR(rdi, x) \ > + rdi->ibdev.x = rdi->ibdev.x ? : rvt_ ##x This is an extremely obscure name. No one will be able to look at this: > + CDR(rdi, alloc_pd); > + CDR(rdi, dealloc_pd); And have a clue what is happening without searching for the macro.
RE: [PATCH 21/37] IB/rdmavt: Move MR datastructures into rvt
> include/rdma/rdma_vt.h | 53 > > 1 files changed, 53 insertions(+), 0 deletions(-) > > diff --git a/include/rdma/rdma_vt.h b/include/rdma/rdma_vt.h > index 5112dd7..39a0737 100644 > --- a/include/rdma/rdma_vt.h > +++ b/include/rdma/rdma_vt.h > @@ -59,6 +59,56 @@ > #include "ib_verbs.h" > > /* > + * For Memory Regions. This stuff should probably be moved into > rdmavt/mr.h once > + * drivers no longer need access to the MR directly. > + */ > + > +/* > + * A segment is a linear region of low physical memory. > + * Used by the verbs layer. > + */ > +struct rvt_seg { > + void *vaddr; > + size_t length; > +}; > + > +/* The number of rvt_segs that fit in a page. */ > +#define RVT_SEGSZ (PAGE_SIZE / sizeof(struct rvt_seg)) > + > +struct rvt_segarray { > + struct rvt_seg segs[RVT_SEGSZ]; > +}; > + > +struct rvt_mregion { > + struct ib_pd *pd; /* shares refcnt of ibmr.pd */ > + u64 user_base; /* User's address for this region */ > + u64 iova; /* IB start address of this region */ > + size_t length; > + u32 lkey; > + u32 offset; /* offset (bytes) to start of region */ > + int access_flags; > + u32 max_segs; /* number of rvt_segs in all the arrays */ > + u32 mapsz; /* size of the map array */ > + u8 page_shift; /* 0 - non unform/non powerof2 sizes */ > + u8 lkey_published; /* in global table */ Without looking ahead in the patch series, won't the access_flags indicate this? > + struct completion comp; /* complete when refcount goes to zero */ > + atomic_t refcount; > + struct rvt_segarray *map[0];/* the segments */ > +}; > + > +#define RVT_MAX_LKEY_TABLE_BITS 23 > + > +struct rvt_lkey_table { > + spinlock_t lock; /* protect changes in this struct */ > + u32 next; /* next unused index (speeds search) */ > + u32 gen;/* generation count */ > + u32 max;/* size of the table */ > + struct rvt_mregion __rcu **table; > +}; > + > +/* End Memmory Region */ > + > +/* > * Things that are driver specific, module parameters in hfi1 and qib > */ > struct rvt_driver_params { > @@ -125,6 +175,9 @@ struct rvt_dev_info { > /* Driver specific properties */ > struct rvt_driver_params dparms; > > + struct rvt_mregion __rcu *dma_mr; > + struct rvt_lkey_table lk_table; Go crazy here and add the 'ey' into the field name.
RE: [PATCH 22/37] IB/rdmavt: Add queue pair data structure to rdmavt
> drivers/infiniband/sw/rdmavt/qp.h |5 - > include/rdma/rdma_vt.h| 233 > + > 2 files changed, 233 insertions(+), 5 deletions(-) > > diff --git a/drivers/infiniband/sw/rdmavt/qp.h > b/drivers/infiniband/sw/rdmavt/qp.h > index 4e4709f..c80d326 100644 > --- a/drivers/infiniband/sw/rdmavt/qp.h > +++ b/drivers/infiniband/sw/rdmavt/qp.h > @@ -53,11 +53,6 @@ > > #include > > -struct rvt_qp { > - struct ib_qp *ibqp; > - /* Other stuff */ > -}; > - > struct ib_qp *rvt_create_qp(struct ib_pd *ibpd, > struct ib_qp_init_attr *init_attr, > struct ib_udata *udata); > diff --git a/include/rdma/rdma_vt.h b/include/rdma/rdma_vt.h > index 39a0737..8d3a41a 100644 > --- a/include/rdma/rdma_vt.h > +++ b/include/rdma/rdma_vt.h > @@ -109,6 +109,239 @@ struct rvt_lkey_table { > /* End Memmory Region */ > > /* > + * Things needed for the Queue Pair definition. Like the MR stuff above > the > + * following should probably get moved to qp.h once drivers stop trying > to make > + * and manipulate thier own QPs. For the few instnaces where a driver may > need > + * to look into a queue pair there should be a pointer to a driver > priavte data > + * structure that they can look at. > + */ > + > +/* > + * These keep track of the copy progress within a memory region. > + * Used by the verbs layer. > + */ > +struct rvt_sge { > + struct rvt_mregion *mr; > + void *vaddr;/* kernel virtual address of segment */ > + u32 sge_length; /* length of the SGE */ > + u32 length; /* remaining length of the segment */ > + u16 m; /* current index: mr->map[m] */ Rename to cur_map? > + u16 n; /* current index: mr->map[m]->segs[n] */ cur_seg? > +}; > + > +/* > + * Send work request queue entry. > + * The size of the sg_list is determined when the QP is created and > stored > + * in qp->s_max_sge. > + */ > +struct rvt_swqe { > + union { > + struct ib_send_wr wr; /* don't use wr.sg_list */ > + struct ib_ud_wr ud_wr; > + struct ib_reg_wr reg_wr; > + struct ib_rdma_wr rdma_wr; > + struct ib_atomic_wr atomic_wr; > + }; > + u32 psn;/* first packet sequence number */ > + u32 lpsn; /* last packet sequence number */ > + u32 ssn;/* send sequence number */ > + u32 length; /* total length of data in sg_list */ > + struct rvt_sge sg_list[0]; > +}; > + > +/* > + * Receive work request queue entry. > + * The size of the sg_list is determined when the QP (or SRQ) is created > + * and stored in qp->r_rq.max_sge (or srq->rq.max_sge). > + */ > +struct rvt_rwqe { > + u64 wr_id; > + u8 num_sge; > + struct ib_sge sg_list[0]; > +}; > + > +/* > + * This structure is used to contain the head pointer, tail pointer, > + * and receive work queue entries as a single memory allocation so > + * it can be mmap'ed into user space. > + * Note that the wq array elements are variable size so you can't > + * just index into the array to get the N'th element; > + * use get_rwqe_ptr() instead. Can you add/use an entry_size field? > + */ > +struct rvt_rwq { > + u32 head; /* new work requests posted to the head */ > + u32 tail; /* receives pull requests from here. */ > + struct rvt_rwqe wq[0]; > +}; > + > +struct rvt_rq { > + struct rvt_rwq *wq; > + u32 size; /* size of RWQE array */ > + u8 max_sge; > + /* protect changes in this struct */ > + spinlock_t lock cacheline_aligned_in_smp; > +}; N�r��yb�X��ǧv�^�){.n�+{��ٚ�{ay�ʇڙ�,j��f���h���z��w��� ���j:+v���w�j�mzZ+�ݢj"��!�i
RE: [PATCH] IB/mad: In validate_mad, validate CM method and attribute
> > + /* CM attributes other than ClassPortInfo only use Send method > */ > > + if (mad_hdr->mgmt_class == IB_MGMT_CLASS_CM) { > > + if (mad_hdr->attr_id != IB_MGMT_CLASSPORTINFO_ATTR_ID) { > > + if (mad_hdr->method != IB_MGMT_METHOD_SEND) > > + goto out; > > + } else if (mad_hdr->method != IB_MGMT_METHOD_GET_RESP) > > + goto out; > > + } > > Doesn't this invalidate a CM Get(ClassPortInfo) mad? I believe this does. I think you could remove the else if clause and let the received MAD get passed to the CM. It would be dropped there as unsupported. The net result is likely the same. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH] IB/mad: In validate_mad, validate CM method and attribute
> IIRC responding to Get(CPI) is mandatory? Maybe the drivers are responding? I don't believe that the CM does. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH] IB/mad: In validate_mad, validate CM method and attribute
> Receipt of CM MAD with response method for other than ClassPortInfo > attribute is invalid. > > CM attributes other than ClassPortInfo use send method only > and GetResp is valid for ClassPortInfo attribute. > Note also that the CM ClassPortInfo is not currently supported. > > The SRP initiator does not maintain a timeout policy for CM connect > requests relies on the CM layer to do that. The result was that > the SRP initiator hung as the connect request never completed. > > A new SRP target has been observed to respond to Send CM REQ > with GetResp of CM REQ with bad status. This is non conformant > with IBA spec but exposes a vulnerability in the current MAD/CM > code which will respond to the incoming GetResp of CM REQ as if > it was a valid incoming Send of CM REQ rather than tossing > this on the floor. It also causes the MAD layer not to > retransmit the original REQ even though it has not received a REP. > > Reviewed-by: Sagi Grimberg> Signed-off-by: Hal Rosenstock Reviewed-by: Sean Hefty
RE: [PATCH libibcm] cmpost.c: Handle ibv_get_device_list returning no IB devices in init()
Merged - thanks. This is the first patch against the libibcm in over 4 years. Is there a reason why this is being used instead of the librdmacm? I ask because I assumed that the libibcm was basically deprecated. The last release was over 6 years ago. - Sean
RE: [PATCH rdma-RC] IB/cm: Fix rb-tree duplicate free and use-after-free
> > Sean, I need to close on this patch. What is your position after > > Matan's explanation? > > > > Absent an objection from Sean, I've pulled this in. A use after free > bug is a pretty serious issue, and you've listed an error flow that > triggers it. The only thing bugging me is that this code is 10+ years > old and this didn't show up until now, which makes me think that some > recent change is the cause of this. I've made note of that fact in my > tag commit and I think this warrants further examination in the next > kernel cycle. But since we are so close to out of time on 4.3, I deemed > it better to fix the use after free issue, even if it isn't necessarily > the perfect fix, than leave that hanging about. I was out last week. I think one of the reasons that this bug hasn't shown up is that very few apps use UD QPs, and those that do likely exchange QP information using some other out of band mechanism. - Sean -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH rdma-RC] IB/cm: Fix sleeping while atomic when creating AH from WC
> Today, cm assumes paths are reversible primary_path->reversible = 1. I can't quickly find a link, but I believe all CM MADs are reversible, per the spec.
RE: [PATCH rdma-RC] IB/cm: Fix sleeping while atomic when creating AH from WC
> >> ib_create_ah_from_wc needs to resolve the DMAC in order to create the > >> AH (this may result sending an ARP and waiting for response). > >> CM uses this function (which is now sleepable). > > > > This is a significant change to the CM. The CM calls are invoked > assuming that they return relatively quickly. They're invoked from > callbacks and internally. Having the calls now wait for an ARP response > requires that this be re-architected, so the calling thread doesn't go out > to lunch for several seconds. > > Agree - this is a significant change, but it was done a long time ago > (at v4.3 if I recall). When we need to send a message we need to We're at 4.3-rc5? > figure out the destination MAC. Even the passive side needs to do that > as some vendors don't report the source MAC of the packet in their wc. > Even if they did, since IP based addressing is rout-able by its > nature, it should follow the networking stack rules. Some crazy > configurations could force sending responses to packets that came from > router1 to router2 - so we have no choice than resolving the DMAC at > every side. Ib_create_ah_from_wc is broken. It is now an asynchronous operation, only the call itself was left as synchronous. We can't block kernel threads for a minute, or however long ARP takes to resolve. The call itself must change to be async, and all users of it updated to allocate some request, queue it, and handle all race conditions that result -- such as state changes or destruction of the work that caused the request to be initiated.
RE: [PATCH rdma-RC] IB/cm: Fix sleeping while atomic when creating AH from WC
> On Mon, Oct 12, 2015 at 7:42 PM, Hefty, Sean <sean.he...@intel.com> wrote: > >> When IP based addressing was introduced, ib_create_ah_from_wc was > >> changed in order to support a suitable AH. Since this AH should > >> now contains the DMAC (which isn't a simple derivative of the GID). > >> In order to find the DMAC, an ARP should sometime be sent. This ARP > >> is a sleeping context. > > > > Wait - are you saying that the CM may now be waiting for an ARP response > before it can send a message? > > > > ib_create_ah_from_wc needs to resolve the DMAC in order to create the > AH (this may result sending an ARP and waiting for response). > CM uses this function (which is now sleepable). This is a significant change to the CM. The CM calls are invoked assuming that they return relatively quickly. They're invoked from callbacks and internally. Having the calls now wait for an ARP response requires that this be re-architected, so the calling thread doesn't go out to lunch for several seconds. - Sean
RE: [PATCH rdma-RC] IB/cm: Fix sleeping while atomic when creating AH from WC
> When IP based addressing was introduced, ib_create_ah_from_wc was > changed in order to support a suitable AH. Since this AH should > now contains the DMAC (which isn't a simple derivative of the GID). > In order to find the DMAC, an ARP should sometime be sent. This ARP > is a sleeping context. Wait - are you saying that the CM may now be waiting for an ARP response before it can send a message? -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH rdma-RC] IB/cm: Fix rb-tree duplicate free and use-after-free
> ib_send_cm_sidr_rep could sometimes erase the node from the sidr > (depending on errors in the process). Since ib_send_cm_sidr_rep is > called both from cm_sidr_req_handler and cm_destroy_id, cm_id_priv This should clarify that it is the app calling from the callback, and not a direct call from the cm_sidr_req_handler. > could be either erased from the rb_tree twice or not erased at all. In an error case, I can see why it would be left in the rbtree, but I don't see how it can be removed twice. > Fixing that by making sure it's erased only once before freeing > cm_id_priv. > > Fixes: a977049dacde ('[PATCH] IB: Add the kernel CM implementation') > Signed-off-by: Doron Tsur> Signed-off-by: Matan Barak > --- > > Hi Doug, > This patch fixes a bug in the CM. In some flow, rb-tree could be > freed twice or used after it was freed. This bug was picked by > our regression tests and this fix was verified. > > Thanks, > Doron and Matan > > drivers/infiniband/core/cm.c | 10 +- > 1 file changed, 9 insertions(+), 1 deletion(-) > > diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c > index f5cf1c4..56ff0f3 100644 > --- a/drivers/infiniband/core/cm.c > +++ b/drivers/infiniband/core/cm.c > @@ -844,6 +844,11 @@ retest: > case IB_CM_SIDR_REQ_RCVD: > spin_unlock_irq(_id_priv->lock); > cm_reject_sidr_req(cm_id_priv, IB_SIDR_REJECT); > + spin_lock_irq(); > + if (!RB_EMPTY_NODE(_id_priv->sidr_id_node)) > + rb_erase(_id_priv->sidr_id_node, > + _sidr_table); > + spin_unlock_irq(); We should be able to use a return value from cm_reject_sidr_req() -- passed through from ib_send_cm_sidr_rep() to determine if the id was removed from the tree. > break; > case IB_CM_REQ_SENT: > case IB_CM_MRA_REQ_RCVD: > @@ -3210,7 +3215,10 @@ int ib_send_cm_sidr_rep(struct ib_cm_id *cm_id, > spin_unlock_irqrestore(_id_priv->lock, flags); > > spin_lock_irqsave(, flags); > - rb_erase(_id_priv->sidr_id_node, _sidr_table); > + if (!RB_EMPTY_NODE(_id_priv->sidr_id_node)) { > + rb_erase(_id_priv->sidr_id_node, _sidr_table); > + RB_CLEAR_NODE(_id_priv->sidr_id_node); > + } > spin_unlock_irqrestore(, flags); Something is very wrong in this function if the id is not in the tree at this point. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH librdmacm] examples: Use gai_strerror rather than perror for [rdma_]getaddrinfo failures
Thanks - merged all 3 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: libmlx4 and libmlx5 git trees? Who is handling those?
> Anyway, I had intended, and I've started on, making a change in how > libibverbs is done anyway. The idea that a new release is just a script > that throws a version on and we go is naive. I *will* be doing > pre-release rc tarballs and there will be testing and there will be a > release process that helps to make sure we don't see things like what we > saw with librdmacm (sorry to pick on you Sean): 1.0.17/1.0.17.1, > 1.0.18/1.0.18.1, and 1.0.19/1.0.19.1. I don’t feel picked on. :) The issue is that basically no one tests the releases of the packages. They only test the OFED bundle, and OFED doesn't want to pick up anything that's not already a release. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: libmlx4 and libmlx5 git trees? Who is handling those?
> >The issue is that basically no one tests the releases of the packages. > > We assume that the maintainer of the package tests it before they release > it :) No one has access to that many pieces of hardware and system configurations. The "community" can't dump the responsibility for testing onto a single person. OFA has always had the problem of testing long after the fact -- months, if not years, after changes are merged upstream (kernel or user space). -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH] infiniband:core:Fix error handling in the function cm_lap_handler
> diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c > index ea4db9c..8fab3a7 100644 > --- a/drivers/infiniband/core/cm.c > +++ b/drivers/infiniband/core/cm.c > @@ -2812,7 +2812,12 @@ static int cm_lap_handler(struct cm_work *work) > cm_init_av_for_response(work->port, work->mad_recv_wc->wc, > work->mad_recv_wc->recv_buf.grh, > _id_priv->av); > - cm_init_av_by_path(param->alternate_path, _id_priv->alt_av); > + ret = cm_init_av_by_path(param->alternate_path, _id_priv- > >alt_av); > + if (ret) { > + spin_unlock_irq(_id_priv->lock); > + cm_deref_id(cm_id_priv); > + } How can this be right? On a failure, we release the spinlock and the reference on the cm_id, then continue on with the processing? > + > ret = atomic_inc_and_test(_id_priv->work_count); > if (!ret) > list_add_tail(>list, _id_priv->work_list); > -- > 2.1.4 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 0/7] devcg: device cgroup extension for rdma resource
> > Trying to limit the number of QPs that an app can allocate, > > therefore, just limits how much of the address space an app can use. > > There's no clear link between QP limits and HW resource limits, > > unless you assume a very specific underlying implementation. > > Isn't that the point though? We have several vendors with hardware > that does impose hard limits on specific resources. There is no way to > avoid that, and ultimately, those exact HW resources need to be > limited. My point is that limiting the number of QPs that an app can allocate doesn't necessarily mean anything. Is allocating 1000 QPs with 1 entry each better or worse than 1 QP with 10,000 entries? Who knows? > If we want to talk about abstraction, then I'd suggest something very > general and simple - two limits: > '% of the RDMA hardware resource pool' (per device or per ep?) > 'bytes of kernel memory for RDMA structures' (all devices) Yes - this makes more sense to me. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 0/7] devcg: device cgroup extension for rdma resource
> So, the existence of resource limitations is fine. That's what we > deal with all the time. The problem usually with this sort of > interfaces which expose implementation details to users directly is > that it severely limits engineering manuevering space. You usually > want your users to express their intentions and a mechanism to > arbitrate resources to satisfy those intentions (and in a way more > graceful than "we can't, maybe try later?"); otherwise, implementing > any sort of high level resource distribution scheme becomes painful > and usually the only thing possible is preventing runaway disasters - > you don't wanna pin unused resource permanently if there actually is > contention around it, so usually all you can do with hard limits is > overcommiting limits so that it at least prevents disasters. I agree with Tejun that this proposal is at the wrong level of abstraction. If you look at just trying to limit QPs, it's not clear what that attempts to accomplish. Conceptually, a QP is little more than an addressable endpoint. It may or may not map to HW resources (for Intel NICs it does not). Even when HW resources do back the QP, the hardware is limited by how many QPs can realistically be active at any one time, based on how much caching is available in the NIC. Trying to limit the number of QPs that an app can allocate, therefore, just limits how much of the address space an app can use. There's no clear link between QP limits and HW resource limits, unless you assume a very specific underlying implementation. - Sean -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: RDMA/CM and multiple QPs
> right now RDMA/CM works on a QP basis, but seems very awakward if you > want multiple QPs as part of a single logical device, which will be > useful for a lot of modern protocols. For example we will need to check > in the CM handler that we're not getting a different ib_device if we > want to apply the device limit in any sort of global scope, and it's > generally very hard to get a struct ib_device that can be used as > a driver model parent. > > Is there any interest in trying to add an API to the CM to do a single > address resolution and allocate multiple QPs with these checks in > place? IMO, you want a completely different level of abstraction. One not based on a specific hardware implementation. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 0/7] devcg: device cgroup extension for rdma resource
> > In past there has been similar comment to have dedicated cgroup > > controller for RDMA instead of merging with device cgroup. > > I am ok with both the approach, however I prefer to utilize device > > controller instead of spinning of new controller for new devices > > category. > > I anticipate more such need would arise and for new device category, > > it might not be worth to have new cgroup controller. > > RapidIO though very less popular and upcoming PCIe are on horizon to > > offer similar benefits as that of RDMA and in future having one > > controller for each of them again would not be right approach. > > > > I certainly seek your and others inputs in this email thread here > whether > > (a) to continue to extend device cgroup (which support character, > > block devices white list) and now RDMA devices > > or > > (b) to spin of new controller, if so what are the compelling reasons > > that it can provide compare to extension. > > I'm doubtful that these things are gonna be mainstream w/o building up > higher level abstractions on top and if we ever get there we won't be > talking about MR or CQ or whatever. Also, whatever next-gen is > unlikely to have enough commonalities when the proposed resource knobs > are this low level, so let's please keep it separate, so that if/when > this goes out of fashion for one reason or another, the controller can > silently wither away too. As an attempt to abstract the hardware resources only, what these devices are exposing to apps can be viewed as command queues (RDMA QPs and SRQs), notification queues (RDMA CQs and EQs), and space in the device cache and allocated memory (RDMA MRs and AHs, maybe PDs). If one wanted a higher level of abstraction, associations exist between these resources. For example, command queues feed into notification queues. Address handles are required resources to use an unconnected queue pair. - Sean -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH] infiniband:core:Fix assumation that the function ib_send_cm_drep never fails in rdma_disconnect
> This fixes the incorrect assumation that the function ib_send_cm_drep > always runs successfully and never fails in the function rdma_disconnect > by making this call be assigned the variable used to return to callers > of rdma_disconnect in order to make sure that a error code is returned > if a failure occurs with this call in order for the caller to handle. It's not assuming that it doesn't fail. It's saying that it doesn't care that it does. The call to send a DREP is optimistic and correct in most circumstances. The app is done at this point, and the connection has been disconnected. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH] infiniband:core:Fix mutex locking to lock before unlocking in the function cma_ib_mc_handler
> This fixes mutex locking to lock before unlocking in the function > cma_ib_mc_handler for the mutex handler_mutex as part of the pointer > id_priv before unlocking it after the critical region for event handler > work and execution in order to have actual proper concurrent execution > protection around this critical region in the function cma_ib_mc_handler. The call cma_disable_callback() acquires the mutex. > --- > drivers/infiniband/core/cma.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c > index 143ded2..7b89195 100644 > --- a/drivers/infiniband/core/cma.c > +++ b/drivers/infiniband/core/cma.c > @@ -3158,6 +3158,7 @@ static int cma_ib_mc_handler(int status, struct > ib_sa_multicast *multicast) > } else > event.event = RDMA_CM_EVENT_MULTICAST_ERROR; > > + mutex_lock(_priv->handler_mutex); This mutex is already held at this point. > ret = id_priv->id.event_handler(_priv->id, ); > if (ret) { > cma_exch(id_priv, RDMA_CM_DESTROYING); > -- > 2.1.4 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH] infiniband:Fix error handling in the function cm_init_av_by_path
> This fixes error handling in the function cm_init_av_by_path > to properly check if the internal call to the function > ib_init_ah_from_path has failed by returning a error code and > if so return immediately to the caller with this error code in > order to signal/allow the caller to handle the error in its own > intended error paths. > > Signed-off-by: Nicholas KrauseAcked-by: Sean Hefty > --- > drivers/infiniband/core/cm.c | 6 -- > 1 file changed, 4 insertions(+), 2 deletions(-) > > diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c > index 3a972eb..cb021fd 100644 > --- a/drivers/infiniband/core/cm.c > +++ b/drivers/infiniband/core/cm.c > @@ -379,8 +379,10 @@ static int cm_init_av_by_path(struct ib_sa_path_rec > *path, struct cm_av *av) > return ret; > > av->port = port; > - ib_init_ah_from_path(cm_dev->ib_device, port->port_num, path, > - >ah_attr); > + ret = ib_init_ah_from_path(cm_dev->ib_device, port->port_num, path, > +>ah_attr); > + if (ret) > + return ret; > av->timeout = path->packet_life_time + 1; > memcpy(av->smac, path->smac, sizeof(av->smac)); > > -- > 2.1.4 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH] infiniband:core:Fix error handling in the function join_handler
> diff --git a/drivers/infiniband/core/multicast.c > b/drivers/infiniband/core/multicast.c > index 2cb865c..9284337 100644 > --- a/drivers/infiniband/core/multicast.c > +++ b/drivers/infiniband/core/multicast.c > @@ -526,8 +526,9 @@ static void join_handler(int status, struct > ib_sa_mcmember_rec *rec, > process_join_error(group, status); > else { > int mgids_changed, is_mgid0; > - ib_find_pkey(group->port->dev->device, group->port->port_num, > - be16_to_cpu(rec->pkey), _index); > + if (ib_find_pkey(group->port->dev->device, group->port- > >port_num, > + be16_to_cpu(rec->pkey), _index)) > + return; We can't just abort here. Apps are waiting on a callback to their join operation, and the SA believes that this port is part of the multicast group. If we don't get a valid pkey, the group will be left set to 'MCAST_INVALID_PKEY_INDEX'. - Sean -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH v6 0/4] Add network namespace support in the RDMA-CM
This series looks reasonable to me -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 1/3] IB/uverbs: reject invalid or unknown opcodes
AFAIK, this path is rarely (never?) actually used. I think all the drivers we have can post directly from userspace. I didn't think the ipath or qib drivers post from userspace. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH] IB/ucma: Fix theoretical user triggered use-after-free
Something like this: CPU A CPU B ucma_destroy_id() wait_for_completion() .. anything ucma_put_ctx() complete() .. continues ... ucma_leave_multicast() mutex_lock(mut) atomic_inc(ctx-ref) mutex_unlock(mut) ucma_free_ctx() ucma_cleanup_multicast() mutex_lock(mut) kfree(mc) rdma_leave_multicast(mc-ctx-cm_id,.. Fix it by latching the ref at 0. Once it goes to 0 mc and ctx cannot leave the mutex(mut) protection. The other atomic_inc in ucma_get_ctx is OK because mutex(mut) protects it from racing with ucma_destroy_id. Signed-off-by: Jason Gunthorpe jguntho...@obsidianresearch.com Acked-by: Sean Hefty sean.he...@intel.com --- drivers/infiniband/core/ucma.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/drivers/infiniband/core/ucma.c b/drivers/infiniband/core/ucma.c index 29b21213ea75..acac9eafdbf6 100644 --- a/drivers/infiniband/core/ucma.c +++ b/drivers/infiniband/core/ucma.c @@ -1321,10 +1321,10 @@ static ssize_t ucma_leave_multicast(struct ucma_file *file, mc = ERR_PTR(-ENOENT); else if (mc-ctx-file != file) mc = ERR_PTR(-EINVAL); - else { + else if (!atomic_inc_not_zero(mc-ctx-ref)) + mc = ERR_PTR(-ENXIO); + else idr_remove(multicast_idr, mc-id); - atomic_inc(mc-ctx-ref); - } mutex_unlock(mut); if (IS_ERR(mc)) { -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH for-next 1/2] IB/core: Add support for RX/TX checksum offload capabilities report
+enum ib_csum_cap_flags { + IB_CSUM_RX_TCP_UDP = 1 0, + IB_CSUM_RX_IP_HDR= 1 1, + IB_CSUM_TX_TCP_UDP = 1 2, + IB_CSUM_TX_IP_HDR= 1 3 +}; TPC and UDP should be separate flags. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH for-next 1/2] IB/core: Add support for RX/TX checksum offload capabilities report
+enum ib_csum_cap_flags { + IB_CSUM_RX_TCP_UDP = 1 0, + IB_CSUM_RX_IP_HDR= 1 1, + IB_CSUM_TX_TCP_UDP = 1 2, + IB_CSUM_TX_IP_HDR= 1 3 +}; TPC and UDP should be separate flags. Can you explain why? For the same reason that you didn't include *all* L4 headers. What we are advertising here is offloads for L3 and L4 checksums, why should it be per protocol? Because UDP and TCP have different headers, and it's entirely possible for a NIC (e.g. usnic) to support offloading one but not the other. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [RFC] split struct ib_send_wr
please take a look at my RFC patch here: http://git.infradead.org/users/hch/scsi.git/commitdiff/751774250b71d a83a26ba8584cff70f5e7bb7b1e the commit contains my explanation, but apparently the patch is too large for the list limit and didn't make it through. This looks like a reasonable start. It may help with feedback if you could just post the changes to ib_verbs.h. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH v2 05/13] IB/cm: Share listening CM IDs
+/** + * Create a new listening ib_cm_id and listen on the given service ID. + * + * If there's an existing ID listening on that same device and service ID, + * return it. + * + * @device: Device associated with the cm_id. All related communication will + * be associated with the specified device. + * @cm_handler: Callback invoked to notify the user of CM events. + * @service_id: Service identifier matched against incoming connection + * and service ID resolution requests. The service ID should be specified + * network-byte order. If set to IB_CM_ASSIGN_SERVICE_ID, the CM will + * assign a service ID to the caller. + * @service_mask: Mask applied to service ID used to listen across a + * range of service IDs. If set to 0, the service ID is matched + * exactly. This parameter is ignored if %service_id is set to + * IB_CM_ASSIGN_SERVICE_ID. + * + * Callers should call ib_destroy_cm_id when done with the listener ID. + */ +struct ib_cm_id *ib_cm_insert_listen(struct ib_device *device, + ib_cm_handler cm_handler, + __be64 service_id, __be64 service_mask) +{ + struct cm_id_private *cm_id_priv; + struct ib_cm_id *cm_id; + unsigned long flags; + int err = 0; + + /* Create an ID in advance, since the creation may sleep */ + cm_id = ib_create_cm_id(device, cm_handler, NULL); + if (IS_ERR(cm_id)) + return cm_id; + + spin_lock_irqsave(cm.lock, flags); + + if (service_id == IB_CM_ASSIGN_SERVICE_ID) + goto new_id; + + /* Find an existing ID */ + cm_id_priv = cm_find_listen(device, service_id, NULL); + if (cm_id_priv) { The service_mask is being ignored through this code path. + if (cm_id-cm_handler != cm_handler || cm_id-context) { + /* Sharing an ib_cm_id with different handlers is not + * supported */ + spin_unlock_irqrestore(cm.lock, flags); + return ERR_PTR(-EINVAL); + } + atomic_inc(cm_id_priv-refcount); + ++cm_id_priv-listen_sharecount; + spin_unlock_irqrestore(cm.lock, flags); + + ib_destroy_cm_id(cm_id); + cm_id = cm_id_priv-id; + return cm_id; + } + +new_id: + /* Use newly created ID */ + err = __ib_cm_listen(cm_id, service_id, service_mask, NULL, false); + + spin_unlock_irqrestore(cm.lock, flags); + + if (err) { + ib_destroy_cm_id(cm_id); + return ERR_PTR(err); + } + return cm_id; +} +EXPORT_SYMBOL(ib_cm_insert_listen); + static __be64 cm_form_tid(struct cm_id_private *cm_id_priv, enum cm_msg_sequence msg_seq) { -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH v2 07/13] IB/cma: Helper functions to access port space IDRs
Add helper functions to access the IDRs by port-space and port number. Pass around the port-space enum in cma.c instead of using pointers to port-space IDRs. Why? drivers/infiniband/core/cma.c | 81 -- - 1 file changed, 60 insertions(+), 21 deletions(-) This change adds code, and AFAICT makes it less efficient. What purpose does it serve? -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH v2 09/13] IB/cma: Add net_dev and private data checks to RDMA CM
@@ -1040,6 +1040,181 @@ static int cma_save_net_info(struct sockaddr *src_addr, return cma_save_ip_info(src_addr, dst_addr, ib_event, service_id); } +struct cma_req_info { + struct ib_device *device; + int port; + const union ib_gid *local_gid; The use of a pointer to some gid inside some other structure looks questionable. There's no reference count held on another structure. We would be better off copying the GID or setting an index into the device GID table. + __be64 service_id; + u16 pkey; +}; Please relocate to the top of the source file to group it with other structure definitions. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH V3 1/5] RDMA/core: Transport-independent access flags
The lkey is possibly useful if someone wants to do single op transfers larger than the S/G limit of the SQE. I haven't noticed any ULPs doing that.. That changes how the buffer is identified, which gets back to my question of are we identifying local buffers by address or through some sort of iova/tag/descriptor/mr/whatever. Do we have this right in the API? -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH V3 1/5] RDMA/core: Transport-independent access flags
This may be a nit, but IMO, the use of the term 'rkey' versus 'stag' matters. They convey different ways of finding a data buffer. For example, do you locate a buffer using the stag, then verify that the offset + length fits into the target buffer? Yes. HW always uses the stag to locate a record that contains the stag state (valid or invalid), the access flags, the 8b key, the va_base, length, PBL describing the host pages, etc. HW validates all that before using the buffer. NOTE: An stag of 0 is the special local-dma-lkey which HW treats differently: If the stag is 0, then the address in the SGE is the bus/dma address itself and no lookup of a MR/PBL/etc is needed. Stag 0 can ONLY be used by kernel users and MUST never be accepted/used from an ingress packet and MUST never be emitted on the wire in a READ or WRITE. Or do you locate the buffer by address, then verify that the key matches? This is never done. These were more rhetorical questions to highlight that stag is a better choice for the name than rkey. Lkey seems better than stag, though I'd rather see lkey removed completely. I don't see where usnic, ipath, qib, or opa technically need an lkey at all. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: rdma_getaddrinfo and GUID
I'm use cat /sys/class/infiniband/mlx4_0/ports/1/gids/0 fe80::::0002:c903:00ef:6601 In node I'm use gid provided below service null, hints contains af_ib and rai_family|rai_numerichost Ah, yes, the rdma_cm supports GIDs. It does not support GUIDs. Though it's usually trivial to convert a GUID into a GID.
RE: rdma_getaddrinfo and GUID
Hello, does it possible to use rdma_getaddrinfo and specify in node port GUID? The current code does not support this. Why rdma_getaddrinfo does not return error? Why it returns success and result contains port guid? I don't think I'm following your questions on how rdma_getaddrinfo is being called. Are you referring to a GUID (64-bit value) or a GID (128-bit value)? How are you calling rdma_getaddrinfo with this value? N�r��yb�X��ǧv�^�){.n�+{��ٚ�{ay�ʇڙ�,j��f���h���z��w��� ���j:+v���w�j�mzZ+�ݢj��!�i
RE: [PATCH V3 1/5] RDMA/core: Transport-independent access flags
There is confusion about lkeys and rkeys with regard to iWARP. In the iWARP verbs, there is no distinction between an lkey and rkey: they are the same key, called a Steering Tag or STAG. When you create a MR, the lkey == rkey == STAG for iwarp transports. Somewhat related, but really a different issue, is that SGEs that are the target of a read need REMOTE_WRITE access flags on their STAG for iWARP. Clear as mud? :) This may be a nit, but IMO, the use of the term 'rkey' versus 'stag' matters. They convey different ways of finding a data buffer. For example, do you locate a buffer using the stag, then verify that the offset + length fits into the target buffer? Or do you locate the buffer by address, then verify that the key matches? Consider if we allow an app to specify the rkey/stag, or reference the buffer using an offset, rather than a virtual address. This seems to be part of the difference between an lkey and an rkey. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: rdma_getaddrinfo and GUID
Hello, does it possible to use rdma_getaddrinfo and specify in node port GUID? The current code does not support this.
RE: [PATCH WIP 01/43] IB: Modify ib_create_mr API
+enum ib_mr_type { + IB_MR_TYPE_FAST_REG, + IB_MR_TYPE_SIGNATURE, If we're going to go through the trouble of changing everything, I vote for dropping the word 'fast'. It's a marketing term. It's goofy. And the IB spec is goofy for using it. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH V3 1/5] RDMA/core: Transport-independent access flags
I am still not clear if all of us agree that we need it. Sean and Steve had some disclaimers... A single entry point doesn't help a whole lot if the app must deal with different behavior based on how the API is used. We have a single entry point for post_send, for example, and that makes things worse. IMO, we need fewer registration *methods* not fewer calls. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH v7 1/4] IB/netlink: Add defines for local service requests through netlink
include/uapi/rdma/rdma_netlink.h | 87 ++ 1 files changed, 87 insertions(+), 0 deletions(-) diff --git a/include/uapi/rdma/rdma_netlink.h b/include/uapi/rdma/rdma_netlink.h index 6e4bb42..d2c50e9 100644 --- a/include/uapi/rdma/rdma_netlink.h +++ b/include/uapi/rdma/rdma_netlink.h @@ -7,12 +7,14 @@ enum { RDMA_NL_RDMA_CM = 1, RDMA_NL_NES, RDMA_NL_C4IW, + RDMA_NL_SA, RDMA_NL_NUM_CLIENTS }; enum { RDMA_NL_GROUP_CM = 1, RDMA_NL_GROUP_IWPM, + RDMA_NL_GROUP_LS, RDMA_NL_NUM_GROUPS }; @@ -128,5 +130,90 @@ enum { IWPM_NLA_ERR_MAX }; +/* Local service group opcodes */ +enum { + RDMA_NL_LS_OP_RESOLVE = 0, + RDMA_NL_LS_OP_SET_TIMEOUT, + RDMA_NL_LS_NUM_OPS +}; + +/* Local service netlink message flags */ +#define RDMA_NL_LS_F_OK 0x0100 /* Success response */ +#define RDMA_NL_LS_F_ERR 0x0200 /* Failed response */ Is there a reason for these odd values? -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH V2 4/5] RDMA/iser: support iWARP devices
I prefer to decouple the iSER changes with this core work. Jason/Sean... thoughts? I could do the iSER w/o patch 3, and the follow up with a series that includes our final solution on transport independent memory registration and change all the TI kernel users (iser and nfsrdma) along with it. There has a been a big push lately to drop the strange transport specific stuff - if you are comitted to seeing the access flag clean up series through then I don't see a problem with using whatever order you like. I agree. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH V2 3/5] RDMA/core: transport-independent access flags
I suggest to start consolidating to ib_create_mr() that receives an extensible ib_mr_init_attr and additional attributes can be mr_roles and mr_attrs. I think this makes sense, but does it really help? If the end result is that the app and providers basically end up switching on mr_attr::type, we end up reducing the number of APIs, but the code complexity remains the same.
RE: [PATCH RFC] RDMA/core: add rdma_get_dma_mr()
+enum rdma_mr_roles { + RDMA_MRR_RECV = 1, + RDMA_MRR_SEND = (11), + RDMA_MRR_READ_SOURCE= (12), + RDMA_MRR_READ_SINK = (13), Maybe it's just me, but it took me a second to figure out which was the source and which was the sink in RDMA reads. Do you think calling them initiator and responder/target would be better? Not to me. For an RDMA operation, the initiator is the app that issues the read request WR. That app doesn't create what I call the READ_SOURCE MR. Its peer application does. So calling READ_SOURCE something like READ_INITIATOR doesn't make sense to me. That's why I thought SOURCE and SINK were clearer. Perhaps not... I would call SOURCE the RESPONDER.. Initiator/Poster and Responder is closest to the language used in other places in the API for READ. I agree with Steve. I think of this in terms of copy and like source/dest myself, similar to source/sink. A memory buffer is not a responder or initiator. In hindsight, we probably never should have exposed initiator_depth and responder_resources. Those values have been an endless source of confusion and bugs. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH RFC] RDMA/core: add rdma_get_dma_mr()
I wanted to point out that with this scheme the ULP may sometimes get an MR with a wider set of permissions that it asked for, and I'm not sure that's always safe. Perhaps the ULP wants to guarantee that the MR doesn't allow certain kinds of accesses and doesn't expect the verbs layer to change that. The providers can return the actual permissions that were granted. If the app cares, it can check. N�r��yb�X��ǧv�^�){.n�+{��ٚ�{ay�ʇڙ�,j��f���h���z��w��� ���j:+v���w�j�mzZ+�ݢj��!�i
RE: [PATCH RFC 2/2] RDMA/isert: Support iWARP transport
+* IWARP transports need REMOTE_WRITE for MRs used as the target of +* an RDMA_READ. Since the DMA MR is used for all ports, then if +* any port is running IWARP, add REMOTE_WRITE. +*/ + if (any_port_is_iwarp(device)) It would be nice to have a new-style cap test for this instead of open coding iwarp. Similar to rdma_cap_read_multi_sge This should be pushed down into the devices, or at least within verbs, rather than having ULPs needing to know this oddness. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH RFC 2/2] RDMA/isert: Support iWARP transport
How would you envision doing this? At the time a MR is registered the device driver doesn't know if the application will be doing RDMA reads or not on that MR. I was thinking of checking for REMOTE_READ, but that doesn't work on the initiator side. I guess you could a READ_DEST(SOURCE? TARGET?) flag to indicate that the MR will be used on the initiating side of the RDMA read operation. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH RFC 2/2] RDMA/isert: Support iWARP transport
What about moving to something more specific? Encode the allowed verbs in the access flag? This makes more sense to me. Something like: SEND, RECV, INIT READ, INIT WRITE, READ TARGET, WRITE TARGET, etc. We're close to this, but it's not clear, for example, what flags are needed for a receive buffer. None? LOCAL_WRITE? -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH RFC] RDMA/core: add rdma_get_dma_mr()
+enum rdma_mr_roles { I would drop naming the enum - it shouldn't be used, as the values are bit flags. + RDMA_MRR_RECV = 1, + RDMA_MRR_SEND = (11), + RDMA_MRR_READ_SOURCE= (12), + RDMA_MRR_READ_SINK = (13), + RDMA_MRR_WRITE_SOURCE = (14), + RDMA_MRR_WRITE_SINK = (15), + RDMA_MRR_ATOMIC = (16), + RDMA_MRR_MW_BIND= (17), + RDMA_MRR_ZERO_BASED = (18), There's 'something' different about this role that cause me hesitation. Maybe that it's dependent on other roles being set to be useful? I'm not sure. Maybe we need both roles and registration flags, with this declared as a flag? + RDMA_MRR_ACCESS_ON_DEMAND = (19), This one is even more different, as it doesn't impact how the MR interacts with the interfaces, or change how the application uses the MR. This is really a hint to the provider regarding the selection of different implementation flows.
RE: [PATCHv2 RFC] IB: Add rdma_cap_ib_switch helper and use where appropriate
Note that there still remain a couple of node type checks in the kernel that we may want to remove. There's an IB CA check in cma.c:rdma_notify This should check for rdma_cap_ib_cm() as well as in rds/ib.c:rds_ib_add_one and rds_ib_laddr_check and Not sure what the underlying reason is for these checks. an RNIC check in rds/iw.c:rds_iw_add_one and rds_iw_laddr_check rdma_cap_iw_cm() The intent is to clarify why the checks that exist are made and replace them with a check that clearly conveys what the restriction is. So, yes, the use of node_type should be replaced. - Sean -- To unsubscribe from this list: send the line unsubscribe linux-rdma in
RE: [PATCHv2 RFC] IB: Add rdma_cap_ib_switch helper and use where appropriate
rds_ib_add_one says: /* Only handle IB (no iWARP) devices */ but not sure comment is 100% accurate as it checks node type != IB CA. rds_ib_laddr_check says: /* rdma_bind_addr will only succeed for IB iWARP devices */ /* due to this, we will claim to support iWARP devices unless we check node_type. */ It's similar to above in that it checks node type != IB CA and does not seem 100% accurate. I noticed these, but I don't know why RDS has this requirement. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in
RE: [PATCH RFC] IB: Add rdma_cap_ib_switch helper and use where appropriate
In an effort to reform the RDMA core and ULPs to minimize use of node_type in struct ib_device, an additional bit is added to struct ib_device for is_switch (IB switch). This is needed to be initialized by any IB switch device driver. This is a NEW requirement on such device drivers which are all out of tree. I agree with Jason, only I would add we need to decide if we want to continue supporting the out of tree devices. We have for so long that it seems reasonable to do so. At the same time, I think we'd reject these changes now without a lower level user of them. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCHv2 RFC] IB: Add rdma_cap_ib_switch helper and use where appropriate
In an effort to reform the RDMA core and ULPs to minimize use of node_type in struct ib_device, an additional bit is added to struct ib_device for is_switch (IB switch). This is needed to be initialized by any IB switch device driver. This is a NEW requirement on such device drivers which are all out of tree. In addition, an ib_switch helper was added to ib_verbs.h based on the is_switch device bit rather than node_type (although those should be consistent). The RDMA core (MAD, SMI, agent, sa_query, multicast, sysfs) as well as (IPoIB and SRP) ULPs are updated where appropriate to use this new helper. In some cases, the helper is now used under the covers of using rdma_[start end]_port rather than the open coding previously used. Signed-off-by: Hal Rosenstock h...@mellanox.com Assuming that the final version matches this one: Reviewed-by: Sean Hefty sean.he...@intel.com -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH v2 01/49] IB/core: Add OPA Port header definitions
include/rdma/opa_port_info.h | 433 This matches the current code structure, but is this the best location for this file? Do you have a suggestion? Couldn't it be a header in the hfi1 driver directory ? Isn't it OPA specific ? I don't have a specific suggestion. If the contents are specific to this driver, I would move them into the driver directory. For values that are OPA, but not driver, specific, include/rdma is probably the best option for now. If the content is mixed (I didn't look closely), I would split the file.
RE: [PATCH 06/41] IB/hfi1: add char device instantiation code
PSM2 (and PSM) uses this to dialog with the driver for hardware specific setup. I'm suspected at some point in the past, this was ioctl based and changed due to bias against ioctl. Could this be distinguished based on a common command header? -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 00/41] Add OPA gen1 driver
Ummm.. Could we get some more descriptions as to what this code is for? Do we have a new OmniPath protocol here as well or is it IB? Which standards are followed? I think the APIs that the driver uses need to be documented somewhere in particular if new sysfs entries etc are created. Are there any user space tools to control/exercise the driver and the protocol stack? Christoph's questions are fair. There should at least be a short write-up that describes how this driver fits into the RDMA stack. (Maybe there are links with additional details that could be provided?) For example, we know from Ira's patches and comments that OPA uses IB like management and supports the same verbs as the qib driver. It's hard to piece this together from separate email threads or by looking at 56,000 lines of code. I'm less concerned about the PSM support myself, as that's self-contained to this driver and doesn't impact the rest of the stack. But at least clarify how it relates to other kernel modules (ib_cm, rdma_cm, ipoib, etc.), if at all. - Sean -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 04/11] IB/cm: Expose DGID in SIDR request events
The idea is to allow SIDR request to be sorted by the GID, when we will have alias GIDs for IPoIB. Please limit this series, or at least the early patches in this series, to simply moving the de-mux out of the ib_cm and into the rdma_cm. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 14/14] IB/mad: Add final OPA MAD processing
You're right that apps can be coded to other CA types, like RNICs and USNICs. However, those are all very different from an IB_CA due to limited queue pair types or limited primitives. If OPA had that same limitation then I would agree it needs a different node type. So this will be my litmus test. Currently, an app that supports all of the RDMA types looks like this: if (node_type == RNIC) do iwarpy stuff else if (node_type == USNIC) do USNIC stuff else if (node_type == IB_CA) do IB verbs stuff if (link_layer == Ethernet) do RoCE addressing/management else do IB addressing/management The node type values were originally defined to align with the IB management NodeInfo structure. AFAICT, there was no intent to associate those values with specific functionality or addressing or verbs support or anything else, really, outside of what IB management needed. iWarp added a new node type, so that the IB management code could ignore those devices. RoCE basically broke this association by forcing additional checks in the local management code to also check against the link layer. The recent mad capability bits are a superior solution, making the node type obsolete. At this point, the node type essentially indicates if we start counting ports at a numeric value 0 or 1. The NodeType that an OPA channel adapter will report in a NodeInfo structure will be 1, the same value as if it were an IB channel adapter. In the end, this argument matters one iota. The kernel code barely relies on the node type, and a user space verbs provider can report whatever value it wants. - Sean
RE: [PATCH acm] acme.c: Fix IPv6 address handling in inet_any_pton
Thanks - applied manually acm/src/acme.c |2 +- Path? diff --git a/acm/src/acme.c b/acm/src/acme.c index 9bf7557..d54c8b9 100644 --- a/acm/src/acme.c +++ b/acm/src/acme.c @@ -787,7 +787,7 @@ static int inet_any_pton(char *addr, struct sockaddr This was off by a couple hundred lines. Can you check your branch? -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 06/41] IB/hfi1: add char device instantiation code
+int __init dev_init(void) +{ + int ret; + + ret = alloc_chrdev_region(hfi1_dev, 0, HFI1_NMINORS, DRIVER_NAME); + if (ret 0) { + pr_err(Could not allocate chrdev region (err %d)\n, - ret); + goto done; + } + + class = class_create(THIS_MODULE, class_name()); + if (IS_ERR(class)) { + ret = PTR_ERR(class); + pr_err(Could not create device class (err %d)\n, - ret); + unregister_chrdev_region(hfi1_dev, HFI1_NMINORS); + } + +done: + return ret; +} so what's the role of the char-device? Someone from Mike's team can chime in, but I believe this supports non-verbs interfaces (i.e. PSM), similar to what's done for qib.
RE: [PATCH 06/41] IB/hfi1: add char device instantiation code
Was Al Viro's comment on qib addressed here? That would be a showstopper for me.. Do you have a link to this comment? I'm missing a bunch of messages from this thread and can't find anything from Al in the logs. - Sean -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH v2 01/49] IB/core: Add OPA Port header definitions
include/rdma/opa_port_info.h | 433 This matches the current code structure, but is this the best location for this file?
RE: [PATCH 03/11] IB/cm: Expose service ID in request events
Expose the service ID on an incoming CM or SIDR request to the event handler. This will allow the RDMA CM module to de-multiplex connection requests based on the information encoded in the service ID. Signed-off-by: Haggai Eran hagg...@mellanox.com Acked-by: Sean Hefty sean.he...@intel.com -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 04/11] IB/cm: Expose DGID in SIDR request events
drivers/infiniband/core/cm.c | 7 +++ include/rdma/ib_cm.h | 2 ++ 2 files changed, 9 insertions(+) diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c index c5f5f89e274a..46f99ec4080a 100644 --- a/drivers/infiniband/core/cm.c +++ b/drivers/infiniband/core/cm.c @@ -2983,6 +2983,13 @@ static void cm_format_sidr_req_event(struct cm_work *work, param-pkey = __be16_to_cpu(sidr_req_msg-pkey); param-listen_id = listen_id; param-service_id = sidr_req_msg-service_id; + if (work-mad_recv_wc-wc-wc_flags IB_WC_GRH) { + param-grh = 1; + memcpy(param-dgid, work-mad_recv_wc-recv_buf.grh-dgid, +sizeof(param-dgid)); + } else { + param-grh = 0; What is the use case here? Are you trying to sort by device? How does the GID of the GMP relate to the listen? -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 05/11] IB/cm: Share listening CM IDs
@@ -722,6 +725,7 @@ struct ib_cm_id *ib_create_cm_id(struct ib_device *device, INIT_LIST_HEAD(cm_id_priv-work_list); atomic_set(cm_id_priv-work_count, -1); atomic_set(cm_id_priv-refcount, 1); + cm_id_priv-listen_sharecount = 1; This is setting the listen count before we know whether the cm_id will actually be used to listen. return cm_id_priv-id; error: @@ -847,11 +851,21 @@ retest: spin_lock_irq(cm_id_priv-lock); switch (cm_id-state) { case IB_CM_LISTEN: - cm_id-state = IB_CM_IDLE; spin_unlock_irq(cm_id_priv-lock); + spin_lock_irq(cm.lock); + if (--cm_id_priv-listen_sharecount 0) { + /* The id is still shared. */ + atomic_dec(cm_id_priv-refcount); + spin_unlock_irq(cm.lock); + return; + } rb_erase(cm_id_priv-service_node, cm.listen_service_table); spin_unlock_irq(cm.lock); + + spin_lock_irq(cm_id_priv-lock); + cm_id-state = IB_CM_IDLE; + spin_unlock_irq(cm_id_priv-lock); Why is the state being changed? The cm_id is about to be freed anyway. break; case IB_CM_SIDR_REQ_SENT: cm_id-state = IB_CM_IDLE; @@ -929,11 +943,32 @@ void ib_destroy_cm_id(struct ib_cm_id *cm_id) } EXPORT_SYMBOL(ib_destroy_cm_id); -int ib_cm_listen(struct ib_cm_id *cm_id, __be64 service_id, __be64 service_mask, - struct ib_cm_compare_data *compare_data) +/** + * __ib_cm_listen - Initiates listening on the specified service ID for + * connection and service ID resolution requests. + * @cm_id: Connection identifier associated with the listen request. + * @service_id: Service identifier matched against incoming connection + * and service ID resolution requests. The service ID should be specified + * network-byte order. If set to IB_CM_ASSIGN_SERVICE_ID, the CM will + * assign a service ID to the caller. + * @service_mask: Mask applied to service ID used to listen across a + * range of service IDs. If set to 0, the service ID is matched + * exactly. This parameter is ignored if %service_id is set to + * IB_CM_ASSIGN_SERVICE_ID. + * @compare_data: This parameter is optional. It specifies data that must + * appear in the private data of a connection request for the specified + * listen request. + * @lock: If set, lock the cm.lock spin-lock when adding the id to the + * listener tree. When false, the caller must already hold the spin- lock, + * and compare_data must be NULL. + */ +static int __ib_cm_listen(struct ib_cm_id *cm_id, __be64 service_id, + __be64 service_mask, + struct ib_cm_compare_data *compare_data, + bool lock) { struct cm_id_private *cm_id_priv, *cur_cm_id_priv; - unsigned long flags; + unsigned long flags = 0; int ret = 0; service_mask = service_mask ? service_mask : ~cpu_to_be64(0); @@ -959,7 +994,8 @@ int ib_cm_listen(struct ib_cm_id *cm_id, __be64 service_id, __be64 service_mask, cm_id-state = IB_CM_LISTEN; - spin_lock_irqsave(cm.lock, flags); + if (lock) + spin_lock_irqsave(cm.lock, flags); I'm not a fan of this sort of locking structure. Why not just move the locking into the outside calls completely? I.e. move to ib_cm_listen() instead of passing in true. if (service_id == IB_CM_ASSIGN_SERVICE_ID) { cm_id-service_id = cpu_to_be64(cm.listen_service_id++); cm_id-service_mask = ~cpu_to_be64(0); @@ -968,7 +1004,8 @@ int ib_cm_listen(struct ib_cm_id *cm_id, __be64 service_id, __be64 service_mask, cm_id-service_mask = service_mask; } cur_cm_id_priv = cm_insert_listen(cm_id_priv); - spin_unlock_irqrestore(cm.lock, flags); + if (lock) + spin_unlock_irqrestore(cm.lock, flags); if (cur_cm_id_priv) { cm_id-state = IB_CM_IDLE; @@ -978,8 +1015,86 @@ int ib_cm_listen(struct ib_cm_id *cm_id, __be64 service_id, __be64 service_mask, } return ret; } + +int ib_cm_listen(struct ib_cm_id *cm_id, __be64 service_id, __be64 service_mask, + struct ib_cm_compare_data *compare_data) +{ + return __ib_cm_listen(cm_id, service_id, service_mask, compare_data, + true); +} EXPORT_SYMBOL(ib_cm_listen); +/** + * Create a new listening ib_cm_id and listen on the given service ID. + * + * If there's an existing ID listening on that same device and service ID, + * return it. + * + * @device: Device associated with the cm_id. All related communication will + * be associated with the specified device. + * @cm_handler: Callback invoked to notify the user of CM events. + * @service_id: Service identifier matched against
RE: [PATCH v2 25/49] IB/hfi1: add local mad header
drivers/infiniband/hw/hfi1/mad.h | 477 ++ 1 file changed, 477 insertions(+) create mode 100644 drivers/infiniband/hw/hfi1/mad.h diff --git a/drivers/infiniband/hw/hfi1/mad.h b/drivers/infiniband/hw/hfi1/mad.h new file mode 100644 index 000..af61249 Snip +#include rdma/ib_pma.h +#define USE_PI_LED_ENABLE1 /* use led enabled bit in struct +* opa_port_states, if available */ +#include rdma/opa_smi.h +#include rdma/opa_port_info.h +#ifndef PI_LED_ENABLE_SUP +#define PI_LED_ENABLE_SUP 0 +#endif +#include opa_compat.h + +#define IB_SMP_UNSUP_VERSIONcpu_to_be16(0x0004) +#define IB_SMP_UNSUP_METHOD cpu_to_be16(0x0008) +#define IB_SMP_UNSUP_METH_ATTR cpu_to_be16(0x000C) +#define IB_SMP_INVALID_FIELDcpu_to_be16(0x001C) + +struct ib_node_info { + u8 base_version; + u8 class_version; + u8 node_type; + u8 num_ports; + __be64 sys_guid; + __be64 node_guid; + __be64 port_guid; + __be16 partition_cap; + __be16 device_id; + __be32 revision; + u8 local_port_num; + u8 vendor_id[3]; +} __packed; + +struct ib_mad_notice_attr { + u8 generic_type; + u8 prod_type_msb; + __be16 prod_type_lsb; + __be16 trap_num; + __be16 issuer_lid; + __be16 toggle_count; + + union { + struct { + u8 details[54]; + } raw_data; + + struct { + __be16 reserved; + __be16 lid;/* where violation happened */ + u8 port_num; /* where violation happened */ + } __packed ntc_129_131; + + struct { + __be16 reserved; + __be16 lid;/* LID where change occurred */ + u8 reserved2; + u8 local_changes; /* low bit - local changes */ + __be32 new_cap_mask; /* new capability mask */ + u8 reserved3; + u8 change_flags; /* low 3 bits only */ + } __packed ntc_144; + + struct { + __be16 reserved; + __be16 lid;/* lid where sys guid changed */ + __be16 reserved2; + __be64 new_sys_guid; + } __packed ntc_145; + + struct { + __be16 reserved; + __be16 lid; + __be16 dr_slid; + u8 method; + u8 reserved2; + __be16 attr_id; + __be32 attr_mod; + __be64 mkey; + u8 reserved3; + u8 dr_trunc_hop; + u8 dr_rtn_path[30]; + } __packed ntc_256; + + struct { + __be16 reserved; + __be16 lid1; + __be16 lid2; + __be32 key; + __be32 sl_qp1; /* SL: high 4 bits */ + __be32 qp2;/* high 8 bits reserved */ + union ib_gidgid1; + union ib_gidgid2; + } __packed ntc_257_258; + + } details; +}; + +/* + * Generic trap/notice types + */ +#define IB_NOTICE_TYPE_FATAL 0x80 +#define IB_NOTICE_TYPE_URGENT0x81 +#define IB_NOTICE_TYPE_SECURITY 0x82 +#define IB_NOTICE_TYPE_SM0x83 +#define IB_NOTICE_TYPE_INFO 0x84 Snip the end... These are generic values and data structures. Why are they isolated into a driver specific folder, versus being in a common include location?
RE: [PATCH for-next V2 0/9] Add completion timestamping support
Intel is supporting multicast in hardware. Its just a bad implementation (broadcast and filtering MC groups in the HCA or what was that?) and there is no plan to fix the issues despite the problem being known for quite some time. Also does this mean that libfabric only to supports the features needed by Intel? Libfabric supports whatever features apps require and the participating vendors want to provide. However, I, personally, have a limited amount of time to my day and will focus my effort on either what my management requires of me or areas that are most interesting. Libfabric is specifically designed to be vendor, transport, and implementation neutral. I would be interested to see some measurements. AFAICT the Intel solutions are based on historically inferior IB technology from Qlogic which has never been able in my lab tests to compete latency wise with other vendors. I have heard these latency claims repeatedly from Qlogic personnel over the years. You are referring to hardware latency. Libfabric is software. No amount of software is going to overcome hardware limitations. The entire reason multicast support was removed from libfabric 1.0 was that the proposed API would have introduced latency by adding a branch into the code path. This is a well designed solution and its easy to use. I fundamentally disagree with the practice of ad-hoc API design. I stated this on the mail list probably 3 years ago. I see nothing wrong with allowing and encouraging vendor specific extensions. It would help libfabric if you would work with other vendors and industries to include support for their needs. MPI is not the only applications that are running on the fabrics. I understand that is historically the only area in which Qlogic hardware was able to compete but I think you need to move beyond that. APIs should be as general as possible abstracting hardware as much as possible. A viable libfabric needs to be easy to use, low overhead as well as covering the requirements of multiple vendors and use cases. Libfabric included requirements from multiple users and applications - MPI, SHMEM, PGAS, DBMS, and sockets all provided input. It chose to target MPI as an initial priority, but it is not limited to MPI at all. It also works with other vendors, including vendors that do not support the verbs interfaces -- Cisco, Cray, Intel PSM, plus others. I, personally, ensured that libfabric would layer well over verbs based hardware. That doesn't mean that I'm obligated to provide optimized providers over everyone's hardware. The goal was not to spend 3 years working on a new API, but to get something usable within a short timeframe that could be extended. OFIWG could have taken a different approach, but this was what the community (not Intel) selected. As a company, Intel has many products. A competitor in one area of the company may be a partner in another. Xeon is by far the most important to this discussion. It's why Intel dedicated developers to enabling high performance networking in Linux for over 10 years -- even before Intel had any products in those spaces. And it's why Intel continues to fund development. Sure, Intel now has IB and Omni-Path Architecture products, but they also have iWarp and Ethernet. Intel MPI runs over a bunch of different fabrics. Libfabric doesn't just need to work well over Intel NICs, it needs to work well over Intel platforms. Returning to this thread, if I had to add time stamps to libfabric, I would still add 2 time stamps into a new completion structure. Those time stamps would be selected using a method similar to what Doug stated in an earlier email. The app would use an enum to select what the time stamps would capture. However, I would lean more to having those values specified as part of the endpoint attributes, rather than the CQ. - Sean -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 14/14] IB/mad: Add final OPA MAD processing
cap_is_switch_smi would be a nice refinement to let us drop nodetype. Exactly, we need a bit added to the immutable data bits, and a new cap_ helper, and then nodetype is ready to be retired. Add a bit, drop a u8 ;-) I agree that the node type enum isn't particularly useful and should be retired. In fact, I don't see where RDMA_NODE_IB_SWITCH is used by any upstream device. So I don't think there's any obligation to keep it. But even if we do, I'm not sure this is the correct approach. I don't know this for a fact, but it seems more likely that someone would embed Linux on an IB switch than they would plug an IB switch into a Linux based system. The code is designed around the latter. Making this a system wide setting might simplify the code and optimize the code paths. - Sean N�r��yb�X��ǧv�^�){.n�+{��ٚ�{ay�ʇڙ�,j��f���h���z��w��� ���j:+v���w�j�mzZ+�ݢj��!�i
RE: [PATCH 14/14] IB/mad: Add final OPA MAD processing
I agree that the node type enum isn't particularly useful and should be retired. Are you referring to kernel space or user space or both ? Short term, kernel space. User space needs to keep something around for backwards compatibility. But the in tree code will never expose this value up. But even if we do, I'm not sure this is the correct approach. I don't know this for a fact, but it seems more likely that someone would embed Linux on an IB switch than they would plug an IB switch into a Linux based system. The code is designed around the latter. Making this a system wide setting might simplify the code and optimize the code paths. I think we need to discuss how user space would be addressed. This is an issue with out of tree drivers. We're having to guess what things might be doing. Are all devices being exposed up as a 'switch', or is there ever a case where there's a 'switch' device and an HCA device being reported together, or (highly unlikely) a switch device and an RNIC? If the real use case is to embed Linux on a switch, then we could look at making that a system wide setting, rather than per device. This could clean up the kernel without impacting the uABI.
RE: [PATCH] IB/core: Don't warn on no SA support in event handler
Registering an event handler is done for a device. This device may have one RoCE port (no SA cap) and one InfiniBand port (has SA cap). Therefore, warning from the event handler about a specific port that doesn't have SA cap is correct but pollutes the kernel log without a need. Maybe we should think about register event handler per port :-P I agree, though I think that can be added separately from this change. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH for-next V2 0/9] Add completion timestamping support
There are multiple problems with libfrabric related to the use cases in my area. Most of all the lack of multicast support. Then there is the build up of software bloat on top. The interest here is in low latency operations. Redenzvous and other new features are really not wanted if they increase the latency. Multicast is only supported by one vendor that has taken a hostile position against libfabric. Support for multicast will eventually be there, but it's definitely not a priority for me. As an open source project, anyone is welcome to propose patches. For native providers, libfabric will reduce latency. That's a provider implementation issue, and native providers will be available soon. The OFIWG selected to have a working set of interfaces that applications can begin using immediately, versus waiting until there were a large set of native providers. This is an entirely unrelated topic for this thread. Jason asked what I would do with libfabric. I answered. Your comments do nothing to change my mind on that answer. This is a way it was implemented IMO, this is exactly the problem. The entire design is being driving by the implementation. That produces an unmaintainable API and fractures the software ecosystem, which is exactly where we are today. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: rdmacm issue
RDMA_CM_EVENT_UNREACHABLE is indicated when there are timeouts in underlying CM protocol exchange. I suspect that the server is really busy and doesn't respond to the low level CM MADs in a timely manner. RDMA CM (and other kernel ULPs like IPoIB and SRP use hard coded local and remote response timeouts of 20 which is ~4.3 sec. This was discussed back in 2006 in http://comments.gmane.org/gmane.linux.drivers.openib/27664. In this scenario, the response took more than 30 seconds. More recently, there was proposal to base RDMA CM response timeout on subnet timeout (http://permalink.gmane.org/gmane.linux.drivers.rdma/19969). Hal's assessment seems likely. Error code -110 is ETIMEDOUT. However, the IB CM timeout when used through the RDMA CM should be much larger, as it makes use of the CM MRA protocol. Unless a lot of MADs are being lost, or I'm not remembering the RDMA CM code correctly, there's still an issue here that I'm not understanding. - Sean
RE: [PATCH v4 1/4] IB/netlink: Add defines for local service requests through netlink
+/* Local Service Reversible attribute */ struct +rdma_nla_ls_reversible { + __u32 reversible; +}; Isn't __u8 sufficient for reversible ? Certainly enough. However, reversible is __u32 in struct ib_user_path_rec and int in struct ib_sa_path_rec. OK; I hadn't double checked there. So it's inherited from those reversible definitions. I don't think we need to adhere to the sizes defined in other structures. I agree with Hal's original comment. A __u8 here seems sufficient, unless we want to introduce some bit fields. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH v4 0/4] Sending kernel pathrecord query to user cache server
This series does not attempt to optimize the kernel needing to know that a PR has been updated. There are existing mechanisms for that. Does this exist in the kernel? -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH v4 0/4] Sending kernel pathrecord query to user cache server
Not in the patches themselves but in the general issue when a PR changes. Do you think this needs addressing or are things fine as they are now ? IMO, I think it needs addressing in terms of can the proposed netlink architecture and design accommodate this sort of request in the future? We shouldn't design in a limitation up front. I don't see anything in the current approach that would cause an issue. There would likely be a need for new messages and attributes. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH v4 0/4] Sending kernel pathrecord query to user cache server
Not directly. IPoIB treats it that way. I guess to be safe. Officially one should register for UnPath/RePath traps. But no one has ever implemented that. To be clear I am agreeing with Hal that having some sort of update signal would be nice. But I don't think that must be done before this series goes in. I think a PR update signal should be an extension to this interface. I agree. I just wanted to make sure that there wasn't some feature regarding PRs, such as unpath, that a kernel client would lose (i.e. it is currently implemented) by changing how the PRs are retrieved. Basically nothing breaks with this change. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH v4 0/4] Sending kernel pathrecord query to user cache server
This series does not attempt to optimize the kernel needing to know that a PR has been updated. There are existing mechanisms for that. Does this exist in the kernel? At least some support, yes. For example client reregister marks all IPoIB paths as invalid. Reregister indicates that all PRs are now invalid? -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH for-next V5 00/12] Move RoCE GID management to IB/Core
Sean, this change is needed b/c two drivers have (mlx4 and ocrda) and more two to come soon (mlx5 and soft-Roce) would have the very same logic of constructing the port GID table according to netdev events and such, no point in repeating this logic/code over and over. Matan explained why we don't have 2 x Y deletions and 1 x Y insertions. It more than doubles the amount of code. That's not a cleanup. It introduces a bunch of new functionality. Jason has asked repeatedly to remove the RoCEv2 code, and that has been ignored repeatedly. As far as I'm concerned, this patch is not worth my time, and I will no longer even bother following this series. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH v4 0/4] Sending kernel pathrecord query to user cache server
While this appears to address the current upstream use model for ACM with it's multicast overlay backend where PRs are static, it does not appear to address PR changes. Although this ties into ibacm, from the viewpoint of the kernel, there's no requirement on the user space implementation. Should aging/revalidation of PRs be supported ? If so, would this make PRs similar at high level to IP neighbor cache in kernel ? This is requesting a new feature not supported by anything in the kernel today, and would seem to fall well outside the scope of the suggested changes. Is there a specific issue in the patches that you are seeing related to this? -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 10/14] IB/mad: Add support for additional MAD info to/from drivers
@@ -1693,8 +1693,11 @@ struct ib_device { u8 port_num, const struct ib_wc *in_wc, const struct ib_grh *in_grh, - const struct ib_mad *in_mad, - struct ib_mad *out_mad); + const struct ib_mad_hdr *in_mad, + size_t in_mad_size, + struct ib_mad_hdr *out_mad, + size_t *out_mad_size, + u16 *out_mad_pkey_index); I don't have an alternate suggestion at the moment, but this call is getting to be a little much. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH] IB/core: Get rid of redundant verb ib_destroy_mr
--- a/drivers/infiniband/hw/mlx5/mr.c +++ b/drivers/infiniband/hw/mlx5/mr.c @@ -1174,6 +1174,18 @@ static int clean_mr(struct mlx5_ib_mr *mr) int umred = mr-umred; int err; + if (mr-sig) { + if (mlx5_core_destroy_psv(dev-mdev, + mr-sig-psv_memory.psv_idx)) + mlx5_ib_warn(dev, failed to destroy mem psv %d\n, + mr-sig-psv_memory.psv_idx); + if (mlx5_core_destroy_psv(dev-mdev, + mr-sig-psv_wire.psv_idx)) + mlx5_ib_warn(dev, failed to destroy wire psv %d\n, + mr-sig-psv_wire.psv_idx); + kfree(mr-sig); + } + if (!umred) { err = destroy_mkey(dev, mr); if (err) { This patch doesn't introduce this problem, but the err check suggests that deregistration can fail for some reason. If so, should mr-sig be cleared above, so that a second call to dereg doesn't attempt to free the memory twice? -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH for-next V5 00/12] Move RoCE GID management to IB/Core
Previously, every vendor implemented its net device notifiers in its own driver. This introduces a huge code duplication as figuring 28 files changed, 2253 insertions(+), 860 deletions(-) How does adding 1400 lines of code help reduce code duplication? Can you please explain and justify why this change is actually needed? -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html