Re: [PATCH] IB/ipoib: Expose ioctl command to retrieve SGID of a given socket

2016-01-06 Thread Haggai Eran
On 06/01/2016 13:03, Yuval Shaia wrote:
> On Thu, Dec 31, 2015 at 05:34:03PM +0200, Haggai Eran wrote:
>>> +   sock = sockfd_lookup(fd, &rc);
>>> +   if (IS_ERR_OR_NULL(sock))
>>> +   return -EINVAL;
>>> +
>>> +   inetsock = inet_sk(sock->sk);
>>> +
>>> +   neigh = neigh_lookup(&arp_tbl, &inetsock->inet_daddr, dev);
>>
>> Also, isn't inet_daddr the destination address? But the function claims 
>> to return the SGID. I guess these can be ambiguous but still it seems 
>> confusing.
> Per description in include/net/inet_sock.h looks like that inet_daddr is
> the address of source peer of the socket.
> * @inet_daddr - Foreign IPv4 addr
>>

I meant it was confusing to have the foreign address in the socket designated 
by "daddr" while the function you proposed returned the foreign GID as SGID.

Haggai

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] IB/ipoib: Expose ioctl command to retrieve SGID of a given socket

2015-12-31 Thread Haggai Eran
On 31/12/2015 16:41, Yuval Shaia wrote:
> To support security applications, that need to filter out connections based
> on SGID, an ioctl command to retrieve SGID of a given socket is added.

Could you elaborate on the security applications? How do you see this ioctl 
being used?

> diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ioctl.c 
> b/drivers/infiniband/ulp/ipoib/ipoib_ioctl.c
> new file mode 100644
> index 000..124d545
> --- /dev/null
> +++ b/drivers/infiniband/ulp/ipoib/ipoib_ioctl.c

> +static int ipoib_get_sguid(struct net_device *dev, int fd, u64 *sgid,
> +u64 *subnet_prefix)
A GID is composed of a GUID and a subnet prefix.

> +{
> + struct socket *sock;
> + struct inet_sock *inetsock;
> + struct neighbour *neigh;
> + int rc = 0;
> + union ib_gid *gid;
> + struct list_head *dev_list = 0;
> + struct ipoib_dev_priv *priv = netdev_priv(dev);
> + u16 pkey_index = priv->pkey_index;
> + struct ipoib_dev_priv *child_priv;
> +
> + sock = sockfd_lookup(fd, &rc);
> + if (IS_ERR_OR_NULL(sock))
> + return -EINVAL;
> +
> + inetsock = inet_sk(sock->sk);
> +
> + neigh = neigh_lookup(&arp_tbl, &inetsock->inet_daddr, dev);

Also, isn't inet_daddr the destination address? But the function claims 
to return the SGID. I guess these can be ambiguous but still it seems 
confusing.

> + if (!IS_ERR_OR_NULL(neigh))
> + goto found;
> +
> + /* If not found try in all other ipoib devices */
> + dev_list = ipoib_get_dev_list(priv->ca);
> + if (!dev_list)
> + return -EINVAL;
> +
> + list_for_each_entry(priv, dev_list, list) {
This list can contain devices that belong to a different namespace. It
doesn't make sense to do a neighbor lookup on these. Also, you can simply use 
neigh_lookup_nodev if you don't care about the device.

> + if (priv->pkey_index == pkey_index) {
> + neigh = neigh_lookup(&arp_tbl, &inetsock->inet_daddr,
> +  priv->dev);
> + if (!IS_ERR_OR_NULL(neigh))
> + goto found;
> + }
> + list_for_each_entry(child_priv, &priv->child_intfs, list) {
> + if (child_priv->pkey_index == pkey_index) {
> + neigh = neigh_lookup(&arp_tbl,
> +  &inetsock->inet_daddr,
> +  child_priv->dev);
> + if (!IS_ERR_OR_NULL(neigh))
> + goto found;
> + }
> + }
> + }
> +
> + return -ENODEV;
> +
> +found:
> + if (!(neigh->nud_state & NUD_VALID))
> + return -EINVAL;
> +
> + gid = (union ib_gid *)(neigh->ha + 4);
If you just take the GID from the hardware address why not do that from 
userspace?
Why do you need a new ioctl to do that for you?

> + *sgid = be64_to_cpu(gid->global.interface_id);
> + *subnet_prefix = be64_to_cpu(gid->global.subnet_prefix);
> +
> + neigh_release(neigh);
> +
> + return 0;
> +}

Regards,
Haggai

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 6/6] IB/uapi: expose device capability flags

2015-12-31 Thread Haggai Eran
On 24/12/2015 16:39, Christoph Hellwig wrote:
> Expose the device capability flags which can be queried through uverbs in
> the uapi headers.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  include/rdma/ib_verbs.h  | 94 
> +++-
>  include/uapi/rdma/ib_verbs.h | 66 +++
>  2 files changed, 98 insertions(+), 62 deletions(-)
> 
> diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
> index 48bfcf5..b8d4113 100644
> --- a/include/rdma/ib_verbs.h
> +++ b/include/rdma/ib_verbs.h
> @@ -151,68 +151,38 @@ enum rdma_link_layer {
>  };
>  
>  enum ib_device_cap_flags {
> - IB_DEVICE_RESIZE_MAX_WR = (1 << 0),
> - IB_DEVICE_BAD_PKEY_CNTR = (1 << 1),
> - IB_DEVICE_BAD_QKEY_CNTR = (1 << 2),
> - IB_DEVICE_RAW_MULTI = (1 << 3),
> - IB_DEVICE_AUTO_PATH_MIG = (1 << 4),
> - IB_DEVICE_CHANGE_PHY_PORT   = (1 << 5),
> - IB_DEVICE_UD_AV_PORT_ENFORCE= (1 << 6),
> - IB_DEVICE_CURR_QP_STATE_MOD = (1 << 7),
> - IB_DEVICE_SHUTDOWN_PORT = (1 << 8),
> - IB_DEVICE_INIT_TYPE = (1 << 9),
> - IB_DEVICE_PORT_ACTIVE_EVENT = (1 << 10),
> - IB_DEVICE_SYS_IMAGE_GUID= (1 << 11),
> - IB_DEVICE_RC_RNR_NAK_GEN= (1 << 12),
> - IB_DEVICE_SRQ_RESIZE= (1 << 13),
> - IB_DEVICE_N_NOTIFY_CQ   = (1 << 14),
> -
> - /*
> -  * This device supports a per-device lkey or stag that can be
> -  * used without performing a memory registration for the local
> -  * memory.  Note that ULPs should never check this flag, but
> -  * instead of use the local_dma_lkey flag in the ib_pd structure,
> -  * which will always contain a usable lkey.
> -  */
> - IB_DEVICE_LOCAL_DMA_LKEY= (1 << 15),
> - IB_DEVICE_RESERVED /* old SEND_W_INV */ = (1 << 16),

Was this ever used by user-space? If so, I think there should be a 
comment saying that bit 16 is reserved, perhaps pointing to commit 
0f39cf3d54e6 ("IB/core: Add support for "send with invalidate" work 
requests").

> + /*
> +  * This device supports a per-device lkey or stag that can be
> +  * used without performing a memory registration for the local
> +  * memory.  Note that ULPs should never check this flag, but
> +  * instead of use the local_dma_lkey flag in the ib_pd structure,
> +  * which will always contain a usable lkey.
> +  */
> + IB_UVERBS_DEVICE_LOCAL_DMA_LKEY = (1 << 15),
I don't think user-space should be able to use local_dma_lkey.

Regards,
Haggai
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/6] IB/uapi: expose uverbs send WR flags

2015-12-31 Thread Haggai Eran
On 24/12/2015 16:39, Christoph Hellwig wrote:
> This exposes the send WR flags supported by uverbs as part of the uapi
> headers.  It follows the same scheme as the WR opcodes.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  drivers/infiniband/hw/mlx5/mlx5_ib.h |  6 +++---
>  include/rdma/ib_verbs.h  | 14 ++
>  include/uapi/rdma/ib_verbs.h |  9 +
>  3 files changed, 18 insertions(+), 11 deletions(-)
> 

> diff --git a/include/uapi/rdma/ib_verbs.h b/include/uapi/rdma/ib_verbs.h
> index 330175e..3be3152 100644
> --- a/include/uapi/rdma/ib_verbs.h
> +++ b/include/uapi/rdma/ib_verbs.h
> @@ -20,4 +20,13 @@ enum ib_uverbs_wr_opcode {
>   IB_UVERBS_WR_END= 9,
>  };
>  
> +enum ib_uverbs_send_flags {
> + IB_UVERBS_SEND_FENCE= (1 << 0),
> + IB_UVERBS_SEND_SIGNALED = (1 << 1),
> + IB_UVERBS_SEND_SOLICITED= (1 << 2),
> + IB_UVERBS_SEND_INLINE   = (1 << 3),
> + IB_UVERBS_SEND_IP_CSUM  = (1 << 4),
> + IB_UVERBS_SEND_END  = (1 << 5),

Do you think we need a check in ib_uverbs_post_send to see only 
user flags are passed? I think it would be safer.

Other than that,
Reviewed-by: Haggai Eran 


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SoftRoCE V1

2015-12-31 Thread Haggai Eran
On 29/12/2015 18:01, Wenda Ni wrote:
> Hi experts,
> 
> We have several Mellanox RoCE V1 NIC cards, and would like to try
> communicating with SoftRoCE V1.
> 
> We are using branch rxe-3.0 from https://github.com/SoftRoCE/rxe-dev
> according to the Soft-RoCE README Rev 1.0 issued from Mellanox last
> year.
> 
> 
> Testing using pingpong examples from libibverbs (hardware RoCE <->
> SoftRoCE) already shows a bug in Ethernet CRC generation. We expect
> further bugs along the way.
> 
> So we would like to know for SoftRoCE V1, are we using the latest
> branch? Also, is it still tied to SLES11 SP3 OS?
Kamal can correct me if I'm wrong, but I think the updated branch is 
rxe_submission-v2.
However it doesn't support RoCE v1, only RoCE v2.

> There are very few documentations we can find, so we hope this is the
> right place to ask such questions.
I can't think of a better place.

Regards,
Haggai
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 10/10] IB: remove the unused usecnt field from struct ib_mr

2015-12-20 Thread Haggai Eran
On 20/12/2015 11:31, Sagi Grimberg wrote:
> 
>>> diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
>>> index 284916d..e45776e 100644
>>> --- a/include/rdma/ib_verbs.h
>>> +++ b/include/rdma/ib_verbs.h
>>> @@ -1306,7 +1306,6 @@ struct ib_mr {
>>>   u64   iova;
>>>   u32   length;
>>>   unsigned int   page_size;
>>> -atomic_t   usecnt; /* count number of MWs */
>>>   };
>>
>> This comment is part of Roland's uverbs commit.
>>
>> I wonder if LL driver supporting the  IB_WR_BIND_MW op
>> ref the MR on port send and deref it on completion?
> 
> I don't see why any driver would ref the underlying MR at
> post_send(IB_WR_BIND_MW) time. It doesn't really make sense.
> 
> However, I think that the patch from Shani 6b52a12bc3fc ("IB/uverbs:
> Implement memory windows support in uverbs") is wrong. A memory
> window allocation should really reference the MR and not the PD (which
> is referenced by the MR). Otherwise the MR deregistration is allowed
> with windows bound to it. If this is the case then this field is not
> useless and we should fix ib_uverbs_alloc_mw()?
> 
> Haggai, can you shed some light here? It's Shani's and your code.
A memory window takes reference on its PD during creation. It can only
be used by that PD, and it isn't bound to any memory region yet. You 
are right that we want to prevent deregistration of an MR when there 
are memory windows bound to it. This is suggested (although not 
required) by the IBA specifications (section 10.6.7.2.6 Deregistering 
Regions with Bound Windows). The problem with the usecnt field in ib_mr 
as you wrote is that it needs to reflect what happens at bind / 
invalidate time, but bind can happen in user-space without the kernel 
knowing, and invalidation can similarly occur due to a 
send-with-invalidate message being received. This is why in the 
ConnectX-3 implementation of memory windows we didn't use the 
ib_mr.usecnt field, but relied on a hardware implementation of the 
reference count.

As I said, the spec also allows a device to proceed with deregistering 
an MR, leaving an orphan memory window. I don't think we can prevent 
this in the kernel, and I'm not sure it would be a good idea anyway, so 
I think it is okay to remove the use_cnt field.

Regards,
Haggai
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-next V2 2/5] IB/core: Add ib_is_udata_cleared

2015-12-16 Thread Haggai Eran
On 15/12/2015 20:30, Matan Barak wrote:
> Extending core and vendor verb commands require us to check that the
> unknown part of the user's given command is all zeros.
> Adding ib_is_udata_cleared in order to do so.
> 
> Signed-off-by: Matan Barak 
Reviewed-by: Haggai Eran 

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-next V1 2/5] IB/core: Add ib_is_udata_cleared

2015-12-15 Thread Haggai Eran
On 14/12/2015 18:31, Matan Barak wrote:
> I'm not sure regarding the string.c location, as it deals with user
> buffers, but in order not to
> be dependent on this, I'll change this code to the following.
> 
> static inline bool ib_is_udata_cleared(struct ib_udata *udata,
>u8 cleared_char,
>size_t offset,
>size_t len)
> {
> const void __user *p = udata->inbuf + offset;
> bool ret = false;
> u8 *buf;
> 
> if (len > USHRT_MAX)
> return false;
> 
> buf = kmalloc(len, GFP_KERNEL);
> if (!buf)
> return false;
> 
> if (copy_from_user(buf, p, len))
> goto free;
> 
> ret = !memchr_inv(buf, cleared_char, len);
> 
> free:
> kfree(buf);
> return ret;
> }

Looks good to me.

Thanks,
Haggai

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-next V1 2/5] IB/core: Add ib_is_udata_cleared

2015-12-13 Thread Haggai Eran
On 10/12/2015 19:29, Matan Barak wrote:
> On Thu, Dec 10, 2015 at 5:20 PM, Haggai Eran  wrote:
>> On 10/12/2015 16:59, Matan Barak wrote:
>>> On Mon, Dec 7, 2015 at 3:18 PM, Haggai Eran  wrote:
>>>> On 12/03/2015 05:44 PM, Matan Barak wrote:
>>>>> Extending core and vendor verb commands require us to check that the
>>>>> unknown part of the user's given command is all zeros.
>>>>> Adding ib_is_udata_cleared in order to do so.
>>>>>
>>>>
>>>> Why not copy the data into kernel space and run memchr_inv() on it?
>>>>
>>>
>>> Probably less efficient, isn't it?
>> Why do you think it is less efficient?
>>
>> I'm not sure calling copy_from_user multiple times is very efficient.
>> For once, you are calling access_ok multiple times. I guess it depends
>> on the amount of data you are copying.
>>
> 
> Isn't access_ok pretty cheap?
> It calls __chk_range_not_ok which on x86 seems like a very cheap
> function and __chk_user_ptr which is a compiler check.
> I guess most kernel-user implementation will be pretty much in sync,
> so we'll possibly call it for a few/dozens of bytes. In that case, I
> think this implementation is a bit faster.
> 
>>> I know it isn't data path, but we'll execute this code in all extended
>>> functions (sometimes even more than once).
>> Do you think it is important enough to maintain our own copy of
>> memchr_inv()?
>>
> 
> True, I'm not sure it's important enough, but do you think it's that
> complicated?

It is complicated in my opinion. It is 67 lines of code, it's
architecture dependent and relies on preprocessor macros and conditional
code. I think this kind of stuff belongs in lib/string.c and not in the
RDMA stack.

Haggai
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-next V1 2/5] IB/core: Add ib_is_udata_cleared

2015-12-10 Thread Haggai Eran
On 10/12/2015 16:59, Matan Barak wrote:
> On Mon, Dec 7, 2015 at 3:18 PM, Haggai Eran  wrote:
>> On 12/03/2015 05:44 PM, Matan Barak wrote:
>>> Extending core and vendor verb commands require us to check that the
>>> unknown part of the user's given command is all zeros.
>>> Adding ib_is_udata_cleared in order to do so.
>>>
>>
>> Why not copy the data into kernel space and run memchr_inv() on it?
>>
> 
> Probably less efficient, isn't it?
Why do you think it is less efficient?

I'm not sure calling copy_from_user multiple times is very efficient.
For once, you are calling access_ok multiple times. I guess it depends
on the amount of data you are copying.

> I know it isn't data path, but we'll execute this code in all extended
> functions (sometimes even more than once).
Do you think it is important enough to maintain our own copy of
memchr_inv()?

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 04/37] IB/rdmavt: Add ib core device attributes to rvt driver params list

2015-12-10 Thread Haggai Eran
On 10/12/2015 15:25, Dennis Dalessandro wrote:
> On Thu, Dec 10, 2015 at 02:26:11PM +0200, Haggai Eran wrote:
>> On 07/12/2015 22:43, Dennis Dalessandro wrote:
>>> +/*
>>> + * Drivers will need to support a number of notifications to rvt in
>>> + * accordance with certain events. This structure should contain
>>> a mask
>>> + * of the supported events. Such events that the rvt may need to
>>> know
>>> + * about include:
>>> + * port errors
>>> + * port active
>>> + * lid change
>>> + * sm change
>>> + * client reregister
>>> + * pkey change
>> Where are the event values defined?
> 
> They aren't really yet. A lot of the comments like this were put in and
> made available on GitHub for an early look into the design. This will
> all get cleaned up as the code continues to come together.
Okay, Great. Since the events above are already defined as
ib_event/ib_event_type maybe you can reuse them somehow.

>>>> + *
>>> + * There may also be other events that the rvt layers needs to know
>>> + * about this is not an exhaustive list. Some events though rvt
>>> does not
>>> + * need to rely on the driver for such as completion queue error.
>>> + */
>>> + int rvt_signal_supported;
>>
>> What will rvt do if it learns that the device doesn't support some of
>> the events?
> 
> Good question. I don't really have a great answer right now. Mostly it
> sort of depends on what events we end up. If it's something critical
> rdmavt will have to check during the registration and fail the call.
> This is not yet implemented in the two patch sets posted.
I'm not sure that's needed. If the driver doesn't satisfy what rvt
expects from it they can break in many ways. I'm not sure this specific
instance requires a check during registration. But it really depends on
the specific drivers and the specific events we're going to have.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 04/37] IB/rdmavt: Add ib core device attributes to rvt driver params list

2015-12-10 Thread Haggai Eran
On 07/12/2015 22:43, Dennis Dalessandro wrote:
>  struct rvt_dev_info {
> + /*
> +  * Prior to calling for registration the driver will be responsible for
> +  * allocating space for this structure. The driver will also need to
> +  * allocate space for any private device or per port data structures.
> +  * Alternatively rvt could do this allocation and the registration API
> +  * would then change to accept an "extra" piece to allocate.
I don't think you need rvt to allocate the private data, but even if you
decide to do that, there's no need for a comment that describes all the
alternative designs here.

> +  *
> +  * The driver will also be
> +  * responsible for filling in certain members of dparms.props
> +  */

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 04/37] IB/rdmavt: Add ib core device attributes to rvt driver params list

2015-12-10 Thread Haggai Eran
On 07/12/2015 22:43, Dennis Dalessandro wrote:
> + /*
> +  * Drivers will need to support a number of notifications to rvt in
> +  * accordance with certain events. This structure should contain a mask
> +  * of the supported events. Such events that the rvt may need to know
> +  * about include:
> +  * port errors
> +  * port active
> +  * lid change
> +  * sm change
> +  * client reregister
> +  * pkey change
Where are the event values defined?

> +  *
> +  * There may also be other events that the rvt layers needs to know
> +  * about this is not an exhaustive list. Some events though rvt does not
> +  * need to rely on the driver for such as completion queue error.
> +  */
> +  int rvt_signal_supported;

What will rvt do if it learns that the device doesn't support some of
the events?

Haggai
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 02/37] IB/rdmavt: Consolidate dma ops in rdmavt.

2015-12-10 Thread Haggai Eran
A few nits:

On 07/12/2015 22:43, Dennis Dalessandro wrote:
> +static int rvt_map_sg(struct ib_device *dev, struct scatterlist *sgl,
> +   int nents, enum dma_data_direction direction)
> +{
> + struct scatterlist *sg;
> + u64 addr;
> + int i;
> + int ret = nents;
> +
> + if (WARN_ON(!valid_dma_direction(direction)))
> + return BAD_DMA_ADDRESS;
The function returns 0 on error, so its technically correct, but it
doesn't return a DMA address, so I think it would make more sense to
return 0 here and not BAD_DMA_ADDRESS.

> +
> + for_each_sg(sgl, sg, nents, i) {
> + addr = (u64)page_address(sg_page(sg));
> + if (!addr) {
> + ret = 0;
> + break;
> + }
> + sg->dma_address = addr + sg->offset;
> +#ifdef CONFIG_NEED_SG_DMA_LENGTH
> + sg->dma_length = sg->length;
> +#endif
> + }
> + return ret;
> +}

> diff --git a/drivers/infiniband/sw/rdmavt/vt.h 
> b/drivers/infiniband/sw/rdmavt/vt.h
> index ec210f3..a19a3af 100644
> --- a/drivers/infiniband/sw/rdmavt/vt.h
> +++ b/drivers/infiniband/sw/rdmavt/vt.h
> @@ -52,5 +52,6 @@
>   */
>  
>  #include 
> +#include "dma.h"
Why do you need the dma.h file included here? Won't it be enough to
include it in vt.c?

Regards,
Haggai
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 01/37] IB/rdmavt: Create module framework and handle driver registration

2015-12-10 Thread Haggai Eran
On 07/12/2015 22:43, Dennis Dalessandro wrote:
> +struct rvt_dev_info {
> + struct ib_device ibdev;
> + int (*port_callback)(struct ib_device *, u8, struct kobject *);
> +
> + /*
> +  * TODO:
> +  *  need to reflect module parameters that may vary by dev
> +  */
> +};

Could you explain what you mean by the TODO comment? Different device
drivers using rvt will have their own modules with their own parameters,
right? Why do you need them reflected here?

Thanks,
Haggai
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-next V2 07/11] IB/core: Validate route in ib_init_ah_from_wc and ib_init_ah_from_path

2015-12-07 Thread Haggai Eran
On 12/03/2015 03:47 PM, Matan Barak wrote:
> +static int addr_resolve_neigh(struct dst_entry *dst,
> +   const struct sockaddr *dst_in,
> +   struct rdma_dev_addr *addr)
> +{
> + if (dst->dev->flags & IFF_LOOPBACK) {
> + int ret;
> +
> + ret = rdma_translate_ip(dst_in, addr, NULL);
> + if (!ret)
> + memcpy(addr->dst_dev_addr, addr->src_dev_addr,
> +MAX_ADDR_LEN);
> +
> + return ret;
> + }
> +
> + /* If the device does ARP internally */
You mean "doesn't do ARP internally" right?

> + if (!(dst->dev->flags & IFF_NOARP)) {
> + const struct sockaddr_in *dst_in4 =
> + (const struct sockaddr_in *)dst_in;
> + const struct sockaddr_in6 *dst_in6 =
> + (const struct sockaddr_in6 *)dst_in;
> +
> + return dst_fetch_ha(dst, addr,
> + dst_in->sa_family == AF_INET ?
> + (const void *)&dst_in4->sin_addr.s_addr :
> + (const void *)&dst_in6->sin6_addr);
> + }
> +
> + return rdma_copy_addr(addr, dst->dev, NULL);
> +}

> +int rdma_resolve_ip_route(struct sockaddr *src_addr,
> +   const struct sockaddr *dst_addr,
> +   struct rdma_dev_addr *addr)
> +{
> + struct sockaddr_storage ssrc_addr;
> + struct sockaddr *src_in = (struct sockaddr *)&ssrc_addr;
> +
> + if (src_addr->sa_family != dst_addr->sa_family)
> + return -EINVAL;
> +
> + if (src_addr)
> + memcpy(src_in, src_addr, rdma_addr_size(src_addr));
> + else
> + src_in->sa_family = dst_addr->sa_family;
Don't you need to clear the rest of src_in? I believe you pass
uninitialized memory to it.

> +
> + return addr_resolve(src_in, dst_addr, addr, false);
> +}
> +EXPORT_SYMBOL(rdma_resolve_ip_route);

Haggai
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH libibverbs] init.c: increase sysfs read buffer size to 16

2015-12-07 Thread Haggai Eran
On 12/04/2015 01:09 AM, Jeff Squyres wrote:
> The default value of 8 is too small to read
> /sys/class/infiniband/usnic_x/node_type, which contains "6: usNIC
> UDP".  Per a7a73a8c1b39362f1701256bc772d82847832f9c, the too-small
> buffer causes a stderr warning to be emitted from ibv_devinfo when
> reading usNIC devices.
> 
> This commit therefore increases the buffer size to 16, which is long
> enough to read the usNIC node_type value.

Reviewed-by: Haggai Eran 
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-next V1 2/5] IB/core: Add ib_is_udata_cleared

2015-12-07 Thread Haggai Eran
On 12/03/2015 05:44 PM, Matan Barak wrote:
> Extending core and vendor verb commands require us to check that the
> unknown part of the user's given command is all zeros.
> Adding ib_is_udata_cleared in order to do so.
> 

Why not copy the data into kernel space and run memchr_inv() on it?

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH libmlx5 V1 6/6] Add always_inline check

2015-12-07 Thread Haggai Eran
On 12/03/2015 06:02 PM, Matan Barak wrote:
> Always inline isn't supported by every compiler. Adding it to
> configure.ac in order to support it only when possible.
> Inline other poll_one data path functions in order to eliminate
> "ifs".
> 
> Signed-off-by: Matan Barak 
> ---
>  configure.ac | 17 +
>  src/cq.c | 42 +-
>  src/mlx5.h   |  6 ++
>  3 files changed, 52 insertions(+), 13 deletions(-)
> 
> diff --git a/configure.ac b/configure.ac
> index fca0b46..50b4f9c 100644
> --- a/configure.ac
> +++ b/configure.ac
> @@ -65,6 +65,23 @@ AC_CHECK_FUNC(ibv_read_sysfs_file, [],
>  AC_MSG_ERROR([ibv_read_sysfs_file() not found.  libmlx5 requires 
> libibverbs >= 1.0.3.]))
>  AC_CHECK_FUNCS(ibv_dontfork_range ibv_dofork_range ibv_register_driver)
>  
> +AC_MSG_CHECKING("always inline")
Did you consider using an existing script like AX_GCC_FUNC_ATTRIBUTE [1]?

> +CFLAGS_BAK="$CFLAGS"
> +CFLAGS="$CFLAGS -Werror"
> +AC_COMPILE_IFELSE([AC_LANG_PROGRAM([[
> + static inline int f(void)
> + __attribute__((always_inline));
> + static inline int f(void)
> + {
> + return 1;
> + }
> +]],[[
> + int a = f();
> + a = a;
> +]])], [AC_MSG_RESULT([yes]) AC_DEFINE([HAVE_ALWAYS_INLINE], [1], [Define if 
> __attribute((always_inline)).])]
The description here doesn't look right. How about "Define if
__attribute__((always_inline) is supported"?

Regards,
Haggai

[1] https://www.gnu.org/software/autoconf-archive/ax_gcc_func_attribute.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] infiniband:core:Add needed error path in cm_init_av_by_path

2015-12-06 Thread Haggai Eran
On Friday, December 4, 2015 8:02 PM, Nicholas Krause  
wrote:
> To: dledf...@redhat.com
> Cc: sean.he...@intel.com; hal.rosenst...@gmail.com; Haggai Eran; 
> jguntho...@obsidianresearch.com; Matan Barak; yun.w...@profitbricks.com; 
> ted.h@oracle.com; Doron Tsur; Erez Shitrit; david.ah...@oracle.com; 
> linux-rdma@vger.kernel.org; linux-ker...@vger.kernel.org
> Subject: [PATCH] infiniband:core:Add needed error path in cm_init_av_by_path
> 
> This adds a needed error path in the function, cm_init_av_by_path
> after the call to ib_init_ah_from_path in order to avoid incorrectly
> accessing the path pointer of structure type ib_sa_path_rec if this
> function call fails to complete its intended work successfully by
> returning a error code.
> 
> Signed-off-by: Nicholas Krause 

The subject doesn't seem to match the convention but apart from that,

Reviewed-by: Haggai Eran 

I wonder if this should go to stable. If I understand correctly, this will fail 
only when the SGID isn't found in the GID table, but such connections would 
fail later on when creating a QP, right?

Haggai--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 10/10] IB/iser: Support the remote invalidation exception

2015-11-30 Thread Haggai Eran
On 24/11/2015 18:23, Sagi Grimberg wrote:
> From: Jenny Derzhavetz 
> 
> Declare that we support remote invalidation in case we are:
> 1. using Fastreg method
> 2. always registering memory.
> 
> Detect the invalidated rkey from the work completion info so we
> won't invalidate it locally. The spec madates that we must not rely
> on the taget remote invalidate our rkey so we must check it upon
typo: s/taget/target/

> a receive (scsi response) completion.

Haggai
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] IB/core: constify mmu_notifier_ops structures

2015-11-29 Thread Haggai Eran
On 30/11/2015 00:02, Julia Lawall wrote:
> This mmu_notifier_ops structure is never modified, so declare it as
> const, like the other mmu_notifier_ops structures.
> 
> Done with the help of Coccinelle.
> 
> Signed-off-by: Julia Lawall 

Reviewed-by: Haggai Eran 

Thanks,
Haggai

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] IB/cma: Add a missing rcu_read_unlock()

2015-11-22 Thread Haggai Eran
On 20/11/2015 21:04, Bart Van Assche wrote:
> Ensure that validate_ipv4_net_dev() calls rcu_read_unlock() if
> fib_lookup() fails. Detected by sparse. Compile-tested only.
> 
> Fixes: "IB/cma: Validate routing of incoming requests" (commit f887f2ac87c2).
> Cc: Haggai Eran 
> Cc: stable 

Thanks.

Reviewed-by: Haggai Eran 
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH ib-next 2/3] IB/core: IB/core: Allow legacy verbs through extended interfaces

2015-11-08 Thread Haggai Eran
On 08/11/2015 17:04, Matan Barak wrote:
>> @@ -704,6 +719,10 @@ static ssize_t ib_uverbs_write(struct file *filp, const 
>> char __user *buf,
>> > }
>> >
>> > command = hdr.command & IB_USER_VERBS_CMD_COMMAND_MASK;
>> > +   if (verify_command_mask(ib_dev, command)) {
>> > +   ret = -EINVAL;
> Why did you replace ENOSYS with EINVAL?

I think ENOSYS is meant only to represent a system call number that
isn't supported. There was a change in checkpatch that now warns about
it [1]. I'm not sure the intention was to fix existing uses though.

Haggai

[1] http://www.gossamer-threads.com/lists/linux/kernel/2148343

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC rdma cgroup

2015-11-04 Thread Haggai Eran
On 03/11/2015 21:11, Parav Pandit wrote:
> So it looks like below,
> #cat rdma.resources.verbs.list
> Output:
> mlx4_0 uctx ah pd cq mr mw srq qp flow
> mlx4_1 uctx ah pd cq mr mw srq qp flow rss_wq
What happens if you set a limit of rss_wq to mlx4_0 in this example?
Would it fail? I think it would be simpler for administrators if they
can configure every resource supported by uverbs. If a resource is not
supported by a specific device, you can never go over the limit anyway.

> #cat rdma.resources.hw.list
> hfi1 hw_qp hw_mr sw_pd
> (This particular one is hypothical example, I haven't actually coded
> this, unlike uverbs which is real).
Sounds fine to me. We will need to be careful to make sure that driver
maintainers don't break backward compatibility with this interface.

>> I guess there aren't a lot of options when the resources can belong to
>> multiple cgroups. So after migrating, new resources will belong to the
>> new cgroup or the old one?
> Resource always belongs to the cgroup in which its created, regardless
> of process migration.
> Again, its owned at the css level instead of cgroup. Therefore
> original cgroup can also be deleted but internal reference to data
> structure and that is freed and last rdma resource is freed.
Okay.

>>> For applications that doesn't use RDMA-CM, query_device and query_port
>>> will filter out the GID entries based on the network namespace in
>>> which caller process is running.
>> This could work well for RoCE, as each entry in the GID table is
>> associated with a net device and a network namespace. However, in
>> InfiniBand, the GID table isn't directly related to the network
>> namespace. As for the P_Keys, you could deduce the set of P_Keys of a
>> namespace by the set of IPoIB netdevs in the network namespace, but
>> InfiniBand is designed to also work without IPoIB, so I don't think it's
>> a good idea.
> Got it. Yeah, this code can be under if(device_type RoCE).
IIRC there's a core capability for the new GID table code that contains
namespace, so you can use that.

>> I think it would be better to allow each cgroup to limit the pkeys and
>> gids its processes can use.
> 
> o.k. So the use case is P_Key? So I believe requirement would similar
> to device cgroup.
> Where set of GID table entries are configured as white list entries.
> and when they are queried or used during create_ah or modify_qp, its
> compared against the white list (or in other words as ACL).
> If they are found in ACL, they are reported in query_device or in
> create_ah, modify_qp. If not they those calls are failed with
> appropriate status?
> Does this look ok? 
Yes, that sounds good to me.

> Can we address requirement as additional feature just after first path?
> Tejun had some other idea on this kind of requirement, and I need to
> discuss with him.
Of course. I think there's use for the RDMA cgroup even without a pkey
or GID ACL, just to make sure one application doesn't hog hardware
resources.

>>> One of the idea I was considering is: to create virtual RDMA device
>>> mapped to physical device.
>>> And configure GID count limit via configfs for each such device.
>> You could probably achieve what you want by creating a virtual RDMA
>> device and use the device cgroup to limit access to it, but it sounds to
>> me like an overkill.
> 
> Actually not much. Basically this virtual RDMA device points to the
> struct device of the physical device itself.
> So only overhead is linking this structure to native device structure
> and  passing most of the calls to native ib_device with thin filter
> layer in control path.
> post_send/recv/poll_cq will directly go native device and same performance.
Still, I think we already have code that wraps ib_device calls for
userspace, which is the ib_uverbs module. There's no need for an extra
layer.

Regards,
Haggai
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC rdma cgroup

2015-11-02 Thread Haggai Eran
On 29/10/2015 20:46, Parav Pandit wrote:
> On Thu, Oct 29, 2015 at 8:27 PM, Haggai Eran  wrote:
>> On 28/10/2015 10:29, Parav Pandit wrote:
>>> 3. Resources are not defined by the RDMA cgroup. Resources are defined
>>> by RDMA/IB subsystem and optionally by HCA vendor device drivers.
>>> Rationale: This allows rdma cgroup to remain constant while RDMA/IB
>>> subsystem can evolve without the need of rdma cgroup update. A new
>>> resource can be easily added by the RDMA/IB subsystem without touching
>>> rdma cgroup.
>> Resources exposed by the cgroup are basically a UAPI, so we have to be
>> careful to make it stable when it evolves. I understand the need for
>> vendor specific resources, following the discussion on the previous
>> proposal, but could you write on how you plan to allow these set of
>> resources to evolve?
> 
> Its fairly simple.
> Here is the code snippet on how resources are defined in my tree.
> It doesn't have the RSS work queues yet, but can be added right after
> this patch.
> 
> Resource are defined as index and as match_table_t.
> 
> enum rdma_resource_type {
> RDMA_VERB_RESOURCE_UCTX,
> RDMA_VERB_RESOURCE_AH,
> RDMA_VERB_RESOURCE_PD,
> RDMA_VERB_RESOURCE_CQ,
> RDMA_VERB_RESOURCE_MR,
> RDMA_VERB_RESOURCE_MW,
> RDMA_VERB_RESOURCE_SRQ,
> RDMA_VERB_RESOURCE_QP,
> RDMA_VERB_RESOURCE_FLOW,
> RDMA_VERB_RESOURCE_MAX,
> };
> So UAPI RDMA resources can evolve by just adding more entries here.
Are the names that appear in userspace also controlled by uverbs? What
about the vendor specific resources?

>>> 8. Typically each RDMA cgroup will have 0 to 4 RDMA devices. Therefore
>>> each cgroup will have 0 to 4 verbs resource pool and optionally 0 to 4
>>> hw resource pool per such device.
>>> (Nothing stops to have more devices and pools, but design is around
>>> this use case).
>> In what way does the design depend on this assumption?
> 
> Current code when performs resource charging/uncharging, it needs to
> identify the resource pool which one to charge to.
> This resource pool is maintained as list_head and so its linear search
> per device.
> If we are thinking of 100 of RDMA devices per container, than liner
> search will not be good way and different data structure needs to be
> deployed.
Okay, sounds fine to me.

>>> (c) When process migrate from one to other cgroup, resource is
>>> continue to be owned by the creator cgroup (rather css).
>>> After process migration, whenever new resource is created in new
>>> cgroup, it will be owned by new cgroup.
>> It sounds a little different from how other cgroups behave. I agree that
>> mostly processes will create the resources in their cgroup and won't
>> migrate, but why not move the charge during migration?
>>
> With fork() process doesn't really own the resource (unlike other file
> and socket descriptors).
> Parent process might have died also.
> There is possibly no clear way to transfer resource to right child.
> Child that cgroup picks might not even want to own RDMA resources.
> RDMA resources might be allocated by one process and freed by other
> process (though this might not be the way they use it).
> Its pretty similar to other cgroups with exception in migration area,
> such exception comes from different behavior of how RDMA resources are
> owned, created and used.
> Recent unified hierarchy patch from Tejun equally highlights to not
> frequently migrate processes among cgroups.
> 
> So in current implementation, (like other),
> if process created a RDMA resource, forked a child.
> child and parent both can allocate and free more resources.
> child moved to different cgroup. But resource is shared among them.
> child can free also the resource. All crazy combinations are possible
> in theory (without much use cases).
> So at best they are charged to the first cgroup css in which
> parent/child are created and reference is hold to CSS.
> cgroup, process can die, cut css remains until RDMA resources are freed.
> This is similar to process behavior where task struct is release but
> id is hold up for a while.

I guess there aren't a lot of options when the resources can belong to
multiple cgroups. So after migrating, new resources will belong to the
new cgroup or the old one?

>> I finally wanted to ask about other limitations an RDMA cgroup could
>> handle. It would be great to be able to limit a container to be allowed
>> to use only a subset of the MAC/VLAN pairs programmed to a device,
> 
> Truly. I agree. That was one of the prime reason I originally ha

Re: RFC rdma cgroup

2015-10-29 Thread Haggai Eran
On 28/10/2015 10:29, Parav Pandit wrote:
> 3. Resources are not defined by the RDMA cgroup. Resources are defined
> by RDMA/IB subsystem and optionally by HCA vendor device drivers.
> Rationale: This allows rdma cgroup to remain constant while RDMA/IB
> subsystem can evolve without the need of rdma cgroup update. A new
> resource can be easily added by the RDMA/IB subsystem without touching
> rdma cgroup.
Resources exposed by the cgroup are basically a UAPI, so we have to be
careful to make it stable when it evolves. I understand the need for
vendor specific resources, following the discussion on the previous
proposal, but could you write on how you plan to allow these set of
resources to evolve?

> 8. Typically each RDMA cgroup will have 0 to 4 RDMA devices. Therefore
> each cgroup will have 0 to 4 verbs resource pool and optionally 0 to 4
> hw resource pool per such device.
> (Nothing stops to have more devices and pools, but design is around
> this use case).
In what way does the design depend on this assumption?

> 9. Resource pool object is created in following situations.
> (a) administrative operation is done to set the limit and no previous
> resource pool exist for the device of interest for the cgroup.
> (b) no resource limits were configured, but IB/RDMA subsystem tries to
> charge the resource. so that when applications are running without
> limits and later on when limits are enforced, during uncharging, it
> correctly uncharges them, otherwise usage count will drop to negative.
> This is done using default resource pool.
> Instead of implementing any sort of time markers, default pool
> simplifies the design.
Having a default resource pool kind of implies there is a non-default
one. Is the only difference between the default and non-default the fact
that the second was created with an administrative operation and has
specified limits or is there some other difference?

> (c) When process migrate from one to other cgroup, resource is
> continue to be owned by the creator cgroup (rather css).
> After process migration, whenever new resource is created in new
> cgroup, it will be owned by new cgroup.
It sounds a little different from how other cgroups behave. I agree that
mostly processes will create the resources in their cgroup and won't
migrate, but why not move the charge during migration?

I finally wanted to ask about other limitations an RDMA cgroup could
handle. It would be great to be able to limit a container to be allowed
to use only a subset of the MAC/VLAN pairs programmed to a device, or
only a subset of P_Keys and GIDs it has. Do you see such limitations
also as part of this cgroup?

Thanks,
Haggai
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v7 2/4] IB/cma: Separate port allocation to network namespaces

2015-10-22 Thread Haggai Eran
Keep a struct for each network namespace containing the IDRs for the RDMA
CM port spaces. The struct is created dynamically using the generic_net
mechanism.

This patch is internal infrastructure work for the following patches. In
this patch, init_net is statically used as the network namespace for
the new port-space API.

Signed-off-by: Haggai Eran 
Signed-off-by: Yotam Kenneth 
Signed-off-by: Shachar Raindel 
Signed-off-by: Guy Shapiro 
---
 drivers/infiniband/core/cma.c | 94 ---
 1 file changed, 70 insertions(+), 24 deletions(-)

diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index 80a8b469a9a8..ac03c32ca7f1 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -44,6 +44,8 @@
 #include 
 #include 
 
+#include 
+#include 
 #include 
 #include 
 #include 
@@ -110,22 +112,33 @@ static LIST_HEAD(dev_list);
 static LIST_HEAD(listen_any_list);
 static DEFINE_MUTEX(lock);
 static struct workqueue_struct *cma_wq;
-static DEFINE_IDR(tcp_ps);
-static DEFINE_IDR(udp_ps);
-static DEFINE_IDR(ipoib_ps);
-static DEFINE_IDR(ib_ps);
+static int cma_pernet_id;
 
-static struct idr *cma_idr(enum rdma_port_space ps)
+struct cma_pernet {
+   struct idr tcp_ps;
+   struct idr udp_ps;
+   struct idr ipoib_ps;
+   struct idr ib_ps;
+};
+
+static struct cma_pernet *cma_pernet(struct net *net)
+{
+   return net_generic(net, cma_pernet_id);
+}
+
+static struct idr *cma_pernet_idr(struct net *net, enum rdma_port_space ps)
 {
+   struct cma_pernet *pernet = cma_pernet(net);
+
switch (ps) {
case RDMA_PS_TCP:
-   return &tcp_ps;
+   return &pernet->tcp_ps;
case RDMA_PS_UDP:
-   return &udp_ps;
+   return &pernet->udp_ps;
case RDMA_PS_IPOIB:
-   return &ipoib_ps;
+   return &pernet->ipoib_ps;
case RDMA_PS_IB:
-   return &ib_ps;
+   return &pernet->ib_ps;
default:
return NULL;
}
@@ -145,24 +158,25 @@ struct rdma_bind_list {
unsigned short  port;
 };
 
-static int cma_ps_alloc(enum rdma_port_space ps,
+static int cma_ps_alloc(struct net *net, enum rdma_port_space ps,
struct rdma_bind_list *bind_list, int snum)
 {
-   struct idr *idr = cma_idr(ps);
+   struct idr *idr = cma_pernet_idr(net, ps);
 
return idr_alloc(idr, bind_list, snum, snum + 1, GFP_KERNEL);
 }
 
-static struct rdma_bind_list *cma_ps_find(enum rdma_port_space ps, int snum)
+static struct rdma_bind_list *cma_ps_find(struct net *net,
+ enum rdma_port_space ps, int snum)
 {
-   struct idr *idr = cma_idr(ps);
+   struct idr *idr = cma_pernet_idr(net, ps);
 
return idr_find(idr, snum);
 }
 
-static void cma_ps_remove(enum rdma_port_space ps, int snum)
+static void cma_ps_remove(struct net *net, enum rdma_port_space ps, int snum)
 {
-   struct idr *idr = cma_idr(ps);
+   struct idr *idr = cma_pernet_idr(net, ps);
 
idr_remove(idr, snum);
 }
@@ -1325,7 +1339,8 @@ static struct rdma_id_private *cma_id_from_event(struct 
ib_cm_id *cm_id,
}
}
 
-   bind_list = cma_ps_find(rdma_ps_from_service_id(req.service_id),
+   bind_list = cma_ps_find(&init_net,
+   rdma_ps_from_service_id(req.service_id),
cma_port_from_service_id(req.service_id));
id_priv = cma_find_listener(bind_list, cm_id, ib_event, &req, *net_dev);
if (IS_ERR(id_priv) && *net_dev) {
@@ -1403,7 +1418,7 @@ static void cma_release_port(struct rdma_id_private 
*id_priv)
mutex_lock(&lock);
hlist_del(&id_priv->node);
if (hlist_empty(&bind_list->owners)) {
-   cma_ps_remove(bind_list->ps, bind_list->port);
+   cma_ps_remove(&init_net, bind_list->ps, bind_list->port);
kfree(bind_list);
}
mutex_unlock(&lock);
@@ -2693,7 +2708,7 @@ static int cma_alloc_port(enum rdma_port_space ps,
if (!bind_list)
return -ENOMEM;
 
-   ret = cma_ps_alloc(ps, bind_list, snum);
+   ret = cma_ps_alloc(&init_net, ps, bind_list, snum);
if (ret < 0)
goto err;
 
@@ -2718,7 +2733,7 @@ static int cma_alloc_any_port(enum rdma_port_space ps,
rover = prandom_u32() % remaining + low;
 retry:
if (last_used_port != rover &&
-   !cma_ps_find(ps, (unsigned short)rover)) {
+   !cma_ps_find(&init_net, ps, (unsigned short)rover)) {
int ret = cma_alloc_port(ps, id_priv, rover);
/*
 * Remember previously used port number in order to avoid
@@ -2784,7 +2799,7 @@ static int cma_use_port(enum rdma_port_space ps,
if (snu

[PATCH v7 4/4] IB/ucma: Take the network namespace from the process

2015-10-22 Thread Haggai Eran
From: Guy Shapiro 

Add support for network namespaces from user space. This is done by passing
the network namespace of the process instead of init_net.

Signed-off-by: Haggai Eran 
Signed-off-by: Yotam Kenneth 
Signed-off-by: Shachar Raindel 
Signed-off-by: Guy Shapiro 
---
 drivers/infiniband/core/ucma.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/core/ucma.c b/drivers/infiniband/core/ucma.c
index e80c107450ab..8b5a934e1133 100644
--- a/drivers/infiniband/core/ucma.c
+++ b/drivers/infiniband/core/ucma.c
@@ -42,6 +42,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -472,8 +473,8 @@ static ssize_t ucma_create_id(struct ucma_file *file, const 
char __user *inbuf,
return -ENOMEM;
 
ctx->uid = cmd.uid;
-   ctx->cm_id = rdma_create_id(&init_net, ucma_event_handler, ctx, cmd.ps,
-   qp_type);
+   ctx->cm_id = rdma_create_id(current->nsproxy->net_ns,
+   ucma_event_handler, ctx, cmd.ps, qp_type);
if (IS_ERR(ctx->cm_id)) {
ret = PTR_ERR(ctx->cm_id);
goto err1;
-- 
1.7.11.2

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v7 3/4] IB/cma: Add support for network namespaces

2015-10-22 Thread Haggai Eran
From: Guy Shapiro 

Add support for network namespaces in the ib_cma module. This is
accomplished by:

1. Adding network namespace parameter for rdma_create_id. This parameter is
   used to populate the network namespace field in rdma_id_private.
   rdma_create_id keeps a reference on the network namespace.
2. Using the network namespace from the rdma_id instead of init_net inside
   of ib_cma, when listening on an ID and when looking for an ID for an
   incoming request.
3. Decrementing the reference count for the appropriate network namespace
   when calling rdma_destroy_id.

In order to preserve the current behavior init_net is passed when calling
from other modules.

Signed-off-by: Guy Shapiro 
Signed-off-by: Haggai Eran 
Signed-off-by: Yotam Kenneth 
Signed-off-by: Shachar Raindel 
---
 drivers/infiniband/core/cma.c  | 46 +-
 drivers/infiniband/core/ucma.c |  3 +-
 drivers/infiniband/ulp/iser/iser_verbs.c   |  2 +-
 drivers/infiniband/ulp/isert/ib_isert.c|  2 +-
 .../staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h|  4 +-
 include/rdma/rdma_cm.h |  6 ++-
 net/9p/trans_rdma.c|  4 +-
 net/rds/ib.c   |  2 +-
 net/rds/ib_cm.c|  2 +-
 net/rds/iw.c   |  2 +-
 net/rds/iw_cm.c|  2 +-
 net/rds/rdma_transport.c   |  4 +-
 net/sunrpc/xprtrdma/svc_rdma_transport.c   |  4 +-
 net/sunrpc/xprtrdma/verbs.c|  3 +-
 14 files changed, 52 insertions(+), 34 deletions(-)

diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index ac03c32ca7f1..7e93eb1f33eb 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -600,7 +600,8 @@ static int cma_disable_callback(struct rdma_id_private 
*id_priv,
return 0;
 }
 
-struct rdma_cm_id *rdma_create_id(rdma_cm_event_handler event_handler,
+struct rdma_cm_id *rdma_create_id(struct net *net,
+ rdma_cm_event_handler event_handler,
  void *context, enum rdma_port_space ps,
  enum ib_qp_type qp_type)
 {
@@ -624,7 +625,7 @@ struct rdma_cm_id *rdma_create_id(rdma_cm_event_handler 
event_handler,
INIT_LIST_HEAD(&id_priv->listen_list);
INIT_LIST_HEAD(&id_priv->mc_list);
get_random_bytes(&id_priv->seq_num, sizeof id_priv->seq_num);
-   id_priv->id.route.addr.dev_addr.net = &init_net;
+   id_priv->id.route.addr.dev_addr.net = get_net(net);
 
return &id_priv->id;
 }
@@ -1278,7 +1279,7 @@ static bool cma_match_net_dev(const struct 
rdma_id_private *id_priv,
   cma_protocol_roce(&id_priv->id);
 
return !addr->dev_addr.bound_dev_if ||
-  (net_eq(dev_net(net_dev), &init_net) &&
+  (net_eq(dev_net(net_dev), addr->dev_addr.net) &&
addr->dev_addr.bound_dev_if == net_dev->ifindex);
 }
 
@@ -1339,7 +1340,7 @@ static struct rdma_id_private *cma_id_from_event(struct 
ib_cm_id *cm_id,
}
}
 
-   bind_list = cma_ps_find(&init_net,
+   bind_list = cma_ps_find(*net_dev ? dev_net(*net_dev) : &init_net,
rdma_ps_from_service_id(req.service_id),
cma_port_from_service_id(req.service_id));
id_priv = cma_find_listener(bind_list, cm_id, ib_event, &req, *net_dev);
@@ -1411,6 +1412,7 @@ static void cma_cancel_operation(struct rdma_id_private 
*id_priv,
 static void cma_release_port(struct rdma_id_private *id_priv)
 {
struct rdma_bind_list *bind_list = id_priv->bind_list;
+   struct net *net = id_priv->id.route.addr.dev_addr.net;
 
if (!bind_list)
return;
@@ -1418,7 +1420,7 @@ static void cma_release_port(struct rdma_id_private 
*id_priv)
mutex_lock(&lock);
hlist_del(&id_priv->node);
if (hlist_empty(&bind_list->owners)) {
-   cma_ps_remove(&init_net, bind_list->ps, bind_list->port);
+   cma_ps_remove(net, bind_list->ps, bind_list->port);
kfree(bind_list);
}
mutex_unlock(&lock);
@@ -1477,6 +1479,7 @@ void rdma_destroy_id(struct rdma_cm_id *id)
cma_deref_id(id_priv->id.context);
 
kfree(id_priv->id.route.path_rec);
+   put_net(id_priv->id.route.addr.dev_addr.net);
kfree(id_priv);
 }
 EXPORT_SYMBOL(rdma_destroy_id);
@@ -1607,7 +1610,8 @@ static struct rdma_id_private *cma_new_conn_id(struct 
rdma_cm_id *listen_id,
  ib_event->param.req_rcvd.primary_path->service_id;
int ret;
 
-   id = rdma_create

[PATCH v7 1/4] IB/addr: Pass network namespace as a parameter

2015-10-22 Thread Haggai Eran
From: Guy Shapiro 

Add network namespace support to the ib_addr module. For that, all the
address resolution and matching should be done using the appropriate
namespace instead of init_net.

This is achieved by:

1. Adding an explicit network namespace argument to exported function that
   require a namespace.
2. Saving the namespace in the rdma_addr_client structure.
3. Using it when calling networking functions.

In order to preserve the behavior of calling modules, &init_net is
passed as the parameter in calls from other modules. This is modified as
namespace support is added on more levels.

Signed-off-by: Haggai Eran 
Signed-off-by: Yotam Kenneth 
Signed-off-by: Shachar Raindel 
Signed-off-by: Guy Shapiro 
---
 drivers/infiniband/core/addr.c | 17 +
 drivers/infiniband/core/cma.c  |  1 +
 include/rdma/ib_addr.h | 16 +++-
 3 files changed, 25 insertions(+), 9 deletions(-)

diff --git a/drivers/infiniband/core/addr.c b/drivers/infiniband/core/addr.c
index d3c42b3c1b51..34b1adad07aa 100644
--- a/drivers/infiniband/core/addr.c
+++ b/drivers/infiniband/core/addr.c
@@ -128,7 +128,7 @@ int rdma_translate_ip(struct sockaddr *addr, struct 
rdma_dev_addr *dev_addr,
int ret = -EADDRNOTAVAIL;
 
if (dev_addr->bound_dev_if) {
-   dev = dev_get_by_index(&init_net, dev_addr->bound_dev_if);
+   dev = dev_get_by_index(dev_addr->net, dev_addr->bound_dev_if);
if (!dev)
return -ENODEV;
ret = rdma_copy_addr(dev_addr, dev, NULL);
@@ -138,7 +138,7 @@ int rdma_translate_ip(struct sockaddr *addr, struct 
rdma_dev_addr *dev_addr,
 
switch (addr->sa_family) {
case AF_INET:
-   dev = ip_dev_find(&init_net,
+   dev = ip_dev_find(dev_addr->net,
((struct sockaddr_in *) addr)->sin_addr.s_addr);
 
if (!dev)
@@ -149,12 +149,11 @@ int rdma_translate_ip(struct sockaddr *addr, struct 
rdma_dev_addr *dev_addr,
*vlan_id = rdma_vlan_dev_vlan_id(dev);
dev_put(dev);
break;
-
 #if IS_ENABLED(CONFIG_IPV6)
case AF_INET6:
rcu_read_lock();
-   for_each_netdev_rcu(&init_net, dev) {
-   if (ipv6_chk_addr(&init_net,
+   for_each_netdev_rcu(dev_addr->net, dev) {
+   if (ipv6_chk_addr(dev_addr->net,
  &((struct sockaddr_in6 *) 
addr)->sin6_addr,
  dev, 1)) {
ret = rdma_copy_addr(dev_addr, dev, NULL);
@@ -236,7 +235,7 @@ static int addr4_resolve(struct sockaddr_in *src_in,
fl4.daddr = dst_ip;
fl4.saddr = src_ip;
fl4.flowi4_oif = addr->bound_dev_if;
-   rt = ip_route_output_key(&init_net, &fl4);
+   rt = ip_route_output_key(addr->net, &fl4);
if (IS_ERR(rt)) {
ret = PTR_ERR(rt);
goto out;
@@ -278,12 +277,12 @@ static int addr6_resolve(struct sockaddr_in6 *src_in,
fl6.saddr = src_in->sin6_addr;
fl6.flowi6_oif = addr->bound_dev_if;
 
-   dst = ip6_route_output(&init_net, NULL, &fl6);
+   dst = ip6_route_output(addr->net, NULL, &fl6);
if ((ret = dst->error))
goto put;
 
if (ipv6_addr_any(&fl6.saddr)) {
-   ret = ipv6_dev_get_saddr(&init_net, ip6_dst_idev(dst)->dev,
+   ret = ipv6_dev_get_saddr(addr->net, ip6_dst_idev(dst)->dev,
 &fl6.daddr, 0, &fl6.saddr);
if (ret)
goto put;
@@ -477,6 +476,7 @@ int rdma_addr_find_dmac_by_grh(const union ib_gid *sgid, 
const union ib_gid *dgi
 
memset(&dev_addr, 0, sizeof(dev_addr));
dev_addr.bound_dev_if = if_index;
+   dev_addr.net = &init_net;
 
ctx.addr = &dev_addr;
init_completion(&ctx.comp);
@@ -511,6 +511,7 @@ int rdma_addr_find_smac_by_sgid(union ib_gid *sgid, u8 
*smac, u16 *vlan_id)
rdma_gid2ip(&gid_addr._sockaddr, sgid);
 
memset(&dev_addr, 0, sizeof(dev_addr));
+   dev_addr.net = &init_net;
ret = rdma_translate_ip(&gid_addr._sockaddr, &dev_addr, vlan_id);
if (ret)
return ret;
diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index be41784a62e4..80a8b469a9a8 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -610,6 +610,7 @@ struct rdma_cm_id *rdma_create_id(rdma_cm_event_handler 
event_handler,
INIT_LIST_HEAD(&id_priv->listen_list);
INIT_LIST_HEAD(&id_priv->mc_list);
get_random_bytes(&id_priv->seq_num, sizeof id_priv->seq_num);
+   id_priv->id.route.addr.dev_addr.net = &init

[PATCH v7 0/4] Add network namespace support in the RDMA-CM

2015-10-22 Thread Haggai Eran
Hi Doug,

I've rebased the network namespaces patches over your 4.4 tree.

Regards,
Haggai

Changes from v6:
- rebased over k.o/for-4.4
- use init_net when no netdev is found (RoCE and AF_IB)

Changes from v5:
- removed patches that got in as part of the cleanup series.

RDMA-CM uses IP based addressing and routing to setup RDMA connections between
hosts. Currently, all of the IP interfaces and addresses used by the RDMA-CM
must reside in the init_net namespace. This restricts the usage of containers
with RDMA to only work with host network namespace (aka the kernel init_net NS
instance).

This patchset allows using network namespaces with the RDMA-CM.

Each RDMA-CM id keeps a reference to a network namespace.

This reference is based on the process network namespace at the time of the
creation of the object or inherited from the listener.

This network namespace is used to perform all IP and network related
operations. Specifically, the local device lookup, as well as the remote GID
address resolution are done in the context of the RDMA-CM object's namespace.
This allows outgoing connections to reach the right target, even if the same
IP address exists in multiple network namespaces. This can happen if each
network namespace resides on a different P_Key.

Additionally, the network namespace is used to split the listener port space
tables. From the user point of view, each network namespace has a unique,
completely independent tables for its port spaces. This allows running multiple
instances of a single service on the same machine, using containers. 

The functionality introduced by this series would come into play when the
transport is InfiniBand and IPoIB interfaces are assigned to each namespace.
Multiple IPoIB interfaces can be created and assigned to different RDMA-CM
capable containers, for example using pipework [1].

The patches apply against Doug's for-4.4 tree.

The patchset is structured as follows:

Patch 1 is a relatively trivial API extension, requiring the callers
of certain ib_addr functions to provide a network namespace, as needed.

Patches 2-4 add proper namespace support to the RDMA-CM module. This
includes adding multiple port space tables, adding a network namespace
parameter, and finally retrieving the namespace from the creating process.

[1] https://github.com/jpetazzo/pipework/

Guy Shapiro (3):
  IB/addr: Pass network namespace as a parameter
  IB/cma: Add support for network namespaces
  IB/ucma: Take the network namespace from the process

Haggai Eran (1):
  IB/cma: Separate port allocation to network namespaces

 drivers/infiniband/core/addr.c |  17 +--
 drivers/infiniband/core/cma.c  | 129 +++--
 drivers/infiniband/core/ucma.c |   4 +-
 drivers/infiniband/ulp/iser/iser_verbs.c   |   2 +-
 drivers/infiniband/ulp/isert/ib_isert.c|   2 +-
 .../staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h|   4 +-
 include/rdma/ib_addr.h |  16 ++-
 include/rdma/rdma_cm.h |   6 +-
 net/9p/trans_rdma.c|   4 +-
 net/rds/ib.c   |   2 +-
 net/rds/ib_cm.c|   2 +-
 net/rds/iw.c   |   2 +-
 net/rds/iw_cm.c|   2 +-
 net/rds/rdma_transport.c   |   4 +-
 net/sunrpc/xprtrdma/svc_rdma_transport.c   |   4 +-
 net/sunrpc/xprtrdma/verbs.c|   3 +-
 16 files changed, 142 insertions(+), 61 deletions(-)

-- 
1.7.11.2

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v6 0/4] Add network namespace support in the RDMA-CM

2015-10-21 Thread Haggai Eran
On 22/10/2015 06:37, Doug Ledford wrote:
> On 08/27/2015 10:29 AM, Haggai Eran wrote:
>> Hi,
>>
>> Now that the code for demuxing requests is inside rdma_cm, here are the 
>> patches
>> to add InfiniBand network namespace again.
>>
>> Changes from v5:
>> - removed patches that got in as part of the cleanup series.
> 
> Hi Haggai,
> 
> This series no longer applies cleanly after all of the cm changes that
> made it into 4.3.  Can you please rebase and resend?

Sure, I'll do that.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 for-next 1/7] IB/core: Extend ib_uverbs_create_qp

2015-10-21 Thread Haggai Eran
On 21/10/2015 17:00, Eran Ben Elisha wrote:
> ib_uverbs_ex_create_qp follows the extension verbs
> mechanism. New features (for example, QP creation flags
> field which is added in a downstream patch) could used
> via user-space libraries without breaking the ABI.
> 
> Signed-off-by: Eran Ben Elisha 
> ---
> Hi Doug,
> This is a fix for the first patch of "Add support for multicast loopback
> prevention to mlx4"
> Changes from v2:
>   remove comp_mask check in create_qp based on Haggai's comment
> 
>  drivers/infiniband/core/uverbs.h  |   1 +
>  drivers/infiniband/core/uverbs_cmd.c  | 254 
> +-
>  drivers/infiniband/core/uverbs_main.c |   1 +
>  include/uapi/rdma/ib_user_verbs.h |  26 
>  4 files changed, 217 insertions(+), 65 deletions(-)
> 

Reviewed-by: Haggai Eran 
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 for-next 1/7] IB/core: Extend ib_uverbs_create_qp

2015-10-21 Thread Haggai Eran
On 21/10/2015 16:46, eran ben elisha wrote:
>>> +ssize_t ib_uverbs_create_qp(struct ib_uverbs_file *file,
>>> >> + struct ib_device *ib_dev,
>>> >> + const char __user *buf, int in_len,
>>> >> + int out_len)
>>> >> +{
>>> >> + struct ib_uverbs_create_qp  cmd;
>>> >> + struct ib_uverbs_ex_create_qp   cmd_ex;
>>> >> + struct ib_udata ucore;
>>> >> + struct ib_udata uhw;
>>> >> + ssize_t resp_size = sizeof(struct ib_uverbs_create_qp_resp);
>>> >> + int err;
>> >
>> > I would expect a check here that in_len >= sizeof(cmd). But I see the
>> > previous code didn't have it either. Any reason adding the check would
>> > break user-space?
> This patch just refactor in ib_uverbs_create_qp and doesn't change any
> of it's logic or fix any bug. we can consider such a fix for the
> future.

Fair enough.

Haggai
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] IB/cma: Use inner P_Key to determine netdev

2015-10-21 Thread Haggai Eran
On 20/10/2015 19:44, Jason Gunthorpe wrote:
> On Tue, Oct 20, 2015 at 09:45:27AM +0300, Haggai Eran wrote:
>> On 19/10/2015 21:19, Jason Gunthorpe wrote:
>>> On Mon, Oct 19, 2015 at 09:09:25PM +0300, Haggai Eran wrote:
>>>> When discussing the patches to demux ids in rdma_cm instead of ib_cm, it
>>>> was decided that it is best to use the P_Key value in the packet headers
>>>> [1]. However, some drivers are currently unable to send correct P_Key in
>>>> GMP headers.
>>>
>>> You should explicitly describe the broken drivers in the commit text.
>> These are mlx5 and ipath. I'll add them to the commit message.
> 
> And ipath is actually ipath, the obsolete driver being removed, not
> qib? (ie we don't care about it?)
Right. qib is fine.

>> The remaining issue is that it doesn't respect the
>> ib_send_wr.ud.pkey_index field when sending.
> 
> But this is a very serious bug, to the point the mis-labeled packets
> may not even be delivered in some cases - you really care about the
> sub case where mis-labeled packets happen to be deliverable but don't
> parse right?
I understand the issue in mlx5 is serious but we've lived with it a long
time. In this patch I only wanted to work around it for the purpose of
fixing the regression introduced in 4.3.

> Well, don't forget to undo this patch when the mlx5 driver is fixed..
Of course.

Thanks,
Haggai
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 for-next 1/7] IB/core: Extend ib_uverbs_create_qp

2015-10-21 Thread Haggai Eran
On 21/10/2015 12:53, Sagi Grimberg wrote:
> On 10/15/2015 2:44 PM, Eran Ben Elisha wrote:
>> ib_uverbs_ex_create_qp follows the extension verbs
>> mechanism. New features (for example, QP creation flags
>> field which is added in a downstream patch) could used
>> via user-space libraries without breaking the ABI.
> 
> This is an important addition, I'll need it soon for stuff
> I'm working on...
> 
>> +struct ib_uverbs_ex_create_qp {
>> +__u64 user_handle;
>> +__u32 pd_handle;
>> +__u32 send_cq_handle;
>> +__u32 recv_cq_handle;
>> +__u32 srq_handle;
>> +__u32 max_send_wr;
>> +__u32 max_recv_wr;
>> +__u32 max_send_sge;
>> +__u32 max_recv_sge;
>> +__u32 max_inline_data;
>> +__u8  sq_sig_all;
>> +__u8  qp_type;
>> +__u8  is_srq;
>> +__u8 reserved;
>> +__u32 comp_mask;
>> +__u32 create_flags;
> 
> Can we please make create_flags u64 to begin with?
Note that the size of the struct must be a round number of 64-bit words
to avoid different sizeof values on 32-bit systems. If you change the
size of create_flags you'll need another 32-bit padding field in the struct.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 for-next 2/7] IB/core: Allow setting create flags in QP init attribute

2015-10-21 Thread Haggai Eran
On 15/10/2015 14:44, Eran Ben Elisha wrote:
> Allow setting IB_QP_CREATE_BLOCK_MULTICAST_LOOPBACK at create_flags in
> ib_uverbs_create_qp_ex.
> 
> Signed-off-by: Eran Ben Elisha 
> ---
>  drivers/infiniband/core/uverbs_cmd.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/infiniband/core/uverbs_cmd.c 
> b/drivers/infiniband/core/uverbs_cmd.c
> index e795d59..e9bafa3 100644
> --- a/drivers/infiniband/core/uverbs_cmd.c
> +++ b/drivers/infiniband/core/uverbs_cmd.c
> @@ -1843,7 +1843,7 @@ static int create_qp(struct ib_uverbs_file *file,
> sizeof(cmd->create_flags))
>   attr.create_flags = cmd->create_flags;
>  
> - if (attr.create_flags) {
> + if (attr.create_flags & ~IB_QP_CREATE_BLOCK_MULTICAST_LOOPBACK) {
>   ret = -EINVAL;
>   goto err_put;
>   }
> 

FWIW

Reviewed-by: Haggai Eran 
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 for-next 1/7] IB/core: Extend ib_uverbs_create_qp

2015-10-21 Thread Haggai Eran
On 15/10/2015 14:44, Eran Ben Elisha wrote:
> ib_uverbs_ex_create_qp follows the extension verbs
> mechanism. New features (for example, QP creation flags
> field which is added in a downstream patch) could used
> via user-space libraries without breaking the ABI.
> 
> Signed-off-by: Eran Ben Elisha 
> ---

> diff --git a/drivers/infiniband/core/uverbs_cmd.c 
> b/drivers/infiniband/core/uverbs_cmd.c
> index be4cb9f..e795d59 100644
> --- a/drivers/infiniband/core/uverbs_cmd.c
> +++ b/drivers/infiniband/core/uverbs_cmd.c
> @@ -1741,66 +1741,65 @@ ssize_t ib_uverbs_destroy_cq(struct ib_uverbs_file 
> *file,
>   return in_len;
>  }
>  
> -ssize_t ib_uverbs_create_qp(struct ib_uverbs_file *file,
> - struct ib_device *ib_dev,
> - const char __user *buf, int in_len,
> - int out_len)
> -{
> - struct ib_uverbs_create_qp  cmd;
> - struct ib_uverbs_create_qp_resp resp;
> - struct ib_udata udata;
> - struct ib_uqp_object   *obj;
> - struct ib_device   *device;
> - struct ib_pd   *pd = NULL;
> - struct ib_xrcd *xrcd = NULL;
> - struct ib_uobject  *uninitialized_var(xrcd_uobj);
> - struct ib_cq   *scq = NULL, *rcq = NULL;
> - struct ib_srq  *srq = NULL;
> - struct ib_qp   *qp;
> - struct ib_qp_init_attr  attr;
> - int ret;
> -
> - if (out_len < sizeof resp)
> - return -ENOSPC;
> -
> - if (copy_from_user(&cmd, buf, sizeof cmd))
> - return -EFAULT;
> +static int create_qp(struct ib_uverbs_file *file,
> +  struct ib_udata *ucore,
> +  struct ib_udata *uhw,
> +  struct ib_uverbs_ex_create_qp *cmd,
> +  size_t cmd_sz,
> +  int (*cb)(struct ib_uverbs_file *file,
> +struct ib_uverbs_ex_create_qp_resp *resp,
> +struct ib_udata *udata),
> +  void *context)
> +{
> + struct ib_uqp_object*obj;
> + struct ib_device*device;
> + struct ib_pd*pd = NULL;
> + struct ib_xrcd  *xrcd = NULL;
> + struct ib_uobject   *uninitialized_var(xrcd_uobj);
> + struct ib_cq*scq = NULL, *rcq = NULL;
> + struct ib_srq   *srq = NULL;
> + struct ib_qp*qp;
> + char*buf;
> + struct ib_qp_init_attr  attr;
> + struct ib_uverbs_ex_create_qp_resp resp;
> + int ret;
>  
> - if (cmd.qp_type == IB_QPT_RAW_PACKET && !capable(CAP_NET_RAW))
> + if (cmd->qp_type == IB_QPT_RAW_PACKET && !capable(CAP_NET_RAW))
>   return -EPERM;
>  
> - INIT_UDATA(&udata, buf + sizeof cmd,
> -(unsigned long) cmd.response + sizeof resp,
> -in_len - sizeof cmd, out_len - sizeof resp);
> -
>   obj = kzalloc(sizeof *obj, GFP_KERNEL);
>   if (!obj)
>   return -ENOMEM;
>  
> - init_uobj(&obj->uevent.uobject, cmd.user_handle, file->ucontext, 
> &qp_lock_class);
> + init_uobj(&obj->uevent.uobject, cmd->user_handle, file->ucontext,
> +   &qp_lock_class);
>   down_write(&obj->uevent.uobject.mutex);
>  
> - if (cmd.qp_type == IB_QPT_XRC_TGT) {
> - xrcd = idr_read_xrcd(cmd.pd_handle, file->ucontext, &xrcd_uobj);
> + if (cmd->qp_type == IB_QPT_XRC_TGT) {
> + xrcd = idr_read_xrcd(cmd->pd_handle, file->ucontext,
> +  &xrcd_uobj);
>   if (!xrcd) {
>   ret = -EINVAL;
>   goto err_put;
>   }
>   device = xrcd->device;
>   } else {
> - if (cmd.qp_type == IB_QPT_XRC_INI) {
> - cmd.max_recv_wr = cmd.max_recv_sge = 0;
> + if (cmd->qp_type == IB_QPT_XRC_INI) {
> + cmd->max_recv_wr = 0;
> + cmd->max_recv_sge = 0;
>   } else {
> - if (cmd.is_srq) {
> - srq = idr_read_srq(cmd.srq_handle, 
> file->ucontext);
> + if (cmd->is_srq) {
> + srq = idr_read_srq(cmd->srq_handle,
> +file->ucontext);
>   if (!srq || srq->srq_type != IB_SRQT_BASIC) {
>   ret = -EINVAL;
>   goto err_put;
>   }
>   }
>  
> - if (cmd.recv_cq_handle != cmd.send_cq_handle) {
> - rcq = idr_read_cq(cmd.recv_cq_handle, 
> file->ucontext, 0);
> + if (cmd->recv_cq_handle != cmd->send

Re: [PATCH v1] IB/cma: Use inner P_Key to determine netdev

2015-10-20 Thread Haggai Eran
On 20/10/2015 10:20, Doug Ledford wrote:
> On 10/20/2015 02:53 AM, Haggai Eran wrote:
>> When discussing the patches to demux ids in rdma_cm instead of ib_cm, it
>> was decided that it is best to use the P_Key value in the packet headers.
>> However, the mlx5 and ipath drivers are currently unable to send correct
>> P_Key values in GMP headers. They always send using a single P_Key that is
>> set during the GSI QP initialization.
>>
>> Change the rdma_cm code to look at the P_Key value that is part of the
>> packet payload as a workaround. Once the drivers are fixed this patch can
>> be reverted.
>>
>> Fixes: 4c21b5bcef73 ("IB/cma: Add net_dev and private data checks to
>> RDMA CM")
>> Signed-off-by: Haggai Eran 
>> ---
>> Changes from v0:
>> - improve commit message
>>
>>  drivers/infiniband/core/cma.c | 4 ++--
>>  1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
>> index 59a2dafc8c57..e8324543e085 100644
>> --- a/drivers/infiniband/core/cma.c
>> +++ b/drivers/infiniband/core/cma.c
>> @@ -1067,14 +1067,14 @@ static int cma_save_req_info(const struct 
>> ib_cm_event *ib_event,
>> sizeof(req->local_gid));
>>  req->has_gid= true;
>>  req->service_id = req_param->primary_path->service_id;
>> -req->pkey   = req_param->bth_pkey;
>> +req->pkey   = be16_to_cpu(req_param->primary_path->pkey);
>>  break;
>>  case IB_CM_SIDR_REQ_RECEIVED:
>>  req->device = sidr_param->listen_id->device;
>>  req->port   = sidr_param->port;
>>  req->has_gid= false;
>>  req->service_id = sidr_param->service_id;
>> -req->pkey   = sidr_param->bth_pkey;
>> +req->pkey   = sidr_param->pkey;
>>  break;
>>  default:
>>  return -EINVAL;
>>
> 
> And, to be clear, you are looking for this to be part of 4.3-rc updates,
> yes?

Yes, the issue was introduced in 4.3 in my cma demux patch.

Haggai

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v1] IB/cma: Use inner P_Key to determine netdev

2015-10-19 Thread Haggai Eran
When discussing the patches to demux ids in rdma_cm instead of ib_cm, it
was decided that it is best to use the P_Key value in the packet headers.
However, the mlx5 and ipath drivers are currently unable to send correct
P_Key values in GMP headers. They always send using a single P_Key that is
set during the GSI QP initialization.

Change the rdma_cm code to look at the P_Key value that is part of the
packet payload as a workaround. Once the drivers are fixed this patch can
be reverted.

Fixes: 4c21b5bcef73 ("IB/cma: Add net_dev and private data checks to
RDMA CM")
Signed-off-by: Haggai Eran 
---
Changes from v0:
- improve commit message

 drivers/infiniband/core/cma.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index 59a2dafc8c57..e8324543e085 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -1067,14 +1067,14 @@ static int cma_save_req_info(const struct ib_cm_event 
*ib_event,
   sizeof(req->local_gid));
req->has_gid= true;
req->service_id = req_param->primary_path->service_id;
-   req->pkey   = req_param->bth_pkey;
+   req->pkey   = be16_to_cpu(req_param->primary_path->pkey);
break;
case IB_CM_SIDR_REQ_RECEIVED:
req->device = sidr_param->listen_id->device;
req->port   = sidr_param->port;
req->has_gid= false;
req->service_id = sidr_param->service_id;
-   req->pkey   = sidr_param->bth_pkey;
+   req->pkey   = sidr_param->pkey;
break;
default:
return -EINVAL;
-- 
1.7.11.2

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] IB/cma: Use inner P_Key to determine netdev

2015-10-19 Thread Haggai Eran
On 19/10/2015 21:19, Jason Gunthorpe wrote:
> On Mon, Oct 19, 2015 at 09:09:25PM +0300, Haggai Eran wrote:
>> When discussing the patches to demux ids in rdma_cm instead of ib_cm, it
>> was decided that it is best to use the P_Key value in the packet headers
>> [1]. However, some drivers are currently unable to send correct P_Key in
>> GMP headers.
> 
> You should explicitly describe the broken drivers in the commit text.
These are mlx5 and ipath. I'll add them to the commit message.

> I thought mlx5 was fixed for receive already? I'm confused why we need
> this.
mlx5 had two issues related to GSI pkeys. The issue that was fixed was
that it treated the pkey value returned by the hardware in receive
completions as a pkey_index. The remaining issue is that it doesn't
respect the ib_send_wr.ud.pkey_index field when sending.

With the current state of things, cma will try to look for an ipoib net
dev matching the BTH pkey of the request, but if the sender is mlx5 or
ipath, the BTH pkey would be the default pkey. If the request was
intended for a different pkey, cma won't find a matching netdev and will
throw away the request.

Haggai
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] IB/cma: Use inner P_Key to determine netdev

2015-10-19 Thread Haggai Eran
When discussing the patches to demux ids in rdma_cm instead of ib_cm, it
was decided that it is best to use the P_Key value in the packet headers
[1]. However, some drivers are currently unable to send correct P_Key in
GMP headers.

Change the rdma_cm code to look at the P_Key value that is part of the
packet payload as a workaround. Once the drivers are fixed this patch can
be reverted.

Fixes: 4c21b5bcef73 ("IB/cma: Add net_dev and private data checks to
RDMA CM")
Signed-off-by: Haggai Eran 
---
 drivers/infiniband/core/cma.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index 59a2dafc8c57..e8324543e085 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -1067,14 +1067,14 @@ static int cma_save_req_info(const struct ib_cm_event 
*ib_event,
   sizeof(req->local_gid));
req->has_gid= true;
req->service_id = req_param->primary_path->service_id;
-   req->pkey   = req_param->bth_pkey;
+   req->pkey   = be16_to_cpu(req_param->primary_path->pkey);
break;
case IB_CM_SIDR_REQ_RECEIVED:
req->device = sidr_param->listen_id->device;
req->port   = sidr_param->port;
req->has_gid= false;
req->service_id = sidr_param->service_id;
-   req->pkey   = sidr_param->bth_pkey;
+   req->pkey   = sidr_param->pkey;
break;
default:
return -EINVAL;
-- 
1.7.11.2

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/6] IB/mad: Support devices taking pkey_index from the GSI QP

2015-10-19 Thread Haggai Eran
On 10/14/2015 08:54 PM, Jason Gunthorpe wrote:
> What this series is doing (src QPN != 1) is absolutely not complainant
> with the spec.

Hal convinced me that the spec implicitly requires that. Originally I
thought that it was allowed, based this paragraph from Section 13.5.1
(MAD Interfaces):
> Note that it is not required by IBA that GS managers use QP1 as the
> source QP used to send management packets to GS agents. GS managers
> may send packets from any QP other than QP0. QP1's primary purpose
> is to be a known QP target to which GS managers can send packets
> to initially contact a GS agent on a node.

On 10/14/2015 08:54 PM, Jason Gunthorpe wrote:
> If hardware doesn't have the ability to set the pkey on outbound, then
> it can only support 1 pkey. This may be why reading other parts of the
> spec is confusing. pkey_index in the verbs section is optional, but
> without it an implementation cannot support multiple pkeys. Thus when
> multiple pkeys are supported it is not optional at all.
I wouldn't say that pkey_index is optional. If you look at section
11.4.1.1 (Post Send Request), there is simply no mention of a pkey index
in the input modifier list.

> Put the hack in the driver, and obsolete it when the hardware is fixed
> to follow the spec. Even better would be to fix this in firmware and
> leave the kernel alone.
We will move the code to create multiple QPs to the driver, and make
sure it uses SQPN == 1 on all created QPs. Note that a side effect will
be that MADs could be sent out of order (if they are put on different
QPs), however we would make sure that completions are in the right order.

> FWIW, IMHO, no device that works like this should be a candidate for
> the IBTA Interop Logo.
I would expect interop tests to test the BTH pkey, but my guess is that
it isn't tested today. That would explain how both the mlx5 and the
ipath drivers have been able to hide this issue so far.

I think most of the kernel code only looks at the BTH pkey to decide on
what pkey to send a response. It is only with our cma demux patches that
we started to look a the BTH pkey field for validating a connection, and
so the issue surfaced.

Since the change to the driver will take some time, I suggest that for
in order to fix the issue in 4.3 we change cma to look at the CM request
payload pkey and not at the BTH. After mlx5 and ipath are fixed we can
change cma back again.

Regards,
Haggai
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 5/6] IB/mad: Create multiple QPs for supporting different P_Keys

2015-10-14 Thread Haggai Eran
On devices that do not support sending different P_Keys on a single GSI QP,
create a different UD QP for each non-zero P_Key in the P_Key table. When
transmitting a MAD with a specified P_Key, send it using the right QP.
Receive packets from all the QPs through the common SRQ.

Signed-off-by: Haggai Eran 
---
 drivers/infiniband/core/mad.c  | 404 +++--
 drivers/infiniband/core/mad_priv.h |   4 +-
 2 files changed, 299 insertions(+), 109 deletions(-)

diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c
index 2d4457239908..02977942574c 100644
--- a/drivers/infiniband/core/mad.c
+++ b/drivers/infiniband/core/mad.c
@@ -84,6 +84,8 @@ static int add_nonoui_reg_req(struct ib_mad_reg_req 
*mad_reg_req,
  u8 mgmt_class);
 static int add_oui_reg_req(struct ib_mad_reg_req *mad_reg_req,
   struct ib_mad_agent_private *agent_priv);
+static void report_mad_completion(struct ib_mad_qp_info *qp_info,
+ struct ib_mad_send_wr_private *mad_send_wr);
 
 /*
  * Returns a ib_mad_port_private structure or NULL for a device/port
@@ -353,7 +355,7 @@ struct ib_mad_agent *ib_register_mad_agent(struct ib_device 
*device,
mad_agent_priv->agent.recv_handler = recv_handler;
mad_agent_priv->agent.send_handler = send_handler;
mad_agent_priv->agent.context = context;
-   mad_agent_priv->agent.pd = port_priv->qp_info[qpn].qp->pd;
+   mad_agent_priv->agent.pd = port_priv->pd;
mad_agent_priv->agent.port_num = port_num;
mad_agent_priv->agent.flags = registration_flags;
spin_lock_init(&mad_agent_priv->lock);
@@ -518,7 +520,7 @@ struct ib_mad_agent *ib_register_mad_snoop(struct ib_device 
*device,
mad_snoop_priv->agent.recv_handler = recv_handler;
mad_snoop_priv->agent.snoop_handler = snoop_handler;
mad_snoop_priv->agent.context = context;
-   mad_snoop_priv->agent.pd = port_priv->qp_info[qpn].qp->pd;
+   mad_snoop_priv->agent.pd = port_priv->pd;
mad_snoop_priv->agent.port_num = port_num;
mad_snoop_priv->mad_snoop_flags = mad_snoop_flags;
init_completion(&mad_snoop_priv->comp);
@@ -829,10 +831,12 @@ static int handle_outgoing_dr_smp(struct 
ib_mad_agent_private *mad_agent_priv,
goto out;
}
 
-   build_smp_wc(mad_agent_priv->qp_info->qp,
+   read_lock_irqsave(&mad_agent_priv->qp_info->qp_lock, flags);
+   build_smp_wc(mad_agent_priv->qp_info->qp[0],
 send_wr->wr_id, drslid,
 send_wr->wr.ud.pkey_index,
 send_wr->wr.ud.port_num, &mad_wc);
+   read_unlock_irqrestore(&mad_agent_priv->qp_info->qp_lock, flags);
 
if (opa && smp->base_version == OPA_MGMT_BASE_VERSION) {
mad_wc.byte_len = mad_send_wr->send_buf.hdr_len
@@ -1141,6 +1145,46 @@ void ib_free_send_mad(struct ib_mad_send_buf *send_buf)
 }
 EXPORT_SYMBOL(ib_free_send_mad);
 
+static int ib_mad_post_send(struct ib_mad_qp_info *qp_info,
+   struct ib_mad_send_wr_private *mad_send_wr,
+   struct ib_send_wr **bad_send_wr)
+{
+   struct ib_qp *qp;
+   u16 pkey_index;
+   unsigned long flags;
+   struct ib_wc;
+   int ret;
+
+   read_lock_irqsave(&qp_info->qp_lock, flags);
+
+   if (!qp_info->srq)
+   pkey_index = 0;
+   else
+   pkey_index = mad_send_wr->send_wr.wr.ud.pkey_index;
+
+   if (pkey_index >= qp_info->num_qps)
+   goto error;
+
+   qp = qp_info->qp[pkey_index];
+   if (!qp)
+   goto error;
+
+   ret = ib_post_send(qp, &mad_send_wr->send_wr, bad_send_wr);
+   read_unlock_irqrestore(&qp_info->qp_lock, flags);
+
+   return ret;
+
+error:
+   read_unlock_irqrestore(&qp_info->qp_lock, flags);
+   dev_dbg(&qp_info->port_priv->device->dev,
+   "ib_mad: failed to send MAD with pkey_index=%d. dropping.\n",
+   pkey_index);
+
+   report_mad_completion(qp_info, mad_send_wr);
+
+   return 0;
+}
+
 int ib_send_mad(struct ib_mad_send_wr_private *mad_send_wr)
 {
struct ib_mad_qp_info *qp_info;
@@ -1183,8 +1227,7 @@ int ib_send_mad(struct ib_mad_send_wr_private 
*mad_send_wr)
 
spin_lock_irqsave(&qp_info->send_queue.lock, flags);
if (qp_info->send_queue.count < qp_info->send_queue.max_active) {
-   ret = ib_post_send(qp_info->qp, &mad_send_wr->send_wr,
-  &bad_send_wr);
+   ret = ib_mad_post_send(qp_info, mad_send_wr, &bad_send_wr);
list = &qp_info->send_queue.list;
} else {
ret = 0;
@

[PATCH 1/6] IB/mad: Use helpers to get ib_device and ib_pd from ib_mad_agent

2015-10-14 Thread Haggai Eran
ib_mad_agent currently exposes an ib_qp and an ib_device unnecessarily.
Replace these fields with a single ib_pd, and use helper functions to get
the device and pd instead of accessing the fields directly.

Signed-off-by: Haggai Eran 
---
 drivers/infiniband/core/agent.c |  4 +-
 drivers/infiniband/core/cm.c|  5 ++-
 drivers/infiniband/core/mad.c   | 69 ++---
 drivers/infiniband/core/mad_rmpp.c  |  4 +-
 drivers/infiniband/core/sa_query.c  | 10 +++--
 drivers/infiniband/core/user_mad.c  |  4 +-
 drivers/infiniband/hw/mlx4/mad.c|  2 +-
 drivers/infiniband/hw/mthca/mthca_mad.c |  2 +-
 drivers/infiniband/ulp/srpt/ib_srpt.c   |  2 +-
 include/rdma/ib_mad.h   | 16 ++--
 10 files changed, 67 insertions(+), 51 deletions(-)

diff --git a/drivers/infiniband/core/agent.c b/drivers/infiniband/core/agent.c
index 0429040304fd..444279ae3827 100644
--- a/drivers/infiniband/core/agent.c
+++ b/drivers/infiniband/core/agent.c
@@ -59,7 +59,7 @@ __ib_get_agent_port(const struct ib_device *device, int 
port_num)
struct ib_agent_port_private *entry;
 
list_for_each_entry(entry, &ib_agent_port_list, port_list) {
-   if (entry->agent[1]->device == device &&
+   if (ib_mad_agent_device(entry->agent[1]) == device &&
entry->agent[1]->port_num == port_num)
return entry;
}
@@ -99,7 +99,7 @@ void agent_send_response(const struct ib_mad_hdr *mad_hdr, 
const struct ib_grh *
}
 
agent = port_priv->agent[qpn];
-   ah = ib_create_ah_from_wc(agent->qp->pd, wc, grh, port_num);
+   ah = ib_create_ah_from_wc(ib_mad_agent_pd(agent), wc, grh, port_num);
if (IS_ERR(ah)) {
dev_err(&device->dev, "ib_create_ah_from_wc error %ld\n",
PTR_ERR(ah));
diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c
index ea4db9c1d44f..c6150c5b6ada 100644
--- a/drivers/infiniband/core/cm.c
+++ b/drivers/infiniband/core/cm.c
@@ -263,7 +263,7 @@ static int cm_alloc_msg(struct cm_id_private *cm_id_priv,
struct ib_ah *ah;
 
mad_agent = cm_id_priv->av.port->mad_agent;
-   ah = ib_create_ah(mad_agent->qp->pd, &cm_id_priv->av.ah_attr);
+   ah = ib_create_ah(ib_mad_agent_pd(mad_agent), &cm_id_priv->av.ah_attr);
if (IS_ERR(ah))
return PTR_ERR(ah);
 
@@ -294,7 +294,8 @@ static int cm_alloc_response_msg(struct cm_port *port,
struct ib_mad_send_buf *m;
struct ib_ah *ah;
 
-   ah = ib_create_ah_from_wc(port->mad_agent->qp->pd, mad_recv_wc->wc,
+   ah = ib_create_ah_from_wc(ib_mad_agent_pd(port->mad_agent),
+ mad_recv_wc->wc,
  mad_recv_wc->recv_buf.grh, port->port_num);
if (IS_ERR(ah))
return PTR_ERR(ah);
diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c
index 4b5c72311deb..62ce3a4c20b7 100644
--- a/drivers/infiniband/core/mad.c
+++ b/drivers/infiniband/core/mad.c
@@ -350,11 +350,10 @@ struct ib_mad_agent *ib_register_mad_agent(struct 
ib_device *device,
mad_agent_priv->qp_info = &port_priv->qp_info[qpn];
mad_agent_priv->reg_req = reg_req;
mad_agent_priv->agent.rmpp_version = rmpp_version;
-   mad_agent_priv->agent.device = device;
mad_agent_priv->agent.recv_handler = recv_handler;
mad_agent_priv->agent.send_handler = send_handler;
mad_agent_priv->agent.context = context;
-   mad_agent_priv->agent.qp = port_priv->qp_info[qpn].qp;
+   mad_agent_priv->agent.pd = port_priv->qp_info[qpn].qp->pd;
mad_agent_priv->agent.port_num = port_num;
mad_agent_priv->agent.flags = registration_flags;
spin_lock_init(&mad_agent_priv->lock);
@@ -516,11 +515,10 @@ struct ib_mad_agent *ib_register_mad_snoop(struct 
ib_device *device,
 
/* Now, fill in the various structures */
mad_snoop_priv->qp_info = &port_priv->qp_info[qpn];
-   mad_snoop_priv->agent.device = device;
mad_snoop_priv->agent.recv_handler = recv_handler;
mad_snoop_priv->agent.snoop_handler = snoop_handler;
mad_snoop_priv->agent.context = context;
-   mad_snoop_priv->agent.qp = port_priv->qp_info[qpn].qp;
+   mad_snoop_priv->agent.pd = port_priv->qp_info[qpn].qp->pd;
mad_snoop_priv->agent.port_num = port_num;
mad_snoop_priv->mad_snoop_flags = mad_snoop_flags;
init_completion(&mad_snoop_priv->comp);
@@ -749,7 +747,7 @@ static int handle_outgoing_dr_smp(struct 
ib_mad_agent_private *mad_agent_priv,
struct ib_mad_private *mad_priv;
struct ib_mad_port_private *port_priv;
struct ib_mad_agent_p

[PATCH 6/6] IB/mad: P_Key change event handler

2015-10-14 Thread Haggai Eran
Add a device event handler to capture P_Key table change events. For
devices that don't support setting the P_Key index per work request, update
the per-P_Key QP table in the MAD layer, creating QPs as
needed.

The code currently doesn't destroy created QPs when their pkeys are
cleared. This can be added later on.

Signed-off-by: Haggai Eran 
---
 drivers/infiniband/core/mad.c  | 51 --
 drivers/infiniband/core/mad_priv.h |  2 ++
 2 files changed, 51 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c
index 02977942574c..a350b4117cb3 100644
--- a/drivers/infiniband/core/mad.c
+++ b/drivers/infiniband/core/mad.c
@@ -3153,6 +3153,28 @@ static void srq_event_handler(struct ib_event *event, 
void *srq_context)
event->event, qp_num);
 }
 
+static void device_event_handler(struct ib_event_handler *handler,
+struct ib_event *event)
+{
+   struct ib_mad_port_private *port_priv =
+   container_of(handler, struct ib_mad_port_private,
+event_handler);
+
+   if (event->element.port_num != port_priv->port_num)
+   return;
+
+   dev_dbg(&port_priv->device->dev, "ib_mad: event %s on port %d\n",
+   ib_event_msg(event->event), event->element.port_num);
+
+   switch (event->event) {
+   case IB_EVENT_PKEY_CHANGE:
+   queue_work(port_priv->wq, &port_priv->pkey_change_work);
+   break;
+   default:
+   break;
+   }
+}
+
 static void init_mad_queue(struct ib_mad_qp_info *qp_info,
   struct ib_mad_queue *mad_queue)
 {
@@ -3306,6 +3328,15 @@ static int update_pkey_table(struct ib_mad_qp_info 
*qp_info)
return 0;
 }
 
+static void pkey_change_handler(struct work_struct *work)
+{
+   struct ib_mad_port_private *port_priv =
+   container_of(work, struct ib_mad_port_private,
+pkey_change_work);
+
+   update_pkey_table(&port_priv->qp_info[1]);
+}
+
 static void destroy_mad_qp(struct ib_mad_qp_info *qp_info)
 {
u16 qp_index;
@@ -3453,6 +3484,17 @@ static int ib_mad_port_open(struct ib_device *device,
}
INIT_WORK(&port_priv->work, ib_mad_completion_handler);
 
+   if (device->gsi_pkey_index_in_qp) {
+   INIT_WORK(&port_priv->pkey_change_work, pkey_change_handler);
+   INIT_IB_EVENT_HANDLER(&port_priv->event_handler, device,
+ device_event_handler);
+   ret = ib_register_event_handler(&port_priv->event_handler);
+   if (ret) {
+   dev_err(&device->dev, "Unable to register event handler 
for ib_mad\n");
+   goto error9;
+   }
+   }
+
spin_lock_irqsave(&ib_mad_port_list_lock, flags);
list_add_tail(&port_priv->port_list, &ib_mad_port_list);
spin_unlock_irqrestore(&ib_mad_port_list_lock, flags);
@@ -3460,16 +3502,19 @@ static int ib_mad_port_open(struct ib_device *device,
ret = ib_mad_port_start(port_priv);
if (ret) {
dev_err(&device->dev, "Couldn't start port\n");
-   goto error9;
+   goto error10;
}
 
return 0;
 
-error9:
+error10:
spin_lock_irqsave(&ib_mad_port_list_lock, flags);
list_del_init(&port_priv->port_list);
spin_unlock_irqrestore(&ib_mad_port_list_lock, flags);
 
+   if (device->gsi_pkey_index_in_qp)
+   ib_unregister_event_handler(&port_priv->event_handler);
+error9:
destroy_workqueue(port_priv->wq);
 error8:
destroy_mad_qp(&port_priv->qp_info[1]);
@@ -3507,6 +3552,8 @@ static int ib_mad_port_close(struct ib_device *device, 
int port_num)
list_del_init(&port_priv->port_list);
spin_unlock_irqrestore(&ib_mad_port_list_lock, flags);
 
+   if (device->gsi_pkey_index_in_qp)
+   ib_unregister_event_handler(&port_priv->event_handler);
destroy_workqueue(port_priv->wq);
destroy_mad_qp(&port_priv->qp_info[1]);
destroy_mad_qp(&port_priv->qp_info[0]);
diff --git a/drivers/infiniband/core/mad_priv.h 
b/drivers/infiniband/core/mad_priv.h
index 32b9532c7868..ee8003648d8a 100644
--- a/drivers/infiniband/core/mad_priv.h
+++ b/drivers/infiniband/core/mad_priv.h
@@ -211,6 +211,8 @@ struct ib_mad_port_private {
struct workqueue_struct *wq;
struct work_struct work;
struct ib_mad_qp_info qp_info[IB_MAD_QPS_CORE];
+   struct ib_event_handler event_handler;
+   struct work_struct pkey_change_work;
 };
 
 int ib_send_mad(struct ib_mad_send_wr_private *mad_send_wr);
-- 
1.7.11.2

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/6] IB/mad: Use a SRQ for receiving GMPs

2015-10-14 Thread Haggai Eran
As a preparation for supporting multiple transmission QPs for each GSI QP,
add a SRQ that will be used for all the receive buffers of these QPs.

Signed-off-by: Haggai Eran 
---
 drivers/infiniband/core/mad.c  | 58 ++
 drivers/infiniband/core/mad_priv.h |  1 +
 2 files changed, 54 insertions(+), 5 deletions(-)

diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c
index 7a1186173179..2d4457239908 100644
--- a/drivers/infiniband/core/mad.c
+++ b/drivers/infiniband/core/mad.c
@@ -2921,7 +2921,12 @@ static int ib_mad_post_receive_mads(struct 
ib_mad_qp_info *qp_info,
post = (++recv_queue->count < recv_queue->max_active);
list_add_tail(&mad_priv->header.mad_list.list, 
&recv_queue->list);
spin_unlock_irqrestore(&recv_queue->lock, flags);
-   ret = ib_post_recv(qp_info->qp, &recv_wr, &bad_recv_wr);
+   if (qp_info->srq)
+   ret = ib_post_srq_recv(qp_info->srq, &recv_wr,
+  &bad_recv_wr);
+   else
+   ret = ib_post_recv(qp_info->qp, &recv_wr, &bad_recv_wr);
+
if (ret) {
spin_lock_irqsave(&recv_queue->lock, flags);
list_del(&mad_priv->header.mad_list.list);
@@ -3074,6 +3079,16 @@ static void qp_event_handler(struct ib_event *event, 
void *qp_context)
event->event, qp_info->qp->qp_num);
 }
 
+static void srq_event_handler(struct ib_event *event, void *srq_context)
+{
+   struct ib_mad_qp_info   *qp_info = srq_context;
+
+   /* We aren't expecting limit reached events, so this must be an error */
+   dev_err(&qp_info->port_priv->device->dev,
+   "Fatal error (%d) on MAD SRQ (QP%d)\n",
+   event->event, qp_info->qp->qp_num);
+}
+
 static void init_mad_queue(struct ib_mad_qp_info *qp_info,
   struct ib_mad_queue *mad_queue)
 {
@@ -3099,19 +3114,45 @@ static void init_mad_qp(struct ib_mad_port_private 
*port_priv,
 static int create_mad_qp(struct ib_mad_qp_info *qp_info,
 enum ib_qp_type qp_type)
 {
+   struct ib_device *device = qp_info->port_priv->device;
+   struct ib_srq_init_attr srq_init_attr;
struct ib_qp_init_attr  qp_init_attr;
+   struct ib_srq *srq = NULL;
+   const bool multiple_qps = qp_type == IB_QPT_GSI &&
+ device->gsi_pkey_index_in_qp;
+
int ret;
 
qp_info->qp_type = qp_type;
 
+   if (multiple_qps) {
+   memset(&srq_init_attr, 0, sizeof(srq_init_attr));
+   srq_init_attr.event_handler = srq_event_handler;
+   srq_init_attr.srq_context = qp_info;
+   srq_init_attr.attr.max_wr = mad_recvq_size;
+   srq_init_attr.attr.max_sge = IB_MAD_RECV_REQ_MAX_SG;
+   srq = ib_create_srq(qp_info->port_priv->pd, &srq_init_attr);
+   if (IS_ERR(srq)) {
+   dev_err(&qp_info->port_priv->device->dev,
+   "Couldn't create ib_mad SRQ for QP%d\n",
+   get_spl_qp_index(qp_type));
+   ret = PTR_ERR(srq);
+   goto error_srq;
+   }
+   }
+   qp_info->srq = srq;
+
memset(&qp_init_attr, 0, sizeof qp_init_attr);
qp_init_attr.send_cq = qp_info->port_priv->cq;
qp_init_attr.recv_cq = qp_info->port_priv->cq;
+   qp_init_attr.srq = srq;
qp_init_attr.sq_sig_type = IB_SIGNAL_ALL_WR;
qp_init_attr.cap.max_send_wr = mad_sendq_size;
-   qp_init_attr.cap.max_recv_wr = mad_recvq_size;
qp_init_attr.cap.max_send_sge = IB_MAD_SEND_REQ_MAX_SG;
-   qp_init_attr.cap.max_recv_sge = IB_MAD_RECV_REQ_MAX_SG;
+   if (!srq) {
+   qp_init_attr.cap.max_recv_wr = mad_recvq_size;
+   qp_init_attr.cap.max_recv_sge = IB_MAD_RECV_REQ_MAX_SG;
+   }
qp_init_attr.qp_type = qp_type;
qp_init_attr.port_num = qp_info->port_priv->port_num;
qp_init_attr.qp_context = qp_info;
@@ -3122,7 +3163,7 @@ static int create_mad_qp(struct ib_mad_qp_info *qp_info,
"Couldn't create ib_mad QP%d\n",
get_spl_qp_index(qp_type));
ret = PTR_ERR(qp_info->qp);
-   goto error;
+   goto error_qp;
}
/* Use minimum queue sizes unless the CQ is resized */
qp_info->send_queue.max_active = mad_sendq_size;
@@ -3130,7 +3171,12 @@ static int create_mad_qp(struct ib_mad_qp_info *qp_info,
qp_info->qp_num = qp_info->qp->qp_num;
return 0;
 
-err

[PATCH 3/6] IB/core: Add capability bit to tell whether per-WR P_Key change in GSI is supported

2015-10-14 Thread Haggai Eran
The IB specs do not require per work-request control over the P_Key used in
GMPs. Most drivers have implemented this feature, but some drivers such as
ipath and mlx5 do not support it.

Add a capability bit, and turn it on in drivers that lack support for
per-WR P_Key.

Cc: Mike Marciniszyn 
Cc: Eli Cohen 
Signed-off-by: Haggai Eran 
---
 drivers/infiniband/hw/mlx5/main.c| 1 +
 drivers/staging/rdma/ipath/ipath_verbs.c | 1 +
 include/rdma/ib_verbs.h  | 2 ++
 3 files changed, 4 insertions(+)

diff --git a/drivers/infiniband/hw/mlx5/main.c 
b/drivers/infiniband/hw/mlx5/main.c
index f1ccd40beae9..2d35e0472215 100644
--- a/drivers/infiniband/hw/mlx5/main.c
+++ b/drivers/infiniband/hw/mlx5/main.c
@@ -1429,6 +1429,7 @@ static void *mlx5_ib_add(struct mlx5_core_dev *mdev)
dev->ib_dev.free_fast_reg_page_list  = mlx5_ib_free_fast_reg_page_list;
dev->ib_dev.check_mr_status = mlx5_ib_check_mr_status;
dev->ib_dev.get_port_immutable  = mlx5_port_immutable;
+   dev->ib_dev.gsi_pkey_index_in_qp = 1;
 
mlx5_ib_internal_fill_odp_caps(dev);
 
diff --git a/drivers/staging/rdma/ipath/ipath_verbs.c 
b/drivers/staging/rdma/ipath/ipath_verbs.c
index ed2bbc2f7eae..2e13a82c1871 100644
--- a/drivers/staging/rdma/ipath/ipath_verbs.c
+++ b/drivers/staging/rdma/ipath/ipath_verbs.c
@@ -2202,6 +2202,7 @@ int ipath_register_ib_device(struct ipath_devdata *dd)
dev->mmap = ipath_mmap;
dev->dma_ops = &ipath_dma_mapping_ops;
dev->get_port_immutable = ipath_port_immutable;
+   dev->gsi_pkey_index_in_qp = 1;
 
snprintf(dev->node_desc, sizeof(dev->node_desc),
 IPATH_IDSTR " %s", init_utsname()->nodename);
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 7845fae6f2df..dc5cd93b3386 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -1792,6 +1792,8 @@ struct ib_device {
__be64   node_guid;
u32  local_dma_lkey;
u16  is_switch:1;
+   /* The device reads the pkey_index from the QP in GSI (QP1) QPs. */
+   u16  gsi_pkey_index_in_qp:1;
u8   node_type;
u8   phys_port_cnt;
 
-- 
1.7.11.2

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/6] IB/mad: Support devices taking pkey_index from the GSI QP

2015-10-14 Thread Haggai Eran
Hi,

Patch 4c21b5bcef73 ("IB/cma: Add net_dev and private data checks to RDMA
CM") started checking the BTH P_Key for RDMA CM requests, and this
uncovered some issues with P_Key handling. In mlx5 the pkey_index in ib_wc
wasn't set correctly [1]. In addition, we found out that mlx5 doesn't
respect the pkey_index in ib_send_wr.ud for GSI packets. Apparently having
the pkey_index in a work request isn't required by the IBA specifications,
and so the Connect-IB and ConnectX-4 cards that mlx5 drives doesn't support
it. The result is that in kernel 4.3, RDMA CM can reject new connections
created using an mlx5 IPoIB interface that uses a non-default P_Key.

The proposed solution here is to change the ib_mad module to create
multiple UD QPs for sending GMPs, when the underlying driver doesn't
support the pkey_index in the work request. A side effect of
this change is that MADs going out with pkey_index != 0 will have a source
QP number != 1. This is allowed by the IBA specifications.

To save resources, the code creates QPs only for pkey table entries that are
non-zero. However, the code doesn't currently destroy QPs when pkey entries are
cleared. This would require reference counting on each QP to work correctly,
and we plan to do that later on.

The patchset is divided as follows. The first two patches add helper
functions and store additional info in ib_mad_qp_info to help eliminate the
depenedency on a single ib_qp pointer in ib_mad_qp_info. Patch 3 adds a
capability bit to ib_device to tell whether a device uses the pkey index
from a work request or from the QP. Patch 4 adds SRQ support to the mad
layer to allow receiving MADs designated to any QP belonging to a GSI agent
to the same queue. Patch 5 adds the bulk of the code required to create and
use multiple QPs on the same ib_mad_qp_info struct, and patch 6 adds a
P_Key change event handler to update the table when the SM changes the
available P_Keys.

[1] mlx5: Fix incorrect wc pkey_index assignment for GSI messages
http://www.spinics.net/lists/linux-rdma/msg28374.html

Regards,
Haggai

Haggai Eran (6):
  IB/mad: Use helpers to get ib_device and ib_pd from ib_mad_agent
  IB/mad: Add QP parameters to ib_mad_qp_info
  IB/core: Add capability bit to tell whether per-WR P_Key change in
GSI is supported
  IB/mad: Use a SRQ for receiving GMPs
  IB/mad: Create multiple QPs for supporting different P_Keys
  IB/mad: P_Key change event handler

 drivers/infiniband/core/agent.c  |   4 +-
 drivers/infiniband/core/cm.c |   5 +-
 drivers/infiniband/core/mad.c| 528 ---
 drivers/infiniband/core/mad_priv.h   |   9 +-
 drivers/infiniband/core/mad_rmpp.c   |   4 +-
 drivers/infiniband/core/sa_query.c   |  10 +-
 drivers/infiniband/core/user_mad.c   |   4 +-
 drivers/infiniband/hw/mlx4/mad.c |   2 +-
 drivers/infiniband/hw/mlx5/main.c|   1 +
 drivers/infiniband/hw/mthca/mthca_mad.c  |   2 +-
 drivers/infiniband/ulp/srpt/ib_srpt.c|   2 +-
 drivers/staging/rdma/ipath/ipath_verbs.c |   1 +
 include/rdma/ib_mad.h|  16 +-
 include/rdma/ib_verbs.h  |   2 +
 14 files changed, 452 insertions(+), 138 deletions(-)

-- 
1.7.11.2

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/6] IB/mad: Add QP parameters to ib_mad_qp_info

2015-10-14 Thread Haggai Eran
In preparation for having an array of QPs in each ib_mad_qp_info, add the
qp_type and qp_num parameters to the ib_mad_qp_info struct so that clients
won't need to access the QPs themselves for this information.

Signed-off-by: Haggai Eran 
---
 drivers/infiniband/core/mad.c  | 16 ++--
 drivers/infiniband/core/mad_priv.h |  2 ++
 2 files changed, 12 insertions(+), 6 deletions(-)

diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c
index 62ce3a4c20b7..7a1186173179 100644
--- a/drivers/infiniband/core/mad.c
+++ b/drivers/infiniband/core/mad.c
@@ -1799,7 +1799,7 @@ static int validate_mad(const struct ib_mad_hdr *mad_hdr,
bool opa)
 {
int valid = 0;
-   u32 qp_num = qp_info->qp->qp_num;
+   u32 qp_num = qp_info->qp_num;
 
/* Make sure MAD base version is understood */
if (mad_hdr->base_version != IB_MGMT_BASE_VERSION &&
@@ -2054,7 +2054,7 @@ static enum smi_action handle_ib_smi(const struct 
ib_mad_port_private *port_priv
&response->grh, wc,
port_priv->device,
smi_get_fwd_port(smp),
-   qp_info->qp->qp_num,
+   qp_info->qp_num,
response->mad_size,
false);
 
@@ -2142,7 +2142,7 @@ handle_opa_smi(struct ib_mad_port_private *port_priv,
&response->grh, wc,
port_priv->device,
opa_smi_get_fwd_port(smp),
-   qp_info->qp->qp_num,
+   qp_info->qp_num,
recv->header.wc.byte_len,
true);
 
@@ -2264,7 +2264,7 @@ static void ib_mad_recv_done_handler(struct 
ib_mad_port_private *port_priv,
&recv->grh, wc,
port_priv->device,
port_num,
-   qp_info->qp->qp_num,
+   qp_info->qp_num,
mad_size, opa);
goto out;
}
@@ -2283,7 +2283,7 @@ static void ib_mad_recv_done_handler(struct 
ib_mad_port_private *port_priv,
   generate_unmatched_resp(recv, response, &mad_size, opa)) {
agent_send_response((const struct ib_mad_hdr *)response->mad, 
&recv->grh, wc,
port_priv->device, port_num,
-   qp_info->qp->qp_num, mad_size, opa);
+   qp_info->qp_num, mad_size, opa);
}
 
 out:
@@ -3009,7 +3009,8 @@ static int ib_mad_port_start(struct ib_mad_port_private 
*port_priv)
 */
attr->qp_state = IB_QPS_INIT;
attr->pkey_index = pkey_index;
-   attr->qkey = (qp->qp_num == 0) ? 0 : IB_QP1_QKEY;
+   attr->qkey = (port_priv->qp_info[i].qp_num == 0) ? 0 :
+IB_QP1_QKEY;
ret = ib_modify_qp(qp, attr, IB_QP_STATE |
 IB_QP_PKEY_INDEX | IB_QP_QKEY);
if (ret) {
@@ -3101,6 +3102,8 @@ static int create_mad_qp(struct ib_mad_qp_info *qp_info,
struct ib_qp_init_attr  qp_init_attr;
int ret;
 
+   qp_info->qp_type = qp_type;
+
memset(&qp_init_attr, 0, sizeof qp_init_attr);
qp_init_attr.send_cq = qp_info->port_priv->cq;
qp_init_attr.recv_cq = qp_info->port_priv->cq;
@@ -3124,6 +3127,7 @@ static int create_mad_qp(struct ib_mad_qp_info *qp_info,
/* Use minimum queue sizes unless the CQ is resized */
qp_info->send_queue.max_active = mad_sendq_size;
qp_info->recv_queue.max_active = mad_recvq_size;
+   qp_info->qp_num = qp_info->qp->qp_num;
return 0;
 
 error:
diff --git a/drivers/infiniband/core/mad_priv.h 
b/drivers/infiniband/core/mad_priv.h
index 4a4f7aad0978..ae099f0f9701 100644
--- a/drivers/infiniband/core/mad_priv.h
+++ b/drivers/infiniband/core/mad_priv.h
@@ -191,6 +191,8 @@ struct ib_mad_qp_info {
struct ib_mad_snoop_private **snoop_table;
int snoop_table_size;
atomic_t snoop_count;
+   enum ib_qp_type qp_type;
+   u32 qp_num;
 };
 
 struct ib_mad_port_private {
-- 
1.7.11.2

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] [RESEND] IB/cma: Accept connection without a valid netdev on RoCE

2015-10-06 Thread Haggai Eran
The netdev checks recently added to RDMA CM expect a valid netdev to be
found for both InfiniBand and RoCE, but the code that find a netdev is
only implemented for InfiniBand.

Currently RoCE doesn't provide an API to find the netdev matching a
given set of parameters, so this patch just disables the netdev enforcement
for each incoming connections when the link layer is RoCE.

Fixes: 4c21b5bcef73 ("IB/cma: Add net_dev and private data checks to RDMA CM")
Reported-by: Kamal Heib 
Signed-off-by: Haggai Eran 
---
I accidentally forgot to send this to the list, so I'm sending again.

 drivers/infiniband/core/cma.c | 54 ---
 1 file changed, 40 insertions(+), 14 deletions(-)

diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index b92a3c2c060b..f163ac680841 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -1232,14 +1232,32 @@ static bool cma_match_private_data(struct 
rdma_id_private *id_priv,
return true;
 }
 
+static bool cma_protocol_roce_dev_port(struct ib_device *device, int port_num)
+{
+   enum rdma_link_layer ll = rdma_port_get_link_layer(device, port_num);
+   enum rdma_transport_type transport =
+   rdma_node_get_transport(device->node_type);
+
+   return ll == IB_LINK_LAYER_ETHERNET && transport == RDMA_TRANSPORT_IB;
+}
+
+static bool cma_protocol_roce(const struct rdma_cm_id *id)
+{
+   struct ib_device *device = id->device;
+   const int port_num = id->port_num ?: rdma_start_port(device);
+
+   return cma_protocol_roce_dev_port(device, port_num);
+}
+
 static bool cma_match_net_dev(const struct rdma_id_private *id_priv,
  const struct net_device *net_dev)
 {
const struct rdma_addr *addr = &id_priv->id.route.addr;
 
if (!net_dev)
-   /* This request is an AF_IB request */
-   return addr->src_addr.ss_family == AF_IB;
+   /* This request is an AF_IB request or a RoCE request */
+   return addr->src_addr.ss_family == AF_IB ||
+  cma_protocol_roce(&id_priv->id);
 
return !addr->dev_addr.bound_dev_if ||
   (net_eq(dev_net(net_dev), &init_net) &&
@@ -1294,6 +1312,10 @@ static struct rdma_id_private *cma_id_from_event(struct 
ib_cm_id *cm_id,
if (PTR_ERR(*net_dev) == -EAFNOSUPPORT) {
/* Assuming the protocol is AF_IB */
*net_dev = NULL;
+   } else if (cma_protocol_roce_dev_port(req.device, req.port)) {
+   /* TODO find the net dev matching the request parameters
+* through the RoCE GID table */
+   *net_dev = NULL;
} else {
return ERR_CAST(*net_dev);
}
@@ -1593,11 +1615,16 @@ static struct rdma_id_private *cma_new_conn_id(struct 
rdma_cm_id *listen_id,
if (ret)
goto err;
} else {
-   /* An AF_IB connection */
-   WARN_ON_ONCE(ss_family != AF_IB);
-
-   cma_translate_ib((struct sockaddr_ib *)cma_src_addr(id_priv),
-&rt->addr.dev_addr);
+   if (!cma_protocol_roce(listen_id) &&
+   cma_any_addr(cma_src_addr(id_priv))) {
+   rt->addr.dev_addr.dev_type = ARPHRD_INFINIBAND;
+   rdma_addr_set_sgid(&rt->addr.dev_addr, 
&rt->path_rec[0].sgid);
+   ib_addr_set_pkey(&rt->addr.dev_addr, 
be16_to_cpu(rt->path_rec[0].pkey));
+   } else if (!cma_any_addr(cma_src_addr(id_priv))) {
+   ret = cma_translate_addr(cma_src_addr(id_priv), 
&rt->addr.dev_addr);
+   if (ret)
+   goto err;
+   }
}
rdma_addr_set_dgid(&rt->addr.dev_addr, &rt->path_rec[0].dgid);
 
@@ -1635,13 +1662,12 @@ static struct rdma_id_private *cma_new_udp_id(struct 
rdma_cm_id *listen_id,
if (ret)
goto err;
} else {
-   /* An AF_IB connection */
-   WARN_ON_ONCE(ss_family != AF_IB);
-
-   if (!cma_any_addr(cma_src_addr(id_priv)))
-   cma_translate_ib((struct sockaddr_ib *)
-   cma_src_addr(id_priv),
-&id->route.addr.dev_addr);
+   if (!cma_any_addr(cma_src_addr(id_priv))) {
+   ret = cma_translate_addr(cma_src_addr(id_priv),
+&id->route.addr.dev_addr);
+   if (ret)
+   goto err;
+   }
}
 
id_pri

Re: [PATCH 3.2] IB/qib: Change lkey table allocation to support more MRs

2015-10-05 Thread Haggai Eran
On 02/10/2015 16:10, Marciniszyn, Mike wrote:
>>> The lkey table is allocated with with a get_user_pages() with an
>> Don't you mean __get_free_pages?
>>
> 
> I was a nit in the original upstream commit.
> 
> I don’t think it is a big deal since the patch context clarifies.
I agree it's not a big deal. I guess it only caught my attention since
reading it made me wonder why you need pinned user memory for the lkey
table.

Haggai
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3.2] IB/qib: Change lkey table allocation to support more MRs

2015-09-30 Thread Haggai Eran
On 29/09/2015 17:51, Mike Marciniszyn wrote:
> The lkey table is allocated with with a get_user_pages() with an
Don't you mean __get_free_pages?

Regards,
Haggai

> order based on a number of index bits from a module parameter.
> 
> The underlying kernel code cannot allocate that many contiguous pages.
> 
> There is no reason the underlying memory needs to be physically
> contiguous.
> 
> This patch:
> - switches the allocation/deallocation to vmalloc/vfree
> - caps the number of bits to 23 to insure at least 1 generation bit
>   o this matches the module parameter description
> 
> Cc: sta...@vger.kernel.org # 3.2
> Reviewed-by: Vinit Agnihotri 
> Signed-off-by: Mike Marciniszyn 
> Signed-off-by: Doug Ledford 
> ---
>  drivers/infiniband/hw/qib/qib_keys.c  |4 
>  drivers/infiniband/hw/qib/qib_verbs.c |   16 +++-
>  drivers/infiniband/hw/qib/qib_verbs.h |2 ++
>  3 files changed, 17 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/infiniband/hw/qib/qib_keys.c 
> b/drivers/infiniband/hw/qib/qib_keys.c
> index 8fd19a4..ca6e6cf 100644
> --- a/drivers/infiniband/hw/qib/qib_keys.c
> +++ b/drivers/infiniband/hw/qib/qib_keys.c
> @@ -69,6 +69,10 @@ int qib_alloc_lkey(struct qib_lkey_table *rkt, struct 
> qib_mregion *mr)
>* unrestricted LKEY.
>*/
>   rkt->gen++;
> + /*
> +  * bits are capped in qib_verbs.c to insure enough bits
> +  * for generation number
> +  */
>   mr->lkey = (r << (32 - ib_qib_lkey_table_size)) |
>   1 << (24 - ib_qib_lkey_table_size)) - 1) & rkt->gen)
><< 8);
> diff --git a/drivers/infiniband/hw/qib/qib_verbs.c 
> b/drivers/infiniband/hw/qib/qib_verbs.c
> index a894762..95df4cd 100644
> --- a/drivers/infiniband/hw/qib/qib_verbs.c
> +++ b/drivers/infiniband/hw/qib/qib_verbs.c
> @@ -40,6 +40,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include "qib.h"
>  #include "qib_common.h"
> @@ -2035,10 +2036,16 @@ int qib_register_ib_device(struct qib_devdata *dd)
>* the LKEY).  The remaining bits act as a generation number or tag.
>*/
>   spin_lock_init(&dev->lk_table.lock);
> + /* insure generation is at least 4 bits see keys.c */
> + if (ib_qib_lkey_table_size > MAX_LKEY_TABLE_BITS) {
> + qib_dev_warn(dd, "lkey bits %u too large, reduced to %u\n",
> + ib_qib_lkey_table_size, MAX_LKEY_TABLE_BITS);
> + ib_qib_lkey_table_size = MAX_LKEY_TABLE_BITS;
> + }
>   dev->lk_table.max = 1 << ib_qib_lkey_table_size;
>   lk_tab_size = dev->lk_table.max * sizeof(*dev->lk_table.table);
> - dev->lk_table.table = (struct qib_mregion **)
> - __get_free_pages(GFP_KERNEL, get_order(lk_tab_size));
> + dev->lk_table.table = (struct qib_mregion __rcu **)
> + vmalloc(lk_tab_size);
>   if (dev->lk_table.table == NULL) {
>   ret = -ENOMEM;
>   goto err_lk;
> @@ -2208,7 +2215,7 @@ err_tx:
>   sizeof(struct qib_pio_header),
> dev->pio_hdrs, dev->pio_hdrs_phys);
>  err_hdrs:
> - free_pages((unsigned long) dev->lk_table.table, get_order(lk_tab_size));
> + vfree(dev->lk_table.table);
>  err_lk:
>   kfree(dev->qp_table);
>  err_qpt:
> @@ -2262,7 +2269,6 @@ void qib_unregister_ib_device(struct qib_devdata *dd)
>   sizeof(struct qib_pio_header),
> dev->pio_hdrs, dev->pio_hdrs_phys);
>   lk_tab_size = dev->lk_table.max * sizeof(*dev->lk_table.table);
> - free_pages((unsigned long) dev->lk_table.table,
> -get_order(lk_tab_size));
> + vfree(dev->lk_table.table);
>   kfree(dev->qp_table);
>  }
> diff --git a/drivers/infiniband/hw/qib/qib_verbs.h 
> b/drivers/infiniband/hw/qib/qib_verbs.h
> index 0c19ef0..66f7f62 100644
> --- a/drivers/infiniband/hw/qib/qib_verbs.h
> +++ b/drivers/infiniband/hw/qib/qib_verbs.h
> @@ -622,6 +622,8 @@ struct qib_qpn_table {
>   struct qpn_map map[QPNMAP_ENTRIES];
>  };
>  
> +#define MAX_LKEY_TABLE_BITS 23
> +
>  struct qib_lkey_table {
>   spinlock_t lock; /* protect changes in this struct */
>   u32 next;   /* next unused index (speeds search) */
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: shrink struct ib_send_wr V4

2015-09-29 Thread Haggai Eran
On 13/09/2015 18:13, Christoph Hellwig wrote:
> This series shrinks the WR size by splitting out the different WR
> types.
> 
> Patch number one is too large for the mailinglist, so if you didn't
> get it grab it here:
> 
>   
> http://git.infradead.org/users/hch/rdma.git/commitdiff_plain/c752d80937ff2d71f25ae7fcdd1a151054fb2ceb
> 
> or the full git tree at:
> 
>   git://git.infradead.org/users/hch/rdma.git wr-cleanup
> 

Hi,

I've tested the current wr-cleanup branch with a Connect-IB device to
make sure the changes to the mlx5-internal UMR work requests are fine
(at least when invoked from user space). To test I also had to add
Sagi's patches to remove the reserved lkey, but with those patches it works.

Regards,
Haggai

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RoCE RDMA CM handling after the demux patches

2015-09-27 Thread Haggai Eran
Hi,

I noticed that the "Demux IB CM requests in the rdma_cm module" patches,
while intended to only apply to InfiniBand connections, accidentally
also affect RoCE connections. The patches used a query to IPoIB for the
netdev associated with a request, and RoCE devices therefore fail to
answer that query and reject all requests.

The long term solution would probably be to use the new RoCE GID table
cache to find the relevant netdev, but since the patches to expose the
netdev through the GID table API haven't been accepted yet, I'm working
on a patch that will just waive the requirement for a valid netdev on
RoCE devices.

Regards,
Haggai
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] IB/cma: Potential NULL dereference in cma_id_from_event

2015-09-21 Thread Haggai Eran
If the lookup of a listening ID failed for an AF_IB request, the code
would try to call dev_put() on a NULL net_dev.

Fixes: be688195bd08 ("IB/cma: Fix net_dev reference leak with failed
requests")
Reported-by: Dan Carpenter 
Signed-off-by: Haggai Eran 
---
 drivers/infiniband/core/cma.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index b1ab13f3e182..b92a3c2c060b 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -1302,7 +1302,7 @@ static struct rdma_id_private *cma_id_from_event(struct 
ib_cm_id *cm_id,
bind_list = cma_ps_find(rdma_ps_from_service_id(req.service_id),
cma_port_from_service_id(req.service_id));
id_priv = cma_find_listener(bind_list, cm_id, ib_event, &req, *net_dev);
-   if (IS_ERR(id_priv)) {
+   if (IS_ERR(id_priv) && *net_dev) {
dev_put(*net_dev);
*net_dev = NULL;
}
-- 
1.7.11.2

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource

2015-09-20 Thread Haggai Eran
On 15/09/2015 06:45, Jason Gunthorpe wrote:
> No, I'm saying the resource pool is *well defined* and *fixed* by each
> hardware.
> 
> The only question is how do we expose the N resource limits, the list
> of which is totally vendor specific.

I don't see why you say the limits are vendor specific. It is true that
different RDMA devices have different implementations and capabilities,
but they all use the expose the same set of RDMA objects with their
limitations. Whether those limitations come from hardware limitations,
from the driver, or just because the address space is limited, they can
still be exhausted.

> Yes, using a % scheme fixes the ratios, 1% is going to be a certain
> number of PD's, QP's, MRs, CQ's, etc at a ratio fixed by the driver
> configuration. That is the trade off for API simplicity.
>
> 
> Yes, this results in some resources being over provisioned.

I agree that such a scheme will be easy to configure, but I don't think
it can work well in all situations. Imagine you want to let one
container use almost all RC QPs as you want it to connect to the entire
cluster through RC. Other containers can still use a single datagram QP
to connect to the entire cluster, but they would require many address
handles. If you force a fixed ratio of resources given to each container
it would be hard to describe such a partitioning.

I think it would be better to expose different controls for the
different RDMA resources.

Regards,
Haggai
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/7] devcg: device cgroup's extension for RDMA resource.

2015-09-08 Thread Haggai Eran
On 08/09/2015 13:50, Parav Pandit wrote:
> On Tue, Sep 8, 2015 at 2:06 PM, Haggai Eran  wrote:
>> On 07/09/2015 23:38, Parav Pandit wrote:
>>> +void devcgroup_rdma_uncharge_resource(struct ib_ucontext *ucontext,
>>> +   enum devcgroup_rdma_rt type, int num)
>>> +{
>>> + struct dev_cgroup *dev_cg, *p;
>>> + struct task_struct *ctx_task;
>>> +
>>> + if (!num)
>>> + return;
>>> +
>>> + /* get cgroup of ib_ucontext it belong to, to uncharge
>>> +  * so that when its called from any worker tasks or any
>>> +  * other tasks to which this resource doesn't belong to,
>>> +  * it can be uncharged correctly.
>>> +  */
>>> + if (ucontext)
>>> + ctx_task = get_pid_task(ucontext->tgid, PIDTYPE_PID);
>>> + else
>>> + ctx_task = current;
>> So what happens if a process creates a ucontext, forks, and then the
>> child creates and destroys a CQ? If I understand correctly, created
>> resources are always charged to the current process (the child), but
>> when it is destroyed the owner of the ucontext (the parent) will be
>> uncharged.
>>
>> Since ucontexts are not meant to be used by multiple processes, I think
>> it would be okay to always charge the owner process (the one that
>> created the ucontext).
> 
> I need to think about it. I would like to avoid keep per task resource
> counters for two reasons.
> For a while I thought that native fork() doesn't take care to share
> the RDMA resources and all CQ, QP dmaable memory from PID namespace
> perspective.
> 
> 1. Because, it could well happen that process and its child process is
> created in PID namespace_A, after which child is migrated to new PID
> namespace_B.
> after which parent from the namespace_A is terminated. I am not sure
> how the ucontext ownership changes from parent to child process at
> that point today.
> I prefer to keep this complexity out if at all it exists as process
> migration across namespaces is not a frequent event for which to
> optimize the code for.
> 
> 2. by having per task counter (as cost of memory some memory) allows
> to avoid using atomic during charge(), uncharge().
> 
> The intent is to have per task (process and thread) to have their
> resource counter instance, but I can see that its broken where its
> charging parent process as of now without atomics.
> As you said its ok to always charge the owner process, I have to relax
> 2nd requirement and fallback to use atomics for charge(), uncharge()
> or I have to get rid of ucontext from the uncharge() API which is
> difficult due to fput() being in worker thread context.
> 

I think the cost of atomic operations here would normally be negligible
compared to the cost of accessing the hardware to allocate or deallocate
these resources.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/7] devcg: device cgroup's extension for RDMA resource.

2015-09-08 Thread Haggai Eran
On 08/09/2015 13:18, Parav Pandit wrote:
>> >
>>> >> + * RDMA resource limits are hierarchical, so the highest configured 
>>> >> limit of
>>> >> + * the hierarchy is enforced. Allowing resource limit configuration to 
>>> >> default
>>> >> + * cgroup allows fair share to kernel space ULPs as well.
>> > In what way is the highest configured limit of the hierarchy enforced? I
>> > would expect all the limits along the hierarchy to be enforced.
>> >
> In  hierarchy, of say 3 cgroups, the smallest limit of the cgroup is applied.
> 
> Lets take example to clarify.
> Say cg_A, cg_B, cg_C
> Role  name   limit
> Parent   cg_A   100
> Child_level1  cg_B (child of cg_A)20
> Child_level2: cg_C (child of cg_B)50
> 
> If the process allocating rdma resource belongs to cg_C, limit lowest
> limit in the hierarchy is applied during charge() stage.
> If cg_A limit happens to be 10, since 10 is lowest, its limit would be
> applicable as you expected.

Looking at the code, the usage in every level is charged. This is what I
would expect. I just think the comment is a bit misleading.

>>> +int devcgroup_rdma_get_max_resource(struct seq_file *sf, void *v)
>>> +{
>>> + struct dev_cgroup *dev_cg = css_to_devcgroup(seq_css(sf));
>>> + int type = seq_cft(sf)->private;
>>> + u32 usage;
>>> +
>>> + if (dev_cg->rdma.tracker[type].limit == DEVCG_RDMA_MAX_RESOURCES) {
>>> + seq_printf(sf, "%s\n", DEVCG_RDMA_MAX_RESOURCE_STR);
>> I'm not sure hiding the actual number is good, especially in the
>> show_usage case.
> 
> This is similar to following other controller same as newly added PID
> subsystem in showing max limit.

Okay.

>>> +void devcgroup_rdma_uncharge_resource(struct ib_ucontext *ucontext,
>>> +   enum devcgroup_rdma_rt type, int num)
>>> +{
>>> + struct dev_cgroup *dev_cg, *p;
>>> + struct task_struct *ctx_task;
>>> +
>>> + if (!num)
>>> + return;
>>> +
>>> + /* get cgroup of ib_ucontext it belong to, to uncharge
>>> +  * so that when its called from any worker tasks or any
>>> +  * other tasks to which this resource doesn't belong to,
>>> +  * it can be uncharged correctly.
>>> +  */
>>> + if (ucontext)
>>> + ctx_task = get_pid_task(ucontext->tgid, PIDTYPE_PID);
>>> + else
>>> + ctx_task = current;
>>> + dev_cg = task_devcgroup(ctx_task);
>>> +
>>> + spin_lock(&ctx_task->rdma_res_counter->lock);
>> Don't you need an rcu read lock and rcu_dereference to access
>> rdma_res_counter?
> 
> I believe, its not required because when uncharge() is happening, it
> can happen only from 3 contexts.
> (a) from the caller task context, who has made allocation call, so no
> synchronizing needed.
> (b) from the dealloc resource context, again this is from the same
> task context which allocated, it so this is single threaded, no need
> to syncronize.
I don't think it is true. You can access uverbs from multiple threads.
What may help your case here I think is the fact that only when the last
ucontext is released you can change the rdma_res_counter field, and
ucontext release takes the ib_uverbs_file->mutex.

Still, I think it would be best to use rcu_dereference(), if only for
documentation and sparse.

> (c) from the fput() context when process is terminated abruptly or as
> part of differed cleanup, when this is happening there cannot be
> allocator task anyway.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 6/7] devcg: Added support to use RDMA device cgroup.

2015-09-08 Thread Haggai Eran
On 08/09/2015 13:22, Parav Pandit wrote:
> On Tue, Sep 8, 2015 at 2:10 PM, Haggai Eran  wrote:
>> On 07/09/2015 23:38, Parav Pandit wrote:
>>> +static void init_ucontext_lists(struct ib_ucontext *ucontext)
>>> +{
>>> + INIT_LIST_HEAD(&ucontext->pd_list);
>>> + INIT_LIST_HEAD(&ucontext->mr_list);
>>> + INIT_LIST_HEAD(&ucontext->mw_list);
>>> + INIT_LIST_HEAD(&ucontext->cq_list);
>>> + INIT_LIST_HEAD(&ucontext->qp_list);
>>> + INIT_LIST_HEAD(&ucontext->srq_list);
>>> + INIT_LIST_HEAD(&ucontext->ah_list);
>>> + INIT_LIST_HEAD(&ucontext->xrcd_list);
>>> + INIT_LIST_HEAD(&ucontext->rule_list);
>>> +}
>>
>> I don't see how this change is related to the patch.
> 
> Its not but code which I added makes this function to grow longer, so
> to keep it to same readability level, I did the cleanup.
> May be I can send separate patch for cleanup?

Sounds good to me.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource

2015-09-08 Thread Haggai Eran
On 07/09/2015 23:38, Parav Pandit wrote:
> Currently user space applications can easily take away all the rdma
> device specific resources such as AH, CQ, QP, MR etc. Due to which other
> applications in other cgroup or kernel space ULPs may not even get chance
> to allocate any rdma resources.
> 
> This patch-set allows limiting rdma resources to set of processes.
> It extend device cgroup controller for limiting rdma device limits.
I don't think extending the device cgroup is the right place for these
limits. It is currently a very generic controller and adding various
RDMA resources to it looks out of place. Why not create a new controller
for rdma?

Another thing I noticed is that all limits in this cgroup are global,
while the resources they control are hardware device specific.
I think it would be better if the cgroup controlled the limits of each
device separately.

> With this patch, user verbs module queries rdma device cgroup controller
> to query process's limit to consume such resource. It uncharge resource 
> counter after resource is being freed.
This is another reason why per-device limits would be better. Since
limits are reflected to user-space when querying a specific device, it
will show the same maximum limit on every device opened. If the user
opens 3 devices they might expect to be able to open 3 times the number
of the resources they actually can.

Regards,
Haggai
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 6/7] devcg: Added support to use RDMA device cgroup.

2015-09-08 Thread Haggai Eran
On 07/09/2015 23:38, Parav Pandit wrote:
> +static void init_ucontext_lists(struct ib_ucontext *ucontext)
> +{
> + INIT_LIST_HEAD(&ucontext->pd_list);
> + INIT_LIST_HEAD(&ucontext->mr_list);
> + INIT_LIST_HEAD(&ucontext->mw_list);
> + INIT_LIST_HEAD(&ucontext->cq_list);
> + INIT_LIST_HEAD(&ucontext->qp_list);
> + INIT_LIST_HEAD(&ucontext->srq_list);
> + INIT_LIST_HEAD(&ucontext->ah_list);
> + INIT_LIST_HEAD(&ucontext->xrcd_list);
> + INIT_LIST_HEAD(&ucontext->rule_list);
> +}

I don't see how this change is related to the patch.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/7] devcg: device cgroup's extension for RDMA resource.

2015-09-08 Thread Haggai Eran
On 07/09/2015 23:38, Parav Pandit wrote:
> +void devcgroup_rdma_uncharge_resource(struct ib_ucontext *ucontext,
> +   enum devcgroup_rdma_rt type, int num)
> +{
> + struct dev_cgroup *dev_cg, *p;
> + struct task_struct *ctx_task;
> +
> + if (!num)
> + return;
> +
> + /* get cgroup of ib_ucontext it belong to, to uncharge
> +  * so that when its called from any worker tasks or any
> +  * other tasks to which this resource doesn't belong to,
> +  * it can be uncharged correctly.
> +  */
> + if (ucontext)
> + ctx_task = get_pid_task(ucontext->tgid, PIDTYPE_PID);
> + else
> + ctx_task = current;
So what happens if a process creates a ucontext, forks, and then the
child creates and destroys a CQ? If I understand correctly, created
resources are always charged to the current process (the child), but
when it is destroyed the owner of the ucontext (the parent) will be
uncharged.

Since ucontexts are not meant to be used by multiple processes, I think
it would be okay to always charge the owner process (the one that
created the ucontext).

> + dev_cg = task_devcgroup(ctx_task);
> +
> + spin_lock(&ctx_task->rdma_res_counter->lock);
> + ctx_task->rdma_res_counter->usage[type] -= num;
> +
> + for (p = dev_cg; p; p = parent_devcgroup(p))
> + uncharge_resource(p, type, num);
> +
> + spin_unlock(&ctx_task->rdma_res_counter->lock);
> +
> + if (type == DEVCG_RDMA_RES_TYPE_UCTX)
> + rdma_free_res_counter(ctx_task);
> +}
> +EXPORT_SYMBOL(devcgroup_rdma_uncharge_resource);

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/7] devcg: Added rdma resource tracker object per task

2015-09-08 Thread Haggai Eran
On 08/09/2015 10:04, Parav Pandit wrote:
> On Tue, Sep 8, 2015 at 11:18 AM, Haggai Eran  wrote:
>> On 07/09/2015 23:38, Parav Pandit wrote:
>>> @@ -2676,7 +2686,7 @@ static inline int thread_group_empty(struct 
>>> task_struct *p)
>>>   * Protects ->fs, ->files, ->mm, ->group_info, ->comm, keyring
>>>   * subscriptions and synchronises with wait4().  Also used in procfs.  Also
>>>   * pins the final release of task.io_context.  Also protects ->cpuset and
>>> - * ->cgroup.subsys[]. And ->vfork_done.
>>> + * ->cgroup.subsys[]. Also projtects ->vfork_done and ->rdma_res_counter.
>> s/projtects/protects/
>>>   *
>>>   * Nests both inside and outside of read_lock(&tasklist_lock).
>>>   * It must not be nested with write_lock_irq(&tasklist_lock),
>>
> 
> Hi Haggai Eran,
> Did you miss to put comments or I missed something?

Yes, I wrote "s/projtects/protects/" to tell you that you have a typo in
your comment. You should change the word "projtects" to "protects".

Haggai

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/7] devcg: device cgroup's extension for RDMA resource.

2015-09-08 Thread Haggai Eran
On 07/09/2015 23:38, Parav Pandit wrote:
> +/* RDMA resources from device cgroup perspective */
> +enum devcgroup_rdma_rt {
> + DEVCG_RDMA_RES_TYPE_UCTX,
> + DEVCG_RDMA_RES_TYPE_CQ,
> + DEVCG_RDMA_RES_TYPE_PD,
> + DEVCG_RDMA_RES_TYPE_AH,
> + DEVCG_RDMA_RES_TYPE_MR,
> + DEVCG_RDMA_RES_TYPE_MW,
I didn't see memory windows in dev_cgroup_files in patch 3. Is it used?
> + DEVCG_RDMA_RES_TYPE_SRQ,
> + DEVCG_RDMA_RES_TYPE_QP,
> + DEVCG_RDMA_RES_TYPE_FLOW,
> + DEVCG_RDMA_RES_TYPE_MAX,
> +};

> +struct devcgroup_rdma_tracker {
> + int limit;
> + atomic_t usage;
> + int failcnt;
> +};
Have you considered using struct res_counter?

> + * RDMA resource limits are hierarchical, so the highest configured limit of
> + * the hierarchy is enforced. Allowing resource limit configuration to 
> default
> + * cgroup allows fair share to kernel space ULPs as well.
In what way is the highest configured limit of the hierarchy enforced? I
would expect all the limits along the hierarchy to be enforced.

> +int devcgroup_rdma_get_max_resource(struct seq_file *sf, void *v)
> +{
> + struct dev_cgroup *dev_cg = css_to_devcgroup(seq_css(sf));
> + int type = seq_cft(sf)->private;
> + u32 usage;
> +
> + if (dev_cg->rdma.tracker[type].limit == DEVCG_RDMA_MAX_RESOURCES) {
> + seq_printf(sf, "%s\n", DEVCG_RDMA_MAX_RESOURCE_STR);
> + } else {
> + usage = dev_cg->rdma.tracker[type].limit;
If this is the resource limit, don't name it 'usage'.

> + seq_printf(sf, "%u\n", usage);
> + }
> + return 0;
> +}

> +int devcgroup_rdma_get_max_resource(struct seq_file *sf, void *v)
> +{
> + struct dev_cgroup *dev_cg = css_to_devcgroup(seq_css(sf));
> + int type = seq_cft(sf)->private;
> + u32 usage;
> +
> + if (dev_cg->rdma.tracker[type].limit == DEVCG_RDMA_MAX_RESOURCES) {
> + seq_printf(sf, "%s\n", DEVCG_RDMA_MAX_RESOURCE_STR);
I'm not sure hiding the actual number is good, especially in the
show_usage case.

> + } else {
> + usage = dev_cg->rdma.tracker[type].limit;
> + seq_printf(sf, "%u\n", usage);
> + }
> + return 0;
> +}

> +void devcgroup_rdma_uncharge_resource(struct ib_ucontext *ucontext,
> +   enum devcgroup_rdma_rt type, int num)
> +{
> + struct dev_cgroup *dev_cg, *p;
> + struct task_struct *ctx_task;
> +
> + if (!num)
> + return;
> +
> + /* get cgroup of ib_ucontext it belong to, to uncharge
> +  * so that when its called from any worker tasks or any
> +  * other tasks to which this resource doesn't belong to,
> +  * it can be uncharged correctly.
> +  */
> + if (ucontext)
> + ctx_task = get_pid_task(ucontext->tgid, PIDTYPE_PID);
> + else
> + ctx_task = current;
> + dev_cg = task_devcgroup(ctx_task);
> +
> + spin_lock(&ctx_task->rdma_res_counter->lock);
Don't you need an rcu read lock and rcu_dereference to access
rdma_res_counter?

> + ctx_task->rdma_res_counter->usage[type] -= num;
> +
> + for (p = dev_cg; p; p = parent_devcgroup(p))
> + uncharge_resource(p, type, num);
> +
> + spin_unlock(&ctx_task->rdma_res_counter->lock);
> +
> + if (type == DEVCG_RDMA_RES_TYPE_UCTX)
> + rdma_free_res_counter(ctx_task);
> +}
> +EXPORT_SYMBOL(devcgroup_rdma_uncharge_resource);

> +int devcgroup_rdma_try_charge_resource(enum devcgroup_rdma_rt type, int num)
> +{
> + struct dev_cgroup *dev_cg = task_devcgroup(current);
> + struct task_rdma_res_counter *res_cnt = current->rdma_res_counter;
> + int status;
> +
> + if (!res_cnt) {
> + res_cnt = kzalloc(sizeof(*res_cnt), GFP_KERNEL);
> + if (!res_cnt)
> + return -ENOMEM;
> +
> + spin_lock_init(&res_cnt->lock);
> + rcu_assign_pointer(current->rdma_res_counter, res_cnt);
Don't you need the task lock to update rdma_res_counter here?

> + }
> +
> + /* synchronize with migration task by taking lock, to avoid
> +  * race condition of performing cgroup resource migration
> +  * in non atomic way with this task, which can leads to leaked
> +  * resources in older cgroup.
> +  */
> + spin_lock(&res_cnt->lock);
> + status = try_charge_resource(dev_cg, type, num);
> + if (status)
> + goto busy;
> +
> + /* single task updating its rdma resource usage, so atomic is
> +  * not required.
> +  */
> + current->rdma_res_counter->usage[type] += num;
> +
> +busy:
> + spin_unlock(&res_cnt->lock);
> + return status;
> +}
> +EXPORT_SYMBOL(devcgroup_rdma_try_charge_resource);

Regards,
Haggai
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/7] devcg: Added rdma resource tracker object per task

2015-09-07 Thread Haggai Eran
On 07/09/2015 23:38, Parav Pandit wrote:
> @@ -2676,7 +2686,7 @@ static inline int thread_group_empty(struct task_struct 
> *p)
>   * Protects ->fs, ->files, ->mm, ->group_info, ->comm, keyring
>   * subscriptions and synchronises with wait4().  Also used in procfs.  Also
>   * pins the final release of task.io_context.  Also protects ->cpuset and
> - * ->cgroup.subsys[]. And ->vfork_done.
> + * ->cgroup.subsys[]. Also projtects ->vfork_done and ->rdma_res_counter.
s/projtects/protects/
>   *
>   * Nests both inside and outside of read_lock(&tasklist_lock).
>   * It must not be nested with write_lock_irq(&tasklist_lock),

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/7] devcg: Added infrastructure for rdma device cgroup.

2015-09-07 Thread Haggai Eran
On 07/09/2015 23:38, Parav Pandit wrote:
> diff --git a/include/linux/device_cgroup.h b/include/linux/device_cgroup.h
> index 8b64221..cdbdd60 100644
> --- a/include/linux/device_cgroup.h
> +++ b/include/linux/device_cgroup.h
> @@ -1,6 +1,57 @@
> +#ifndef _DEVICE_CGROUP
> +#define _DEVICE_CGROUP
> +
>  #include 
> +#include 
> +#include 

You cannot add this include line before adding the device_rdma_cgroup.h
(added in patch 5). You should reorder the patches so that after each
patch the kernel builds correctly.

I also noticed in patch 2 you add device_rdma_cgroup.o to the Makefile
before it was added to the kernel.

Regards,
Haggai
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] libmlx5: Implement query device extended verb

2015-09-05 Thread Haggai Eran
Simply pass the extended query device verb to back to libiverbs, in
order to support it.

Also share some code with the legacy query device verb.

Signed-off-by: Haggai Eran 
---
 src/mlx5.c  |  1 +
 src/mlx5.h  |  3 +++
 src/verbs.c | 41 +++--
 3 files changed, 39 insertions(+), 6 deletions(-)

diff --git a/src/mlx5.c b/src/mlx5.c
index d02328881992..f058ac498819 100644
--- a/src/mlx5.c
+++ b/src/mlx5.c
@@ -583,6 +583,7 @@ static int mlx5_init_context(struct verbs_device *vdev,
verbs_set_ctx_op(v_ctx, close_xrcd, mlx5_close_xrcd);
verbs_set_ctx_op(v_ctx, create_srq_ex, mlx5_create_srq_ex);
verbs_set_ctx_op(v_ctx, get_srq_num, mlx5_get_srq_num);
+   verbs_set_ctx_op(v_ctx, query_device_ex, mlx5_query_device_ex);
 
return 0;
 
diff --git a/src/mlx5.h b/src/mlx5.h
index 6ad79fe324d3..364afe12de68 100644
--- a/src/mlx5.h
+++ b/src/mlx5.h
@@ -537,6 +537,9 @@ void mlx5_free_db(struct mlx5_context *context, uint32_t 
*db);
 
 int mlx5_query_device(struct ibv_context *context,
   struct ibv_device_attr *attr);
+int mlx5_query_device_ex(struct ibv_context *context,
+const struct ibv_query_device_ex_input *input,
+struct ibv_device_attr_ex *attr, size_t attr_size);
 struct ibv_qp *mlx5_create_qp_ex(struct ibv_context *context,
 struct ibv_qp_init_attr_ex *attr);
 int mlx5_query_port(struct ibv_context *context, uint8_t port,
diff --git a/src/verbs.c b/src/verbs.c
index 8ddf4e631c9f..f269c37ea266 100644
--- a/src/verbs.c
+++ b/src/verbs.c
@@ -53,23 +53,52 @@
 
 int mlx5_single_threaded = 0;
 
+static int raw_fw_ver_to_string(uint64_t raw_fw_ver, char *buf, size_t len)
+{
+   unsigned major, minor, sub_minor;
+
+   major = (raw_fw_ver >> 32) & 0x;
+   minor = (raw_fw_ver >> 16) & 0x;
+   sub_minor = raw_fw_ver & 0x;
+
+   return snprintf(buf, len, "%d.%d.%04d", major, minor, sub_minor);
+}
+
 int mlx5_query_device(struct ibv_context *context, struct ibv_device_attr 
*attr)
 {
struct ibv_query_device cmd;
uint64_t raw_fw_ver;
-   unsigned major, minor, sub_minor;
int ret;
 
ret = ibv_cmd_query_device(context, attr, &raw_fw_ver, &cmd, sizeof 
cmd);
if (ret)
return ret;
 
-   major = (raw_fw_ver >> 32) & 0x;
-   minor = (raw_fw_ver >> 16) & 0x;
-   sub_minor = raw_fw_ver & 0x;
+   raw_fw_ver_to_string(raw_fw_ver, attr->fw_ver, sizeof attr->fw_ver);
+
+   return 0;
+}
+
+int mlx5_query_device_ex(struct ibv_context *context,
+const struct ibv_query_device_ex_input *input,
+struct ibv_device_attr_ex *attr, size_t attr_size)
+{
+   struct ibv_query_device_ex cmd;
+   struct ibv_query_device_resp_ex resp;
+   uint64_t raw_fw_ver;
+   int ret;
+
+   cmd.comp_mask = 0;
+   ret = ibv_cmd_query_device_ex(context, input, attr, attr_size,
+ &raw_fw_ver, &cmd,
+ sizeof(struct ibv_query_device_ex),
+ sizeof(cmd), &resp,
+ sizeof(struct ibv_query_device_resp_ex),
+ sizeof(resp));
+   if (ret)
+   return ret;
 
-   snprintf(attr->fw_ver, sizeof attr->fw_ver,
-"%d.%d.%04d", major, minor, sub_minor);
+   raw_fw_ver_to_string(raw_fw_ver, attr->orig_attr.fw_ver, sizeof 
attr->orig_attr.fw_ver);
 
return 0;
 }
-- 
1.7.11.2

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v1 2/3] Add on-demand paging support

2015-09-03 Thread Haggai Eran
On-demand paging feature allows registering memory regions without pinning
their pages. Unfortunately the feature doesn't work together will all
transports and all operations. This patch adds the ability to report on-demand
paging capabilities through the ibv_query_device_ex.

The patch also add the IBV_ACCESS_ON_DEMAND access flag to allow registration
of on-demand paging enabled memory regions.

Signed-off-by: Shachar Raindel 
Signed-off-by: Majd Dibbiny 
Signed-off-by: Haggai Eran 
---
 examples/devinfo.c| 51 +++
 include/infiniband/kern-abi.h | 11 ++
 include/infiniband/verbs.h| 25 -
 man/ibv_query_device_ex.3 | 23 +++
 man/ibv_reg_mr.3  |  2 ++
 src/cmd.c | 16 ++
 6 files changed, 127 insertions(+), 1 deletion(-)

diff --git a/examples/devinfo.c b/examples/devinfo.c
index f8aa9b45838a..a8de9826558e 100644
--- a/examples/devinfo.c
+++ b/examples/devinfo.c
@@ -43,6 +43,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -204,6 +205,54 @@ static const char *link_layer_str(uint8_t link_layer)
}
 }
 
+void print_odp_trans_caps(uint32_t trans)
+{
+   uint32_t unknown_transport_caps = ~(IBV_ODP_SUPPORT_SEND |
+   IBV_ODP_SUPPORT_RECV |
+   IBV_ODP_SUPPORT_WRITE |
+   IBV_ODP_SUPPORT_READ |
+   IBV_ODP_SUPPORT_ATOMIC);
+
+   if (!trans) {
+   printf("\t\t\t\t\tNO SUPPORT\n");
+   } else {
+   if (trans & IBV_ODP_SUPPORT_SEND)
+   printf("\t\t\t\t\tSUPPORT_SEND\n");
+   if (trans & IBV_ODP_SUPPORT_RECV)
+   printf("\t\t\t\t\tSUPPORT_RECV\n");
+   if (trans & IBV_ODP_SUPPORT_WRITE)
+   printf("\t\t\t\t\tSUPPORT_WRITE\n");
+   if (trans & IBV_ODP_SUPPORT_READ)
+   printf("\t\t\t\t\tSUPPORT_READ\n");
+   if (trans & IBV_ODP_SUPPORT_ATOMIC)
+   printf("\t\t\t\t\tSUPPORT_ATOMIC\n");
+   if (trans & unknown_transport_caps)
+   printf("\t\t\t\t\tUnknown flags: 0x%" PRIX32 "\n",
+  trans & unknown_transport_caps);
+   }
+}
+
+void print_odp_caps(const struct ibv_odp_caps *caps)
+{
+   uint64_t unknown_general_caps = ~(IBV_ODP_SUPPORT);
+
+   /* general odp caps */
+   printf("\tgeneral_odp_caps:\n");
+   if (caps->general_caps & IBV_ODP_SUPPORT)
+   printf("\t\t\t\t\tODP_SUPPORT\n");
+   if (caps->general_caps & unknown_general_caps)
+   printf("\t\t\t\t\tUnknown flags: 0x%" PRIX64 "\n",
+  caps->general_caps & unknown_general_caps);
+
+   /* RC transport */
+   printf("\trc_odp_caps:\n");
+   print_odp_trans_caps(caps->per_transport_caps.rc_odp_caps);
+   printf("\tuc_odp_caps:\n");
+   print_odp_trans_caps(caps->per_transport_caps.uc_odp_caps);
+   printf("\tud_odp_caps:\n");
+   print_odp_trans_caps(caps->per_transport_caps.ud_odp_caps);
+}
+
 static int print_hca_cap(struct ibv_device *ib_dev, uint8_t ib_port)
 {
struct ibv_context *ctx;
@@ -288,6 +337,8 @@ static int print_hca_cap(struct ibv_device *ib_dev, uint8_t 
ib_port)
}
printf("\tmax_pkeys:\t\t\t%d\n", 
device_attr.orig_attr.max_pkeys);
printf("\tlocal_ca_ack_delay:\t\t%d\n", 
device_attr.orig_attr.local_ca_ack_delay);
+
+   print_odp_caps(&device_attr.odp_caps);
}
 
for (port = 1; port <= device_attr.orig_attr.phys_port_cnt; ++port) {
diff --git a/include/infiniband/kern-abi.h b/include/infiniband/kern-abi.h
index baa897c0d1bf..800c5abab7f8 100644
--- a/include/infiniband/kern-abi.h
+++ b/include/infiniband/kern-abi.h
@@ -252,10 +252,21 @@ struct ibv_query_device_ex {
__u32   reserved;
 };
 
+struct ibv_odp_caps_resp {
+   __u64 general_caps;
+   struct {
+   __u32 rc_odp_caps;
+   __u32 uc_odp_caps;
+   __u32 ud_odp_caps;
+   } per_transport_caps;
+   __u32 reserved;
+};
+
 struct ibv_query_device_resp_ex {
struct ibv_query_device_resp base;
__u32 comp_mask;
__u32 response_length;
+   struct ibv_odp_caps_resp odp_caps;
 };
 
 struct ibv_query_port {
diff --git a/include/infiniband/verbs.h b/include/infiniband/verbs.h
index a3b999eebe47..a32f29095eab 100644
--- a/include/infiniband/verbs.h
+++ b/include/infiniband/verbs.h
@@ -175,9 +175,31 @@ struct

[PATCH v1 3/3] libibverbs/examples: Support odp in rc_pingpong

2015-09-03 Thread Haggai Eran
From: Majd Dibbiny 

Signed-off-by: Majd Dibbiny 
Signed-off-by: Haggai Eran 
---
 examples/rc_pingpong.c | 31 +--
 1 file changed, 29 insertions(+), 2 deletions(-)

diff --git a/examples/rc_pingpong.c b/examples/rc_pingpong.c
index ddfe8d007e1a..90a8320121b9 100644
--- a/examples/rc_pingpong.c
+++ b/examples/rc_pingpong.c
@@ -55,6 +55,7 @@ enum {
 };
 
 static int page_size;
+static int use_odp;
 
 struct pingpong_context {
struct ibv_context  *context;
@@ -315,6 +316,7 @@ static struct pingpong_context *pp_init_ctx(struct 
ibv_device *ib_dev, int size,
int use_event)
 {
struct pingpong_context *ctx;
+   int access_flags = IBV_ACCESS_LOCAL_WRITE;
 
ctx = calloc(1, sizeof *ctx);
if (!ctx)
@@ -355,7 +357,25 @@ static struct pingpong_context *pp_init_ctx(struct 
ibv_device *ib_dev, int size,
goto clean_comp_channel;
}
 
-   ctx->mr = ibv_reg_mr(ctx->pd, ctx->buf, size, IBV_ACCESS_LOCAL_WRITE);
+   if (use_odp) {
+   const uint32_t rc_caps_mask = IBV_ODP_SUPPORT_SEND |
+ IBV_ODP_SUPPORT_RECV;
+   struct ibv_device_attr_ex attrx;
+
+   if (ibv_query_device_ex(ctx->context, NULL, &attrx)) {
+   fprintf(stderr, "Couldn't query device for its 
features\n");
+   goto clean_comp_channel;
+   }
+
+   if (!(attrx.odp_caps.general_caps & IBV_ODP_SUPPORT) ||
+   (attrx.odp_caps.per_transport_caps.rc_odp_caps & 
rc_caps_mask) != rc_caps_mask) {
+   fprintf(stderr, "The device isn't ODP capable or does 
not support RC send and receive with ODP\n");
+   goto clean_comp_channel;
+   }
+   access_flags |= IBV_ACCESS_ON_DEMAND;
+   }
+   ctx->mr = ibv_reg_mr(ctx->pd, ctx->buf, size, access_flags);
+
if (!ctx->mr) {
fprintf(stderr, "Couldn't register MR\n");
goto clean_pd;
@@ -540,6 +560,7 @@ static void usage(const char *argv0)
printf("  -l, --sl=  service level value\n");
printf("  -e, --events   sleep on CQ events (default poll)\n");
printf("  -g, --gid-idx= local port gid index\n");
+   printf("  -o, --odp use on demand paging\n");
 }
 
 int main(int argc, char *argv[])
@@ -582,11 +603,13 @@ int main(int argc, char *argv[])
{ .name = "sl",   .has_arg = 1, .val = 'l' },
{ .name = "events",   .has_arg = 0, .val = 'e' },
{ .name = "gid-idx",  .has_arg = 1, .val = 'g' },
+   { .name = "odp",  .has_arg = 0, .val = 'o' },
{ 0 }
};
 
-   c = getopt_long(argc, argv, "p:d:i:s:m:r:n:l:eg:",
+   c = getopt_long(argc, argv, "p:d:i:s:m:r:n:l:eg:o",
long_options, NULL);
+
if (c == -1)
break;
 
@@ -643,6 +666,10 @@ int main(int argc, char *argv[])
gidx = strtol(optarg, NULL, 0);
break;
 
+   case 'o':
+   use_odp = 1;
+   break;
+
default:
usage(argv[0]);
return 1;
-- 
1.7.11.2

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v1 0/3] libibverbs: On-demand paging support

2015-09-03 Thread Haggai Eran
This series adds userspace support for on-demand paging. The first patch adds
support for the new extended query device verb. Patch 2 adds the capability and
interface bits related to on-demand paging, and patch 3 adds example code to
the rc_pingpong program to use on-demand paging.

Changes from v1:
- Patch 1:
  * move code to revert to legacy ibv_query_device when ibv_query_device_ex
is missing to the inline function.
  * add an input parameter to the ibv_query_device_ex verb for future
extension.
  * add the size of the ibv_device_attr_ex struct as a parameter to the
ibv_query_device_ex verb, to allow the verb to handle older
applications.
  * check the validity of the input parameter and output struct size.
  * remove reserved words from ibv_query_device_resp_ex, and remove unused
ibv_device_attr_ex_resp struct.
- Patch 2:
  * let print_odp_caps() receive a const pointer instead of a by-value
struct.
  * check that the application has enough space for ODP capabilities in the
provided ibv_device_attr_ex struct.

Eli Cohen (1):
  Add support for extended query device capabilities

Haggai Eran (1):
  Add on-demand paging support

Majd Dibbiny (1):
  libibverbs/examples: Support odp in rc_pingpong

 Makefile.am   |   3 +-
 examples/devinfo.c| 145 +++--
 examples/rc_pingpong.c|  31 -
 include/infiniband/driver.h   |  10 +++
 include/infiniband/kern-abi.h |  36 ++-
 include/infiniband/verbs.h|  68 ++-
 man/ibv_query_device_ex.3 |  70 
 man/ibv_reg_mr.3  |   2 +
 src/cmd.c | 147 ++
 src/libibverbs.map|   2 +
 10 files changed, 420 insertions(+), 94 deletions(-)
 create mode 100644 man/ibv_query_device_ex.3

-- 
1.7.11.2

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v1 1/3] Add support for extended query device capabilities

2015-09-03 Thread Haggai Eran
From: Eli Cohen 

Add the verb ibv_query_device_ex which is extensible and allows following
commits to add new features to define additional properties.

Cc: Moshe Lazer 
Signed-off-by: Eli Cohen 
Signed-off-by: Haggai Eran 
---
 Makefile.am   |   3 +-
 examples/devinfo.c|  94 +++---
 include/infiniband/driver.h   |  10 
 include/infiniband/kern-abi.h |  25 +++-
 include/infiniband/verbs.h|  43 ++
 man/ibv_query_device_ex.3 |  47 +++
 src/cmd.c | 131 +-
 src/libibverbs.map|   2 +
 8 files changed, 264 insertions(+), 91 deletions(-)
 create mode 100644 man/ibv_query_device_ex.3

diff --git a/Makefile.am b/Makefile.am
index ef4df033581d..c85e98ae0662 100644
--- a/Makefile.am
+++ b/Makefile.am
@@ -62,7 +62,8 @@ man_MANS = man/ibv_asyncwatch.1 man/ibv_devices.1 
man/ibv_devinfo.1   \
 man/ibv_query_srq.3 man/ibv_rate_to_mult.3 man/ibv_reg_mr.3
\
 man/ibv_req_notify_cq.3 man/ibv_resize_cq.3 man/ibv_rate_to_mbps.3  \
 man/ibv_create_qp_ex.3 man/ibv_create_srq_ex.3 man/ibv_open_xrcd.3  \
-man/ibv_get_srq_num.3 man/ibv_open_qp.3
+man/ibv_get_srq_num.3 man/ibv_open_qp.3 \
+man/ibv_query_device_ex.3
 
 DEBIAN = debian/changelog debian/compat debian/control debian/copyright \
 debian/ibverbs-utils.install debian/libibverbs1.install \
diff --git a/examples/devinfo.c b/examples/devinfo.c
index afa8c853868f..f8aa9b45838a 100644
--- a/examples/devinfo.c
+++ b/examples/devinfo.c
@@ -207,7 +207,7 @@ static const char *link_layer_str(uint8_t link_layer)
 static int print_hca_cap(struct ibv_device *ib_dev, uint8_t ib_port)
 {
struct ibv_context *ctx;
-   struct ibv_device_attr device_attr;
+   struct ibv_device_attr_ex device_attr;
struct ibv_port_attr port_attr;
int rc = 0;
uint8_t port;
@@ -219,12 +219,12 @@ static int print_hca_cap(struct ibv_device *ib_dev, 
uint8_t ib_port)
rc = 1;
goto cleanup;
}
-   if (ibv_query_device(ctx, &device_attr)) {
+   if (ibv_query_device_ex(ctx, NULL, &device_attr)) {
fprintf(stderr, "Failed to query device props\n");
rc = 2;
goto cleanup;
}
-   if (ib_port && ib_port > device_attr.phys_port_cnt) {
+   if (ib_port && ib_port > device_attr.orig_attr.phys_port_cnt) {
fprintf(stderr, "Invalid port requested for device\n");
/* rc = 3 is taken by failure to clean up */
rc = 4;
@@ -234,63 +234,63 @@ static int print_hca_cap(struct ibv_device *ib_dev, 
uint8_t ib_port)
printf("hca_id:\t%s\n", ibv_get_device_name(ib_dev));
printf("\ttransport:\t\t\t%s (%d)\n",
   transport_str(ib_dev->transport_type), ib_dev->transport_type);
-   if (strlen(device_attr.fw_ver))
-   printf("\tfw_ver:\t\t\t\t%s\n", device_attr.fw_ver);
-   printf("\tnode_guid:\t\t\t%s\n", guid_str(device_attr.node_guid, buf));
-   printf("\tsys_image_guid:\t\t\t%s\n", 
guid_str(device_attr.sys_image_guid, buf));
-   printf("\tvendor_id:\t\t\t0x%04x\n", device_attr.vendor_id);
-   printf("\tvendor_part_id:\t\t\t%d\n", device_attr.vendor_part_id);
-   printf("\thw_ver:\t\t\t\t0x%X\n", device_attr.hw_ver);
+   if (strlen(device_attr.orig_attr.fw_ver))
+   printf("\tfw_ver:\t\t\t\t%s\n", device_attr.orig_attr.fw_ver);
+   printf("\tnode_guid:\t\t\t%s\n", 
guid_str(device_attr.orig_attr.node_guid, buf));
+   printf("\tsys_image_guid:\t\t\t%s\n", 
guid_str(device_attr.orig_attr.sys_image_guid, buf));
+   printf("\tvendor_id:\t\t\t0x%04x\n", device_attr.orig_attr.vendor_id);
+   printf("\tvendor_part_id:\t\t\t%d\n", 
device_attr.orig_attr.vendor_part_id);
+   printf("\thw_ver:\t\t\t\t0x%X\n", device_attr.orig_attr.hw_ver);
 
if (ibv_read_sysfs_file(ib_dev->ibdev_path, "board_id", buf, sizeof 
buf) > 0)
printf("\tboard_id:\t\t\t%s\n", buf);
 
-   printf("\tphys_port_cnt:\t\t\t%d\n", device_attr.phys_port_cnt);
+   printf("\tphys_port_cnt:\t\t\t%d\n", 
device_attr.orig_attr.phys_port_cnt);
 
if (verbose) {
printf("\tmax_mr_size:\t\t\t0x%llx\n",
-  (unsigned long long) device_attr.max_mr_size);
+  (unsigned long long) device_attr.orig_attr.max_mr_size);
printf("\tpage_size_cap:\t\t\t0x%llx\n",
-  (unsigned long long) device_attr.page_size_cap);
-   printf("\tmax_qp:\t\t\t\t%d\n", device_attr.max_qp);
-   printf("\tm

Re: [PATCH 2/3] Add on-demand paging support

2015-09-02 Thread Haggai Eran
On 02/09/2015 22:17, Sagi Grimberg wrote:
> On 8/27/2015 6:22 PM, Haggai Eran wrote:
>> On-demand paging feature allows registering memory regions without
>> pinning
>> their pages. Unfortunately the feature doesn't work together will all
>> transports and all operations. This patch adds the ability to report
>> on-demand
>> paging capabilities through the ibv_query_device_ex.
>>
>> The patch also add the IBV_ACCESS_ON_DEMAND access flag to allow
>> registration
>> of on-demand paging enabled memory regions.
>>
>> Signed-off-by: Shachar Raindel 
>> Signed-off-by: Majd Dibbiny 
>> Signed-off-by: Haggai Eran 
>> ---
> 
> Looks good,
> 
> Reviewed-by: Sagi Grimberg 
> 
> I have a patch to add ODP support to TGT user-space target.
> The performance gain is a clear cut.
> 
> Doug, can we get this in?

I received some comments from Moshe Lazer about the first patch in this
series that I should fix. Mainly, the ibv_query_device_ex() verb as it
was defined by the patch doesn't receive an extensible input struct
(only an output struct), and doesn't receive the length of the output
struct.

I'll fix these comments and send a v1.

Haggai

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] IB/mlx5: avoid destroying a NULL mr in reg_user_mr error flow

2015-08-31 Thread Haggai Eran
The mlx5_ib_reg_user_mr() function will attempt to call clean_mr() in
its error flow even though there is never a case where the error flow
occurs with a valid MR pointer to destroy.

Remove the clean_mr() call and the incorrect comment above it.

Fixes: b4cfe447d47b ("IB/mlx5: Implement on demand paging by adding
support for MMU notifiers")
Cc: Eli Cohen 
Signed-off-by: Haggai Eran 
---
A quick resend without spaces between the Fixes: line and the rest of the
signatures, per Or Gerlitz's request.

 drivers/infiniband/hw/mlx5/mr.c | 12 
 1 file changed, 12 deletions(-)

diff --git a/drivers/infiniband/hw/mlx5/mr.c b/drivers/infiniband/hw/mlx5/mr.c
index 0dfd379b9bad..54a15b5d336d 100644
--- a/drivers/infiniband/hw/mlx5/mr.c
+++ b/drivers/infiniband/hw/mlx5/mr.c
@@ -1114,19 +1114,7 @@ struct ib_mr *mlx5_ib_reg_user_mr(struct ib_pd *pd, u64 
start, u64 length,
return &mr->ibmr;
 
 error:
-   /*
-* Destroy the umem *before* destroying the MR, to ensure we
-* will not have any in-flight notifiers when destroying the
-* MR.
-*
-* As the MR is completely invalid to begin with, and this
-* error path is only taken if we can't push the mr entry into
-* the pagefault tree, this is safe.
-*/
-
ib_umem_release(umem);
-   /* Kill the MR, and return an error code. */
-   clean_mr(mr);
return ERR_PTR(err);
 }
 
-- 
1.7.11.2

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] IB/mlx5: avoid destroying a NULL mr in reg_user_mr error flow

2015-08-31 Thread Haggai Eran
The mlx5_ib_reg_user_mr() function will attempt to call clean_mr() in
its error flow even though there is never a case where the error flow
occurs with a valid MR pointer to destroy.

Remove the clean_mr() call and the incorrect comment above it.

Fixes: b4cfe447d47b ("IB/mlx5: Implement on demand paging by adding
support for MMU notifiers")

Cc: Eli Cohen 
Signed-off-by: Haggai Eran 
---
 drivers/infiniband/hw/mlx5/mr.c | 12 
 1 file changed, 12 deletions(-)

diff --git a/drivers/infiniband/hw/mlx5/mr.c b/drivers/infiniband/hw/mlx5/mr.c
index 0dfd379b9bad..54a15b5d336d 100644
--- a/drivers/infiniband/hw/mlx5/mr.c
+++ b/drivers/infiniband/hw/mlx5/mr.c
@@ -1114,19 +1114,7 @@ struct ib_mr *mlx5_ib_reg_user_mr(struct ib_pd *pd, u64 
start, u64 length,
return &mr->ibmr;
 
 error:
-   /*
-* Destroy the umem *before* destroying the MR, to ensure we
-* will not have any in-flight notifiers when destroying the
-* MR.
-*
-* As the MR is completely invalid to begin with, and this
-* error path is only taken if we can't push the mr entry into
-* the pagefault tree, this is safe.
-*/
-
ib_umem_release(umem);
-   /* Kill the MR, and return an error code. */
-   clean_mr(mr);
return ERR_PTR(err);
 }
 
-- 
1.7.11.2

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 09/14] IB/cm: Expose BTH P_Key in CM and SIDR request events

2015-08-30 Thread Haggai Eran
On 30/08/2015 21:23, Sagi Grimberg wrote:
> 
> Looks like for some reason cm_get_bth_pkey got pkey_index of 0x
> instead of 0 (working on the default pkey 0x at entry 0).

It looks like the mlx5 driver doesn't interpret the completion format
correctly. It takes a field defined in the programmer reference manual
as pkey, and interprets it as pkey_index [1].

> log:
> infiniband mlx5_0: ib_cm: Couldn't retrieve pkey for incoming request (port 
> 1, pkey index 65535). -22
> ib_srpt Received SRP_LOGIN_REQ with i_port_id 0x0:0x2c90300ed0960, t_port_id 
> 0x2c90300ed0950:0x2c90300ed0950 and it_iu_len 260 on port 1 
> (guid=0xfe80:0x2c90300ed0950)
> ib_srpt Session : kernel thread ib_srpt_compl (PID 8584) started
> infiniband mlx5_0: ib_cm: Couldn't retrieve pkey for incoming request (port 
> 1, pkey index 65535). -22
> ib_srpt Received SRP_LOGIN_REQ with i_port_id 0x0:0x2c90300ed0960, t_port_id 
> 0x2c90300ed0950:0x2c90300ed0950 and it_iu_len 260 on port 1 
> (guid=0xfe80:0x2c90300ed0950)
> ib_srpt Session : kernel thread ib_srpt_compl (PID 8585) started
> mlx5_0:dump_cqe:238:(pid 8584): dump error cqe
>    
>    
> 002b   
>  94003004 002c b8e0
> ib_srpt receiving failed for idx 0 with status 4
> :04:00.0:poll_health:151:(pid 0): device's health compromised
> assert_var[0] 0x0094
> assert_var[1] 0x
> assert_var[2] 0x
> assert_var[3] 0x
> assert_var[4] 0x
> assert_exit_ptr 0x0061d35c
> assert_callra 0x0067a5f4
> fw_ver 0xa0641900
> hw_id 0x01ff
> irisc_index 2
> synd 0x1: firmware internal error
> ext_sync 0x
> :04:00.0:health_care:76:(pid 7943): handling bad device here
> ib_srpt Received DREQ and sent DREP for session 
> 0x0002c90300ed0960.
> ib_srpt Received DREQ and sent DREP for session 
> 0x0002c90300ed0960.
> ib_srpt Received IB TimeWait exit for cm_id 88046d1fb200.
> ib_srpt Received IB TimeWait exit for cm_id 880454ffa000.
> ib_srpt Session 0x0002c90300ed0960: kernel thread 
> ib_srpt_compl (PID 8585) stopped

I don't know how that can cause all the other errors though.

Haggai

[1]
http://lxr.free-electrons.com/source/drivers/infiniband/hw/mlx5/cq.c?v=4.1#L230
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/3] Add support for extended query device capabilities

2015-08-27 Thread Haggai Eran
From: Eli Cohen 

Add the verb ibv_query_device_ex which is extensible and allows following
commits to add new features to define additional properties.

Signed-off-by: Eli Cohen 
Signed-off-by: Haggai Eran 
---
 Makefile.am   |   3 +-
 examples/devinfo.c|  16 --
 include/infiniband/driver.h   |   9 
 include/infiniband/kern-abi.h |  26 +-
 include/infiniband/verbs.h|  28 ++
 man/ibv_query_device_ex.3 |  47 +
 src/cmd.c | 118 --
 src/libibverbs.map|   2 +
 8 files changed, 202 insertions(+), 47 deletions(-)
 create mode 100644 man/ibv_query_device_ex.3

diff --git a/Makefile.am b/Makefile.am
index ef4df033581d..c85e98ae0662 100644
--- a/Makefile.am
+++ b/Makefile.am
@@ -62,7 +62,8 @@ man_MANS = man/ibv_asyncwatch.1 man/ibv_devices.1 
man/ibv_devinfo.1   \
 man/ibv_query_srq.3 man/ibv_rate_to_mult.3 man/ibv_reg_mr.3
\
 man/ibv_req_notify_cq.3 man/ibv_resize_cq.3 man/ibv_rate_to_mbps.3  \
 man/ibv_create_qp_ex.3 man/ibv_create_srq_ex.3 man/ibv_open_xrcd.3  \
-man/ibv_get_srq_num.3 man/ibv_open_qp.3
+man/ibv_get_srq_num.3 man/ibv_open_qp.3 \
+man/ibv_query_device_ex.3
 
 DEBIAN = debian/changelog debian/compat debian/control debian/copyright \
 debian/ibverbs-utils.install debian/libibverbs1.install \
diff --git a/examples/devinfo.c b/examples/devinfo.c
index afa8c853868f..95e8f83753ca 100644
--- a/examples/devinfo.c
+++ b/examples/devinfo.c
@@ -208,6 +208,7 @@ static int print_hca_cap(struct ibv_device *ib_dev, uint8_t 
ib_port)
 {
struct ibv_context *ctx;
struct ibv_device_attr device_attr;
+   struct ibv_device_attr_ex attrx;
struct ibv_port_attr port_attr;
int rc = 0;
uint8_t port;
@@ -219,11 +220,18 @@ static int print_hca_cap(struct ibv_device *ib_dev, 
uint8_t ib_port)
rc = 1;
goto cleanup;
}
-   if (ibv_query_device(ctx, &device_attr)) {
-   fprintf(stderr, "Failed to query device props\n");
-   rc = 2;
-   goto cleanup;
+
+   if (ibv_query_device_ex(ctx, &attrx)) {
+   attrx.comp_mask = 0;
+   if (ibv_query_device(ctx, &device_attr)) {
+   fprintf(stderr, "Failed to query device props\n");
+   rc = 2;
+   goto cleanup;
+   }
+   } else {
+   device_attr = attrx.orig_attr;
}
+
if (ib_port && ib_port > device_attr.phys_port_cnt) {
fprintf(stderr, "Invalid port requested for device\n");
/* rc = 3 is taken by failure to clean up */
diff --git a/include/infiniband/driver.h b/include/infiniband/driver.h
index 5cc092bf9bd5..b78093ae6a8e 100644
--- a/include/infiniband/driver.h
+++ b/include/infiniband/driver.h
@@ -105,6 +105,15 @@ int ibv_cmd_query_device(struct ibv_context *context,
 struct ibv_device_attr *device_attr,
 uint64_t *raw_fw_ver,
 struct ibv_query_device *cmd, size_t cmd_size);
+int ibv_cmd_query_device_ex(struct ibv_context *context,
+   struct ibv_device_attr_ex *attr,
+   uint64_t *raw_fw_ver,
+   struct ibv_query_device_ex *cmd,
+   size_t cmd_core_size,
+   size_t cmd_size,
+   struct ibv_query_device_resp_ex *resp,
+   size_t resp_core_size,
+   size_t resp_size);
 int ibv_cmd_query_port(struct ibv_context *context, uint8_t port_num,
   struct ibv_port_attr *port_attr,
   struct ibv_query_port *cmd, size_t cmd_size);
diff --git a/include/infiniband/kern-abi.h b/include/infiniband/kern-abi.h
index 91b45d837239..af2a1bebf683 100644
--- a/include/infiniband/kern-abi.h
+++ b/include/infiniband/kern-abi.h
@@ -101,12 +101,20 @@ enum {
 
 #define IB_USER_VERBS_CMD_FLAG_EXTENDED0x80ul
 
+/* use this mask for creating extended commands that
+   correspond to old commands */
+#define IB_USER_VERBS_CMD_EXTENDED_MASK \
+   (IB_USER_VERBS_CMD_FLAG_EXTENDED << \
+IB_USER_VERBS_CMD_FLAGS_SHIFT)
+
 
 enum {
IB_USER_VERBS_CMD_CREATE_FLOW = (IB_USER_VERBS_CMD_FLAG_EXTENDED <<
 IB_USER_VERBS_CMD_FLAGS_SHIFT) +
IB_USER_VERBS_CMD_THRESHOLD,
-   IB_USER_VERBS_CMD_DESTROY_FLOW
+   IB_USER_VERBS_CMD_DESTROY_FLOW,
+   IB_USER_VERBS_CMD_QUERY_DEVICE_EX = IB_USER_VERBS_CMD_EXTENDED_MASK |
+   IB_USER_VERBS_CMD_QUERY_DEVICE,
 };
 
 /*
@@ -240,6 +248,19 @@ struct ibv_query_device_resp {
__u8  reserved[4]

[PATCH 0/3] libibverbs: On-demand paging support

2015-08-27 Thread Haggai Eran
This series adds userspace support for on-demand paging. The first patch adds
support for the new extended query device verb. Patch 2 adds the capability and
interface bits related to on-demand paging, and patch 3 adds example code to
the rc_pingpong program to use on-demand paging.

Eli Cohen (1):
  Add support for extended query device capabilities

Haggai Eran (1):
  Add on-demand paging support

Majd Dibbiny (1):
  libibverbs/examples: Support odp in rc_pingpong

 Makefile.am   |   3 +-
 examples/devinfo.c|  67 --
 examples/rc_pingpong.c|  31 +-
 include/infiniband/driver.h   |   9 +++
 include/infiniband/kern-abi.h |  36 +++-
 include/infiniband/verbs.h|  53 -
 man/ibv_query_device_ex.3 |  70 +++
 man/ibv_reg_mr.3  |   2 +
 src/cmd.c | 129 +-
 src/libibverbs.map|   2 +
 10 files changed, 352 insertions(+), 50 deletions(-)
 create mode 100644 man/ibv_query_device_ex.3

-- 
1.7.11.2

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/3] libibverbs/examples: Support odp in rc_pingpong

2015-08-27 Thread Haggai Eran
From: Majd Dibbiny 

Signed-off-by: Majd Dibbiny 
Signed-off-by: Haggai Eran 
---
 examples/rc_pingpong.c | 31 +--
 1 file changed, 29 insertions(+), 2 deletions(-)

diff --git a/examples/rc_pingpong.c b/examples/rc_pingpong.c
index ddfe8d007e1a..904ec83a633f 100644
--- a/examples/rc_pingpong.c
+++ b/examples/rc_pingpong.c
@@ -55,6 +55,7 @@ enum {
 };
 
 static int page_size;
+static int use_odp;
 
 struct pingpong_context {
struct ibv_context  *context;
@@ -315,6 +316,7 @@ static struct pingpong_context *pp_init_ctx(struct 
ibv_device *ib_dev, int size,
int use_event)
 {
struct pingpong_context *ctx;
+   int access_flags = IBV_ACCESS_LOCAL_WRITE;
 
ctx = calloc(1, sizeof *ctx);
if (!ctx)
@@ -355,7 +357,25 @@ static struct pingpong_context *pp_init_ctx(struct 
ibv_device *ib_dev, int size,
goto clean_comp_channel;
}
 
-   ctx->mr = ibv_reg_mr(ctx->pd, ctx->buf, size, IBV_ACCESS_LOCAL_WRITE);
+   if (use_odp) {
+   const uint32_t rc_caps_mask = IBV_ODP_SUPPORT_SEND |
+ IBV_ODP_SUPPORT_RECV;
+   struct ibv_device_attr_ex attrx = {};
+
+   if (ibv_query_device_ex(ctx->context, &attrx)) {
+   fprintf(stderr, "Couldn't query device for its 
features\n");
+   goto clean_comp_channel;
+   }
+
+   if (!(attrx.odp_caps.general_caps & IBV_ODP_SUPPORT) ||
+   (attrx.odp_caps.per_transport_caps.rc_odp_caps & 
rc_caps_mask) != rc_caps_mask) {
+   fprintf(stderr, "The device isn't ODP capable or does 
not support RC send and receive with ODP\n");
+   goto clean_comp_channel;
+   }
+   access_flags |= IBV_ACCESS_ON_DEMAND;
+   }
+   ctx->mr = ibv_reg_mr(ctx->pd, ctx->buf, size, access_flags);
+
if (!ctx->mr) {
fprintf(stderr, "Couldn't register MR\n");
goto clean_pd;
@@ -540,6 +560,7 @@ static void usage(const char *argv0)
printf("  -l, --sl=  service level value\n");
printf("  -e, --events   sleep on CQ events (default poll)\n");
printf("  -g, --gid-idx= local port gid index\n");
+   printf("  -o, --odp use on demand paging\n");
 }
 
 int main(int argc, char *argv[])
@@ -582,11 +603,13 @@ int main(int argc, char *argv[])
{ .name = "sl",   .has_arg = 1, .val = 'l' },
{ .name = "events",   .has_arg = 0, .val = 'e' },
{ .name = "gid-idx",  .has_arg = 1, .val = 'g' },
+   { .name = "odp",  .has_arg = 0, .val = 'o' },
{ 0 }
};
 
-   c = getopt_long(argc, argv, "p:d:i:s:m:r:n:l:eg:",
+   c = getopt_long(argc, argv, "p:d:i:s:m:r:n:l:eg:o",
long_options, NULL);
+
if (c == -1)
break;
 
@@ -643,6 +666,10 @@ int main(int argc, char *argv[])
gidx = strtol(optarg, NULL, 0);
break;
 
+   case 'o':
+   use_odp = 1;
+   break;
+
default:
usage(argv[0]);
return 1;
-- 
1.7.11.2

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/3] Add on-demand paging support

2015-08-27 Thread Haggai Eran
On-demand paging feature allows registering memory regions without pinning
their pages. Unfortunately the feature doesn't work together will all
transports and all operations. This patch adds the ability to report on-demand
paging capabilities through the ibv_query_device_ex.

The patch also add the IBV_ACCESS_ON_DEMAND access flag to allow registration
of on-demand paging enabled memory regions.

Signed-off-by: Shachar Raindel 
Signed-off-by: Majd Dibbiny 
Signed-off-by: Haggai Eran 
---
 examples/devinfo.c| 51 +++
 include/infiniband/kern-abi.h | 12 +-
 include/infiniband/verbs.h| 25 -
 man/ibv_query_device_ex.3 | 23 +++
 man/ibv_reg_mr.3  |  2 ++
 src/cmd.c | 11 ++
 6 files changed, 122 insertions(+), 2 deletions(-)

diff --git a/examples/devinfo.c b/examples/devinfo.c
index 95e8f83753ca..61cfdf520be6 100644
--- a/examples/devinfo.c
+++ b/examples/devinfo.c
@@ -43,6 +43,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -204,6 +205,54 @@ static const char *link_layer_str(uint8_t link_layer)
}
 }
 
+void print_odp_trans_caps(uint32_t trans)
+{
+   uint32_t unknown_transport_caps = ~(IBV_ODP_SUPPORT_SEND |
+   IBV_ODP_SUPPORT_RECV |
+   IBV_ODP_SUPPORT_WRITE |
+   IBV_ODP_SUPPORT_READ |
+   IBV_ODP_SUPPORT_ATOMIC);
+
+   if (!trans) {
+   printf("\t\t\t\t\tNO SUPPORT\n");
+   } else {
+   if (trans & IBV_ODP_SUPPORT_SEND)
+   printf("\t\t\t\t\tSUPPORT_SEND\n");
+   if (trans & IBV_ODP_SUPPORT_RECV)
+   printf("\t\t\t\t\tSUPPORT_RECV\n");
+   if (trans & IBV_ODP_SUPPORT_WRITE)
+   printf("\t\t\t\t\tSUPPORT_WRITE\n");
+   if (trans & IBV_ODP_SUPPORT_READ)
+   printf("\t\t\t\t\tSUPPORT_READ\n");
+   if (trans & IBV_ODP_SUPPORT_ATOMIC)
+   printf("\t\t\t\t\tSUPPORT_ATOMIC\n");
+   if (trans & unknown_transport_caps)
+   printf("\t\t\t\t\tUnknown flags: 0x%" PRIX32 "\n",
+  trans & unknown_transport_caps);
+   }
+}
+
+void print_odp_caps(struct ibv_odp_caps caps)
+{
+   uint64_t unknown_general_caps = ~(IBV_ODP_SUPPORT);
+
+   /* general odp caps */
+   printf("\tgeneral_odp_caps:\n");
+   if (caps.general_caps & IBV_ODP_SUPPORT)
+   printf("\t\t\t\t\tODP_SUPPORT\n");
+   if (caps.general_caps & unknown_general_caps)
+   printf("\t\t\t\t\tUnknown flags: 0x%" PRIX64 "\n",
+  caps.general_caps & unknown_general_caps);
+
+   /* RC transport */
+   printf("\trc_odp_caps:\n");
+   print_odp_trans_caps(caps.per_transport_caps.rc_odp_caps);
+   printf("\tuc_odp_caps:\n");
+   print_odp_trans_caps(caps.per_transport_caps.uc_odp_caps);
+   printf("\tud_odp_caps:\n");
+   print_odp_trans_caps(caps.per_transport_caps.ud_odp_caps);
+}
+
 static int print_hca_cap(struct ibv_device *ib_dev, uint8_t ib_port)
 {
struct ibv_context *ctx;
@@ -296,6 +345,8 @@ static int print_hca_cap(struct ibv_device *ib_dev, uint8_t 
ib_port)
}
printf("\tmax_pkeys:\t\t\t%d\n", device_attr.max_pkeys);
printf("\tlocal_ca_ack_delay:\t\t%d\n", 
device_attr.local_ca_ack_delay);
+
+   print_odp_caps(attrx.odp_caps);
}
 
for (port = 1; port <= device_attr.phys_port_cnt; ++port) {
diff --git a/include/infiniband/kern-abi.h b/include/infiniband/kern-abi.h
index af2a1bebf683..1c0d0d30c612 100644
--- a/include/infiniband/kern-abi.h
+++ b/include/infiniband/kern-abi.h
@@ -254,11 +254,21 @@ struct ibv_query_device_ex {
__u32   reserved;
 };
 
+struct ibv_odp_caps_resp {
+   __u64 general_caps;
+   struct {
+   __u32 rc_odp_caps;
+   __u32 uc_odp_caps;
+   __u32 ud_odp_caps;
+   } per_transport_caps;
+   __u32 reserved;
+};
+
 struct ibv_query_device_resp_ex {
struct ibv_query_device_resp base;
__u32 comp_mask;
__u32 response_length;
-   __u64 reserved[3];
+   struct ibv_odp_caps_resp odp_caps;
 };
 
 struct ibv_query_port {
diff --git a/include/infiniband/verbs.h b/include/infiniband/verbs.h
index ff806bf8555d..ce56315b236e 100644
--- a/include/infiniband/verbs.h
+++ b/include/infiniband/verbs.h
@@ -168,9 +168,31 @@ struct ibv_device_attr {
uint8_t phys_por

[PATCH v6 3/4] IB/cma: Add support for network namespaces

2015-08-27 Thread Haggai Eran
From: Guy Shapiro 

Add support for network namespaces in the ib_cma module. This is
accomplished by:

1. Adding network namespace parameter for rdma_create_id. This parameter is
   used to populate the network namespace field in rdma_id_private.
   rdma_create_id keeps a reference on the network namespace.
2. Using the network namespace from the rdma_id instead of init_net inside
   of ib_cma, when listening on an ID and when looking for an ID for an
   incoming request.
3. Decrementing the reference count for the appropriate network namespace
   when calling rdma_destroy_id.

In order to preserve the current behavior init_net is passed when calling
from other modules.

Signed-off-by: Guy Shapiro 
Signed-off-by: Haggai Eran 
Signed-off-by: Yotam Kenneth 
Signed-off-by: Shachar Raindel 
---
 drivers/infiniband/core/cma.c  | 46 +-
 drivers/infiniband/core/ucma.c |  3 +-
 drivers/infiniband/ulp/iser/iser_verbs.c   |  2 +-
 drivers/infiniband/ulp/isert/ib_isert.c|  2 +-
 .../staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h|  4 +-
 include/rdma/rdma_cm.h |  6 ++-
 net/9p/trans_rdma.c|  4 +-
 net/rds/ib.c   |  2 +-
 net/rds/ib_cm.c|  2 +-
 net/rds/iw.c   |  2 +-
 net/rds/iw_cm.c|  2 +-
 net/rds/rdma_transport.c   |  4 +-
 net/sunrpc/xprtrdma/svc_rdma_transport.c   |  4 +-
 net/sunrpc/xprtrdma/verbs.c|  3 +-
 14 files changed, 52 insertions(+), 34 deletions(-)

diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index f40ca053fa3e..debf25ccf930 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -591,7 +591,8 @@ static int cma_disable_callback(struct rdma_id_private 
*id_priv,
return 0;
 }
 
-struct rdma_cm_id *rdma_create_id(rdma_cm_event_handler event_handler,
+struct rdma_cm_id *rdma_create_id(struct net *net,
+ rdma_cm_event_handler event_handler,
  void *context, enum rdma_port_space ps,
  enum ib_qp_type qp_type)
 {
@@ -615,7 +616,7 @@ struct rdma_cm_id *rdma_create_id(rdma_cm_event_handler 
event_handler,
INIT_LIST_HEAD(&id_priv->listen_list);
INIT_LIST_HEAD(&id_priv->mc_list);
get_random_bytes(&id_priv->seq_num, sizeof id_priv->seq_num);
-   id_priv->id.route.addr.dev_addr.net = &init_net;
+   id_priv->id.route.addr.dev_addr.net = get_net(net);
 
return &id_priv->id;
 }
@@ -1257,7 +1258,7 @@ static bool cma_match_net_dev(const struct 
rdma_id_private *id_priv,
return addr->src_addr.ss_family == AF_IB;
 
return !addr->dev_addr.bound_dev_if ||
-  (net_eq(dev_net(net_dev), &init_net) &&
+  (net_eq(dev_net(net_dev), addr->dev_addr.net) &&
addr->dev_addr.bound_dev_if == net_dev->ifindex);
 }
 
@@ -1314,7 +1315,7 @@ static struct rdma_id_private *cma_id_from_event(struct 
ib_cm_id *cm_id,
}
}
 
-   bind_list = cma_ps_find(&init_net,
+   bind_list = cma_ps_find(dev_net(*net_dev),
rdma_ps_from_service_id(req.service_id),
cma_port_from_service_id(req.service_id));
id_priv = cma_find_listener(bind_list, cm_id, ib_event, &req, *net_dev);
@@ -1386,6 +1387,7 @@ static void cma_cancel_operation(struct rdma_id_private 
*id_priv,
 static void cma_release_port(struct rdma_id_private *id_priv)
 {
struct rdma_bind_list *bind_list = id_priv->bind_list;
+   struct net *net = id_priv->id.route.addr.dev_addr.net;
 
if (!bind_list)
return;
@@ -1393,7 +1395,7 @@ static void cma_release_port(struct rdma_id_private 
*id_priv)
mutex_lock(&lock);
hlist_del(&id_priv->node);
if (hlist_empty(&bind_list->owners)) {
-   cma_ps_remove(&init_net, bind_list->ps, bind_list->port);
+   cma_ps_remove(net, bind_list->ps, bind_list->port);
kfree(bind_list);
}
mutex_unlock(&lock);
@@ -1452,6 +1454,7 @@ void rdma_destroy_id(struct rdma_cm_id *id)
cma_deref_id(id_priv->id.context);
 
kfree(id_priv->id.route.path_rec);
+   put_net(id_priv->id.route.addr.dev_addr.net);
kfree(id_priv);
 }
 EXPORT_SYMBOL(rdma_destroy_id);
@@ -1582,7 +1585,8 @@ static struct rdma_id_private *cma_new_conn_id(struct 
rdma_cm_id *listen_id,
  ib_event->param.req_rcvd.primary_path->service_id;
int ret;
 
-   id = rdma_create_id(listen_id->event_handler, l

[PATCH v6 0/4] Add network namespace support in the RDMA-CM

2015-08-27 Thread Haggai Eran
Hi,

Now that the code for demuxing requests is inside rdma_cm, here are the patches
to add InfiniBand network namespace again.

Changes from v5:
- removed patches that got in as part of the cleanup series.

RDMA-CM uses IP based addressing and routing to setup RDMA connections between
hosts. Currently, all of the IP interfaces and addresses used by the RDMA-CM
must reside in the init_net namespace. This restricts the usage of containers
with RDMA to only work with host network namespace (aka the kernel init_net NS
instance).

This patchset allows using network namespaces with the RDMA-CM.

Each RDMA-CM id keeps a reference to a network namespace.

This reference is based on the process network namespace at the time of the
creation of the object or inherited from the listener.

This network namespace is used to perform all IP and network related
operations. Specifically, the local device lookup, as well as the remote GID
address resolution are done in the context of the RDMA-CM object's namespace.
This allows outgoing connections to reach the right target, even if the same
IP address exists in multiple network namespaces. This can happen if each
network namespace resides on a different P_Key.

Additionally, the network namespace is used to split the listener port space
tables. From the user point of view, each network namespace has a unique,
completely independent tables for its port spaces. This allows running multiple
instances of a single service on the same machine, using containers. 

The functionality introduced by this series would come into play when the
transport is InfiniBand and IPoIB interfaces are assigned to each namespace.
Multiple IPoIB interfaces can be created and assigned to different RDMA-CM
capable containers, for example using pipework [1].

The patches apply against Doug's to-be-rebased tree for v4.3.

The patchset is structured as follows:

Patch 1 is a relatively trivial API extension, requiring the callers
of certain ib_addr functions to provide a network namespace, as needed.

Patches 2-4 add proper namespace support to the RDMA-CM module. This
includes adding multiple port space tables, adding a network namespace
parameter, and finally retrieving the namespace from the creating process.

[1] https://github.com/jpetazzo/pipework/pull/108

Guy Shapiro (3):
  IB/addr: Pass network namespace as a parameter
  IB/cma: Add support for network namespaces
  IB/ucma: Take the network namespace from the process

Haggai Eran (1):
  IB/cma: Separate port allocation to network namespaces

 drivers/infiniband/core/addr.c |  17 +--
 drivers/infiniband/core/cma.c  | 129 +++--
 drivers/infiniband/core/ucma.c |   4 +-
 drivers/infiniband/ulp/iser/iser_verbs.c   |   2 +-
 drivers/infiniband/ulp/isert/ib_isert.c|   2 +-
 .../staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h|   4 +-
 include/rdma/ib_addr.h |  16 ++-
 include/rdma/rdma_cm.h |   6 +-
 net/9p/trans_rdma.c|   4 +-
 net/rds/ib.c   |   2 +-
 net/rds/ib_cm.c|   2 +-
 net/rds/iw.c   |   2 +-
 net/rds/iw_cm.c|   2 +-
 net/rds/rdma_transport.c   |   4 +-
 net/sunrpc/xprtrdma/svc_rdma_transport.c   |   4 +-
 net/sunrpc/xprtrdma/verbs.c|   3 +-
 16 files changed, 142 insertions(+), 61 deletions(-)

-- 
1.7.11.2

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v6 4/4] IB/ucma: Take the network namespace from the process

2015-08-27 Thread Haggai Eran
From: Guy Shapiro 

Add support for network namespaces from user space. This is done by passing
the network namespace of the process instead of init_net.

Signed-off-by: Haggai Eran 
Signed-off-by: Yotam Kenneth 
Signed-off-by: Shachar Raindel 
Signed-off-by: Guy Shapiro 
---
 drivers/infiniband/core/ucma.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/core/ucma.c b/drivers/infiniband/core/ucma.c
index 82a17a7b7e6d..00402e6d505a 100644
--- a/drivers/infiniband/core/ucma.c
+++ b/drivers/infiniband/core/ucma.c
@@ -42,6 +42,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -472,8 +473,8 @@ static ssize_t ucma_create_id(struct ucma_file *file, const 
char __user *inbuf,
return -ENOMEM;
 
ctx->uid = cmd.uid;
-   ctx->cm_id = rdma_create_id(&init_net, ucma_event_handler, ctx, cmd.ps,
-   qp_type);
+   ctx->cm_id = rdma_create_id(current->nsproxy->net_ns,
+   ucma_event_handler, ctx, cmd.ps, qp_type);
if (IS_ERR(ctx->cm_id)) {
ret = PTR_ERR(ctx->cm_id);
goto err1;
-- 
1.7.11.2

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v6 1/4] IB/addr: Pass network namespace as a parameter

2015-08-27 Thread Haggai Eran
From: Guy Shapiro 

Add network namespace support to the ib_addr module. For that, all the
address resolution and matching should be done using the appropriate
namespace instead of init_net.

This is achieved by:

1. Adding an explicit network namespace argument to exported function that
   require a namespace.
2. Saving the namespace in the rdma_addr_client structure.
3. Using it when calling networking functions.

In order to preserve the behavior of calling modules, &init_net is
passed as the parameter in calls from other modules. This is modified as
namespace support is added on more levels.

Signed-off-by: Haggai Eran 
Signed-off-by: Yotam Kenneth 
Signed-off-by: Shachar Raindel 
Signed-off-by: Guy Shapiro 
---
 drivers/infiniband/core/addr.c | 17 +
 drivers/infiniband/core/cma.c  |  1 +
 include/rdma/ib_addr.h | 16 +++-
 3 files changed, 25 insertions(+), 9 deletions(-)

diff --git a/drivers/infiniband/core/addr.c b/drivers/infiniband/core/addr.c
index 746cdf56bc76..6ed9685efebd 100644
--- a/drivers/infiniband/core/addr.c
+++ b/drivers/infiniband/core/addr.c
@@ -128,7 +128,7 @@ int rdma_translate_ip(struct sockaddr *addr, struct 
rdma_dev_addr *dev_addr,
int ret = -EADDRNOTAVAIL;
 
if (dev_addr->bound_dev_if) {
-   dev = dev_get_by_index(&init_net, dev_addr->bound_dev_if);
+   dev = dev_get_by_index(dev_addr->net, dev_addr->bound_dev_if);
if (!dev)
return -ENODEV;
ret = rdma_copy_addr(dev_addr, dev, NULL);
@@ -138,7 +138,7 @@ int rdma_translate_ip(struct sockaddr *addr, struct 
rdma_dev_addr *dev_addr,
 
switch (addr->sa_family) {
case AF_INET:
-   dev = ip_dev_find(&init_net,
+   dev = ip_dev_find(dev_addr->net,
((struct sockaddr_in *) addr)->sin_addr.s_addr);
 
if (!dev)
@@ -149,12 +149,11 @@ int rdma_translate_ip(struct sockaddr *addr, struct 
rdma_dev_addr *dev_addr,
*vlan_id = rdma_vlan_dev_vlan_id(dev);
dev_put(dev);
break;
-
 #if IS_ENABLED(CONFIG_IPV6)
case AF_INET6:
rcu_read_lock();
-   for_each_netdev_rcu(&init_net, dev) {
-   if (ipv6_chk_addr(&init_net,
+   for_each_netdev_rcu(dev_addr->net, dev) {
+   if (ipv6_chk_addr(dev_addr->net,
  &((struct sockaddr_in6 *) 
addr)->sin6_addr,
  dev, 1)) {
ret = rdma_copy_addr(dev_addr, dev, NULL);
@@ -236,7 +235,7 @@ static int addr4_resolve(struct sockaddr_in *src_in,
fl4.daddr = dst_ip;
fl4.saddr = src_ip;
fl4.flowi4_oif = addr->bound_dev_if;
-   rt = ip_route_output_key(&init_net, &fl4);
+   rt = ip_route_output_key(addr->net, &fl4);
if (IS_ERR(rt)) {
ret = PTR_ERR(rt);
goto out;
@@ -278,12 +277,12 @@ static int addr6_resolve(struct sockaddr_in6 *src_in,
fl6.saddr = src_in->sin6_addr;
fl6.flowi6_oif = addr->bound_dev_if;
 
-   dst = ip6_route_output(&init_net, NULL, &fl6);
+   dst = ip6_route_output(addr->net, NULL, &fl6);
if ((ret = dst->error))
goto put;
 
if (ipv6_addr_any(&fl6.saddr)) {
-   ret = ipv6_dev_get_saddr(&init_net, ip6_dst_idev(dst)->dev,
+   ret = ipv6_dev_get_saddr(addr->net, ip6_dst_idev(dst)->dev,
 &fl6.daddr, 0, &fl6.saddr);
if (ret)
goto put;
@@ -476,6 +475,7 @@ int rdma_addr_find_dmac_by_grh(const union ib_gid *sgid, 
const union ib_gid *dgi
rdma_gid2ip(&dgid_addr._sockaddr, dgid);
 
memset(&dev_addr, 0, sizeof(dev_addr));
+   dev_addr.net = &init_net;
 
ctx.addr = &dev_addr;
init_completion(&ctx.comp);
@@ -510,6 +510,7 @@ int rdma_addr_find_smac_by_sgid(union ib_gid *sgid, u8 
*smac, u16 *vlan_id)
rdma_gid2ip(&gid_addr._sockaddr, sgid);
 
memset(&dev_addr, 0, sizeof(dev_addr));
+   dev_addr.net = &init_net;
ret = rdma_translate_ip(&gid_addr._sockaddr, &dev_addr, vlan_id);
if (ret)
return ret;
diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index b1ab13f3e182..0530c6188e75 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -601,6 +601,7 @@ struct rdma_cm_id *rdma_create_id(rdma_cm_event_handler 
event_handler,
INIT_LIST_HEAD(&id_priv->listen_list);
INIT_LIST_HEAD(&id_priv->mc_list);
get_random_bytes(&id_priv->seq_num, sizeof id_priv->seq_num);
+   id_priv->id.route.addr.dev_add

[PATCH v6 2/4] IB/cma: Separate port allocation to network namespaces

2015-08-27 Thread Haggai Eran
Keep a struct for each network namespace containing the IDRs for the RDMA
CM port spaces. The struct is created dynamically using the generic_net
mechanism.

This patch is internal infrastructure work for the following patches. In
this patch, init_net is statically used as the network namespace for
the new port-space API.

Signed-off-by: Haggai Eran 
Signed-off-by: Yotam Kenneth 
Signed-off-by: Shachar Raindel 
Signed-off-by: Guy Shapiro 
---
 drivers/infiniband/core/cma.c | 94 ---
 1 file changed, 70 insertions(+), 24 deletions(-)

diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index 0530c6188e75..f40ca053fa3e 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -44,6 +44,8 @@
 #include 
 #include 
 
+#include 
+#include 
 #include 
 #include 
 #include 
@@ -110,22 +112,33 @@ static LIST_HEAD(dev_list);
 static LIST_HEAD(listen_any_list);
 static DEFINE_MUTEX(lock);
 static struct workqueue_struct *cma_wq;
-static DEFINE_IDR(tcp_ps);
-static DEFINE_IDR(udp_ps);
-static DEFINE_IDR(ipoib_ps);
-static DEFINE_IDR(ib_ps);
+static int cma_pernet_id;
 
-static struct idr *cma_idr(enum rdma_port_space ps)
+struct cma_pernet {
+   struct idr tcp_ps;
+   struct idr udp_ps;
+   struct idr ipoib_ps;
+   struct idr ib_ps;
+};
+
+static struct cma_pernet *cma_pernet(struct net *net)
+{
+   return net_generic(net, cma_pernet_id);
+}
+
+static struct idr *cma_pernet_idr(struct net *net, enum rdma_port_space ps)
 {
+   struct cma_pernet *pernet = cma_pernet(net);
+
switch (ps) {
case RDMA_PS_TCP:
-   return &tcp_ps;
+   return &pernet->tcp_ps;
case RDMA_PS_UDP:
-   return &udp_ps;
+   return &pernet->udp_ps;
case RDMA_PS_IPOIB:
-   return &ipoib_ps;
+   return &pernet->ipoib_ps;
case RDMA_PS_IB:
-   return &ib_ps;
+   return &pernet->ib_ps;
default:
return NULL;
}
@@ -145,24 +158,25 @@ struct rdma_bind_list {
unsigned short  port;
 };
 
-static int cma_ps_alloc(enum rdma_port_space ps,
+static int cma_ps_alloc(struct net *net, enum rdma_port_space ps,
struct rdma_bind_list *bind_list, int snum)
 {
-   struct idr *idr = cma_idr(ps);
+   struct idr *idr = cma_pernet_idr(net, ps);
 
return idr_alloc(idr, bind_list, snum, snum + 1, GFP_KERNEL);
 }
 
-static struct rdma_bind_list *cma_ps_find(enum rdma_port_space ps, int snum)
+static struct rdma_bind_list *cma_ps_find(struct net *net,
+ enum rdma_port_space ps, int snum)
 {
-   struct idr *idr = cma_idr(ps);
+   struct idr *idr = cma_pernet_idr(net, ps);
 
return idr_find(idr, snum);
 }
 
-static void cma_ps_remove(enum rdma_port_space ps, int snum)
+static void cma_ps_remove(struct net *net, enum rdma_port_space ps, int snum)
 {
-   struct idr *idr = cma_idr(ps);
+   struct idr *idr = cma_pernet_idr(net, ps);
 
idr_remove(idr, snum);
 }
@@ -1300,7 +1314,8 @@ static struct rdma_id_private *cma_id_from_event(struct 
ib_cm_id *cm_id,
}
}
 
-   bind_list = cma_ps_find(rdma_ps_from_service_id(req.service_id),
+   bind_list = cma_ps_find(&init_net,
+   rdma_ps_from_service_id(req.service_id),
cma_port_from_service_id(req.service_id));
id_priv = cma_find_listener(bind_list, cm_id, ib_event, &req, *net_dev);
if (IS_ERR(id_priv)) {
@@ -1378,7 +1393,7 @@ static void cma_release_port(struct rdma_id_private 
*id_priv)
mutex_lock(&lock);
hlist_del(&id_priv->node);
if (hlist_empty(&bind_list->owners)) {
-   cma_ps_remove(bind_list->ps, bind_list->port);
+   cma_ps_remove(&init_net, bind_list->ps, bind_list->port);
kfree(bind_list);
}
mutex_unlock(&lock);
@@ -2663,7 +2678,7 @@ static int cma_alloc_port(enum rdma_port_space ps,
if (!bind_list)
return -ENOMEM;
 
-   ret = cma_ps_alloc(ps, bind_list, snum);
+   ret = cma_ps_alloc(&init_net, ps, bind_list, snum);
if (ret < 0)
goto err;
 
@@ -2688,7 +2703,7 @@ static int cma_alloc_any_port(enum rdma_port_space ps,
rover = prandom_u32() % remaining + low;
 retry:
if (last_used_port != rover &&
-   !cma_ps_find(ps, (unsigned short)rover)) {
+   !cma_ps_find(&init_net, ps, (unsigned short)rover)) {
int ret = cma_alloc_port(ps, id_priv, rover);
/*
 * Remember previously used port number in order to avoid
@@ -2754,7 +2769,7 @@ static int cma_use_port(enum rdma_port_space ps,
if (snum < PROT_SO

[PATCH] IB/cma: Fix net_dev reference leak with failed requests

2015-08-27 Thread Haggai Eran
When no matching listening ID is found for a given request, the net_dev
that was used to find the request isn't released.

Fixes: 20c36836ecad ("IB/cma: Use found net_dev for passive connections")
Signed-off-by: Haggai Eran 
---
 drivers/infiniband/core/cma.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index 9b306d7b5c27..b1ab13f3e182 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -1302,6 +1302,10 @@ static struct rdma_id_private *cma_id_from_event(struct 
ib_cm_id *cm_id,
bind_list = cma_ps_find(rdma_ps_from_service_id(req.service_id),
cma_port_from_service_id(req.service_id));
id_priv = cma_find_listener(bind_list, cm_id, ib_event, &req, *net_dev);
+   if (IS_ERR(id_priv)) {
+   dev_put(*net_dev);
+   *net_dev = NULL;
+   }
 
return id_priv;
 }
-- 
1.7.11.2

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-next 00/10] Add RoCE support to the mlx5 driver

2015-08-25 Thread Haggai Eran
On 25/08/2015 16:06, Achiad Shochat wrote:
> On 8/25/2015 3:32 PM, Tom Talpey wrote:
>> But I don't understand how it will work per-GID. What if the target
>> node is RoCEv2 and on another subnet? How will it discover the remote's
>> capability and establish the right protocol? How does the initiator
>> select the protocol, if there is a choice?
>>
>> Another way of asking this question is, why is all this stuff in the
>> driver, at the bottom of the stack? I think it should be in the
>> rdmacm layer.
>>
>> Tom.
>>
> 
> Selection of the RoCE version to be used per connection and discovering
> the RoCE versions supported by the target is up to the application.
> 
> It is not done by the driver of course, neither by the kernel stack.
> 
> The driver only exposes add/del_gid callbacks, used by the ib_core to
> configure the device GID table entries (including RoCE version per
> entry) based on netdev IPs configured in the system.

Let me add that when using RDMA CM, it is in charge of choosing the gid
index for a connection, and needs to be aware of the different
protocols. The patchset that includes the core changes for RoCE v2 [1]
allows configuring the preferred protocol for each port. There is also a
heuristic that selects RoCE v2 (which is routable) when the address
resolution detects more than one hop to the destination.

Regards,
Haggai

[1] http://www.spinics.net/lists/linux-rdma/msg28120.html

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v9 0/4] Sending kernel pathrecord query to user cache server

2015-08-24 Thread Haggai Eran
On 24/08/2015 17:32, Wan, Kaike wrote:
>> On Fri, Aug 14, 2015 at 08:52:05AM -0400, kaike@intel.com wrote:
>> > 
>>> > > Some tests with namespace have been performed:
>>> > > 1. An unprivileged user cannot bind to the RDMA_NL_GROUP_LS multicast
>>> > >group;
>>> > > 2. An unprivileged user cannot create a new network namespace. However,
>>> > >it can create a new user namespace together with a new network
>>> > >namespace by using clone() with CLONE_NEWUSER | CLONE_NEWNET
>> > flags;
>>> > > 3. In the user and network namespaces created by an unprivileged user,
>>> > >the user can be mapped into root and thus be able to bind to the
>>> > >RDMA_NL_GROUP_LS multicast group. However, it can neither send
>>> > >requests to the kernel RDMA netlink code nor receive requests from
>>> > >it. This is because kernel RDMA netlink code associates itself with
>>> > >the init_net network namespace, which in turn associates itself with
>>> > >init_user_ns namespace.
>> > 
>> > Haggie, how does this coverage match your expectations with your
>> > namespace series?
>> > 
>> > Kaike, how does #3 work? 
> I created a test app that used clone() with CLONE_NEWUSER | CLONE_NEWNET to 
> create child process (modeled after the user_namespace man page example: 
> http://man7.org/linux/man-pages/man7/user_namespaces.7.html). Once the child 
> process was mapped to root (uid 0),   it created the netlink socket and bound 
> to the RDMA_NL_GROUP_LS and waited to receive requests from the kernel.
> 
> If I create a user namespace and try to bind it
>> > succeeds to userspace but ibnl_chk_listeners still returns false in the 
>> > kernel?
> ibnl_chk_listeners() actually returned 0 (success), indicating that there 
> were listeners. However, ibnl_multicast() failed. From the code of 
> netlink_has_listeners(), it is apparently that the check has nothing to do 
> with namespace (that's why it succeeded).

It looks like the ibnl socket (nls) is created with the &init_net 
network namespace, and netlink won't send multicasts to sockets on 
other namespaces (see [1]).

Haggai

[1] http://lxr.free-electrons.com/source/net/netlink/af_netlink.c?v=4.1#L1935
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v9 0/4] Sending kernel pathrecord query to user cache server

2015-08-24 Thread Haggai Eran
On 24/08/2015 17:32, Wan, Kaike wrote:
>> On Fri, Aug 14, 2015 at 08:52:05AM -0400, kaike@intel.com wrote:
>> > 
>>> > > Some tests with namespace have been performed:
>>> > > 1. An unprivileged user cannot bind to the RDMA_NL_GROUP_LS multicast
>>> > >group;
>>> > > 2. An unprivileged user cannot create a new network namespace. However,
>>> > >it can create a new user namespace together with a new network
>>> > >namespace by using clone() with CLONE_NEWUSER | CLONE_NEWNET
>> > flags;
>>> > > 3. In the user and network namespaces created by an unprivileged user,
>>> > >the user can be mapped into root and thus be able to bind to the
>>> > >RDMA_NL_GROUP_LS multicast group. However, it can neither send
>>> > >requests to the kernel RDMA netlink code nor receive requests from
>>> > >it. This is because kernel RDMA netlink code associates itself with
>>> > >the init_net network namespace, which in turn associates itself with
>>> > >init_user_ns namespace.
>> > 
>> > Haggie, how does this coverage match your expectations with your
>> > namespace series?
>> > 
>> > Kaike, how does #3 work? 
> I created a test app that used clone() with CLONE_NEWUSER | CLONE_NEWNET to 
> create child process (modeled after the user_namespace man page example: 
> http://man7.org/linux/man-pages/man7/user_namespaces.7.html). Once the child 
> process was mapped to root (uid 0),   it created the netlink socket and bound 
> to the RDMA_NL_GROUP_LS and waited to receive requests from the kernel.
> 
> If I create a user namespace and try to bind it
>> > succeeds to userspace but ibnl_chk_listeners still returns false in the 
>> > kernel?
> ibnl_chk_listeners() actually returned 0 (success), indicating that there 
> were listeners. However, ibnl_multicast() failed. From the code of 
> netlink_has_listeners(), it is apparently that the check has nothing to do 
> with namespace (that's why it succeeded).

It looks like the ibnl socket (nls) is created with the &init_net 
network namespace, and netlink won't send multicasts to sockets on 
other namespaces (see [1]).

Haggai

[1] http://lxr.free-electrons.com/source/net/netlink/af_netlink.c?v=4.1#L1935
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v9 0/4] Sending kernel pathrecord query to user cache server

2015-08-24 Thread Haggai Eran
On 24/08/2015 17:32, Wan, Kaike wrote:
>> On Fri, Aug 14, 2015 at 08:52:05AM -0400, kaike@intel.com wrote:
>> > 
>>> > > Some tests with namespace have been performed:
>>> > > 1. An unprivileged user cannot bind to the RDMA_NL_GROUP_LS multicast
>>> > >group;
>>> > > 2. An unprivileged user cannot create a new network namespace. However,
>>> > >it can create a new user namespace together with a new network
>>> > >namespace by using clone() with CLONE_NEWUSER | CLONE_NEWNET
>> > flags;
>>> > > 3. In the user and network namespaces created by an unprivileged user,
>>> > >the user can be mapped into root and thus be able to bind to the
>>> > >RDMA_NL_GROUP_LS multicast group. However, it can neither send
>>> > >requests to the kernel RDMA netlink code nor receive requests from
>>> > >it. This is because kernel RDMA netlink code associates itself with
>>> > >the init_net network namespace, which in turn associates itself with
>>> > >init_user_ns namespace.
>> > 
>> > Haggie, how does this coverage match your expectations with your
>> > namespace series?
>> > 
>> > Kaike, how does #3 work? 
> I created a test app that used clone() with CLONE_NEWUSER | CLONE_NEWNET to 
> create child process (modeled after the user_namespace man page example: 
> http://man7.org/linux/man-pages/man7/user_namespaces.7.html). Once the child 
> process was mapped to root (uid 0),   it created the netlink socket and bound 
> to the RDMA_NL_GROUP_LS and waited to receive requests from the kernel.
> 
> If I create a user namespace and try to bind it
>> > succeeds to userspace but ibnl_chk_listeners still returns false in the 
>> > kernel?
> ibnl_chk_listeners() actually returned 0 (success), indicating that there 
> were listeners. However, ibnl_multicast() failed. From the code of 
> netlink_has_listeners(), it is apparently that the check has nothing to do 
> with namespace (that's why it succeeded).

It looks like the ibnl socket (nls) is created with the &init_net 
network namespace, and netlink won't send multicasts to sockets on 
other namespaces (see [1]).

Haggai

[1] http://lxr.free-electrons.com/source/net/netlink/af_netlink.c?v=4.1#L1935
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] IB/uverbs: reject invalid or unknown opcodes

2015-08-24 Thread Haggai Eran
On 24/08/2015 09:55, Christoph Hellwig wrote:
> On Mon, Aug 24, 2015 at 09:52:14AM +0300, Haggai Eran wrote:
>> Okay. Maybe you can just add a case for IB_WR_SEND in this patch to
>> avoid hurting bisectability.
> 
> I've done this already, just waiting for more feedback before resending:
> 
> http://git.infradead.org/users/hch/rdma.git/commitdiff/20f34ca8ecac302984f3a92b9ad29f5f4b41780d

Great.

>> Looking at the uverbs part in patch 2, I think the changes are okay. I
>> noticed there's a (__be32 __force) cast of the immediate data from
>> userspace (it was already in the existing code). I wonder, why not
>> define the field in the uapi struct as __be32 in the first place?
> 
> It looks odd to me as well, but it's not really something I want to
> change in this series.  Note that sparse annoted types like __be32
> aren't really common in userspace, but with a bit of effort they can
> be supported.  We have them and regularly run sparse for xfsprogs for
> example.

I have to try it with libibverbs sometime. It doesn't use uapi yet
though IIRC - it has its own version of the header files.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] IB/uverbs: reject invalid or unknown opcodes

2015-08-23 Thread Haggai Eran
On 22/08/2015 11:25, Christoph Hellwig wrote:
> On Sat, Aug 22, 2015 at 06:38:47AM +0000, Haggai Eran wrote:
>> It looks like the default case in the non-UD branch is currently used to 
>> handle plain IB_WR_SEND operations, so the patch would cause these to return 
>> an error.
> 
> Indeed.  It's handled fine in patch 2 which splits up the case, but
> will be incorrectly rejected with just this patch applied.

Okay. Maybe you can just add a case for IB_WR_SEND in this patch to
avoid hurting bisectability.

Looking at the uverbs part in patch 2, I think the changes are okay. I
noticed there's a (__be32 __force) cast of the immediate data from
userspace (it was already in the existing code). I wonder, why not
define the field in the uapi struct as __be32 in the first place?

Haggai
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] IB/uverbs: reject invalid or unknown opcodes

2015-08-21 Thread Haggai Eran
On Thursday, August 20, 2015 11:52 AM, linux-rdma-ow...@vger.kernel.org 
 on behalf of Sagi Grimberg 
 wrote:
> On 8/19/2015 7:37 PM, Christoph Hellwig wrote:
>> We have many WR opcodes that are only supported in kernel space
>> and/or require optional information to be copied into the WR
>> structure.  Reject all those not explicitly handled so that we
>> can't pass invalid information to drivers.
>>
>> Cc: sta...@vger.kernel.org
>> Signed-off-by: Christoph Hellwig 
>> ---
>>   drivers/infiniband/core/uverbs_cmd.c | 9 -
>>   1 file changed, 8 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/infiniband/core/uverbs_cmd.c 
>> b/drivers/infiniband/core/uverbs_cmd.c
>> index a15318a..f9f3921 100644
>> --- a/drivers/infiniband/core/uverbs_cmd.c
>> +++ b/drivers/infiniband/core/uverbs_cmd.c
>> @@ -2372,6 +2372,12 @@ ssize_t ib_uverbs_post_send(struct ib_uverbs_file 
>> *file,
>>   next->send_flags = user_wr->send_flags;
>>
>>   if (is_ud) {
>> + if (next->opcode != IB_WR_SEND &&
>> + next->opcode != IB_WR_SEND_WITH_IMM) {
>> + ret = -EINVAL;
>> + goto out_put;
>> + }
>> +
>>   next->wr.ud.ah = idr_read_ah(user_wr->wr.ud.ah,
>>file->ucontext);
>>   if (!next->wr.ud.ah) {
>> @@ -2413,7 +2419,8 @@ ssize_t ib_uverbs_post_send(struct ib_uverbs_file 
>> *file,
>>   next->wr.atomic.rkey = user_wr->wr.atomic.rkey;
>>   break;
>>   default:
>> - break;
>> + ret = -EINVAL;
>> + goto out_put;
>>   }
>>   }
>>
>>
> 
> Reviewed-by: Sagi Grimberg 
> 
> Haggai, can you also have a look?

It looks like the default case in the non-UD branch is currently used to handle 
plain IB_WR_SEND operations, so the patch would cause these to return an error.

Haggai--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v9 0/4] Sending kernel pathrecord query to user cache server

2015-08-21 Thread Haggai Eran
On Saturday, August 22, 2015 2:07 AM, Jason Gunthorpe 
 wrote:
> On Fri, Aug 14, 2015 at 08:52:05AM -0400, kaike@intel.com wrote:
> 
>> Some tests with namespace have been performed:
>> 1. An unprivileged user cannot bind to the RDMA_NL_GROUP_LS multicast
>>group;
>> 2. An unprivileged user cannot create a new network namespace. However,
>>it can create a new user namespace together with a new network
>>namespace by using clone() with CLONE_NEWUSER | CLONE_NEWNET flags;
>> 3. In the user and network namespaces created by an unprivileged user,
>>the user can be mapped into root and thus be able to bind to the
>>RDMA_NL_GROUP_LS multicast group. However, it can neither send
>>requests to the kernel RDMA netlink code nor receive requests from
>>it. This is because kernel RDMA netlink code associates itself with
>>the init_net network namespace, which in turn associates itself with
>>init_user_ns namespace.
> 
> Haggie, how does this coverage match your expectations with your
> namespace series?

It sounds good. Our thinking was that the network namespace assigns netdevs to 
namespaces, and we don't want to assign an entire IB device to a namespace, so 
it isn't suitable for restricting applications that deal with IB directly. RDMA 
CM already used IP addressing so it is was suitable to be changed to work only 
with the devices in a process's network namespace.

We also discussed creating a cgroup for RDMA later on, that could make sure a 
container can only use certain P_Keys, for applications that don't use RDMA CM. 
Such a cgroup could also be used to limit the SA queries a process can issue.

Haggai
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] IB/sa: Restrict SA Netlink to admin users

2015-08-09 Thread Haggai Eran
On 07/08/2015 00:08, ira.we...@intel.com wrote:
> @@ -754,6 +764,12 @@ static int ib_nl_handle_resolve_resp(struct sk_buff *skb,
>   int found = 0;
>   int ret;
>  
> + if (!ns_capable(sock_net(skb->sk)->user_ns, CAP_NET_ADMIN)) {
> + pr_warn_ratelimited("SA netlink: invalid perm for response: 
> `%s'.\n",
> + current->comm);
> + return -EPERM;
> + }
> +
>   spin_lock_irqsave(&ib_nl_request_lock, flags);
>   list_for_each_entry(query, &ib_nl_request_list, list) {
>   /*

Maybe I'm missing something, but I thought you would want to check the
capability with init_user_ns as the target, since the SA queries will
affect all namespaces, not just the one that sent the response.

Haggai
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-next 1/2] IB/core: Add support for RX/TX checksum offload capabilities report

2015-08-06 Thread Haggai Eran
On 08/06/2015 02:18 PM, Parav Pandit wrote:
> On Thu, Aug 6, 2015 at 4:30 PM, Haggai Eran  <mailto:hagg...@mellanox.com>> wrote:
> 
> On Wednesday, August 5, 2015 8:16 PM, Jason Gunthorpe
> jguntho...@obsidianresearch.com
> <mailto:jguntho...@obsidianresearch.com>> wrote:
> > On Wed, Aug 05, 2015 at 06:34:26PM +0300, Amir Vadai wrote:
> >>  struct ib_uverbs_ex_query_device {
> >>   __u32 comp_mask;
> >> + __u32 csum_caps;
> >>   __u32 reserved;
> >>  };
> >
> > Uh no.
> This is the struct of the command, not the response. There is no
> need to extend it. The command is designed to always return as much
> information as possible, so the user space code doesn't need to pass
> anything for it to work.
> 
> Even if you did want to extend it, you would need to replace the
> reserved word. The structs in this header file must be made in such
> way that they have the same size on 32-bit systems and on 64-bit
> systems (see the comment at the beginning of the header file). This
> is why the reserved word is there.
> 
> >
> >> @@ -221,6 +222,7 @@ struct ib_uverbs_odp_caps {
> >>  struct ib_uverbs_ex_query_device_resp {
> >>   struct ib_uverbs_query_device_resp base;
> >>   __u32 comp_mask;
> >> + __u32 csum_caps;
> >>   __u32 response_length;
> >>   struct ib_uverbs_odp_caps odp_caps;
> >>   __u64 timestamp_mask;
> >
> > Also totally wrong.
> 
> The response struct must maintain backward compatibility. You cannot
> change the order of the existing fields. The only valid way of
> extending it is at the end. Here too, you must make sure that the
> struct has the same size on 32-bit systems, so you would need to add
> a 32-bit reserved word at the end.
> 
> Haggai
> 
> As struct ib_uverbs_ex_query_device_resp captures extended capabilities,
> does it make sense to have few more reserved words defined as part of
> this patch?
> So that later on those reserved can be defined in future for additional
> features.
> This way for every new feature we dont need to bump structure size of
> ABI, not we need to define new set of ABI calls.
> Its hard to say how much more is sufficient, but was thinking of 8
> 32-bit words.
> 

I don't see how increasing the size now would get you anything that
changing the returned response_length field wouldn't. I'm not sure what
you consider an ABI change. Doesn't adding new meaning to reserved
fields count as a change? In any case, increasing the response length
doesn't require adding new calls. The kernel code will agree to fill
only the fields that fit in the buffer provided by the user-space caller.

Haggai
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-next 1/2] IB/core: Add support for RX/TX checksum offload capabilities report

2015-08-06 Thread Haggai Eran
On Wednesday, August 5, 2015 8:16 PM, Jason Gunthorpe 
jguntho...@obsidianresearch.com> wrote:
> On Wed, Aug 05, 2015 at 06:34:26PM +0300, Amir Vadai wrote:
>>  struct ib_uverbs_ex_query_device {
>>   __u32 comp_mask;
>> + __u32 csum_caps;
>>   __u32 reserved;
>>  };
> 
> Uh no.
This is the struct of the command, not the response. There is no need to extend 
it. The command is designed to always return as much information as possible, 
so the user space code doesn't need to pass anything for it to work.

Even if you did want to extend it, you would need to replace the reserved word. 
The structs in this header file must be made in such way that they have the 
same size on 32-bit systems and on 64-bit systems (see the comment at the 
beginning of the header file). This is why the reserved word is there. 

> 
>> @@ -221,6 +222,7 @@ struct ib_uverbs_odp_caps {
>>  struct ib_uverbs_ex_query_device_resp {
>>   struct ib_uverbs_query_device_resp base;
>>   __u32 comp_mask;
>> + __u32 csum_caps;
>>   __u32 response_length;
>>   struct ib_uverbs_odp_caps odp_caps;
>>   __u64 timestamp_mask;
> 
> Also totally wrong.

The response struct must maintain backward compatibility. You cannot change the 
order of the existing fields. The only valid way of extending it is at the end. 
Here too, you must make sure that the struct has the same size on 32-bit 
systems, so you would need to add a 32-bit reserved word at the end.

Haggai
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 01/12] IB/core: Guarantee that a local_dma_lkey is available

2015-08-02 Thread Haggai Eran
On 31/07/2015 02:22, Jason Gunthorpe wrote:
>  int ib_dealloc_pd(struct ib_pd *pd)
>  {
> + if (pd->local_mr) {
> + if (ib_dereg_mr(pd->local_mr))
> + return -EBUSY;
> + pd->local_mr = NULL;
> + }
> +

It looks like ib_uverbs_alloc_pd calls ib_device.alloc_pd() directly,
and some drivers don't use kzalloc for allocating the pd, so the
ib_dereg_mr call above results in a general protection fault.

Haggai
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 00/13] Demux IB CM requests in the rdma_cm module

2015-07-30 Thread Haggai Eran
On Thursday, July 30, 2015 6:33 PM, Doug Ledford  wrote:
> To: Haggai Eran
> Cc: Liran Liss; linux-rdma@vger.kernel.org; Jason Gunthorpe
> Subject: Re: [PATCH v3 00/13] Demux IB CM requests in the rdma_cm module
> 
> On 07/30/2015 05:03 AM, Haggai Eran wrote:
>> On 29/07/2015 17:49, Doug Ledford wrote:
>>> This doesn't apply on to a clean 4.2-rc4 kernel tree.  Can you please
>>> rebase against either that or my to-be-rebase/for-4.3 branch of my
>>> github repo?
>>
>> Sure. Keep in mind that this patchset also depends on the patch from
>> Matan's series to protected the device and client list with a rwsem [2].
>> Do you want me to add it to this series?
> 
> You added it to the series, which is OK and I've picked that series up.
>  In general though, if there are two different patch series that both
> need the same patch, I would prefer it if that one patch is sent by
> itself so I can grab it and not included in either of the other two
> series.  I think some of the patches that didn't apply were when I
> forgot to remove a duplicate patch found in two different series.

Okay, I'll do that in the future.

In any case, there was minor conflict between the client context patch and the 
hot removal patchset, so a rebase was necessary anyway.

Thanks,
Haggai--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


  1   2   3   4   >