Re: [PATCH] IB/ipoib: CSUM support in connected mode
Hi, Le mercredi 18 novembre 2015 à 12:46 +0200, Yuval Shaia a écrit : > On Thu, Jul 30, 2015 at 03:58:13PM +0200, Yann Droneaud wrote: > > Le jeudi 30 juillet 2015 à 04:46 -0700, Yuval Shaia a écrit : > > > This enhancement suggest the usage of IB CRC instead of CSUM in > > > IPoIB CM. IPoIB CM uses RC (Reliable Connection) which guarantees > > > the corruption free delivery of the packet. > > > > > > InfiniBand uses 32b CRC which provides stronger data integrity > > > protection compare to 16b IP Checksum. > > > > InfiniBand 32b CRC <=> Ethernet 32b CRC, it's link layer, layer 2. > > > > IPv4 checksum is at another level, it's internet layer, layer 3. > > > > > So, there is no added value that IP/TCP Checksum provides in the > > > IB world. > > > > > > > Sure, IPv4 checksum is a thing of the past: checksum was dropped > > from IP header in IPv6: it assumes the lower layer, such as > > Ethernet, provides the required integrety check. > Right, will change description to something like this: > InfiniBand 32b CRC offers strong data integrity protection compare > CSUM. This patch must clearly state that, without IPv4 header checksum, it's no more IP protocol as described by RFC 791 > > > > I think not checking the IPv4 checksum should be a choice, > > carefully thought, for inside a fabric, as I understand your > > proposal, packet with invalid checksum will be allowed to go in/out > > of the fabric. > If peer supports this feature do you see any reason why not? You're free to run you own protocol inside your private network, but this protocol is not compatible with IPv4. > > > > It sound like it's a departure from the behavior one can expect > > from an IPv4 network stack. > > > > > The proposal is to tell network stack that IPoIB-CM supports IP > > > Checksum offload. This enables the kernel to save the time of > > > checksum calculation of IPoIB CM packets. Network sends the IP > > > packet without adding the IP Checksum to the header. On the > > > receive side, IPoIB driver again tells the network stack that IP > > > Checksum is good for the incoming packets and network stack > > > avoids the IP Checksum calculations. > > > > > > During connection establishment the driver determine if peer > > > supports IB CRC as checksum. This is done so driver will be able > > > to calculate checksum before transmiting the packet in case the > > > peer does not support this feature. > > > > > > > Two questions: > > > > - What will see tool such as wireshark/tcpdump when sniffing > > checksum-less IPv4 packets sent/received on IPoIB interface ? > Never checked it but i assume 0 or what ever the kernel put there > when driver reports NETIF_F_IP_CSUM. I can check and zero if needed > but any reason to believe it is needed? I don't think it's needed if you agree that this modification creates a new protocol which is no more IPv4. > > > > - What might happen if such checksum-less IPv4 packet is later > > routed to a different IPv4 network ? > Same as my question above, if peer supports this feature do you see > anything wrong? If peer is going to forward this packet to a different network, which is not IPoIB based, the checksum will be checked and the packet will be rejected. Regards. -- Yann Droneaud OPTEYA -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 1/3] IB/core: Expose a device attribute for rdma_read access flags
Hi, Le mardi 10 novembre 2015 à 12:44 +0200, Sagi Grimberg a écrit : > Different devices may require different access permissions > for rdma reads e.g. for Infiniband devices, local write access > suffices while iWARP devices require remote write permissions as > well. > > This situation generates extra logic for ULPs that need to be aware > of the underlying device to determine rdma read access. Instead, > add a device attribute for ULPs to use without being aware of the > underlying device. > > Also, set rdma_read_access_flags in the relevant device drivers: > mlx4, mlx5, qib, ocrdma, nes: IB_ACCESS_LOCAL_WRITE > cxgb3, cxgb4, hfi: IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_WRITE > Why were those hw providers not modified to enforce IB_ACCESS_REMOTE_WRITE when needed, instead of asking users to set it for them ? Regards. -- Yann Droneaud OPTEYA -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] IB/core: avoid 32-bit warning
Hi, Le mercredi 07 octobre 2015 à 16:19 +0300, Sagi Grimberg a écrit : > On 10/7/2015 3:29 PM, Arnd Bergmann wrote: > > The INIT_UDATA() macro requires a pointer or unsigned long argument > > for > > both input and output buffer, and all callers had a cast from when > > the code was merged until a recent restructuring, so now we get > > > > core/uverbs_cmd.c: In function 'ib_uverbs_create_cq': > > core/uverbs_cmd.c:1481:66: warning: cast to pointer from integer of > > different size [-Wint-to-pointer-cast] > > > > This makes the code behave as before by adding back the cast to > > unsigned long. > > > > Signed-off-by: Arnd Bergmann <a...@arndb.de> > > Fixes: 565197dd8fb1 ("IB/core: Extend ib_uverbs_create_cq") > > > > diff --git a/drivers/infiniband/core/uverbs_cmd.c > > b/drivers/infiniband/core/uverbs_cmd.c > > index be4cb9f04be3..88b3b78340f2 100644 > > --- a/drivers/infiniband/core/uverbs_cmd.c > > +++ b/drivers/infiniband/core/uverbs_cmd.c > > @@ -1478,7 +1478,7 @@ ssize_t ib_uverbs_create_cq(struct > > ib_uverbs_file *file, > > if (copy_from_user(, buf, sizeof(cmd))) > > return -EFAULT; > > > > - INIT_UDATA(, buf, cmd.response, sizeof(cmd), > > sizeof(resp)); > > + INIT_UDATA(, buf, (unsigned long)cmd.response, > > sizeof(cmd), sizeof(resp)); > > Would it make sense to cast inside INIT_UDATA() and not have callers > worry about it? It's ... complicated. See INIT_UDATA_BUF_OR_NULL(). Awayway, I have patch to do the opposite, eg. explicitly cast u64 value to (void __user *)(unsigned long) in the caller function instead, as it allows some safer / new operations on the response buffer. Regards. -- Yann Droneaud OPTEYA -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] IB/core: avoid 32-bit warning
Hi, Le mercredi 07 octobre 2015 à 14:29 +0200, Arnd Bergmann a écrit : > The INIT_UDATA() macro requires a pointer or unsigned long argument > for both input and output buffer, and all callers had a cast from > when the code was merged until a recent restructuring, so now we get > > core/uverbs_cmd.c: In function 'ib_uverbs_create_cq': > core/uverbs_cmd.c:1481:66: warning: cast to pointer from integer of > different size [-Wint-to-pointer-cast] > > This makes the code behave as before by adding back the cast to > unsigned long. > > Signed-off-by: Arnd Bergmann <a...@arndb.de> > Fixes: 565197dd8fb1 ("IB/core: Extend ib_uverbs_create_cq") > Reviewed-by: Yann Droneaud <ydrone...@opteya.com> > diff --git a/drivers/infiniband/core/uverbs_cmd.c > b/drivers/infiniband/core/uverbs_cmd.c > index be4cb9f04be3..88b3b78340f2 100644 > --- a/drivers/infiniband/core/uverbs_cmd.c > +++ b/drivers/infiniband/core/uverbs_cmd.c > @@ -1478,7 +1478,7 @@ ssize_t ib_uverbs_create_cq(struct > ib_uverbs_file *file, > if (copy_from_user(, buf, sizeof(cmd))) > return -EFAULT; > > - INIT_UDATA(, buf, cmd.response, sizeof(cmd), > sizeof(resp)); > + INIT_UDATA(, buf, (unsigned long)cmd.response, > sizeof(cmd), sizeof(resp)); > > INIT_UDATA(, buf + sizeof(cmd), > (unsigned long)cmd.response + sizeof(resp), > Regards. -- Yann Droneaud OPTEYA -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: message size, was Re: merge struct ib_device_attr into struct ib_device
Hi, Le mardi 22 septembre 2015 à 23:55 +0200, 'Christoph Hellwig' a écrit : > On Tue, Sep 22, 2015 at 04:06:17PM -0500, Steve Wise wrote: > > Can you create a series of smaller patches that will fit on the > > list? > > That would make it easier for everyone to review/comment. > > I don't see how that is possible, as it's a flag day change. > Perhaps you could change ib_query_device() to use the ib_device_attr from struct ib_device so that ulp doesn't have to be modified at the same time the drivers. Then further patches can update the users of ib_query_device(), one at a time. And the last patch would remove ib_query_device(). Regards. -- Yann Droneaud -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: merge struct ib_device_attr into struct ib_device
Le lundi 21 septembre 2015 à 13:59 -0700, Christoph Hellwig a écrit : > This patch gets rid of struct ib_device_attr and cleans up drivers > nicely. > > It goes on top of my send_wr cleanups and the memory registration > udpates > from Sagi. > Is the patch missing ? Regards. -- Yann Droneaud OPTEYA -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH for-next 02/10] IB/core: Add flags parameter for ib_create_cq
Hi, Le dimanche 17 mai 2015 à 16:36 +0300, Or Gerlitz a écrit : diff --git a/drivers/infiniband/ulp/iser/iser_verbs.c b/drivers/infiniband/ulp/iser/iser_verbs.c index cc2dd35..922d322 100644 --- a/drivers/infiniband/ulp/iser/iser_verbs.c +++ b/drivers/infiniband/ulp/iser/iser_verbs.c @@ -126,11 +126,9 @@ static int iser_create_device_ib_res(struct iser_device *device) struct iser_comp *comp = device-comps[i]; comp-device = device; - comp-cq = ib_create_cq(device-ib_device, - iser_cq_callback, - iser_cq_event_callback, - (void *)comp, - max_cqe, i); + comp-cq = ib_create_cq(device-ib_device, iser_cq_callback, + iser_cq_event_callback, (void *)comp, + max_cqe, i, 0); Don't change indentation needlessly. if (IS_ERR(comp-cq)) { comp-cq = NULL; goto cq_err; diff --git a/drivers/infiniband/ulp/isert/ib_isert.c b/drivers/infiniband/ulp/isert/ib_isert.c index 327529e..f7d9ae0 100644 --- a/drivers/infiniband/ulp/isert/ib_isert.c +++ b/drivers/infiniband/ulp/isert/ib_isert.c @@ -320,11 +320,12 @@ isert_alloc_comps(struct isert_device *device, comp-device = device; INIT_WORK(comp-work, isert_cq_work); - comp-cq = ib_create_cq(device-ib_device, - isert_cq_callback, + comp-cq = ib_create_cq(device-ib_device, isert_cq_callback, Don't change indentation. isert_cq_event_callback, (void *)comp, - max_cqe, i); + max_cqe, + i, + 0); if (IS_ERR(comp-gt;cq)) { isert_err(Unable to allocate cq\n); ret = PTR_ERR(comp-gt;cq); diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h index ad0e2ea..1bbe4a4 100644 --- a/include/rdma/ib_verbs.h +++ b/include/rdma/ib_verbs.h @@ -173,6 +173,10 @@ struct ib_odp_caps { } per_transport_caps; }; +enum ib_cq_creation_flags { + IB_CQ_FLAGS_TIMESTAMP = 1 0, +}; + That's must be part of a different patch, otherwise the commit message is not true. struct ib_cq_init_attr { int cqe; int comp_vector; diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c index 3df8320..cfb5915 100644 --- a/net/sunrpc/xprtrdma/svc_rdma_transport.c +++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c @@ -901,21 +901,16 @@ static struct svc_xprt *svc_rdma_accept(struct svc_xprt *xprt) goto errout; } newxprt-sc_sq_cq = ib_create_cq(newxprt-sc_cm_id-device, -sq_comp_handler, -cq_event_handler, -newxprt, -newxprt-sc_sq_depth, -0); +sq_comp_handler, cq_event_handler, +newxprt, newxprt-sc_sq_depth, 0, 0); Don't change indentation. if (IS_ERR(newxprt-sc_sq_cq)) { dprintk(svcrdma: error creating SQ CQ for connect request\n); goto errout; } newxprt-sc_rq_cq = ib_create_cq(newxprt-sc_cm_id-device, -rq_comp_handler, -cq_event_handler, -newxprt, -newxprt-sc_max_requests, -0); +rq_comp_handler, cq_event_handler, +newxprt, newxprt-sc_max_requests, +0, 0); Don't change indentation. if (IS_ERR(newxprt-sc_rq_cq)) { dprintk(svcrdma: error creating RQ CQ for connect request\n); goto errout; Regards -- Yann Droneaud OPTEYA -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH for-next 09/10] IB/mlx4: Add timestamp_mask and hca_core_clock to query_device
Hi, Le mercredi 20 mai 2015 à 17:41 +0300, Or Gerlitz a écrit : On 5/20/2015 3:29 AM, Jason Gunthorpe wrote: On Tue, May 19, 2015 at 10:30:00PM +0300, Or Gerlitz wrote: Are you objecting adding the clock frequency and mask to the qeury device verb? why? Lets see the verbs side and I'll let you know. You mean the user series of libibverbs/libmlx4? I don't see why this should be a must for the review of the kernel bits. The user-space code is coming up soon, sure, but we should be able to review kernel patches without requiring to actually see the user-space code. In some other subsystems: no userspace code, no merge. http://blog.ffwll.ch/2015/05/gfx-kernel-upstreaming-requirements.html As the change-logs here explained, the clock frequency is needed for applications to convert the HCA clock delta (current time - timestamp on WC) into nano-secs and such.The mask is needed to realize how many bits from the 64b time-stamp are supported by the HW. If not, are objecting using the vendor specific track of the verb to pass from the vendor driver to the vendor library this or that detail which is needed for proper operation? why? I'm uncomfortable seeing otherwise vendor-neutral calls gain vendor extensions. But this is whole purpose of the udata framework in uverbs, right? for each uverb command the vendor user-space library has a well defined channel to communicate directly with the low level vendor driver throughout the uverbs channels. Uverbs convey information between kernel and userspace drivers to implement verbs for userspace application. I don't think it's designed to allow vendor to add random extensions in the best way with regard to backward/forward compability. Anyway, please, we have to make drivers which are going to behave as good citizens to the kernel *and* userspace. Adding a dedicated extensions which is going to be replaced later by a generic, vendor neutral, extension will be painful to maintain to ensure backward compatibility. So let's think how this timestamp extension can be made generic enough to be future-proof (and at least present proof to address current use cases). Regards. -- Yann Droneaud OPTEYA -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH for-next 01/10] IB/core: Change provider's API of create_cq to be extendible
Le mardi 19 mai 2015 à 12:45 -0600, Jason Gunthorpe a écrit : + struct ib_cq * (*create_cq)(struct ib_device *device, + struct ib_cq_init_attr *attr, const struct ib_cq_init_attr *attr, And related changes that will cause. I was going to ask for the same change. Regards. -- Yann Droneaud OPTEYA -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: rdma kernel tree
Hi, Le mardi 19 mai 2015 à 15:49 -0400, Doug Ledford a écrit : On Tue, 2015-05-19 at 21:47 +0300, Or Gerlitz wrote: On Tue, May 19, 2015 at 3:22 PM, Doug Ledford dledf...@redhat.com wrote: On Tue, 2015-05-19 at 07:35 +0300, Or Gerlitz wrote: I pulled that via a bundle from patchworks. I'll double check it. Did you check it out? fixed it out? I took a look now, you've rebased to-be-rebased/for-4.2 to 4.1-rc4 and Correct. it seems this is what you are going to push into the kernel.org treem Correct. but this series is still there with the zillion tested/reviewed/etc signature per one 2-3 patch, I think we've agreed this needs to be addressed prior to the upstream push, right? Incorrect. What you objected to before was the large Cc: list in the patches. That is gone. What is there now is just the reviewed-by: list of three people, and the tested-by list of two people. As the entire patch set as a whole was reviewed and tested by those people, it seems accurate to me. Doug, I have never ever seen a patch set (specifically the 15~23 part of it) with that level of simplicity That portion of the patchset didn't start out with that level of simplicity per patch, it evolved to that because it made review *significantly* easier. It's very simple to review a patch that does: Add 1 helper Replace tests in code with just that 1 helper because you can scroll through that patch and know that every line being replaced is related to that one helper. If you want to know every line that was replaced with rdma_cap_iw_cm, you go to that one patch and it's all listed very easy to read. On the other hand, when you squash all those patches together, review becomes much harder because if you want to see what a single helper does, you have to sift through all of the other helper changes and hope you find the right ones, and that you found all of the right ones. and signature/reviewers/tested-by/etc inflation. I added those myself as part of an automated addition. It applied to the entire series, so it was put on each patch. The people that tested/reviewed the series did not do so to individual patches, they hit all of them. And as was pointed out a couple of weeks ago on an earlier patchset I picked up, it is generally good behavior to give attribution where it is due to encourage people to participate. I agree with all the above replies from Doug: - small patches are easier to read and review: I prefer reviewing 10 littles patches than one big patch. - Signed-off-by:, Reviewed-by:, Tested-by: and Cc: are required too so that when it come to modifying existing code, we can contact people previously involved for reviews, tests, advice, etc. (I usually add Link: with a reference to the e-mail so that one commit can be traced back to the related patch submission and discussions (cover-letter !), and, to be able to trace back all patch iterations, I add a link to the previous patchset submission, and so on). Regards. -- Yann Droneaud OPTEYA -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] libibverbs init.c: remove stderr warnings if no userspace driver found
Hi, Le vendredi 08 mai 2015 à 11:21 -0700, Jeff Squyres a écrit : Signed-off-by: Jeff Squyres jsquy...@cisco.com This is a little short for an explanation: what was the issue with the error messages ? --- src/init.c | 14 -- 1 file changed, 14 deletions(-) diff --git a/src/init.c b/src/init.c index d0e4b1c..9c21768 100644 --- a/src/init.c +++ b/src/init.c @@ -557,19 +557,5 @@ HIDDEN int ibverbs_init(struct ibv_device ***list) } out: - for (sysfs_dev = sysfs_dev_list, - next_dev = sysfs_dev ? sysfs_dev-next : NULL; - sysfs_dev; - sysfs_dev = next_dev, next_dev = sysfs_dev ? sysfs_dev-next : NULL) { - if (!sysfs_dev-have_driver) { - fprintf(stderr, PFX Warning: no userspace device-specific - driver found for %s\n, sysfs_dev-sysfs_path); - if (statically_linked) - fprintf(stderr,When linking libibverbs statically, - driver must be statically linked too.\n); - } - free(sysfs_dev); I believe this free() was necessary to not leak some memory. - } - return num_devices; } Regards. -- Yann Droneaud OPTEYA -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V1 net-next] IB/ipoib: Fix ndo_get_iflink
Hi, Le jeudi 16 avril 2015 à 16:34 +0300, Erez Shitrit a écrit : Currently, iflink of the parent interface was always accessed, even when interface didn't have a parent and hence we crashed there. + since commit 5aa7add8f14b ('infiniband/ipoib: implement ndo_get_iflink'). as there was no crash here before AFAIK. Handle the interface types properly: for a child interface, return the ifindex of the parent, for parent interface, return its ifindex. For child devices, make sure to set the parent pointer prior to invoking register_netdevice(), this allows the new ndo to be called by the stack immediately after the child device is registered. Fixes: 5aa7add8f14b ('infiniband/ipoib: implement ndo_get_iflink') Cc: Nicolas Dichtel nicolas.dich...@6wind.com Reported-by: Honggang Li ho...@redhat.com Signed-off-by: Erez Shitrit ere...@mellanox.com Signed-off-by: Honggang Li ho...@redhat.com --- changes from V0: - fixed two typos in the change-log drivers/infiniband/ulp/ipoib/ipoib_main.c | 5 + drivers/infiniband/ulp/ipoib/ipoib_vlan.c | 3 +-- 2 files changed, 6 insertions(+), 2 deletions(-) diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c index 657b89b..915ad04 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c @@ -846,6 +846,11 @@ static int ipoib_get_iflink(const struct net_device *dev) { struct ipoib_dev_priv *priv = netdev_priv(dev); + /* parent interface */ + if (!test_bit(IPOIB_FLAG_SUBINTERFACE, priv-flags)) + return dev-ifindex; + + /* child/vlan interface */ return priv-parent-ifindex; } diff --git a/drivers/infiniband/ulp/ipoib/ipoib_vlan.c b/drivers/infiniband/ulp/ipoib/ipoib_vlan.c index 4dd1313..fca1a88 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_vlan.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_vlan.c @@ -58,6 +58,7 @@ int __ipoib_vlan_add(struct ipoib_dev_priv *ppriv, struct ipoib_dev_priv *priv, /* MTU will be reset when mcast join happens */ priv-dev-mtu = IPOIB_UD_MTU(priv-max_ib_mtu); priv-mcast_mtu = priv-admin_mtu = priv-dev-mtu; + priv-parent = ppriv-dev; set_bit(IPOIB_FLAG_SUBINTERFACE, priv-flags); result = ipoib_set_dev_features(priv, ppriv-ca); @@ -84,8 +85,6 @@ int __ipoib_vlan_add(struct ipoib_dev_priv *ppriv, struct ipoib_dev_priv *priv, goto register_failed; } - priv-parent = ppriv-dev; - ipoib_create_debug_files(priv-dev); /* RTNL childs don't need proprietary sysfs entries */ Regards. -- Yann Droneaud OPTEYA -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v1 2/2] IB/core: don't disallow registering region starting at 0x0
Hi, Le mardi 14 avril 2015 à 12:20 +0300, Sagi Grimberg a écrit : On 4/13/2015 3:56 PM, Yann Droneaud wrote: In a call to ib_umem_get(), if address is 0x0 and size is already page aligned, check added in commit 8494057ab5e4 (IB/uverbs: Prevent integer overflow in ib_umem_get address arithmetic) will refuse to register a memory region that could otherwise be valid (provided vm.mmap_min_addr sysctl and mmap_low_allowed SELinux knobs allow userspace to map something at address 0x0). This patch allows back such registration: ib_umem_get() should probably don't care of the base address provided it can be pinned with get_user_pages(). There's two possible overflows, in (addr + size) and in PAGE_ALIGN(addr + size), this patch keep ensuring none of them happen while allowing to pin memory at address 0x0. Anyway, the case of size equal 0 is no more (partially) handled as 0-length memory region are disallowed by an earlier check. Link: http://mid.gmane.org/cover.1428929103.git.ydrone...@opteya.com Cc: sta...@vger.kernel.org # 8494057ab5e4 (IB/uverbs: Prevent integer overflow in ib_umem_get address arithmetic) Cc: Shachar Raindel rain...@mellanox.com Cc: Jack Morgenstein ja...@mellanox.com Cc: Or Gerlitz ogerl...@mellanox.com Signed-off-by: Yann Droneaud ydrone...@opteya.com --- drivers/infiniband/core/umem.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c index 9ac4068d2088..38acb3cfc545 100644 --- a/drivers/infiniband/core/umem.c +++ b/drivers/infiniband/core/umem.c @@ -106,8 +106,8 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr, * If the combination of the addr and size requested for this memory * region causes an integer overflow, return error. */ - if ((PAGE_ALIGN(addr + size) = size) || - (PAGE_ALIGN(addr + size) = addr)) + if (((addr + size) addr) || + PAGE_ALIGN(addr + size) (addr + size)) If you do change the first statement to be: (addr + size) = addr wouldn't it cover patch #1? Yes, but it doesn't sound a great place to do it: here it's about overflow, so I'd prefer not doing the null memory region check there. Regards. -- Yann Droneaud OPTEYA -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH linux-next 1/4] infiniband/ipoib: fix possible NULL pointer dereference in ipoib_get_iflink
Hi Nicolas, Le mardi 14 avril 2015 à 17:49 +0200, Nicolas Dichtel a écrit : Le 14/04/2015 17:44, Honggang LI a écrit : On Tue, Apr 14, 2015 at 08:34:33AM -0700, Eric Dumazet wrote: On Tue, 2015-04-14 at 23:20 +0800, Honggang Li wrote: Starting monitoring for VG vg_rdma01: 3 logical volume(s) in volume group vg_rdma01 monitored [ OK ] CR2: 0120 ---[ end trace a8610f6e9640eb85 ]--- Signed-off-by: Honggang Li ho...@redhat.com When was this bug added ? Sorry, I do not know. I ran into this bug today when I'm tracing a race condition issue related qib. In order to check upstream linux-next fix the race condition or not, I build linux-next-20150414 on two machines. Both machines panic at modprobe ib_ipoib. Do you means I need to report a bug? But I do not know report to who or where. Here is the tag: Fixes: 5aa7add8f14b (infiniband/ipoib: implement ndo_get_iflink) Pardon me, but this patch was never submitted to linux-rdma@vger.kernel.org for review !? Regards. -- Yann Droneaud OPTEYA -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v1 0/2] Fixes on top of CVE-2014-8159 kernel: infiniband: uverbs: unprotected physical memory access
Hi, Please find one patch to prevent a possible issue partially addressed by commit 8494057ab5e4 (IB/uverbs: Prevent integer overflow in ib_umem_get address arithmetic) (see discussions in [1]) and another one to add back the possibility of registering memory mapped at 0 (which is probably not something to be allowed, but it's probably not up to ib_umem_get() to prevent it). Changes from v0 [2]: - don't touch to overflow logic in first patch: not modifying the logic here so that the patch can be applied even on kernel without the overflow preventing checks, and second patch is going to rewrite the check. - don't break overflow detection in second patch: changing less or equal to less comparison broke the overflow detection logic regarding to rounding done by PAGE_ALIGN, so fixes this by checking for overflow in addr + size, then by checking for overflow in PAGE_ALIGN(addr + size). [1] Re: CVE-2014-8159 kernel: infiniband: uverbs: unprotected physical memory access http://mid.gmane.org/1428497043.22575.176.ca...@opteya.com http://marc.info/?i=1428497043.22575.176.ca...@opteya.com [2] [PATCH RESEND 0/2] Fixes on top of CVE-2014-8159 kernel: infiniband: uverbs: unprotected physical memory access http://mid.gmane.org/cover.1428523125.git.ydrone...@opteya.com http://marc.info/?i=cover.1428523125.git.ydrone...@opteya.com Yann Droneaud (2): IB/core: disallow registering 0-sized memory region IB/core: don't disallow registering region starting at 0x0 drivers/infiniband/core/umem.c | 7 +-- 1 file changed, 5 insertions(+), 2 deletions(-) -- 2.1.0 -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v1 1/2] IB/core: disallow registering 0-sized memory region
If ib_umem_get() is called with a size equal to 0 and an non-page aligned address, one page will be pinned and a 0-sized umem will be returned to the caller. This should not be allowed: it's not expected for a memory region to have a size equal to 0. This patch adds a check to explicitly refuse to register a 0-sized region. Link: http://mid.gmane.org/cover.1428929103.git.ydrone...@opteya.com Cc: sta...@vger.kernel.org Cc: Shachar Raindel rain...@mellanox.com Cc: Jack Morgenstein ja...@mellanox.com Cc: Or Gerlitz ogerl...@mellanox.com Signed-off-by: Yann Droneaud ydrone...@opteya.com --- drivers/infiniband/core/umem.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c index 8c014b5dab4c..9ac4068d2088 100644 --- a/drivers/infiniband/core/umem.c +++ b/drivers/infiniband/core/umem.c @@ -99,6 +99,9 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr, if (dmasync) dma_set_attr(DMA_ATTR_WRITE_BARRIER, attrs); + if (!size) + return ERR_PTR(-EINVAL); + /* * If the combination of the addr and size requested for this memory * region causes an integer overflow, return error. -- 2.1.0 -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: CVE-2014-8159 kernel: infiniband: uverbs: unprotected physical memory access
Hi, Le jeudi 02 avril 2015 à 18:12 +, Haggai Eran a écrit : On Thursday, April 2, 2015 7:44 PM, Shachar Raindel wrote: -Original Message- From: Yann Droneaud [mailto:ydrone...@opteya.com] Le jeudi 02 avril 2015 à 18:18 +0300, Haggai Eran a écrit : On 02/04/2015 16:30, Yann Droneaud wrote: Le jeudi 02 avril 2015 à 10:52 +, Shachar Raindel a écrit : -Original Message- From: Yann Droneaud [mailto:ydrone...@opteya.com] Sent: Thursday, April 02, 2015 1:05 PM Le mercredi 18 mars 2015 à 17:39 +, Shachar Raindel a écrit : + /* + * If the combination of the addr and size requested for this memory + * region causes an integer overflow, return error. + */ + if ((PAGE_ALIGN(addr + size) = size) || + (PAGE_ALIGN(addr + size) = addr)) + return ERR_PTR(-EINVAL); + Can access_ok() be used here ? if (!access_ok(writable ? VERIFY_WRITE : VERIFY_READ, addr, size)) return ERR_PTR(-EINVAL); No, this will break the current ODP semantics. ODP allows the user to register memory that is not accessible yet. This is a critical design feature, as it allows avoiding holding a registration cache. Adding this check will break the behavior, forcing memory to be all accessible when registering an ODP MR. Where's the check for the range being in userspace memory space, especially for the ODP case ? For non ODP case (eg. plain old behavior), does get_user_pages() ensure the requested pages fit in userspace region on all architectures ? I think so. Yes, get_user_pages will return a smaller amount of pages than requested if it encounters an unmapped region (or a region without write permissions for write requests). If this happens, the loop in ib_umem_get calls get_user_pages again with the next set of pages, and this time if it the first page still cannot be mapped an error is returned. In ODP case, I'm not sure such check is ever done ? In ODP, we also call get_user_pages, but only when a page fault occurs (see ib_umem_odp_map_dma_pages()). This allows the user to pre- register a memory region that contains unmapped virtual space, and then mmap different files into that area without needing to re-register. OK, thanks for the description. ... Another related question: as the large memory range could be registered by user space with ibv_reg_mr(pd, base, size, IB_ACCESS_ON_DEMAND), what's prevent the kernel to map a file as the result of mmap(0, ...) in this region, making it available remotely through IBV_WR_RDMA_READ / IBV_WR_RDMA_WRITE ? This is not a bug. This is a feature. Exposing a file through RDMA, using ODP, can be done exactly like this. Given that the application explicitly requested this behavior, I don't see why it is a problem. Actually, some of our tests use such flows. The mmu notifiers mechanism allow us to do this safely. When the page is written back to disk, it is removed from the ODP mapping. When it is accessed by the HCA, it is brought back to RAM. I want to add that we would like to see users registering a very large memory region (perhaps the entire process address space) for local access, and then enabling remote access only to specific regions using memory windows. However, this isn't supported yet by our driver. In such scheme, the registration must still be handled manually: one has to create a memory window to get a rkey to be exchanged with a peer, so why one would want to register such a large memory region (the whole process space) ? I guess creating the memory window is faster than registering memory region. I'd rather say this is not an excuse to register a larger memory region (up to the whole process space, current and future) as it sounds like a surprising optimisation: let the HCA known too many pages just to be sure it already knows some when the process want to use them. It seems it would become difficult to handle if there's too many processes. I would prefer creating memory region becoming costless (through ODP :). Still, there are valid cases where you would still want the results of an mmap(0,...) call to be remotely accessible, in cases where there is enough trust between the local process and the remote process. mmap(0, , fd) let the kernel choose where to put the file in process virtual memory space, so it may, may not, partially, end up in an ODP pre registered memory region for a range unallocated/unused yet. I don't think one want such to happen. It may help a middleware communication library register a large portion of the address space in advance, and still work with random pointers given to it by another application module. But as said in the beginnig of your message, the middleware would have bind a memory window
[PATCH v1 2/2] IB/core: don't disallow registering region starting at 0x0
In a call to ib_umem_get(), if address is 0x0 and size is already page aligned, check added in commit 8494057ab5e4 (IB/uverbs: Prevent integer overflow in ib_umem_get address arithmetic) will refuse to register a memory region that could otherwise be valid (provided vm.mmap_min_addr sysctl and mmap_low_allowed SELinux knobs allow userspace to map something at address 0x0). This patch allows back such registration: ib_umem_get() should probably don't care of the base address provided it can be pinned with get_user_pages(). There's two possible overflows, in (addr + size) and in PAGE_ALIGN(addr + size), this patch keep ensuring none of them happen while allowing to pin memory at address 0x0. Anyway, the case of size equal 0 is no more (partially) handled as 0-length memory region are disallowed by an earlier check. Link: http://mid.gmane.org/cover.1428929103.git.ydrone...@opteya.com Cc: sta...@vger.kernel.org # 8494057ab5e4 (IB/uverbs: Prevent integer overflow in ib_umem_get address arithmetic) Cc: Shachar Raindel rain...@mellanox.com Cc: Jack Morgenstein ja...@mellanox.com Cc: Or Gerlitz ogerl...@mellanox.com Signed-off-by: Yann Droneaud ydrone...@opteya.com --- drivers/infiniband/core/umem.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c index 9ac4068d2088..38acb3cfc545 100644 --- a/drivers/infiniband/core/umem.c +++ b/drivers/infiniband/core/umem.c @@ -106,8 +106,8 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr, * If the combination of the addr and size requested for this memory * region causes an integer overflow, return error. */ - if ((PAGE_ALIGN(addr + size) = size) || - (PAGE_ALIGN(addr + size) = addr)) + if (((addr + size) addr) || + PAGE_ALIGN(addr + size) (addr + size)) return ERR_PTR(-EINVAL); if (!can_do_mlock()) -- 2.1.0 -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Stepping down as maintainer (was Re: [PATCH for-next 0/9] mlx4 changes in virtual GID management)
Hi, Le vendredi 10 avril 2015 à 20:39 +, Hefty, Sean a écrit : I'm on a perhaps-unfortunately-timed vacation this week, but to make the transition smooth, I'll handle the easy patches that are in flight for 4.1, and then turn things over to the community for 4.2. (Unless the feeling is that it's better for me to step down immediately — I have no problem doing that, I just don't want to leave people in the lurch) This transition plan makes sense to me. I can dedicate more of my time helping to review kernel patches, at least for a couple of months. Hal is listed as the other sub-maintainer. He usually limits his input to the IB mgmt components, but maybe he can contribute more until we're over this hump? I'm not trying to volunteer anyone, but longer term, Doug appears to be in an ideal situation for maintaining the RDMA subtree. Because of his position at RedHat, he has a strong interest in ensuring user space compatibility, can test across a wide range of devices, and needs to deal with long term maintenance issues. That would make sense to have someone like Doug or Jason being the potential successors to Roland in order to keep InfiniBand subsystem vendor neutrality. Regards. -- Yann Droneaud OPTEYA -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: CVE-2014-8159 kernel: infiniband: uverbs: unprotected physical memory access
Hi, Le jeudi 02 avril 2015 à 16:34 +, Shachar Raindel a écrit : -Original Message- From: Yann Droneaud [mailto:ydrone...@opteya.com] Sent: Thursday, April 02, 2015 6:16 PM Le jeudi 02 avril 2015 à 10:52 +, Shachar Raindel a écrit : -Original Message- From: Yann Droneaud [mailto:ydrone...@opteya.com] Sent: Thursday, April 02, 2015 1:05 PM Le mercredi 18 mars 2015 à 17:39 +, Shachar Raindel a écrit : ... + /* + * If the combination of the addr and size requested for this memory + * region causes an integer overflow, return error. + */ + if ((PAGE_ALIGN(addr + size) = size) || + (PAGE_ALIGN(addr + size) = addr)) + return ERR_PTR(-EINVAL); + Can access_ok() be used here ? if (!access_ok(writable ? VERIFY_WRITE : VERIFY_READ, addr, size)) return ERR_PTR(-EINVAL); No, this will break the current ODP semantics. ODP allows the user to register memory that is not accessible yet. This is a critical design feature, as it allows avoiding holding a registration cache. Adding this check will break the behavior, forcing memory to be all accessible when registering an ODP MR. Failed to notice previously, but since this would break ODP, and ODP is only available starting v3.19-rc1, my proposed fix might be applicable for older kernel (if not better). Can you explain how this proposed fix is better than the existing patch? Why do we want to push to the stable tree a patch that is not in the upstream? There is an existing, tested, patch that is going to the tip of the development. It even applies cleanly on every kernel version around. access_ok() check for overflow *and* that the region is the memory range for the current process. The later check is not done in your proposed fix (but it should not be needed as get_user_pages() will be called to validate the whole region for non-ODP memory registration). Anyway, AFAIK access_ok() won't check for address being not NULL and size not being 0, and I've noticed your proposed fix also ensure address is not equal to NULL and, more important, that size is not equal to 0: before v3.15-rc1 and commit eeb8461e36c9 (IB: Refactor umem to use linear SG table), calling ib_umem_get() with size equal to 0 would succeed with any arbitrary address ... who knows what might happen in the lowlevel drivers (aka. providers) if they got an umem for a 0-sized memory region. This part of the changes was not detailled in your commit message: it's an issue not related to overflow which is addressed by your patch. So I agree my proposed patch is no better than yours: I've missed the 0-sized memory region issue and didn't take care of NULL address. Regards. -- Yann Droneaud OPTEYA -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: CVE-2014-8159 kernel: infiniband: uverbs: unprotected physical memory access
Hi, Le mercredi 08 avril 2015 à 14:19 +0200, Yann Droneaud a écrit : Le jeudi 02 avril 2015 à 16:34 +, Shachar Raindel a écrit : -Original Message- From: Yann Droneaud [mailto:ydrone...@opteya.com] Sent: Thursday, April 02, 2015 6:16 PM Le jeudi 02 avril 2015 à 10:52 +, Shachar Raindel a écrit : -Original Message- From: Yann Droneaud [mailto:ydrone...@opteya.com] Sent: Thursday, April 02, 2015 1:05 PM Le mercredi 18 mars 2015 à 17:39 +, Shachar Raindel a écrit : ... + /* +* If the combination of the addr and size requested for this memory +* region causes an integer overflow, return error. +*/ + if ((PAGE_ALIGN(addr + size) = size) || + (PAGE_ALIGN(addr + size) = addr)) + return ERR_PTR(-EINVAL); + Can access_ok() be used here ? if (!access_ok(writable ? VERIFY_WRITE : VERIFY_READ, addr, size)) return ERR_PTR(-EINVAL); No, this will break the current ODP semantics. ODP allows the user to register memory that is not accessible yet. This is a critical design feature, as it allows avoiding holding a registration cache. Adding this check will break the behavior, forcing memory to be all accessible when registering an ODP MR. Failed to notice previously, but since this would break ODP, and ODP is only available starting v3.19-rc1, my proposed fix might be applicable for older kernel (if not better). Can you explain how this proposed fix is better than the existing patch? Why do we want to push to the stable tree a patch that is not in the upstream? There is an existing, tested, patch that is going to the tip of the development. It even applies cleanly on every kernel version around. access_ok() check for overflow *and* that the region is the memory range for the current process. The later check is not done in your proposed fix (but it should not be needed as get_user_pages() will be called to validate the whole region for non-ODP memory registration). Anyway, AFAIK access_ok() won't check for address being not NULL and size not being 0, and I've noticed your proposed fix also ensure address is not equal to NULL and, more important, that size is not equal to 0 It only check address not being 0 if size is already PAGE_SIZE aligned, and it only check size not being 0 if address is already PAGE_SIZE aligned. before v3.15-rc1 and commit eeb8461e36c9 (IB: Refactor umem to use linear SG table), calling ib_umem_get() with size equal to 0 would succeed with any arbitrary address ... who knows what might happen in the lowlevel drivers (aka. providers) if they got an umem for a 0-sized memory region. This part of the changes was not detailled in your commit message: it's an issue not related to overflow which is addressed by your patch. So I agree my proposed patch is no better than yours: I've missed the 0-sized memory region issue and didn't take care of NULL address. Regards. -- Yann Droneaud OPTEYA -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/2] IB/core: disallow registering 0-sized memory region
If ib_umem_get() is called with a size equal to 0 and an non-page aligned address, one page will be pinned and a 0-sized umem will be returned to the caller. This should not be allowed: it's not expected for a memory region to have a size equal to 0. This patch adds a check to explicitly refuse to register a 0-sized region. Additionally, it updates check added in commit 8494057ab5e4 (IB/uverbs: Prevent integer overflow in ib_umem_get address arithmetic) to not care about 0-sized region: it would had catched 0-sized region only if address was already page aligned. Link: http://mid.gmane.org/cover.1428502843.git.ydrone...@opteya.com Cc: sta...@vger.kernel.org # 8494057ab5e4 (IB/uverbs: Prevent integer overflow in ib_umem_get address arithmetic) Cc: Shachar Raindel rain...@mellanox.com Cc: Jack Morgenstein ja...@mellanox.com Cc: Or Gerlitz ogerl...@mellanox.com Signed-off-by: Yann Droneaud ydrone...@opteya.com --- drivers/infiniband/core/umem.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c index 8c014b5dab4c..cbe361645c1b 100644 --- a/drivers/infiniband/core/umem.c +++ b/drivers/infiniband/core/umem.c @@ -99,12 +99,15 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr, if (dmasync) dma_set_attr(DMA_ATTR_WRITE_BARRIER, attrs); + if (!size) + return ERR_PTR(-EINVAL); + /* * If the combination of the addr and size requested for this memory * region causes an integer overflow, return error. */ if ((PAGE_ALIGN(addr + size) = size) || - (PAGE_ALIGN(addr + size) = addr)) + (PAGE_ALIGN(addr + size) addr)) return ERR_PTR(-EINVAL); if (!can_do_mlock()) -- 2.1.0 -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/2] IB/core: don't disallow registering region starting at 0x0
In a call to ib_umem_get(), if address is 0x0 and size is already page aligned, check added in commit 8494057ab5e4 (IB/uverbs: Prevent integer overflow in ib_umem_get address arithmetic) will refuse to register a memory region that could otherwise be valid (provided vm.mmap_min_addr sysctl and mmap_low_allowed SELinux knobs allow userspace to map something at address 0x0). This patch allows back such registration: ib_umem_get() should probably don't care of the base address provided it can be pinned with get_user_pages(). Link: http://mid.gmane.org/cover.1428502843.git.ydrone...@opteya.com Cc: sta...@vger.kernel.org # 8494057ab5e4 (IB/uverbs: Prevent integer overflow in ib_umem_get address arithmetic) Cc: Shachar Raindel rain...@mellanox.com Cc: Jack Morgenstein ja...@mellanox.com Cc: Or Gerlitz ogerl...@mellanox.com Signed-off-by: Yann Droneaud ydrone...@opteya.com --- drivers/infiniband/core/umem.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c index cbe361645c1b..a0aadd05ab6d 100644 --- a/drivers/infiniband/core/umem.c +++ b/drivers/infiniband/core/umem.c @@ -106,7 +106,7 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr, * If the combination of the addr and size requested for this memory * region causes an integer overflow, return error. */ - if ((PAGE_ALIGN(addr + size) = size) || + if ((PAGE_ALIGN(addr + size) size) || (PAGE_ALIGN(addr + size) addr)) return ERR_PTR(-EINVAL); -- 2.1.0 -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/2] Fixes on top of CVE-2014-8159 kernel: infiniband: uverbs: unprotected physical memory access
Hi, Please find one patch to prevent a possible issue partially addressed by commit 8494057ab5e4 (IB/uverbs: Prevent integer overflow in ib_umem_get address arithmetic) (see discussions in [1]) and another one to add back the possibility of registering memory mapped at 0 (which is probably not something to be allowed, but it's not up to ib_umem_get() to prevent it). [1] Re: CVE-2014-8159 kernel: infiniband: uverbs: unprotected physical memory access http://mid.gmane.org/1428497043.22575.176.ca...@opteya.com http://marc.info/?i=1428497043.22575.176.ca...@opteya.com Regards. Yann Droneaud (2): IB/core: disallow registering 0-sized memory region IB/core: don't disallow registering region starting at 0x0 drivers/infiniband/core/umem.c | 7 +-- 1 file changed, 5 insertions(+), 2 deletions(-) -- 2.1.0 -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH RESEND 1/2] IB/core: disallow registering 0-sized memory region
If ib_umem_get() is called with a size equal to 0 and an non-page aligned address, one page will be pinned and a 0-sized umem will be returned to the caller. This should not be allowed: it's not expected for a memory region to have a size equal to 0. This patch adds a check to explicitly refuse to register a 0-sized region. Additionally, it updates check added in commit 8494057ab5e4 (IB/uverbs: Prevent integer overflow in ib_umem_get address arithmetic) to not care about 0-sized region: it would had catched 0-sized region only if address was already page aligned. Link: http://mid.gmane.org/cover.1428523125.git.ydrone...@opteya.com Cc: sta...@vger.kernel.org # 8494057ab5e4 (IB/uverbs: Prevent integer overflow in ib_umem_get address arithmetic) Cc: Shachar Raindel rain...@mellanox.com Cc: Jack Morgenstein ja...@mellanox.com Cc: Or Gerlitz ogerl...@mellanox.com Signed-off-by: Yann Droneaud ydrone...@opteya.com --- drivers/infiniband/core/umem.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c index 8c014b5dab4c..cbe361645c1b 100644 --- a/drivers/infiniband/core/umem.c +++ b/drivers/infiniband/core/umem.c @@ -99,12 +99,15 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr, if (dmasync) dma_set_attr(DMA_ATTR_WRITE_BARRIER, attrs); + if (!size) + return ERR_PTR(-EINVAL); + /* * If the combination of the addr and size requested for this memory * region causes an integer overflow, return error. */ if ((PAGE_ALIGN(addr + size) = size) || - (PAGE_ALIGN(addr + size) = addr)) + (PAGE_ALIGN(addr + size) addr)) return ERR_PTR(-EINVAL); if (!can_do_mlock()) -- 2.1.0 -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH RESEND 2/2] IB/core: don't disallow registering region starting at 0x0
In a call to ib_umem_get(), if address is 0x0 and size is already page aligned, check added in commit 8494057ab5e4 (IB/uverbs: Prevent integer overflow in ib_umem_get address arithmetic) will refuse to register a memory region that could otherwise be valid (provided vm.mmap_min_addr sysctl and mmap_low_allowed SELinux knobs allow userspace to map something at address 0x0). This patch allows back such registration: ib_umem_get() should probably don't care of the base address provided it can be pinned with get_user_pages(). Link: http://mid.gmane.org/cover.1428523125.git.ydrone...@opteya.com Cc: sta...@vger.kernel.org # 8494057ab5e4 (IB/uverbs: Prevent integer overflow in ib_umem_get address arithmetic) Cc: Shachar Raindel rain...@mellanox.com Cc: Jack Morgenstein ja...@mellanox.com Cc: Or Gerlitz ogerl...@mellanox.com Signed-off-by: Yann Droneaud ydrone...@opteya.com --- drivers/infiniband/core/umem.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c index cbe361645c1b..a0aadd05ab6d 100644 --- a/drivers/infiniband/core/umem.c +++ b/drivers/infiniband/core/umem.c @@ -106,7 +106,7 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr, * If the combination of the addr and size requested for this memory * region causes an integer overflow, return error. */ - if ((PAGE_ALIGN(addr + size) = size) || + if ((PAGE_ALIGN(addr + size) size) || (PAGE_ALIGN(addr + size) addr)) return ERR_PTR(-EINVAL); -- 2.1.0 -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: CVE-2014-8159 kernel: infiniband: uverbs: unprotected physical memory access
Hi, Le vendredi 03 avril 2015 à 08:39 +, Haggai Eran a écrit : On Thursday, April 2, 2015 11:40 PM, Yann Droneaud ydrone...@opteya.com wrote: Le jeudi 02 avril 2015 à 16:44 +, Shachar Raindel a écrit : -Original Message- From: Yann Droneaud [mailto:ydrone...@opteya.com] Sent: Thursday, April 02, 2015 7:35 PM Another related question: as the large memory range could be registered by user space with ibv_reg_mr(pd, base, size, IB_ACCESS_ON_DEMAND), what's prevent the kernel to map a file as the result of mmap(0, ...) in this region, making it available remotely through IBV_WR_RDMA_READ / IBV_WR_RDMA_WRITE ? This is not a bug. This is a feature. Exposing a file through RDMA, using ODP, can be done exactly like this. Given that the application explicitly requested this behavior, I don't see why it is a problem. If the application cannot choose what will end up in the region it has registered, it's an issue ! What might happen if one library in a program call mmap(0, size, ...) to load a file storing a secret (a private key), and that file ends up being mapped in an registered but otherwise free region (afaict, the kernel is allowed to do it) ? What might happen if one library in a program call call mmap(0, size, ..., MAP_ANONYMOUS,...) to allocate memory, call mlock(), then write in this location a secret (a passphrase), and that area ends up in the memory region registered for on demand paging ? The application haven't choose to disclose these confidential piece of information, but they are available for reading/writing by remote through RDMA given it knows the rkey of the memory region (which is a 32bits value). I hope I'm missing something, because I'm not feeling confident such behavor is a feature. What we are aiming for is the possibility to register the entire process' address space for RDMA operations (if the process chooses to use this feature). This is similar to multiple threads accessing the same address space. I'm sure you wouldn't be complaining about the ability of one thread to access the secret passphrase mmapped by another thread in your example. I'm trying to understand how the application can choose what is exposed through RDMA if it registers a very large memory region for later use (but do not actually explicitly map something there yet): what's the consequences ? void *start = sbrk(0); size_t size = ULONG_MAX - (unsigned long)start; ibv_reg_mr(pd, start, size, IB_ACCESS_ON_DEMAND) The consequences are exactly as you wrote. Just as giving a non-ODP rkey to a remote node allows the node to access the registered memory behind that rkey, giving an ODP rkey to a remote node allows that node to access the virtual address space behind that rkey. There's a difference: it's impossible to give a valid non-ODP rkey that point to a memory region not already mapped (backed by a file for example), so the application *choose* the content of the memory to be made accessible remotely before making it accessible. As I understand the last explanation regarding ODP, at creation time, an ODP rkey can point to a free, unused, unallocated memory portion. At this point the kernel can happily map anything the application (and its libraries) want to map at a (almost) *random* address that could be in (or partially in) the ODP memory region. And I have a problem with such random behavior. Allowing this is seems dangerous and should be done with care. I believe the application must kept the control of what's end up in its ODP registered memory region. Especially for multi thread program: imagine one thread creating a large memory region for its future purposes, then send the rkey to a remote peer and wait for some work to be done. In the mean time another call mmap(0, ...) to map a file at a kernel chosen address, and that address happen to be in the memory region registered by the other thread: 1) the first thread is amputated from a portion of memory it was willing to use; 2) the data used by the second thread is accessible to the remote peer(s) while not expected. Speculatively registering memory seems dangerous for any use case I could think of. Regards. -- Yann Droneaud OPTEYA -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: CVE-2014-8159 kernel: infiniband: uverbs: unprotected physical memory access
Hi, Le jeudi 02 avril 2015 à 16:44 +, Shachar Raindel a écrit : -Original Message- From: Yann Droneaud [mailto:ydrone...@opteya.com] Sent: Thursday, April 02, 2015 7:35 PM Another related question: as the large memory range could be registered by user space with ibv_reg_mr(pd, base, size, IB_ACCESS_ON_DEMAND), what's prevent the kernel to map a file as the result of mmap(0, ...) in this region, making it available remotely through IBV_WR_RDMA_READ / IBV_WR_RDMA_WRITE ? This is not a bug. This is a feature. Exposing a file through RDMA, using ODP, can be done exactly like this. Given that the application explicitly requested this behavior, I don't see why it is a problem. If the application cannot choose what will end up in the region it has registered, it's an issue ! What might happen if one library in a program call mmap(0, size, ...) to load a file storing a secret (a private key), and that file ends up being mapped in an registered but otherwise free region (afaict, the kernel is allowed to do it) ? What might happen if one library in a program call call mmap(0, size, ..., MAP_ANONYMOUS,...) to allocate memory, call mlock(), then write in this location a secret (a passphrase), and that area ends up in the memory region registered for on demand paging ? The application haven't choose to disclose these confidential piece of information, but they are available for reading/writing by remote through RDMA given it knows the rkey of the memory region (which is a 32bits value). I hope I'm missing something, because I'm not feeling confident such behavor is a feature. Actually, some of our tests use such flows. The mmu notifiers mechanism allow us to do this safely. When the page is written back to disk, it is removed from the ODP mapping. When it is accessed by the HCA, it is brought back to RAM. I'm not discussing about the benefit of On Demand Paging and why it's a very good feature to expose files through RDMA. I'm trying to understand how the application can choose what is exposed through RDMA if it registers a very large memory region for later use (but do not actually explicitly map something there yet): what's the consequences ? void *start = sbrk(0); size_t size = ULONG_MAX - (unsigned long)start; ibv_reg_mr(pd, start, size, IB_ACCESS_ON_DEMAND) Regards. -- Yann Droneaud OPTEYA -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: CVE-2014-8159 kernel: infiniband: uverbs: unprotected physical memory access
Hi, Le mercredi 18 mars 2015 à 17:39 +, Shachar Raindel a écrit : Hi, It was found that the Linux kernel's InfiniBand/RDMA subsystem did not properly sanitize input parameters while registering memory regions from user space via the (u)verbs API. A local user with access to a /dev/infiniband/uverbsX device could use this flaw to crash the system or, potentially, escalate their privileges on the system. The issue has been assigned CVE-2014-8159. The issue exists in the InfiniBand/RDMA/iWARP drivers since Linux Kernel version 2.6.13. Mellanox OFED 2.4-1.0.4 fixes the issue. Available from: http://www.mellanox.com/page/products_dyn?product_family=26mtag=linux_sw_drivers RedHat errata: https://access.redhat.com/security/cve/CVE-2014-8159 Canonical errata: http://people.canonical.com/~ubuntu-security/cve/2014/CVE-2014-8159.html Novell (Suse) bug tracking: https://bugzilla.novell.com/show_bug.cgi?id=914742 The following patch fixes the issue: --- 8 -- From d4d68430d4a12c569e28b4f4468284ea2286 Mon Sep 17 00:00:00 2001 From: Shachar Raindel rain...@mellanox.com Date: Sun, 04 Jan 2015 18:30:32 +0200 Subject: [PATCH] IB/core: Prevent integer overflow in ib_umem_get address arithmetic Properly verify that the resulting page aligned end address is larger than both the start address and the length of the memory area requested. Both the start and length arguments for ib_umem_get are controlled by the user. A misbehaving user can provide values which will cause an integer overflow when calculating the page aligned end address. This overflow can cause also miscalculation of the number of pages mapped, and additional logic issues. Signed-off-by: Shachar Raindel rain...@mellanox.com Signed-off-by: Jack Morgenstein ja...@mellanox.com Signed-off-by: Or Gerlitz ogerl...@mellanox.com --- diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c index aec7a6a..8c014b5 100644 --- a/drivers/infiniband/core/umem.c +++ b/drivers/infiniband/core/umem.c @@ -99,6 +99,14 @@ if (dmasync) dma_set_attr(DMA_ATTR_WRITE_BARRIER, attrs); + /* + * If the combination of the addr and size requested for this memory + * region causes an integer overflow, return error. + */ + if ((PAGE_ALIGN(addr + size) = size) || + (PAGE_ALIGN(addr + size) = addr)) + return ERR_PTR(-EINVAL); + Can access_ok() be used here ? if (!access_ok(writable ? VERIFY_WRITE : VERIFY_READ, addr, size)) return ERR_PTR(-EINVAL); if (!can_do_mlock()) return ERR_PTR(-EPERM); Regards. -- Yann Droneaud OPTEYA -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: CVE-2014-8159 kernel: infiniband: uverbs: unprotected physical memory access
Hi, Le jeudi 02 avril 2015 à 10:52 +, Shachar Raindel a écrit : -Original Message- From: Yann Droneaud [mailto:ydrone...@opteya.com] Sent: Thursday, April 02, 2015 1:05 PM Le mercredi 18 mars 2015 à 17:39 +, Shachar Raindel a écrit : ... + /* + * If the combination of the addr and size requested for this memory + * region causes an integer overflow, return error. + */ + if ((PAGE_ALIGN(addr + size) = size) || + (PAGE_ALIGN(addr + size) = addr)) + return ERR_PTR(-EINVAL); + Can access_ok() be used here ? if (!access_ok(writable ? VERIFY_WRITE : VERIFY_READ, addr, size)) return ERR_PTR(-EINVAL); No, this will break the current ODP semantics. ODP allows the user to register memory that is not accessible yet. This is a critical design feature, as it allows avoiding holding a registration cache. Adding this check will break the behavior, forcing memory to be all accessible when registering an ODP MR. Failed to notice previously, but since this would break ODP, and ODP is only available starting v3.19-rc1, my proposed fix might be applicable for older kernel (if not better). From 2a3beaeb4b35d968f127cb59cfda2f12dbd87e4b Mon Sep 17 00:00:00 2001 From: Yann Droneaud ydrone...@opteya.com Date: Thu, 2 Apr 2015 17:01:05 +0200 Subject: [RFC PATCH] IB/core: reject invalid userspace memory range registration Signed-off-by: Yann Droneaud ydrone...@opteya.com --- drivers/infiniband/core/umem.c | 23 +++ 1 file changed, 15 insertions(+), 8 deletions(-) diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c index df0c4f605a21..6758b4ac56eb 100644 --- a/drivers/infiniband/core/umem.c +++ b/drivers/infiniband/core/umem.c @@ -90,6 +90,7 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr, DEFINE_DMA_ATTRS(attrs); struct scatterlist *sg, *sg_list_start; int need_release = 0; + bool writable; if (dmasync) dma_set_attr(DMA_ATTR_WRITE_BARRIER, attrs); @@ -97,6 +98,19 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr, if (!can_do_mlock()) return ERR_PTR(-EPERM); + /* +* We ask for writable memory if any access flags other than +* remote read are set. Local write and remote write +* obviously require write access. Remote atomic can do +* things like fetch and add, which will modify memory, and +* MW bind can change permissions by binding a window. +*/ + writable = !!(access ~IB_ACCESS_REMOTE_READ); + + if (!access_ok(writable ? VERIFY_WRITE : VERIFY_READ, + (void __user *)addr, size)) + return ERR_PTR(-EFAULT); + umem = kzalloc(sizeof *umem, GFP_KERNEL); if (!umem) return ERR_PTR(-ENOMEM); @@ -106,14 +120,7 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr, umem-offset= addr ~PAGE_MASK; umem-page_size = PAGE_SIZE; umem-pid = get_task_pid(current, PIDTYPE_PID); - /* -* We ask for writable memory if any access flags other than -* remote read are set. Local write and remote write -* obviously require write access. Remote atomic can do -* things like fetch and add, which will modify memory, and -* MW bind can change permissions by binding a window. -*/ - umem-writable = !!(access ~IB_ACCESS_REMOTE_READ); + umem-writable = writable; /* We assume the memory is from hugetlb until proved otherwise */ umem-hugetlb = 1; -- 2.1.0 Regards. -- Yann Droneaud OPTEYA -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: CVE-2014-8159 kernel: infiniband: uverbs: unprotected physical memory access
Hi, Le jeudi 02 avril 2015 à 10:52 +, Shachar Raindel a écrit : -Original Message- From: Yann Droneaud [mailto:ydrone...@opteya.com] Sent: Thursday, April 02, 2015 1:05 PM Le mercredi 18 mars 2015 à 17:39 +, Shachar Raindel a écrit : + /* + * If the combination of the addr and size requested for this memory + * region causes an integer overflow, return error. + */ + if ((PAGE_ALIGN(addr + size) = size) || + (PAGE_ALIGN(addr + size) = addr)) + return ERR_PTR(-EINVAL); + Can access_ok() be used here ? if (!access_ok(writable ? VERIFY_WRITE : VERIFY_READ, addr, size)) return ERR_PTR(-EINVAL); No, this will break the current ODP semantics. ODP allows the user to register memory that is not accessible yet. This is a critical design feature, as it allows avoiding holding a registration cache. Adding this check will break the behavior, forcing memory to be all accessible when registering an ODP MR. Where's the check for the range being in userspace memory space, especially for the ODP case ? For non ODP case (eg. plain old behavior), does get_user_pages() ensure the requested pages fit in userspace region on all architectures ? I think so. In ODP case, I'm not sure such check is ever done ? (Aside, does it take special mesure to protect shared mapping from being read and/or *written* ?) Regards. -- Yann Droneaud OPTEYA -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[-stable] commit 377b513485fd (IB/core: Avoid leakage from kernel to user space)
Hi, Please add commit 377b513485fd (IB/core: Avoid leakage from kernel to user space) to -stable. It can be applied to v2.6.32 and later. Regards. -- Yann Droneaud OPTEYA -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] rdma: replace deprecated ifconfig in doc
Hi, Le lundi 09 mars 2015 à 11:29 -0700, Stephen Hemminger a écrit : On Mon, 09 Mar 2015 13:06:30 +0100 Yann Droneaud ydrone...@opteya.com wrote: Le dimanche 08 mars 2015 à 16:36 -0700, Stephen Hemminger a écrit : The ifconfig command has been deprecated for many years. To encourage new users not to continue using it and learning iproute2; the ifconfig should not be used in examples. Signed-off-by: Stephen Hemminger step...@networkplumber.org --- a/Documentation/filesystems/nfs/nfs-rdma.txt 2014-09-28 16:27:20.393505440 -0700 +++ b/Documentation/filesystems/nfs/nfs-rdma.txt 2015-03-08 16:33:21.483985810 -0700 @@ -187,8 +187,10 @@ Check RDMA and NFS Setup To further test the InfiniBand software stack, use IPoIB (this assumes you have two IB hosts named host1 and host2): -host1$ ifconfig ib0 a.b.c.x -host2$ ifconfig ib0 a.b.c.y +host1$ ip link set dev ib0 up +host1$ ip address add dev ib0 a.b.c.x +host2$ ip link set dev ib0 up I don't think we need to up the device twice. You need to do it on each host. Oops, missed that. Sorry. +host2$ ip address add dev ib0 a.b.c.y host1$ ping a.b.c.y host2$ ping a.b.c.x @@ -229,7 +231,8 @@ NFS/RDMA Setup $ modprobe ib_mthca $ modprobe ib_ipoib -$ ifconfig ib0 a.b.c.d +$ ip li set dev ib0 up s/li/link ? ip commands support shorthand. Yes, sure, but here, it's looking more like a typo. It should be consistent throughout the document. Regards. -- Yann Droneaud OPTEYA -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] rdma: replace deprecated ifconfig in doc
Hi, Le dimanche 08 mars 2015 à 16:36 -0700, Stephen Hemminger a écrit : The ifconfig command has been deprecated for many years. To encourage new users not to continue using it and learning iproute2; the ifconfig should not be used in examples. Signed-off-by: Stephen Hemminger step...@networkplumber.org --- a/Documentation/filesystems/nfs/nfs-rdma.txt 2014-09-28 16:27:20.393505440 -0700 +++ b/Documentation/filesystems/nfs/nfs-rdma.txt 2015-03-08 16:33:21.483985810 -0700 @@ -187,8 +187,10 @@ Check RDMA and NFS Setup To further test the InfiniBand software stack, use IPoIB (this assumes you have two IB hosts named host1 and host2): -host1$ ifconfig ib0 a.b.c.x -host2$ ifconfig ib0 a.b.c.y +host1$ ip link set dev ib0 up +host1$ ip address add dev ib0 a.b.c.x +host2$ ip link set dev ib0 up I don't think we need to up the device twice. +host2$ ip address add dev ib0 a.b.c.y host1$ ping a.b.c.y host2$ ping a.b.c.x @@ -229,7 +231,8 @@ NFS/RDMA Setup $ modprobe ib_mthca $ modprobe ib_ipoib -$ ifconfig ib0 a.b.c.d +$ ip li set dev ib0 up s/li/link ? +$ ip addr add dev ib0 a.b.c.d NOTE: use unique addresses for the client and server Regards. -- Yann Droneaud OPTEYA -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] IB/srp: Add 64-bit LUN support
-opcode = SRP_TSK_MGMT; - tsk_mgmt-lun = cpu_to_be64(((u64) lun) 48); + int_to_scsilun(lun, tsk_mgmt-lun); tsk_mgmt-tsk_mgmt_func = SRP_TSK_ABORT_TASK; tsk_mgmt-task_tag = (u64) found_evt; @@ -1652,7 +1652,7 @@ static int ibmvscsi_eh_device_reset_handler(struct scsi_cmnd *cmd) /* Set up a lun reset SRP command */ memset(tsk_mgmt, 0x00, sizeof(*tsk_mgmt)); tsk_mgmt-opcode = SRP_TSK_MGMT; - tsk_mgmt-lun = cpu_to_be64(((u64) lun) 48); + int_to_scsilun(lun, tsk_mgmt-lun); tsk_mgmt-tsk_mgmt_func = SRP_TSK_LUN_RESET; evt-sync_srp = srp_rsp; diff --git a/include/scsi/srp.h b/include/scsi/srp.h index 1ae84db..5be834d 100644 --- a/include/scsi/srp.h +++ b/include/scsi/srp.h @@ -42,6 +42,7 @@ */ #include linux/types.h +#include scsi/scsi.h enum { SRP_LOGIN_REQ = 0x00, @@ -179,7 +180,7 @@ struct srp_tsk_mgmt { u8 reserved1[6]; u64 tag; u8 reserved2[4]; - __be64 lun __attribute__((packed)); + struct scsi_lun lun; u8 reserved3[2]; u8 tsk_mgmt_func; u8 reserved4; @@ -200,7 +201,7 @@ struct srp_cmd { u8 data_in_desc_cnt; u64 tag; u8 reserved2[4]; - __be64 lun __attribute__((packed)); + struct scsi_lun lun; u8 reserved3; u8 task_attr; u8 reserved4; @@ -265,7 +266,7 @@ struct srp_aer_req { __be32 req_lim_delta; u64 tag; u32 reserved2; - __be64 lun; + struct scsi_lun lun; __be32 sense_data_len; u32 reserved3; u8 sense_data[0]; Reviewed-by: Yann Droneaud ydrone...@opteya.com Regards. -- Yann Droneaud OPTEYA -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH, resend] IB/srp: Add 64-bit LUN support
Hi, Le mercredi 04 mars 2015 à 11:01 +0100, Bart Van Assche a écrit : The SCSI standard defines 64-bit values for LUNs. Large arrays employing large or hierarchical LUN numbers become more and more common. So update the SRP initiator to use 64-bit LUN numbers. See also Hannes Reinecke, commit 9cb78c16f5da (scsi: use 64-bit LUNs), June 2014. Thanks for the added description. The largest LUN number that has been tested is 0xd2003fff. Checked the following structure sizes with gdb: * sizeof(struct srp_cmd) = 48 * sizeof(struct srp_tsk_mgmt) = 48 * sizeof(struct srp_aer_req) = 36 The ibmvscsi changes have been compile tested only on a PPC system. Signed-off-by: Bart Van Assche bart.vanass...@sandisk.com Reviewed-by: Hannes Reinecke h...@suse.de Cc: Sagi Grimberg sa...@mellanox.com Cc: Sebastian Parschauer sebastian.rie...@profitbricks.com Cc: Brian King brk...@linux.vnet.ibm.com Cc: Nathan Fontenot nf...@linux.vnet.ibm.com Cc: Tyrel Datwyler tyr...@linux.vnet.ibm.com --- drivers/infiniband/ulp/srp/ib_srp.c | 12 ++-- drivers/scsi/ibmvscsi/ibmvscsi.c| 6 +++--- include/scsi/srp.h | 7 --- 3 files changed, 13 insertions(+), 12 deletions(-) diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c index a0e24a8..e427454 100644 --- a/drivers/infiniband/ulp/srp/ib_srp.c +++ b/drivers/infiniband/ulp/srp/ib_srp.c @@ -3146,7 +3146,7 @@ static ssize_t srp_create_target(struct device *dev, target_host-transportt = ib_srp_transport_template; target_host-max_channel = 0; target_host-max_id = 1; - target_host-max_lun = SRP_MAX_LUN; + target_host-max_lun = -1LL; I guess you can remove SRP_MAX_LUN from drivers/infiniband/ulp/srp/ib_srp.h too. target_host-max_cmd_len = sizeof ((struct srp_cmd *) (void *) 0L)-cdb; target = host_to_target(target_host); diff --git a/include/scsi/srp.h b/include/scsi/srp.h index 1ae84db..5be834d 100644 --- a/include/scsi/srp.h +++ b/include/scsi/srp.h @@ -42,6 +42,7 @@ */ #include linux/types.h +#include scsi/scsi.h enum { SRP_LOGIN_REQ = 0x00, @@ -54,7 +54,6 @@ enum { SRP_DLID_REDIRECT = 2, SRP_STALE_CONN = 3, - SRP_MAX_LUN = 512, SRP_DEF_SG_TABLESIZE= 12, SRP_DEFAULT_QUEUE_SIZE = 1 6, Regards. -- Yann Droneaud OPTEYA -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v1 1/5] IB/uverbs: ex_query_device: answer must not depend on request's comp_mask
Hi Or, Le samedi 31 janvier 2015 à 22:09 +0200, Or Gerlitz a écrit : On Fri, Jan 30, 2015, Yann Droneaud ydrone...@opteya.com wrote: [..] I have not the chance of owning HCA with the support for this feature, nor the patches libibverbs / libmlx5 ... (anyway I would not have the time to test). [...] Yann, have you ever run Linux over RDMA capable HW (IB, RoCE, iWARP or alike?) That was a silly question. So I've took a little time to consider answering it, perhaps you were gently kidding me. Anyway, please take the following links as an answer to your question: rdma_create_qp() and max_send_wr http://mid.gmane.org/1303404264.2243.19.ca...@deela.quest-ce.net ibv_create_qp() and max_inline_data behavior http://mid.gmane.org/d7685e13978f93890b2939c5ac2d5...@meuh.org [PATCH librdmacm 0/3] no IBV_SEND_INLINE thus memory registration required on QLogic/Intel HCA http://mid.gmane.org/cover.1376746185.git.ydrone...@opteya.com [PATCH 00/22] infiniband: improve userspace input check http://mid.gmane.org/cover.1376847403.git.ydrone...@opteya.com I hope this would be enough to join the gang^Wclub. Where're the by-laws ? Regards. -- Yann Droneaud OPTEYA -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] IB/ipatch: Use setup_timer and mod_timer
Le lundi 02 mars 2015 à 01:16 +0530, Vaishali Thakkar a écrit : Use timer API functions setup_timer and mod_timer instead of structure assignments as they are standard way to set the timer and to update the expire field of an active timer respectively. This is done using Coccinelle and semantic patch used for this is as follows: // smpl @@ expression x,y,z,a,b; @@ -init_timer (x); +setup_timer (x, y, z); +mod_timer (a, b); -x.function = y; -x.data = z; -x.expires = b; -add_timer(a); // /smpl Signed-off-by: Vaishali Thakkar vthakkar1...@gmail.com --- drivers/infiniband/hw/ipath/ipath_driver.c| 9 +++-- drivers/infiniband/hw/ipath/ipath_init_chip.c | 10 +++--- drivers/infiniband/hw/ipath/ipath_verbs.c | 7 ++- 3 files changed, 8 insertions(+), 18 deletions(-) [...] diff --git a/drivers/infiniband/hw/ipath/ipath_init_chip.c b/drivers/infiniband/hw/ipath/ipath_init_chip.c index be2a60e..34ffb43 100644 --- a/drivers/infiniband/hw/ipath/ipath_init_chip.c +++ b/drivers/infiniband/hw/ipath/ipath_init_chip.c @@ -950,13 +950,9 @@ int ipath_init_chip(struct ipath_devdata *dd, int reinit) * set up stats retrieval timer, even if we had errors * in last portion of setup */ - init_timer(dd-ipath_stats_timer); - dd-ipath_stats_timer.function = ipath_get_faststats; - dd-ipath_stats_timer.data = (unsigned long) dd; - /* every 5 seconds; */ - dd-ipath_stats_timer.expires = jiffies + 5 * HZ; - /* takes ~16 seconds to overflow at full IB 4x bandwdith */ - add_timer(dd-ipath_stats_timer); + setup_timer(dd-ipath_stats_timer, ipath_get_faststats, + (unsigned long)dd); + mod_timer(dd-ipath_stats_timer, jiffies + 5 * HZ); The code seems correct, but you remove the comments, loosing some useful information. Regards. -- Yann Droneaud OPTEYA -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RE-RESEND V2 for-next 3/5] IB/core: Call ib_unregister_mad_agent() only for valid agents
Hi, Le lundi 16 février 2015 à 22:28 +0100, Yann Droneaud a écrit : Le jeudi 05 février 2015 à 13:53 +0200, Or Gerlitz a écrit : From: Majd Dibbiny m...@mellanox.com In some error flows, ib_mad_unregister_agent is being invoked also in cases where ^^^ ib_unregister_mad_agent() the ib_mad_register_agent call failed (resulting in an illegal pointer in the ^ ib_register_mad_agent() agent field). This causes a kernel crash in the error flow. Fix this by calling ib_unregister_mad_agent only for cases where ib_register_mad_agent succeeded. The code was checking for struct ib_mad_agent *agent not being NULL, while ib_register_mad_agent() returns a ERR_PTR() in case of error ... bad :( After reading Jason's comments [1], I don't understand the purpose of this patch. How can an ERR_PTR() value be present in the arrays ? A quick review makes me think that mthca and mlx4 error paths lacks NULL assignment after calling ib_unregister_mad_agent(). In other words, qib might be correct while the others should be fixed. [1] http://mid.gmane.org/20150205174337.gb31...@obsidianresearch.com Signed-off-by: Majd Dibbiny m...@mellanox.com Signed-off-by: Or Gerlitz ogerl...@mellanox.com --- drivers/infiniband/hw/mlx4/mad.c|2 +- drivers/infiniband/hw/mthca/mthca_mad.c |2 +- drivers/infiniband/hw/qib/qib_mad.c |2 +- 3 files changed, 3 insertions(+), 3 deletions(-) diff --git a/drivers/infiniband/hw/mlx4/mad.c b/drivers/infiniband/hw/mlx4/mad.c index 82a7dd8..6be0d2c 100644 --- a/drivers/infiniband/hw/mlx4/mad.c +++ b/drivers/infiniband/hw/mlx4/mad.c @@ -907,7 +907,7 @@ int mlx4_ib_mad_init(struct mlx4_ib_dev *dev) err: for (p = 0; p dev-num_ports; ++p) for (q = 0; q = 1; ++q) - if (dev-send_agent[p][q]) + if (!IS_ERR(dev-send_agent[p][q])) ib_unregister_mad_agent(dev-send_agent[p][q]); return ret; diff --git a/drivers/infiniband/hw/mthca/mthca_mad.c b/drivers/infiniband/hw/mthca/mthca_mad.c index 8881fa3..5f1a7ce 100644 --- a/drivers/infiniband/hw/mthca/mthca_mad.c +++ b/drivers/infiniband/hw/mthca/mthca_mad.c @@ -317,7 +317,7 @@ int mthca_create_agents(struct mthca_dev *dev) err: for (p = 0; p dev-limits.num_ports; ++p) for (q = 0; q = 1; ++q) - if (dev-send_agent[p][q]) + if (!IS_ERR(dev-send_agent[p][q])) ib_unregister_mad_agent(dev-send_agent[p][q]); return ret; diff --git a/drivers/infiniband/hw/qib/qib_mad.c b/drivers/infiniband/hw/qib/qib_mad.c index 636be11..6c7cc80 100644 --- a/drivers/infiniband/hw/qib/qib_mad.c +++ b/drivers/infiniband/hw/qib/qib_mad.c @@ -2499,7 +2499,7 @@ int qib_create_agents(struct qib_ibdev *dev) err: for (p = 0; p dd-num_pports; p++) { ibp = dd-pport[p].ibport_data; - if (ibp-send_agent) { + if (!IS_ERR(ibp-send_agent)) { agent = ibp-send_agent; ibp-send_agent = NULL; ib_unregister_mad_agent(agent); Regards. -- Yann Droneaud OPTEYA -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RE-RESEND V2 for-next 5/5] IB/core: Fix deadlock on uverbs modify_qp error flow
Le jeudi 05 février 2015 à 13:53 +0200, Or Gerlitz a écrit : From: Moshe Lazer mos...@mellanox.com The deadlock occurs on __uverbs_modify_qp, we take a lock (idr_read_qp) and in case of failure in ib_resolve_eth_l2_attrs we don't release it (put_qp_read). Fix that. Issue: 355606 ^^ I'm not sure this is relevant Fixes: ed4c54e5b4ba (IB/core: Resolve Ethernet L2 addresses when modifying QP) Signed-off-by: Moshe Lazer mos...@mellanox.com Signed-off-by: Or Gerlitz ogerl...@mellanox.com --- drivers/infiniband/core/uverbs_cmd.c |9 + 1 files changed, 5 insertions(+), 4 deletions(-) diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c index ecb6430..a602ce9 100644 --- a/drivers/infiniband/core/uverbs_cmd.c +++ b/drivers/infiniband/core/uverbs_cmd.c @@ -2097,20 +2097,21 @@ ssize_t ib_uverbs_modify_qp(struct ib_uverbs_file *file, if (qp-real_qp == qp) { ret = ib_resolve_eth_l2_attrs(qp, attr, cmd.attr_mask); if (ret) - goto out; + goto release_qp; ret = qp-device-modify_qp(qp, attr, modify_qp_mask(qp-qp_type, cmd.attr_mask), udata); } else { ret = ib_modify_qp(qp, attr, modify_qp_mask(qp-qp_type, cmd.attr_mask)); } - put_qp_read(qp); - if (ret) - goto out; + goto release_qp; ret = in_len; +release_qp: + put_qp_read(qp); + out: kfree(attr); Regards. -- Yann Droneaud OPTEYA -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RE-RESEND V2 for-next 4/5] IB/core: Make sure that the PSN does not overflow
Hi, Le jeudi 05 février 2015 à 13:53 +0200, Or Gerlitz a écrit : From: Majd Dibbiny m...@mellanox.com The rq/sq-psn is 24 bits as defined in the IB spec, therefore ULPs and User space applications shouldn't use the 8 most significant bits in the 32 bits variables to avoid overflow in modify_qp. Fixed the PSN generation by the RDMA-CM to mask out the 8 most significant bits, also mask out these bits in uverbs for attributes provided by user-space. Signed-off-by: Majd Dibbiny m...@mellanox.com Signed-off-by: Or Gerlitz ogerl...@mellanox.com --- drivers/infiniband/core/cma.c|1 + drivers/infiniband/core/uverbs_cmd.c |4 ++-- 2 files changed, 3 insertions(+), 2 deletions(-) diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c index d570030..fab0ee5 100644 --- a/drivers/infiniband/core/cma.c +++ b/drivers/infiniband/core/cma.c @@ -512,6 +512,7 @@ struct rdma_cm_id *rdma_create_id(rdma_cm_event_handler event_handler, INIT_LIST_HEAD(id_priv-listen_list); INIT_LIST_HEAD(id_priv-mc_list); get_random_bytes(id_priv-seq_num, sizeof id_priv-seq_num); + id_priv-seq_num = 0xff; return id_priv-id; } diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c index 532d8eb..ecb6430 100644 --- a/drivers/infiniband/core/uverbs_cmd.c +++ b/drivers/infiniband/core/uverbs_cmd.c @@ -2053,8 +2053,8 @@ ssize_t ib_uverbs_modify_qp(struct ib_uverbs_file *file, attr-path_mtu= cmd.path_mtu; attr-path_mig_state = cmd.path_mig_state; attr-qkey= cmd.qkey; - attr-rq_psn = cmd.rq_psn; - attr-sq_psn = cmd.sq_psn; + attr-rq_psn = cmd.rq_psn 0xff; + attr-sq_psn = cmd.sq_psn 0xff; attr-dest_qp_num = cmd.dest_qp_num; attr-qp_access_flags = cmd.qp_access_flags; attr-pkey_index = cmd.pkey_index; 0xff could be made a #define Question: userspace is allowed to ask for a PSN on 32bits, but it will be silently truncated, is it going to puzzle applications ? Anyway, it would have been better to return an error in the first place ... not sure if we can do it now ... Regards. -- Yann Droneaud OPTEYA -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] IB/core: Add support for extended query device caps
Le dimanche 08 février 2015 à 13:28 +0200, Haggai Eran a écrit : From: Eli Cohen e...@mellanox.com Add extensible query device capabilities verb to allow adding new features. ib_uverbs_ex_query_device is added and copy_query_dev_fields is used to copy capability fields to be used by both ib_uverbs_query_device and ib_uverbs_ex_query_device. Following the discussion about this patch [1], the code now validates the command's comp_mask is zero, returning -EINVAL for unknown values, in order to allow extending the verb in the future. The verb also checks the user-space provided response buffer size and only fills in capabilities that will fit in the buffer. In attempt to follow the spirit of presentation [2] by Tzahi Oved that was presented during OpenFabrics Alliance International Developer Workshop 2013, the comp_mask bits will only describe which fields are valid. Furthermore, fields that can simply be cleared when they are not supported, do not require a comp_mask bit at all. The verb returns a response_length field containing the actual number of bytes written by the kernel, so that a newer version running on an older kernel can tell which fields were actually returned. [1] [PATCH v1 0/5] IB/core: extended query device caps cleanup for v3.19 http://thread.gmane.org/gmane.linux.kernel.api/7889/ [2] https://www.openfabrics.org/images/docs/2013_Dev_Workshop/Tues_0423/2013_Workshop_Tues_0830_Tzahi_Oved-verbs_extensions_ofa_2013-tzahio.pdf Cc: Yann Droneaud ydrone...@opteya.com Cc: Ira Weiny ira.we...@intel.com Cc: Jason Gunthorpe jguntho...@obsidianresearch.com Signed-off-by: Eli Cohen e...@mellanox.com Signed-off-by: Haggai Eran hagg...@mellanox.com --- drivers/infiniband/core/uverbs.h | 1 + drivers/infiniband/core/uverbs_cmd.c | 131 +++--- drivers/infiniband/core/uverbs_main.c | 1 + include/uapi/rdma/ib_user_verbs.h | 12 4 files changed, 104 insertions(+), 41 deletions(-) diff --git a/drivers/infiniband/core/uverbs.h b/drivers/infiniband/core/uverbs.h index 643c08a025a5..b716b0815644 100644 --- a/drivers/infiniband/core/uverbs.h +++ b/drivers/infiniband/core/uverbs.h @@ -258,5 +258,6 @@ IB_UVERBS_DECLARE_CMD(close_xrcd); IB_UVERBS_DECLARE_EX_CMD(create_flow); IB_UVERBS_DECLARE_EX_CMD(destroy_flow); +IB_UVERBS_DECLARE_EX_CMD(query_device); #endif /* UVERBS_H */ diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c index b7943ff16ed3..8f507538c42b 100644 --- a/drivers/infiniband/core/uverbs_cmd.c +++ b/drivers/infiniband/core/uverbs_cmd.c @@ -400,6 +400,52 @@ err: return ret; } +static void copy_query_dev_fields(struct ib_uverbs_file *file, + struct ib_uverbs_query_device_resp *resp, + struct ib_device_attr *attr) +{ + resp-fw_ver= attr-fw_ver; + resp-node_guid = file-device-ib_dev-node_guid; + resp-sys_image_guid= attr-sys_image_guid; + resp-max_mr_size = attr-max_mr_size; + resp-page_size_cap = attr-page_size_cap; + resp-vendor_id = attr-vendor_id; + resp-vendor_part_id= attr-vendor_part_id; + resp-hw_ver= attr-hw_ver; + resp-max_qp= attr-max_qp; + resp-max_qp_wr = attr-max_qp_wr; + resp-device_cap_flags = attr-device_cap_flags; + resp-max_sge = attr-max_sge; + resp-max_sge_rd= attr-max_sge_rd; + resp-max_cq= attr-max_cq; + resp-max_cqe = attr-max_cqe; + resp-max_mr= attr-max_mr; + resp-max_pd= attr-max_pd; + resp-max_qp_rd_atom= attr-max_qp_rd_atom; + resp-max_ee_rd_atom= attr-max_ee_rd_atom; + resp-max_res_rd_atom = attr-max_res_rd_atom; + resp-max_qp_init_rd_atom = attr-max_qp_init_rd_atom; + resp-max_ee_init_rd_atom = attr-max_ee_init_rd_atom; + resp-atomic_cap= attr-atomic_cap; + resp-max_ee= attr-max_ee; + resp-max_rdd = attr-max_rdd; + resp-max_mw= attr-max_mw; + resp-max_raw_ipv6_qp = attr-max_raw_ipv6_qp; + resp-max_raw_ethy_qp = attr-max_raw_ethy_qp; + resp-max_mcast_grp = attr-max_mcast_grp; + resp-max_mcast_qp_attach = attr-max_mcast_qp_attach; + resp-max_total_mcast_qp_attach = attr-max_total_mcast_qp_attach; + resp-max_ah= attr-max_ah; + resp-max_fmr = attr-max_fmr; + resp-max_map_per_fmr = attr-max_map_per_fmr; + resp-max_srq = attr-max_srq; + resp-max_srq_wr= attr-max_srq_wr; + resp-max_srq_sge = attr-max_srq_sge; + resp-max_pkeys = attr-max_pkeys; + resp-local_ca_ack_delay
Re: [PATCH 2/3] IB/core: Add on demand paging caps to ib_uverbs_ex_query_device
Hi, Le dimanche 08 février 2015 à 13:28 +0200, Haggai Eran a écrit : Add a on-demand capabilities reporting to the extended query device verb. Add on-demand paging (ODP) capabilities [...] Perhaps we could also add a mention such the following to get an idea of the purpose of those flags: see also commit 860f10a799c8 (IB/core: Add flags for on demand paging support) Yann Droneaud writes: Note: as offsetof() is used to retrieve the size of the lower chunk of the response, beware that it only works if the upper chunk is right after, without any implicit padding. And, as the size of the latter chunk is added to the base size, implicit padding at the end of the structure is not taken in account. Both point must be taken in account when extending the uverbs functionalities. Cc: Yann Droneaud ydrone...@opteya.com Cc: Ira Weiny ira.we...@intel.com Cc: Jason Gunthorpe jguntho...@obsidianresearch.com Cc: Eli Cohen e...@mellanox.com Signed-off-by: Haggai Eran hagg...@mellanox.com --- drivers/infiniband/core/uverbs_cmd.c | 20 +++- include/uapi/rdma/ib_user_verbs.h| 11 +++ 2 files changed, 30 insertions(+), 1 deletion(-) diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c index 8f507538c42b..04ca04559ce5 100644 --- a/drivers/infiniband/core/uverbs_cmd.c +++ b/drivers/infiniband/core/uverbs_cmd.c @@ -3318,7 +3318,7 @@ int ib_uverbs_ex_query_device(struct ib_uverbs_file *file, if (cmd.reserved) return -EINVAL; - resp.response_length = sizeof(resp); + resp.response_length = offsetof(typeof(resp), odp_caps); if (ucore-outlen resp.response_length) return -ENOSPC; @@ -3330,6 +3330,24 @@ int ib_uverbs_ex_query_device(struct ib_uverbs_file *file, copy_query_dev_fields(file, resp.base, attr); resp.comp_mask = 0; + if (ucore-outlen resp.response_length + sizeof(resp.odp_caps)) + goto end; + +#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING + resp.odp_caps.general_caps = attr.odp_caps.general_caps; + resp.odp_caps.per_transport_caps.rc_odp_caps = + attr.odp_caps.per_transport_caps.rc_odp_caps; + resp.odp_caps.per_transport_caps.uc_odp_caps = + attr.odp_caps.per_transport_caps.uc_odp_caps; + resp.odp_caps.per_transport_caps.ud_odp_caps = + attr.odp_caps.per_transport_caps.ud_odp_caps; + resp.odp_caps.reserved = 0; +#else + memset(resp.odp_caps, 0, sizeof(resp.odp_caps)); +#endif + resp.response_length += sizeof(resp.odp_caps); + +end: err = ib_copy_to_udata(ucore, resp, resp.response_length); if (err) return err; diff --git a/include/uapi/rdma/ib_user_verbs.h b/include/uapi/rdma/ib_user_verbs.h index f0f799afd856..b513e662d8e4 100644 --- a/include/uapi/rdma/ib_user_verbs.h +++ b/include/uapi/rdma/ib_user_verbs.h @@ -207,10 +207,21 @@ struct ib_uverbs_ex_query_device { __u32 reserved; }; +struct ib_uverbs_odp_caps { + __u64 general_caps; + struct { + __u32 rc_odp_caps; + __u32 uc_odp_caps; + __u32 ud_odp_caps; + } per_transport_caps; + __u32 reserved; +}; + struct ib_uverbs_ex_query_device_resp { struct ib_uverbs_query_device_resp base; __u32 comp_mask; __u32 response_length; + struct ib_uverbs_odp_caps odp_caps; }; struct ib_uverbs_query_port { Reviewed-by: Yann Droneaud ydrone...@opteya.com Thanks a lot for this updated patch. Regards. -- Yann Droneaud OPTEYA -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RE-RESEND V2 for-next 3/5] IB/core: Call ib_unregister_mad_agent() only for valid agents
Le jeudi 05 février 2015 à 13:53 +0200, Or Gerlitz a écrit : From: Majd Dibbiny m...@mellanox.com In some error flows, ib_mad_unregister_agent is being invoked also in cases where ^^^ ib_unregister_mad_agent() the ib_mad_register_agent call failed (resulting in an illegal pointer in the ^ ib_register_mad_agent() agent field). This causes a kernel crash in the error flow. Fix this by calling ib_unregister_mad_agent only for cases where ib_register_mad_agent succeeded. The code was checking for struct ib_mad_agent *agent not being NULL, while ib_register_mad_agent() returns a ERR_PTR() in case of error ... bad :( Signed-off-by: Majd Dibbiny m...@mellanox.com Signed-off-by: Or Gerlitz ogerl...@mellanox.com --- drivers/infiniband/hw/mlx4/mad.c|2 +- drivers/infiniband/hw/mthca/mthca_mad.c |2 +- drivers/infiniband/hw/qib/qib_mad.c |2 +- 3 files changed, 3 insertions(+), 3 deletions(-) diff --git a/drivers/infiniband/hw/mlx4/mad.c b/drivers/infiniband/hw/mlx4/mad.c index 82a7dd8..6be0d2c 100644 --- a/drivers/infiniband/hw/mlx4/mad.c +++ b/drivers/infiniband/hw/mlx4/mad.c @@ -907,7 +907,7 @@ int mlx4_ib_mad_init(struct mlx4_ib_dev *dev) err: for (p = 0; p dev-num_ports; ++p) for (q = 0; q = 1; ++q) - if (dev-send_agent[p][q]) + if (!IS_ERR(dev-send_agent[p][q])) ib_unregister_mad_agent(dev-send_agent[p][q]); return ret; diff --git a/drivers/infiniband/hw/mthca/mthca_mad.c b/drivers/infiniband/hw/mthca/mthca_mad.c index 8881fa3..5f1a7ce 100644 --- a/drivers/infiniband/hw/mthca/mthca_mad.c +++ b/drivers/infiniband/hw/mthca/mthca_mad.c @@ -317,7 +317,7 @@ int mthca_create_agents(struct mthca_dev *dev) err: for (p = 0; p dev-limits.num_ports; ++p) for (q = 0; q = 1; ++q) - if (dev-send_agent[p][q]) + if (!IS_ERR(dev-send_agent[p][q])) ib_unregister_mad_agent(dev-send_agent[p][q]); return ret; diff --git a/drivers/infiniband/hw/qib/qib_mad.c b/drivers/infiniband/hw/qib/qib_mad.c index 636be11..6c7cc80 100644 --- a/drivers/infiniband/hw/qib/qib_mad.c +++ b/drivers/infiniband/hw/qib/qib_mad.c @@ -2499,7 +2499,7 @@ int qib_create_agents(struct qib_ibdev *dev) err: for (p = 0; p dd-num_pports; p++) { ibp = dd-pport[p].ibport_data; - if (ibp-send_agent) { + if (!IS_ERR(ibp-send_agent)) { agent = ibp-send_agent; ibp-send_agent = NULL; ib_unregister_mad_agent(agent); Regards. -- Yann Droneaud OPTEYA -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] IB/mlx5: Enable the ODP capability query verb
Hi, Le dimanche 08 février 2015 à 13:28 +0200, Haggai Eran a écrit : Re-enable the on-demand paging capability query through the extended query device verb. I would find useful to have a link to the initial commit in this one, so that the thing makes more sense: commit 8cdd312cfed7 (IB/mlx5: Implement the ODP capability query verb). Signed-off-by: Haggai Eran hagg...@mellanox.com --- drivers/infiniband/hw/mlx5/main.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/infiniband/hw/mlx5/main.c b/drivers/infiniband/hw/mlx5/main.c index 03bf81211a54..8a87404e9c76 100644 --- a/drivers/infiniband/hw/mlx5/main.c +++ b/drivers/infiniband/hw/mlx5/main.c @@ -1331,6 +1331,8 @@ static void *mlx5_ib_add(struct mlx5_core_dev *mdev) (1ull IB_USER_VERBS_CMD_DESTROY_SRQ) | (1ull IB_USER_VERBS_CMD_CREATE_XSRQ) | (1ull IB_USER_VERBS_CMD_OPEN_QP); + dev-ib_dev.uverbs_ex_cmd_mask = + (1ull IB_USER_VERBS_EX_CMD_QUERY_DEVICE); dev-ib_dev.query_device= mlx5_ib_query_device; dev-ib_dev.query_port = mlx5_ib_query_port; Reviewed-by: Yann Droneaud ydrone...@opteya.com Regards. -- Yann Droneaud OPTEYA -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] IB/srp: Add 64-bit LUN support
[2]; u8 tsk_mgmt_func; u8 reserved4; @@ -200,7 +201,7 @@ struct srp_cmd { u8 data_in_desc_cnt; u64 tag; u8 reserved2[4]; - __be64 lun __attribute__((packed)); + struct scsi_lun lun; u8 reserved3; u8 task_attr; u8 reserved4; @@ -265,7 +266,7 @@ struct srp_aer_req { __be32 req_lim_delta; u64 tag; u32 reserved2; - __be64 lun; + struct scsi_lun lun; __be32 sense_data_len; u32 reserved3; u8 sense_data[0]; Regards. -- Yann Droneaud OPTEYA -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] IB/srp: Add 64-bit LUN support
Hi Bart, Le mardi 10 février 2015 à 12:41 +0100, Bart Van Assche a écrit : On 02/10/15 11:13, Yann Droneaud wrote: Le lundi 02 février 2015 à 18:21 +0100, Bart Van Assche a écrit : The largest LUN number that has been tested is 0xd2003fff. Checked the following structure sizes with gdb: * sizeof(struct srp_cmd) = 48 * sizeof(struct srp_tsk_mgmt) = 48 * sizeof(struct srp_aer_req) = 36 You can use 'pahole' tool to retrieve the size (and layout) of any data structure from a .o object file. Thanks, I will have a look at that tool. The ibmvscsi changes have been compile tested only (on a PPC system). The changelog lack of any information about the purpose of 64bit LUNs. Please detail a bit why and how. You might have missed Hannes' patch series through which 64-bit LUN support was added in the SCSI core and in most SCSI LLD drivers. Yes I've missed this thread, and, as this commit doesn't have a clear reference to it, I feel I might not be alone wondering about it. The description of one of the patches in that series reads as follows: The SCSI standard defines 64-bit values for LUNs, and large arrays employing large or hierarchical LUN numbers become more and more common. So update the linux SCSI stack to use 64-bit LUN numbers. Is it OK for you if I resend this patch with that text added in the patch description ? That's would be perfect: it makes a lot more sense with these information added. Additionally, you could add a link to the patch series. Thanks a lot. Regards. -- Yann Droneaud OPTEYA -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [GIT PULL] please pull infiniband.git
Hi, Le vendredi 06 février 2015 à 13:19 -0800, Roland Dreier a écrit : Please pull from git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git tags/rdma-for-linus One more last-second RDMA change for 3.19: - Yann realized that the previous revert of new userspace ABI did not go far enough, and we're still exposing a change that we don't want. Revert even closer to 3.18 interface to make sure we get things right in the long run. Sorry for sending this at the very end of the release cycle, but we didn't realize the scope of the required fix until just now. I hope this could go in v3.19 as, at this stage, we don't want to expose any bits of this ABI in a released kernel. Regards. -- Yann Droneaud OPTEYA -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Revert IB/core: Add support for extended query device caps
Hi, Le vendredi 06 février 2015 à 00:55 -0800, Roland Dreier a écrit : Thanks for noticing this at the last moment. I should have noticed this earlier, but, as small is beautiful, I was impressed by the patch from Haggai and thought it would better than an ugly revert ... Foolish I was. I will send this to Linus on Friday. Thanks a lot. And sorry about this epic adventure in ABI land :/ Regards. -- Yann Droneaud OPTEYA -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Revert IB/core: Add support for extended query device caps
While commit 7e36ef8205ff (IB/core: Temporarily disable ex_query_device uverb) is correct as it make the extended QUERY_DEVICE uverb as integrated as part of commit 5a77abf9a97a (IB/core: Add support for extended query device caps) and commit 860f10a799c8 (IB/core: Add flags for on demand paging support) not available to userspace, it doesn't address the initial issue regarding ib_copy_to_udata() [1][2]. Additionally, further discussions around this new uverb seems to conclude it would require a different data structure than the one currently described in rdma/ib_user_verbs.h [3]. Both of these issues require a revert of the changes, so this patch partially revert commit 8cdd312cfed7 (IB/mlx5: Implement the ODP capability query verb), commit 860f10a799c8 (IB/core: Add flags for on demand paging support) and fully revert commit 5a77abf9a97a (IB/core: Add support for extended query device caps). [1] Re: [PATCH v3 06/17] IB/core: Add support for extended query device caps http://mid.gmane.org/1418733236.2779.26.ca...@opteya.com [2] Re: [PATCH] IB/core: Temporarily disable ex_query_device uverb http://mid.gmane.org/1423067503.3030.83.ca...@opteya.com [3] RE: [PATCH v1 1/5] IB/uverbs: ex_query_device: answer must not depend on request's comp_mask http://mid.gmane.org/2807e5fd2f6fda4886f6618eac48510e0cc12...@crsmsx101.amr.corp.intel.com Cc: Eli Cohen e...@mellanox.com Cc: Haggai Eran hagg...@mellanox.com Cc: Ira Weiny ira.we...@intel.com Cc: Jason Gunthorpe jguntho...@obsidianresearch.com Cc: Sagi Grimberg sa...@mellanox.com Cc: Shachar Raindel rain...@mellanox.com Signed-off-by: Yann Droneaud ydrone...@opteya.com --- drivers/infiniband/core/uverbs.h | 1 - drivers/infiniband/core/uverbs_cmd.c | 137 +++ drivers/infiniband/hw/mlx5/main.c| 2 - include/rdma/ib_verbs.h | 5 +- include/uapi/rdma/ib_user_verbs.h| 27 --- 5 files changed, 42 insertions(+), 130 deletions(-) diff --git a/drivers/infiniband/core/uverbs.h b/drivers/infiniband/core/uverbs.h index b716b0815644..643c08a025a5 100644 --- a/drivers/infiniband/core/uverbs.h +++ b/drivers/infiniband/core/uverbs.h @@ -258,6 +258,5 @@ IB_UVERBS_DECLARE_CMD(close_xrcd); IB_UVERBS_DECLARE_EX_CMD(create_flow); IB_UVERBS_DECLARE_EX_CMD(destroy_flow); -IB_UVERBS_DECLARE_EX_CMD(query_device); #endif /* UVERBS_H */ diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c index 532d8eba8b02..b7943ff16ed3 100644 --- a/drivers/infiniband/core/uverbs_cmd.c +++ b/drivers/infiniband/core/uverbs_cmd.c @@ -400,52 +400,6 @@ err: return ret; } -static void copy_query_dev_fields(struct ib_uverbs_file *file, - struct ib_uverbs_query_device_resp *resp, - struct ib_device_attr *attr) -{ - resp-fw_ver= attr-fw_ver; - resp-node_guid = file-device-ib_dev-node_guid; - resp-sys_image_guid= attr-sys_image_guid; - resp-max_mr_size = attr-max_mr_size; - resp-page_size_cap = attr-page_size_cap; - resp-vendor_id = attr-vendor_id; - resp-vendor_part_id= attr-vendor_part_id; - resp-hw_ver= attr-hw_ver; - resp-max_qp= attr-max_qp; - resp-max_qp_wr = attr-max_qp_wr; - resp-device_cap_flags = attr-device_cap_flags; - resp-max_sge = attr-max_sge; - resp-max_sge_rd= attr-max_sge_rd; - resp-max_cq= attr-max_cq; - resp-max_cqe = attr-max_cqe; - resp-max_mr= attr-max_mr; - resp-max_pd= attr-max_pd; - resp-max_qp_rd_atom= attr-max_qp_rd_atom; - resp-max_ee_rd_atom= attr-max_ee_rd_atom; - resp-max_res_rd_atom = attr-max_res_rd_atom; - resp-max_qp_init_rd_atom = attr-max_qp_init_rd_atom; - resp-max_ee_init_rd_atom = attr-max_ee_init_rd_atom; - resp-atomic_cap= attr-atomic_cap; - resp-max_ee= attr-max_ee; - resp-max_rdd = attr-max_rdd; - resp-max_mw= attr-max_mw; - resp-max_raw_ipv6_qp = attr-max_raw_ipv6_qp; - resp-max_raw_ethy_qp = attr-max_raw_ethy_qp; - resp-max_mcast_grp = attr-max_mcast_grp; - resp-max_mcast_qp_attach = attr-max_mcast_qp_attach; - resp-max_total_mcast_qp_attach = attr-max_total_mcast_qp_attach; - resp-max_ah= attr-max_ah; - resp-max_fmr = attr-max_fmr; - resp-max_map_per_fmr = attr-max_map_per_fmr; - resp-max_srq = attr-max_srq; - resp-max_srq_wr= attr-max_srq_wr; - resp-max_srq_sge = attr-max_srq_sge; - resp-max_pkeys = attr-max_pkeys; - resp-local_ca_ack_delay
Re: [PATCH] IB/core: Temporarily disable ex_query_device uverb
Hi, Le lundi 02 février 2015 à 14:10 +0100, Yann Droneaud a écrit : Le dimanche 01 février 2015 à 15:35 +0200, Haggai Eran a écrit : Commit 5a77abf9a97a (IB/core: Add support for extended query device caps) added a new extended verb to query the capabilities of RDMA devices, but the semantics of this verb are still under debate [1]. Block access to this verb from user-space until the new semantics are in place. Cc: Yann Droneaud ydrone...@opteya.com Cc: Jason Gunthorpe jguntho...@obsidianresearch.com Cc: Eli Cohen e...@mellanox.com [1] [PATCH v1 0/5] IB/core: extended query device caps cleanup for v3.19 http://www.spinics.net/lists/linux-rdma/msg22904.html Signed-off-by: Haggai Eran hagg...@mellanox.com Reviewed-by: Yann Droneaud ydrone...@opteya.com drivers/infiniband/core/uverbs_main.c | 1 - 1 file changed, 1 deletion(-) diff --git a/drivers/infiniband/core/uverbs_main.c b/drivers/infiniband/core/uverbs_main.c index e6c23b9eab33..5db1a8cc388d 100644 --- a/drivers/infiniband/core/uverbs_main.c +++ b/drivers/infiniband/core/uverbs_main.c @@ -123,7 +123,6 @@ static int (*uverbs_ex_cmd_table[])(struct ib_uverbs_file *file, struct ib_udata *uhw) = { [IB_USER_VERBS_EX_CMD_CREATE_FLOW] = ib_uverbs_ex_create_flow, [IB_USER_VERBS_EX_CMD_DESTROY_FLOW] = ib_uverbs_ex_destroy_flow, - [IB_USER_VERBS_EX_CMD_QUERY_DEVICE] = ib_uverbs_ex_query_device }; static void ib_uverbs_add_one(struct ib_device *device); That's the smallest (and smartest) patch to be applied instead of reverting. Unfortunately, I've missed the issue I was complaining about in the first place [1]. And I feel a bit guilty having missed the issue. The present patch is fine as it fully disable the new extended QUERY_DEVICE uverb, but it doesn't disable the broken logic added in ib_copy_to_udata() by commit 5a77abf9a97a ('IB/core: Add support for extended query device caps'): diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h index 470a011d6fa4..97a999f9e4d8 100644 --- a/include/rdma/ib_verbs.h +++ b/include/rdma/ib_verbs.h @@ -1662,7 +1662,10 @@ static inline int ib_copy_from_udata(void *dest, struct ib_udata *udata, size_t static inline int ib_copy_to_udata(struct ib_udata *udata, void *src, size_t len) { - return copy_to_user(udata-outbuf, src, len) ? -EFAULT : 0; + size_t copy_sz; + + copy_sz = min_t(size_t, len, udata-outlen); + return copy_to_user(udata-outbuf, src, copy_sz) ? -EFAULT : 0; } That part of commit 5a77abf9a97a should be reverted as I'm not sure it doesn't introduce regressions, especially difficult to notice ones. Regards. [1] Re: [PATCH v3 06/17] IB/core: Add support for extended query device caps http://mid.gmane.org/1418733236.2779.26.ca...@opteya.com Regards. -- Yann Droneaud OPTEYA -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] IB/core: Temporarily disable ex_query_device uverb
Le dimanche 01 février 2015 à 15:35 +0200, Haggai Eran a écrit : Commit 5a77abf9a97a (IB/core: Add support for extended query device caps) added a new extended verb to query the capabilities of RDMA devices, but the semantics of this verb are still under debate [1]. Block access to this verb from user-space until the new semantics are in place. Cc: Yann Droneaud ydrone...@opteya.com Cc: Jason Gunthorpe jguntho...@obsidianresearch.com Cc: Eli Cohen e...@mellanox.com [1] [PATCH v1 0/5] IB/core: extended query device caps cleanup for v3.19 http://www.spinics.net/lists/linux-rdma/msg22904.html Signed-off-by: Haggai Eran hagg...@mellanox.com Reviewed-by: Yann Droneaud ydrone...@opteya.com --- Hi Roland, As I wrote in a separate mail, I hope you can accept Yann's patches to modify the new query device verb [1] in time for kernel 3.19. If you decide to revert the extended query device verb, here is a minimal patch to do block this verb temporarily, leaving the rest of the ODP infrastructure in-place. Regards, Haggai drivers/infiniband/core/uverbs_main.c | 1 - 1 file changed, 1 deletion(-) diff --git a/drivers/infiniband/core/uverbs_main.c b/drivers/infiniband/core/uverbs_main.c index e6c23b9eab33..5db1a8cc388d 100644 --- a/drivers/infiniband/core/uverbs_main.c +++ b/drivers/infiniband/core/uverbs_main.c @@ -123,7 +123,6 @@ static int (*uverbs_ex_cmd_table[])(struct ib_uverbs_file *file, struct ib_udata *uhw) = { [IB_USER_VERBS_EX_CMD_CREATE_FLOW] = ib_uverbs_ex_create_flow, [IB_USER_VERBS_EX_CMD_DESTROY_FLOW] = ib_uverbs_ex_destroy_flow, - [IB_USER_VERBS_EX_CMD_QUERY_DEVICE] = ib_uverbs_ex_query_device }; static void ib_uverbs_add_one(struct ib_device *device); That's the smallest (and smartest) patch to be applied instead of reverting. Regards. -- Yann Droneaud OPTEYA -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH for-next 05/10] IB/cm,cma: Move RDMA IP CM private-data parsing code from ib_cma to ib_cm
(cma_mask, 0xF); if (!cma_any_addr(addr)) { cma_data-dst_addr.ip6 = ip6_addr; memset(cma_mask-dst_addr.ip6, 0xFF, @@ -2615,35 +2510,6 @@ err1: } EXPORT_SYMBOL(rdma_bind_addr); -static int cma_format_hdr(void *hdr, struct rdma_id_private *id_priv) -{ - struct cma_hdr *cma_hdr; - - cma_hdr = hdr; - cma_hdr-cma_version = CMA_VERSION; - if (cma_family(id_priv) == AF_INET) { - struct sockaddr_in *src4, *dst4; - - src4 = (struct sockaddr_in *) cma_src_addr(id_priv); - dst4 = (struct sockaddr_in *) cma_dst_addr(id_priv); - - cma_set_ip_ver(cma_hdr, 4); - cma_hdr-src_addr.ip4.addr = src4-sin_addr.s_addr; - cma_hdr-dst_addr.ip4.addr = dst4-sin_addr.s_addr; - cma_hdr-port = src4-sin_port; - } else if (cma_family(id_priv) == AF_INET6) { - struct sockaddr_in6 *src6, *dst6; - - src6 = (struct sockaddr_in6 *) cma_src_addr(id_priv); - dst6 = (struct sockaddr_in6 *) cma_dst_addr(id_priv); - - cma_set_ip_ver(cma_hdr, 6); - cma_hdr-src_addr.ip6 = src6-sin6_addr; - cma_hdr-dst_addr.ip6 = dst6-sin6_addr; - cma_hdr-port = src6-sin6_port; - } - return 0; -} static int cma_sidr_rep_handler(struct ib_cm_id *cm_id, struct ib_cm_event *ib_event) @@ -2731,7 +2597,9 @@ static int cma_resolve_ib_udp(struct rdma_id_private *id_priv, conn_param-private_data_len); if (private_data) { - ret = cma_format_hdr(private_data, id_priv); + ret = cm_format_hdr(private_data, cma_family(id_priv), + cma_src_addr(id_priv), + cma_dst_addr(id_priv)); if (ret) goto out; req.private_data = private_data; @@ -2796,7 +2664,9 @@ static int cma_connect_ib(struct rdma_id_private *id_priv, route = id_priv-id.route; if (private_data) { - ret = cma_format_hdr(private_data, id_priv); + ret = cm_format_hdr(private_data, cma_family(id_priv), + cma_src_addr(id_priv), + cma_dst_addr(id_priv)); if (ret) goto out; req.private_data = private_data; diff --git a/include/rdma/ib_cm.h b/include/rdma/ib_cm.h index 0e3ff30647d5..e418a11afcfe 100644 --- a/include/rdma/ib_cm.h +++ b/include/rdma/ib_cm.h @@ -274,6 +274,52 @@ struct ib_cm_event { #define CM_LAP_ATTR_ID cpu_to_be16(0x0019) #define CM_APR_ATTR_ID cpu_to_be16(0x001A) +union cm_ip_addr { + struct in6_addr ip6; + struct { + __be32 pad[3]; + __be32 addr; + } ip4; +}; + +struct cm_hdr { + u8 cm_version; + u8 ip_version; /* IP version: 7:4 */ + __be16 port; + union cm_ip_addr src_addr; + union cm_ip_addr dst_addr; +}; + +#define RDMA_IP_CM_VERSION 0x00 + +/** + * cm_format_hdr - Fill in a cm_hdr struct according to connection details + * @hdr: cm_hdr struct to fill + * @family: ip family of the addresses - AF_INET or AF_INTET6 + * @src_addr: source address of the connection + * @dst_addr: destination address of the connection + **/ +int cm_format_hdr(void *hdr, int family, + struct sockaddr *src_addr, + struct sockaddr *dst_addr); + +/** + * cm_save_net_info - saves ib connection event details + * @src_addr: source address of the connection + * @dst_addr: destination address of the connection + * @ib_event: ib event to take connection details from + **/ +int cm_save_net_info(struct sockaddr *src_addr, + struct sockaddr *dst_addr, + struct ib_cm_event *ib_event); + +/** + * cm_set_ip_ver - sets the ip version of a cm_hdr struct + * @hdr:cm_hdr struct to change + * @ip_ver: ip version to set - a 4 bit value + **/ +void cm_set_ip_ver(struct cm_hdr *hdr, u8 ip_ver); + /** * ib_cm_handler - User-defined callback to process communication events. * @cm_id: Communication identifier associated with the reported event. Every other symbols in ib_cm.h are prefixed by ib_cm_, so I would prefer having symbols moved from ib_cma.c to ib_cm.c be renamed, except it would create a lot of code change ... Regards. -- Yann Droneaud OPTEYA -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH for-next 08/10] IB/cma: Add support for network namespaces
error; diff --git a/net/rds/ib.c b/net/rds/ib.c index ba2dffeff608..cc137f523248 100644 --- a/net/rds/ib.c +++ b/net/rds/ib.c @@ -326,7 +326,7 @@ static int rds_ib_laddr_check(__be32 addr) /* Create a CMA ID and try to bind it. This catches both * IB and iWARP capable NICs. */ - cm_id = rdma_create_id(NULL, NULL, RDMA_PS_TCP, IB_QPT_RC); + cm_id = rdma_create_id(NULL, NULL, RDMA_PS_TCP, IB_QPT_RC, init_net); if (IS_ERR(cm_id)) return PTR_ERR(cm_id); diff --git a/net/rds/ib_cm.c b/net/rds/ib_cm.c index 31b74f5e61ad..d19b91296ddc 100644 --- a/net/rds/ib_cm.c +++ b/net/rds/ib_cm.c @@ -584,7 +584,7 @@ int rds_ib_conn_connect(struct rds_connection *conn) /* XXX I wonder what affect the port space has */ /* delegate cm event handler to rdma_transport */ ic-i_cm_id = rdma_create_id(rds_rdma_cm_event_handler, conn, - RDMA_PS_TCP, IB_QPT_RC); + RDMA_PS_TCP, IB_QPT_RC, init_net); if (IS_ERR(ic-i_cm_id)) { ret = PTR_ERR(ic-i_cm_id); ic-i_cm_id = NULL; diff --git a/net/rds/iw.c b/net/rds/iw.c index 589935661d66..8501b73ed12f 100644 --- a/net/rds/iw.c +++ b/net/rds/iw.c @@ -227,7 +227,7 @@ static int rds_iw_laddr_check(__be32 addr) /* Create a CMA ID and try to bind it. This catches both * IB and iWARP capable NICs. */ - cm_id = rdma_create_id(NULL, NULL, RDMA_PS_TCP, IB_QPT_RC); + cm_id = rdma_create_id(NULL, NULL, RDMA_PS_TCP, IB_QPT_RC, init_net); if (IS_ERR(cm_id)) return PTR_ERR(cm_id); diff --git a/net/rds/iw_cm.c b/net/rds/iw_cm.c index a91e1db62ee6..e5ee2d562a60 100644 --- a/net/rds/iw_cm.c +++ b/net/rds/iw_cm.c @@ -521,7 +521,7 @@ int rds_iw_conn_connect(struct rds_connection *conn) /* XXX I wonder what affect the port space has */ /* delegate cm event handler to rdma_transport */ ic-i_cm_id = rdma_create_id(rds_rdma_cm_event_handler, conn, - RDMA_PS_TCP, IB_QPT_RC); + RDMA_PS_TCP, IB_QPT_RC, init_net); if (IS_ERR(ic-i_cm_id)) { ret = PTR_ERR(ic-i_cm_id); ic-i_cm_id = NULL; diff --git a/net/rds/rdma_transport.c b/net/rds/rdma_transport.c index 6cd9d1deafc3..066b60b27b12 100644 --- a/net/rds/rdma_transport.c +++ b/net/rds/rdma_transport.c @@ -160,7 +160,7 @@ static int rds_rdma_listen_init(void) int ret; cm_id = rdma_create_id(rds_rdma_cm_event_handler, NULL, RDMA_PS_TCP, -IB_QPT_RC); +IB_QPT_RC, init_net); if (IS_ERR(cm_id)) { ret = PTR_ERR(cm_id); printk(KERN_ERR RDS/RDMA: failed to setup listener, diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c index 4e618808bc98..e3b246e305f9 100644 --- a/net/sunrpc/xprtrdma/svc_rdma_transport.c +++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c @@ -701,7 +701,7 @@ static struct svc_xprt *svc_rdma_create(struct svc_serv *serv, xprt = cma_xprt-sc_xprt; listen_id = rdma_create_id(rdma_listen_handler, cma_xprt, RDMA_PS_TCP, -IB_QPT_RC); +IB_QPT_RC, init_net); if (IS_ERR(listen_id)) { ret = PTR_ERR(listen_id); dprintk(svcrdma: rdma_create_id failed = %d\n, ret); diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c index c98e40643910..f574e77165f4 100644 --- a/net/sunrpc/xprtrdma/verbs.c +++ b/net/sunrpc/xprtrdma/verbs.c @@ -483,7 +483,8 @@ rpcrdma_create_id(struct rpcrdma_xprt *xprt, init_completion(ia-ri_done); - id = rdma_create_id(rpcrdma_conn_upcall, xprt, RDMA_PS_TCP, IB_QPT_RC); + id = rdma_create_id(rpcrdma_conn_upcall, xprt, RDMA_PS_TCP, IB_QPT_RC, + init_net); if (IS_ERR(id)) { rc = PTR_ERR(id); dprintk(RPC: %s: rdma_create_id() failed %i\n, Regards. -- Yann Droneaud OPTEYA -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v1 2/5] IB/uverbs: ex_query_device: check request's comp_mask
Hi, Le dimanche 01 février 2015 à 10:12 +0200, Haggai Eran a écrit : On 29/01/2015 19:59, Yann Droneaud wrote: This patch ensures the extended QUERY_DEVICE uverbs request's comp_mask has only known and supported bits (currently none). If userspace set unknown features bits, -EINVAL will be returned, ensuring current programs are not allowed to set random feature bits: such bits could enable new extended features in future kernel versions and those features can trigger a behavior not unsupported by the older programs or make the newer kernels return an error for a request which was valid on older kernels. Additionally, returning an error for unsupported feature would allow userspace to probe/discover which extended features are currently supported by a kernel. As I wrote before, I hope in the future we don't force userspace to probe features this way, because it may be unnecessarily complex. I believe that most use cases won't need probing as applications are often built according to the current kernel features in mind. If applications need to use new features, it seems to be a small price to pay to be prepared to get -EINVAL. In another word: backward compatibility from application point of view: a newer application wanting to run on older kernel must be prepared to. I agree though that we should have a way to extend this verb in the future. Reviewed-by: Haggai Eran hagg...@mellanox.com Thanks. Link: http://mid.gmane.org/cover.1422553023.git.ydrone...@opteya.com Cc: Sagi Grimberg sa...@mellanox.com Cc: Shachar Raindel rain...@mellanox.com Cc: Eli Cohen e...@mellanox.com Cc: Haggai Eran hagg...@mellanox.com Signed-off-by: Yann Droneaud ydrone...@opteya.com --- drivers/infiniband/core/uverbs_cmd.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c index 6ef06a9b4362..fbcc54b86795 100644 --- a/drivers/infiniband/core/uverbs_cmd.c +++ b/drivers/infiniband/core/uverbs_cmd.c @@ -3312,6 +3312,9 @@ int ib_uverbs_ex_query_device(struct ib_uverbs_file *file, if (err) return err; + if (cmd.comp_mask) + return -EINVAL; + if (cmd.reserved) return -EINVAL; -- To unsubscribe from this list: send the line unsubscribe linux-api in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] infiniband: Help gcc generate better code for ocrdma_srq_toggle_bit
Hi, Le samedi 31 janvier 2015 à 00:00 +0100, Rasmus Villemoes a écrit : ping As you're fixing ocrdma driver, I think you might want to find people @emulex.com to review your patches. BTW, there's no MAINTAINERS entry for ocrdma driver ... which is a pity. On Fri, Jan 16 2015, Rasmus Villemoes li...@rasmusvillemoes.dk wrote: gcc emits a surprising amount of code in order to flip a bit. One would think that a single instruction is enough. $ scripts/bloat-o-meter /tmp/ocrdma_verbs.o drivers/infiniband/hw/ocrdma/ocrdma_verbs.o add/remove: 0/0 grow/shrink: 0/3 up/down: 0/-142 (-142) function old new delta ocrdma_post_srq_recv 498 460 -38 ocrdma_poll_cq 20101962 -48 ocrdma_discard_cqes 495 439 -56 All three calls of ocrdma_srq_toggle_bit happen within spinlocks, so saving a few useless instructions might be worthwhile. Signed-off-by: Rasmus Villemoes li...@rasmusvillemoes.dk --- drivers/infiniband/hw/ocrdma/ocrdma_verbs.c | 5 + 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c b/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c index fb8d8c4dfbb9..eff11e6c6183 100644 --- a/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c +++ b/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c @@ -1484,10 +1484,7 @@ static void ocrdma_srq_toggle_bit(struct ocrdma_srq *srq, int idx) int i = idx / 32; unsigned int mask = (1 (idx % 32)); - if (srq-idx_bit_fields[i] mask) - srq-idx_bit_fields[i] = ~mask; - else - srq-idx_bit_fields[i] |= mask; + srq-idx_bit_fields[i] ^= mask; } static int ocrdma_hwq_free_cnt(struct ocrdma_qp_hwq_info *q) -- Regards. -- Yann Droneaud OPTEYA -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v1 1/5] IB/uverbs: ex_query_device: answer must not depend on request's comp_mask
Hi, Le jeudi 29 janvier 2015 à 15:17 -0800, Roland Dreier a écrit : On Thu, Jan 29, 2015 at 1:59 PM, Yann Droneaud ydrone...@opteya.com wrote: Roland: I agree with Yann, these patches need to go in, or the ODP patches reverted. Reverting all On Demand Paging patches seems overkill: if something as to be reverted it should be commit 5a77abf9a97a (IB/core: Add support for extended query device caps) and the part of commit 860f10a799c8 (IB/core: Add flags for on demand paging support) which modify ib_uverbs_ex_query_device(). Thank you and Jason for taking on this interface. At this point I feel like I do about the IPoIB changes -- we should revert the broken stuff and get it right for 3.20. If we revert the two things you describe above, is everything else OK to leave in 3.19 with respect to ABI? I've tried to review every changes since v3.18 on drivers/infiniband include/rdma and include/uapi/rdma with respect to ABI issues. I've noticed no other issue, but I have to admit I've not well reviewed the drivers (hw/) internal changes. If the IB_USER_VERBS_EX_CMD_QUERY_DEVICE and ib_uverbs_ex_query_device changes are going to be reverted for v3.19, the on-demand-paging feature will be available (IB_DEVICE_ON_DEMAND_PAGING will be set device_cap_flags in response to non extended QUERY_DEVICE for mlx5 HCA and IB_ACCESS_ON_DEMAND access flag will be effective for REG_MR uverbs), but its parameters won't be. I don't know if it's a no-go for the usage of on-demand paging by userspace: I have not the chance of owning HCA with the support for this feature, nor the patches libibverbs / libmlx5 ... (anyway I would not have the time to test). I've hoped people from Mellanox would have commented on the revert option too. Regards. -- Yann Droneaud OPTEYA -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/4] IB/uverbs: ex_query_device: check request's comp_mask
Hi, Le mercredi 28 janvier 2015 à 17:40 +0200, Haggai Eran a écrit : On 28/01/2015 15:19, Yann Droneaud wrote: Le mardi 27 janvier 2015 à 08:50 +0200, Haggai Eran a écrit : On 26/01/2015 13:17, Yann Droneaud wrote: ... Le dimanche 25 janvier 2015 à 17:23 +0200, Haggai Eran a écrit : On 22/01/2015 15:28, Yann Droneaud wrote: This patch ensures the extended QUERY_DEVICE uverbs request's comp_mask has only known values. If userspace returns unknown features, -EINVAL will be returned, allowing to probe/discover which features are currently supported by the kernel. This probing process will be much more cumbersome than it needs to be because userspace will have to call QUERY_DEVICE repetitively with different comp_mask fields until it finds out the exact set of supported bits. O(log2(N)) I don't think user space developers will be happy having to do trial and error to determine what features the kernel driver supports. It might be even more then O(log2(N)). If my understanding of comp_mask bits usage is correct it would O(N). But it's not the time complexity I'm worried about, it's the fact that it requires user-space developers to go through hoops in order to get information that can be much more easily exported. But there's currently *NO* such mean that could give a hint to userspace developer whether one bit of request's comp_mask is currently effective in extended QUERY_DEVICE uverbs. While we have such mean in CREATE_FLOW and DESTROY_FLOW: unsupported bits trigger -EINVAL. In QUERY_DEVICE, userspace developer is allowed to set request's comp_mask to -1: it won't hurt it on current kernel, so why not using this value or any other random value ? The program will run: it's Works For Me. But the same program (either binary or source code) might fail on newer kernel where some bits in comp_mask gain a meaning not supported by the program. Why don't we make the command comp_mask in QUERY_DEVICE zero in both versions. The behavior of the command with comp_mask = 0 will be to return the maximum amount of data that fits in the response buffer. The kernel will return -EINVAL if the input command comp_mask is not zero in the current version. Yes, that's exactly what I wanted to do. If in the future we want to alter the behavior or add more input fields to QUERY_DEVICE, we would use new bits. Yes. Or you had to add a different interface, dedicated to retrieve the exact supported feature mask. Moreover, it also ensure the requested features set in comp_mask are sequentially set, not skipping intermediate features, eg. the highest requested feature also request all the lower ones. This way each new feature will have to be stacked on top of the existing ones: this is especially important for the request and response data structures where fields are added after the current ones when expanded to support a new feature. I think it is perfectly acceptable that not all drivers will implement all available features, and so you shouldn't enforce this requirement. With regard to QUERY_DEVICE: the data structure layout depends on the comp_mask value. So either you propose a way to express multipart data structure (see CMSG or NETLINK), or we have to ensure the data structure is ever-growing, with each new chunck stacked over the existing ones: that's the purpose of : if (cmd.comp_mask (cmd.comp_mask + 1)) return -EINVAL; Also, it makes the comp_mask nothing more than a wasteful version number between 0 and 63. That's what I've already explained earlier in Re: [PATCH v3 06/17] IB/core: Add support for extended query device caps: http://mid.gmane.org/1421844612.13543.40.ca...@opteya.com Yes, you wrote there: Regarding comp_mask (not for this particular verb): It's not clear whether request's comp_mask describe the request or the response, as such I'm puzzled. How would the kernel and the userspace be able to parse the request and the response if they ignore unknown bits ? How would they be able to skip the unrecognized chunk of the request and response buffer ? How one bit in a comp_mask is related to a chunk in the request or response ? It's likely the kernel or userspace would have to skip the remaining comp_mask's bits after encountering an unknown bit as the size of the corresponding chunk in request or response would be unknown, making impossible to locate the corresponding chunk for the next bit set in comp_mask. Having said that, comp_mask is just a complicated way of expressing a version, which is turn describe a size (ever growing). It is my understanding that each comp_mask bit marks a set of fields in the command or in the response struct as valid, so the struct format remains the same and the kernel and userspace don't need to make difficult calculation as to where each field is, but you
[PATCH v1 4/5] IB/uverbs: ex_query_device: no need to clear the whole structure
As only the requested fields are set and copied to userspace, there's no need to clear the content of the response structure beforehand. Link: http://mid.gmane.org/cover.1422553023.git.ydrone...@opteya.com Cc: Sagi Grimberg sa...@mellanox.com Cc: Shachar Raindel rain...@mellanox.com Cc: Eli Cohen e...@mellanox.com Cc: Haggai Eran hagg...@mellanox.com Signed-off-by: Yann Droneaud ydrone...@opteya.com --- drivers/infiniband/core/uverbs_cmd.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c index 81d8b5aa2eb6..9520880a163f 100644 --- a/drivers/infiniband/core/uverbs_cmd.c +++ b/drivers/infiniband/core/uverbs_cmd.c @@ -3328,9 +3328,9 @@ int ib_uverbs_ex_query_device(struct ib_uverbs_file *file, if (err) return err; - memset(resp, 0, sizeof(resp)); copy_query_dev_fields(file, resp.base, attr); resp.comp_mask = 0; + resp.reserved = 0; if (ucore-outlen resp_len + sizeof(resp.odp_caps)) goto end; @@ -3343,6 +3343,9 @@ int ib_uverbs_ex_query_device(struct ib_uverbs_file *file, attr.odp_caps.per_transport_caps.uc_odp_caps; resp.odp_caps.per_transport_caps.ud_odp_caps = attr.odp_caps.per_transport_caps.ud_odp_caps; + resp.odp_caps.reserved = 0; +#else + memset(resp.odp_caps, 0, sizeof(resp.odp_caps)); #endif resp.comp_mask |= IB_USER_VERBS_EX_QUERY_DEVICE_ODP; resp_len += sizeof(resp.odp_caps); -- 2.1.0 -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v1 1/5] IB/uverbs: ex_query_device: answer must not depend on request's comp_mask
As specified in Extending Verbs API presentation [1] by Tzahi Oved during OFA International Developer Workshop 2013, the request's comp_mask should describe the request data: it's describe the availability of extended fields in the request. Conversely, the response's comp_mask should describe the presence of extended fields in the response. So this patch changes ib_uverbs_ex_query_device() function to always return the maximum supported features, currently only IB_USER_VERBS_EX_QUERY_DEVICE_ODP per commit 860f10a799c8 (IB/core: Add flags for on demand paging support), regardless of the value of request's comp_mask. [1] https://www.openfabrics.org/images/docs/2013_Dev_Workshop/Tues_0423/2013_Workshop_Tues_0830_Tzahi_Oved-verbs_extensions_ofa_2013-tzahio.pdf Link: http://mid.gmane.org/cover.1422553023.git.ydrone...@opteya.com Cc: Sagi Grimberg sa...@mellanox.com Cc: Shachar Raindel rain...@mellanox.com Cc: Eli Cohen e...@mellanox.com Cc: Haggai Eran hagg...@mellanox.com Signed-off-by: Yann Droneaud ydrone...@opteya.com --- drivers/infiniband/core/uverbs_cmd.c | 18 -- 1 file changed, 8 insertions(+), 10 deletions(-) diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c index 532d8eba8b02..6ef06a9b4362 100644 --- a/drivers/infiniband/core/uverbs_cmd.c +++ b/drivers/infiniband/core/uverbs_cmd.c @@ -3324,17 +3324,15 @@ int ib_uverbs_ex_query_device(struct ib_uverbs_file *file, resp.comp_mask = 0; #ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING - if (cmd.comp_mask IB_USER_VERBS_EX_QUERY_DEVICE_ODP) { - resp.odp_caps.general_caps = attr.odp_caps.general_caps; - resp.odp_caps.per_transport_caps.rc_odp_caps = - attr.odp_caps.per_transport_caps.rc_odp_caps; - resp.odp_caps.per_transport_caps.uc_odp_caps = - attr.odp_caps.per_transport_caps.uc_odp_caps; - resp.odp_caps.per_transport_caps.ud_odp_caps = - attr.odp_caps.per_transport_caps.ud_odp_caps; - resp.comp_mask |= IB_USER_VERBS_EX_QUERY_DEVICE_ODP; - } + resp.odp_caps.general_caps = attr.odp_caps.general_caps; + resp.odp_caps.per_transport_caps.rc_odp_caps = + attr.odp_caps.per_transport_caps.rc_odp_caps; + resp.odp_caps.per_transport_caps.uc_odp_caps = + attr.odp_caps.per_transport_caps.uc_odp_caps; + resp.odp_caps.per_transport_caps.ud_odp_caps = + attr.odp_caps.per_transport_caps.ud_odp_caps; #endif + resp.comp_mask |= IB_USER_VERBS_EX_QUERY_DEVICE_ODP; err = ib_copy_to_udata(ucore, resp, sizeof(resp)); if (err) -- 2.1.0 -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v1 5/5] IB/core: ib_copy_to_udata(): don't silently truncate response
While ib_copy_to_udata() should check for the available output space as already proposed in some other patches [1][2][3], the changes brought by commit 5a77abf9a97a (IB/core: Add support for extended query device caps) are silently truncating the data to be written to userspace if the output buffer is not large enough to hold the response data. Silently truncating the response is not a reliable behavior as userspace is not given any hint about this truncation: userspace is leaved with garbage to play with. Not checking the response buffer size and writing past the userspace buffer is no good either, but it's the current behavior. So this patch revert the particular change on ib_copy_to_udata() as a better behavior is implemented in the upper level function ib_uverbs_ex_query_device(). [1] [PATCH 00/22] infiniband: improve userspace input check http://mid.gmane.org/cover.1376847403.git.ydrone...@opteya.com [2] [PATCH 03/22] infiniband: ib_copy_from_udata(): check input length http://mid.gmane.org/2bf102a41c51f61965ee09df827abe8fefb523a9.1376847403.git.ydrone...@opteya.com [3] [PATCH 04/22] infiniband: ib_copy_to_udata(): check output length http://mid.gmane.org/d27716a3a1c180f832d153a7402f65ea8a75b734.1376847403.git.ydrone...@opteya.com Link: http://mid.gmane.org/cover.1422553023.git.ydrone...@opteya.com Cc: Sagi Grimberg sa...@mellanox.com Cc: Shachar Raindel rain...@mellanox.com Cc: Eli Cohen e...@mellanox.com Cc: Haggai Eran hagg...@mellanox.com Signed-off-by: Yann Droneaud ydrone...@opteya.com --- include/rdma/ib_verbs.h | 5 + 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h index 0d74f1de99aa..65994a19e840 100644 --- a/include/rdma/ib_verbs.h +++ b/include/rdma/ib_verbs.h @@ -1707,10 +1707,7 @@ static inline int ib_copy_from_udata(void *dest, struct ib_udata *udata, size_t static inline int ib_copy_to_udata(struct ib_udata *udata, void *src, size_t len) { - size_t copy_sz; - - copy_sz = min_t(size_t, len, udata-outlen); - return copy_to_user(udata-outbuf, src, copy_sz) ? -EFAULT : 0; + return copy_to_user(udata-outbuf, src, len) ? -EFAULT : 0; } /** -- 2.1.0 -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v1 3/5] IB/uverbs: ex_query_device: answer must depend on response buffer's size
As specified in Extending Verbs API presentation [1] by Tzahi Oved during OFA International Developer Workshop 2013, the request's comp_mask should describe the request data: it's describe the availability of extended fields in the request. Conversely, the response's comp_mask should describe the presence of extended fields in the response. So instead of silently truncating extended QUERY_DEVICE uverb's response, see commit 5a77abf9a97a (IB/core: Add support for extended query device caps)), this patch makes function ib_uverbs_ex_query_device() check the available space in the response buffer against the minimal response structure and fail with -ENOSPC if this base structure cannot be returned to userspace: it's required to be able to return the comp_mask's value, at least. For extended features, currently only IB_USER_VERBS_EX_QUERY_DEVICE_ODP per commit 860f10a799c8 (IB/core: Add flags for on demand paging support), the extension's data is returned if the needed space is available in the response buffer: it is expected that newer userspace program pass a bigger response buffer so that it can retrieve extended features. The comp_mask value will match the fields that were effectively returned to userspace. In the end: - userspace won't get truncated responses; - newer kernel would be able to support older binaries and older binaries will work flawlessly with newer kernel; - additionally, newer binaries would even be able to support older kernel, as far as they don't set new feature bit in request's comp_mask. Note: as offsetof() is used to retrieve the size of the lower chunk of the response, beware that it only works if the upper chunk is right after, without any implicit padding. And, as the size of the latter chunk is added to the base size, implicit padding at the end of the structure is not taken in account. Both point must be taken in account when extending the uverbs functionalities. [1] https://www.openfabrics.org/images/docs/2013_Dev_Workshop/Tues_0423/2013_Workshop_Tues_0830_Tzahi_Oved-verbs_extensions_ofa_2013-tzahio.pdf Link: http://mid.gmane.org/cover.1422553023.git.ydrone...@opteya.com Cc: Sagi Grimberg sa...@mellanox.com Cc: Shachar Raindel rain...@mellanox.com Cc: Eli Cohen e...@mellanox.com Cc: Haggai Eran hagg...@mellanox.com Signed-off-by: Yann Droneaud ydrone...@opteya.com --- drivers/infiniband/core/uverbs_cmd.c | 13 - 1 file changed, 12 insertions(+), 1 deletion(-) diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c index fbcc54b86795..81d8b5aa2eb6 100644 --- a/drivers/infiniband/core/uverbs_cmd.c +++ b/drivers/infiniband/core/uverbs_cmd.c @@ -3302,6 +3302,7 @@ int ib_uverbs_ex_query_device(struct ib_uverbs_file *file, struct ib_uverbs_ex_query_device cmd; struct ib_device_attr attr; struct ib_device *device; + size_t resp_len; int err; device = file-device-ib_dev; @@ -3318,6 +3319,11 @@ int ib_uverbs_ex_query_device(struct ib_uverbs_file *file, if (cmd.reserved) return -EINVAL; + resp_len = offsetof(typeof(resp), odp_caps); + + if (ucore-outlen resp_len) + return -ENOSPC; + err = device-query_device(device, attr); if (err) return err; @@ -3326,6 +3332,9 @@ int ib_uverbs_ex_query_device(struct ib_uverbs_file *file, copy_query_dev_fields(file, resp.base, attr); resp.comp_mask = 0; + if (ucore-outlen resp_len + sizeof(resp.odp_caps)) + goto end; + #ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING resp.odp_caps.general_caps = attr.odp_caps.general_caps; resp.odp_caps.per_transport_caps.rc_odp_caps = @@ -3336,8 +3345,10 @@ int ib_uverbs_ex_query_device(struct ib_uverbs_file *file, attr.odp_caps.per_transport_caps.ud_odp_caps; #endif resp.comp_mask |= IB_USER_VERBS_EX_QUERY_DEVICE_ODP; + resp_len += sizeof(resp.odp_caps); - err = ib_copy_to_udata(ucore, resp, sizeof(resp)); +end: + err = ib_copy_to_udata(ucore, resp, resp_len); if (err) return err; -- 2.1.0 -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v1 2/5] IB/uverbs: ex_query_device: check request's comp_mask
This patch ensures the extended QUERY_DEVICE uverbs request's comp_mask has only known and supported bits (currently none). If userspace set unknown features bits, -EINVAL will be returned, ensuring current programs are not allowed to set random feature bits: such bits could enable new extended features in future kernel versions and those features can trigger a behavior not unsupported by the older programs or make the newer kernels return an error for a request which was valid on older kernels. Additionally, returning an error for unsupported feature would allow userspace to probe/discover which extended features are currently supported by a kernel. Link: http://mid.gmane.org/cover.1422553023.git.ydrone...@opteya.com Cc: Sagi Grimberg sa...@mellanox.com Cc: Shachar Raindel rain...@mellanox.com Cc: Eli Cohen e...@mellanox.com Cc: Haggai Eran hagg...@mellanox.com Signed-off-by: Yann Droneaud ydrone...@opteya.com --- drivers/infiniband/core/uverbs_cmd.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c index 6ef06a9b4362..fbcc54b86795 100644 --- a/drivers/infiniband/core/uverbs_cmd.c +++ b/drivers/infiniband/core/uverbs_cmd.c @@ -3312,6 +3312,9 @@ int ib_uverbs_ex_query_device(struct ib_uverbs_file *file, if (err) return err; + if (cmd.comp_mask) + return -EINVAL; + if (cmd.reserved) return -EINVAL; -- 2.1.0 -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v1 0/5] IB/core: extended query device caps cleanup for v3.19
Hi, Following discussions in thread [PATCH v3 06/17] IB/core: Add support for extended query device caps [1] and further comments on previous patch in [PATCH 3/4] IB/uverbs: ex_query_device: check request's comp_mask thread [2], I'm proposing an updated patchset to implement a slighly different behavior to handle the request parameters in ib_uverbs_ex_query_device() in order to restore the current behavior of ib_copy_to_udata(). I think we found an agreement on the scheme to be implemented in Haggai Eran's response [3]: - input's comp_mask is currently not used, so it must be 0; - the response buffer size is checked so that the base answer can be returned without being truncated; - the extended feature (odp) information are returned if, and only if, there's enough space in the response buffer, allowing older programs to get only the features they would expect. Newer programs are expected to provide more space, but as the response's comp_mask will describe only the available fields, newer program cannot be surprized to not get information when run on older kernel (if we want to support this use case). I feel even more confident this behavior would allow a better maintainability. Additionally, it's still looking more like the behavior already implemented by other InfiniBand/RDMA kernel - userspace interfaces. I hope this would go in v3.19 before its release so that the extended QUERY_DEVICE uverbs won't hurt us being part of an official release: in it's current shape it won't be easy to support it. Regards. Changes from v0 [4] - don't use input's comp_mask to conditionnaly build the response - ensure input's comp_mask is set 0 [1] http://mid.gmane.org/1418733236.2779.26.ca...@opteya.com [2] http://mid.gmane.org/063956366559d6919693aabb324bab83d676bc28.1421931555.git.ydrone...@opteya.com [3] http://mid.gmane.org/54c902e4.5010...@mellanox.com [4] http://mid.gmane.org/cover.1421931555.git.ydrone...@opteya.com Yann Droneaud (5): IB/uverbs: ex_query_device: answer must not depend on request's comp_mask IB/uverbs: ex_query_device: check request's comp_mask IB/uverbs: ex_query_device: answer must depend on response buffer's size IB/uverbs: ex_query_device: no need to clear the whole structure IB/core: ib_copy_to_udata(): don't silently truncate response drivers/infiniband/core/uverbs_cmd.c | 39 +--- include/rdma/ib_verbs.h | 5 + 2 files changed, 28 insertions(+), 16 deletions(-) -- 2.1.0 -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/4] IB/uverbs: ex_query_device: check request's comp_mask
Hi, Le jeudi 29 janvier 2015 à 11:09 -0700, Jason Gunthorpe a écrit : On Wed, Jan 28, 2015 at 02:19:11PM +0100, Yann Droneaud wrote: But the same program (either binary or source code) might fail on newer kernel where some bits in comp_mask gain a meaning not supported by the program. To clarify some of this: The intention of the comp_mask scheme was to define a growing structure. The invariant is: A bigger structure allways has all smaller structures (past versions) as a prefix. So the size alone is enough for user space to know what is going on. userspace only processes structure members up to the size returned by the kernel, and there is only one structure layout. Unfortunately, the userspace don't get the size of the returned data: it's only a single write() syscall after all. So the kernel is left with the comp_mask in the response to express the returned size. The purpose of the output comp_mask is to allow drivers to declare they do not support the new structure members, and comp_mask bits should only be used with new structure members do not have a natural 'null' value. It's not (yet) about drivers as the output's comp_mask (in the response), is set by uverbs layer. Do you think we have to enforce a 1 to 1 relation between the request comp_mask's bits and the response comp_mask's bits ? That would not fit with my understanding of Extending Verbs API presentation [1]: request's comp_mask describe the fields present in the request and response's comp_mask describe the fields present in the response. Further, we need to distinguish cases where additional data is mandatory or optional. query_device is clearly optional, the only purpose the input comp mask serves is to reduce expensive work in the driver by indicating that some result bits are not needed. That's not how I've understand the purpose of the request's comp_mask after reading the presentation: request's comp_mask describe the content of the request. Then that additional content can trigger the presence of additional fields in the response if needed. It is perfectly OK for the kernel to ignore the input comp mask, and OK for userspace to typically request all bits. (indeed, I think this is a pretty silly optimization myself, and the original patch that motivated this was restructured so it is not needed) It's not at all OK to ignore request's comp_mask: it has to be checked to find if there's a feature requested the kernel cannot fullfil, and any unknown bit is a possible feature request. So the kernel has to reject the request as a whole as it don't know how to process it. As we don't know yet how we would extend the extended QUERY_DEVICE uverbs, the kernel have to refuse any random value. http://blog.ffwll.ch/2013/11/botching-up-ioctls.html Other verbs must be more strict, if one side does not understand the comp mask then the verb must fail in some way. Presumably the valid comp mask values are inferred in some way (eg query_device says the extended function is supported) Everything should fit in one of those two general cases.. I believe every interface should return an error for any unknown value (every unused bits of a data structure), that is, there's only one case. Thanks for the link to the presentation. If I recall the presentation is old and had some flaws that were addressed before things made it into libibverbs.. I have to have a look to this part of libibverbs: I'm not sure the extended QUERY_DEVICE is already implemented. Thank you for looking at this, it is very important that this scheme is used properly, and it is very easy to make mistakes. I haven't had time to review any of these new patches myself. I hope you would find some time to review my latest patchset so that we don't miss a corner case. It's starting to become urgent. Regards. -- Yann Droneaud OPTEYA -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v1 1/5] IB/uverbs: ex_query_device: answer must not depend on request's comp_mask
Hi, Le jeudi 29 janvier 2015 à 11:28 -0700, Jason Gunthorpe a écrit : On Thu, Jan 29, 2015 at 06:59:58PM +0100, Yann Droneaud wrote: As specified in Extending Verbs API presentation [1] by Tzahi Oved during OFA International Developer Workshop 2013, the request's comp_mask should describe the request data: it's describe the availability of extended fields in the request. Conversely, the response's comp_mask should describe the presence of extended fields in the response. Roland: I agree with Yann, these patches need to go in, or the ODP patches reverted. #ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING - if (cmd.comp_mask IB_USER_VERBS_EX_QUERY_DEVICE_ODP) { Absolutely, a query output should never depend on the input comp_mask. We should keep this in mind then. - resp.odp_caps.general_caps = attr.odp_caps.general_caps; - resp.odp_caps.per_transport_caps.rc_odp_caps = - attr.odp_caps.per_transport_caps.rc_odp_caps; - resp.odp_caps.per_transport_caps.uc_odp_caps = - attr.odp_caps.per_transport_caps.uc_odp_caps; - resp.odp_caps.per_transport_caps.ud_odp_caps = - attr.odp_caps.per_transport_caps.ud_odp_caps; - resp.comp_mask |= IB_USER_VERBS_EX_QUERY_DEVICE_ODP; - } + resp.odp_caps.general_caps = attr.odp_caps.general_caps; + resp.odp_caps.per_transport_caps.rc_odp_caps = + attr.odp_caps.per_transport_caps.rc_odp_caps; + resp.odp_caps.per_transport_caps.uc_odp_caps = + attr.odp_caps.per_transport_caps.uc_odp_caps; + resp.odp_caps.per_transport_caps.ud_odp_caps = + attr.odp_caps.per_transport_caps.ud_odp_caps; #endif + resp.comp_mask |= IB_USER_VERBS_EX_QUERY_DEVICE_ODP; Not sure about this - if 0 is a valid null answer for all the _caps then it is fine, and the comp_mask bit should just be removed as the size alone should be enough. That's difficult to say. But I hope 0 are valids answers in this case. Anyway, the response's comp_mask is needed, as it's the sole way for the userspace to know the size of the response. See below. This looks wrong however: err = ib_copy_to_udata(ucore, resp, sizeof(resp)); if (err) return err; return 0; Why isn't this returning the filled structure size to userspace? This would seem to be very urgently wrong to me? Am I missing something? Patch 3 probably should gain: - return 0; + return resp_len; The write() syscall must return the size buffer passed to it, or less, but in such case it would ask for trouble as userspace would be allowed to write() the remaining bytes. Returning a size bigger than the one passed to write() is not acceptable and would break any expectation. This patch looks like an improvement to me: Reviewed-By: Jason Gunthorpe jguntho...@obsidianresearch.com Thanks a lot. Regards. -- Yann Droneaud OPTEYA -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v1 1/5] IB/uverbs: ex_query_device: answer must not depend on request's comp_mask
Hi, Le jeudi 29 janvier 2015 à 11:28 -0700, Jason Gunthorpe a écrit : On Thu, Jan 29, 2015 at 06:59:58PM +0100, Yann Droneaud wrote: As specified in Extending Verbs API presentation [1] by Tzahi Oved during OFA International Developer Workshop 2013, the request's comp_mask should describe the request data: it's describe the availability of extended fields in the request. Conversely, the response's comp_mask should describe the presence of extended fields in the response. Roland: I agree with Yann, these patches need to go in, or the ODP patches reverted. Reverting all On Demand Paging patches seems overkill: if something as to be reverted it should be commit 5a77abf9a97a (IB/core: Add support for extended query device caps) and the part of commit 860f10a799c8 (IB/core: Add flags for on demand paging support) which modify ib_uverbs_ex_query_device(). But I wonder about this part of commit 860f10a799c8: diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c index c7a43624c96b..f9326ccda4b5 100644 --- a/drivers/infiniband/core/uverbs_cmd.c +++ b/drivers/infiniband/core/uverbs_cmd.c @@ -953,6 +953,18 @@ ssize_t ib_uverbs_reg_mr(struct ib_uverbs_file *file, goto err_free; } + if (cmd.access_flags IB_ACCESS_ON_DEMAND) { + struct ib_device_attr attr; + + ret = ib_query_device(pd-device, attr); + if (ret || !(attr.device_cap_flags + IB_DEVICE_ON_DEMAND_PAGING)) { + pr_debug(ODP support not available\n); + ret = -EINVAL; + goto err_put; + } + } + AFAICT (1 6) bit from struct ib_uverbs_reg_mr access_flags field was not enforced to be 0 previously, as ib_check_mr_access() only check for some coherency between a subset of the bits (it's not a function dedicated to check flags provided by userspace): include/rdma/ib_verbs.h: 1094 enum ib_access_flags { 1095 IB_ACCESS_LOCAL_WRITE = 1, 1096 IB_ACCESS_REMOTE_WRITE = (11), 1097 IB_ACCESS_REMOTE_READ = (12), 1098 IB_ACCESS_REMOTE_ATOMIC = (13), 1099 IB_ACCESS_MW_BIND = (14), 1100 IB_ZERO_BASED = (15), 1101 IB_ACCESS_ON_DEMAND = (16), 1102 }; drivers/infiniband/core/uverbs_cmd.c: ib_uverbs_reg_mr() 961 ret = ib_check_mr_access(cmd.access_flags); 962 if (ret) 963 return ret; include/rdma/ib_verbs.h: 2643 static inline int ib_check_mr_access(int flags) 2644 { 2645 /* 2646 * Local write permission is required if remote write or 2647 * remote atomic permission is also requested. 2648 */ 2649 if (flags (IB_ACCESS_REMOTE_ATOMIC | IB_ACCESS_REMOTE_WRITE) 2650 !(flags IB_ACCESS_LOCAL_WRITE)) 2651 return -EINVAL; 2652 2653 return 0; 2654 } drivers/infiniband/core/uverbs_cmd.c: ib_uverbs_reg_mr() 990 mr = pd-device-reg_user_mr(pd, cmd.start, cmd.length, cmd.hca_va, 991 cmd.access_flags, udata); reg_user_mr() functions may call ib_umem_get() and pass access_flags to it: drivers/infiniband/core/umem.c: ib_umem_get() 114 /* 115 * We ask for writable memory if any of the following 116 * access flags are set. Local write and remote write 117 * obviously require write access. Remote atomic can do 118 * things like fetch and add, which will modify memory, and 119 * MW bind can change permissions by binding a window. 120 */ 121 umem-writable = !!(access 122 (IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_WRITE | 123 IB_ACCESS_REMOTE_ATOMIC | IB_ACCESS_MW_BIND)); 124 125 if (access IB_ACCESS_ON_DEMAND) { 126 ret = ib_umem_odp_get(context, umem); 127 if (ret) { 128 kfree(umem); 129 return ERR_PTR(ret); 130 } 131 return umem; 132 } As you can see only a few bits in access_flags are checked in the end, so it may exist a very unlikely possibility that an existing userspace program is setting this IB_ACCESS_ON_DEMAND bit without the intention of enabling on demand paging as this would be unnoticed by older kernel. In the other hand, a newer program built with on-demand-paging in mind will set the bit, but when run on older kernel, it will be a no-op, allowing the program to continue, perhaps thinking on-demand-paging is available. That should be avoided as much as possible. Unfortunately, I think this cannot be fixed as it's was long since IB_ZERO_BASED was added by commit 7083e42ee2 (IB/core: Add type 2
Re: [PATCH v1 1/5] IB/uverbs: ex_query_device: answer must not depend on request's comp_mask
Le jeudi 29 janvier 2015 à 19:43 +0100, Yann Droneaud a écrit : Le jeudi 29 janvier 2015 à 11:28 -0700, Jason Gunthorpe a écrit : On Thu, Jan 29, 2015 at 06:59:58PM +0100, Yann Droneaud wrote: - resp.odp_caps.general_caps = attr.odp_caps.general_caps; - resp.odp_caps.per_transport_caps.rc_odp_caps = - attr.odp_caps.per_transport_caps.rc_odp_caps; - resp.odp_caps.per_transport_caps.uc_odp_caps = - attr.odp_caps.per_transport_caps.uc_odp_caps; - resp.odp_caps.per_transport_caps.ud_odp_caps = - attr.odp_caps.per_transport_caps.ud_odp_caps; - resp.comp_mask |= IB_USER_VERBS_EX_QUERY_DEVICE_ODP; - } + resp.odp_caps.general_caps = attr.odp_caps.general_caps; + resp.odp_caps.per_transport_caps.rc_odp_caps = + attr.odp_caps.per_transport_caps.rc_odp_caps; + resp.odp_caps.per_transport_caps.uc_odp_caps = + attr.odp_caps.per_transport_caps.uc_odp_caps; + resp.odp_caps.per_transport_caps.ud_odp_caps = + attr.odp_caps.per_transport_caps.ud_odp_caps; #endif + resp.comp_mask |= IB_USER_VERBS_EX_QUERY_DEVICE_ODP; Not sure about this - if 0 is a valid null answer for all the _caps then it is fine, and the comp_mask bit should just be removed as the size alone should be enough. That's difficult to say. But I hope 0 are valids answers in this case. Hum, according to include/rdma/ib_verbs.h, there's a bit set for ODP support: 148 enum ib_odp_general_cap_bits { 149 IB_ODP_SUPPORT = 1 0, 150 }; So it should be safe to set everything to 0 in struct ib_uverbs_odp_caps. Regards. -- Yann Droneaud OPTEYA -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v1 2/5] IB/uverbs: ex_query_device: check request's comp_mask
Hi, Le jeudi 29 janvier 2015 à 11:36 -0700, Jason Gunthorpe a écrit : On Thu, Jan 29, 2015 at 06:59:59PM +0100, Yann Droneaud wrote: This patch ensures the extended QUERY_DEVICE uverbs request's comp_mask has only known and supported bits (currently none). I think I would be happy to see the input comp_mask removed entirely. I can't see a possible use for input data to a QUERY command that wouldn't be better served by creating a new command But forcing the value to 0 seems reasonable as well. I cannot forsee the future, but having at least one unused bit available allow for any kind of yet unknown extension: having this bit/these bits checked now, permit fixing mistake later. So let it be. Reviewed-By: Jason Gunthorpe jguntho...@obsidianresearch.com Thanks. Also, the _ex varients were supposed to be supersets of the base call, so it is wrong that query_device_ex doesn't return all the same data as query_device, layed out so that the original response structure is a prefix of the extended response structure. The other _ex calls in verbs were designed that way so it will be surprising to the user that this one is different. It seems to me it has the layout you're expecting: 224 struct ib_uverbs_ex_query_device_resp { 225 struct ib_uverbs_query_device_resp base; 226 __u32 comp_mask; 227 __u32 reserved; 228 struct ib_uverbs_odp_caps odp_caps; 229 }; Regards. -- Yann Droneaud OPTEYA -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v1 3/5] IB/uverbs: ex_query_device: answer must depend on response buffer's size
Hi, Le jeudi 29 janvier 2015 à 11:38 -0700, Jason Gunthorpe a écrit : On Thu, Jan 29, 2015 at 07:00:00PM +0100, Yann Droneaud wrote: As specified in Extending Verbs API presentation [1] by Tzahi Oved during OFA International Developer Workshop 2013, the request's comp_mask should describe the request data: it's describe the availability of extended fields in the request. Conversely, the response's comp_mask should describe the presence of extended fields in the response. This makes sense to me as well. - err = ib_copy_to_udata(ucore, resp, sizeof(resp)); +end: + err = ib_copy_to_udata(ucore, resp, resp_len); if (err) return err; I think resp_len should be returned, not 0? As said previously it's not possible to use the effective size of the response as the return value of the write() syscall: the syscall must return the size of the input buffer, not the size of the output buffer, otherwise it would break the semantics of the write() syscall. Regards. -- Yann Droneaud OPTEYA -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v1 1/5] IB/uverbs: ex_query_device: answer must not depend on request's comp_mask
Hi, Le jeudi 29 janvier 2015 à 12:18 -0700, Jason Gunthorpe a écrit : On Thu, Jan 29, 2015 at 07:43:29PM +0100, Yann Droneaud wrote: The write() syscall must return the size buffer passed to it, or less, but in such case it would ask for trouble as userspace would be allowed to write() the remaining bytes. Returning a size bigger than the one passed to write() is not acceptable and would break any expectation. By that logic the 0 return is still wrong, and it should be ucore-in_len This is handled by ib_uverbs_write() in drivers/infiniband/core/uverbs_main.c: 709 if (err) 710 return err; 711 712 return written_count; But I think we can return less without risking anything breaking, it already violates the invariant associated with write() - it mutates the buffer passed in! I don't think so, as only the response buffer is written to and the response buffer pointer is provided in the buffer given to write(). AFAIK, no uverbs ever change the content of the input buffer (eg. the request): I've managed to declare the various input buffers const so it would surprising to find it use for writing to userspace. Anyway, I recognize that uverb way of abusing write() syscall is borderline (at best) regarding other Linux subsystems and Unix paradigm in general. But it's not enough to screw it more. Regards. -- Yann Droneaud OPTEYA -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/4] IB/uverbs: ex_query_device: check request's comp_mask
Le jeudi 29 janvier 2015 à 12:14 -0700, Jason Gunthorpe a écrit : On Thu, Jan 29, 2015 at 07:35:14PM +0100, Yann Droneaud wrote: Unfortunately, the userspace don't get the size of the returned data: it's only a single write() syscall after all. A write syscall that behaves nothing like write() actually should, so I don't see why we can't have resp_len = write(fd,inout_buf,sizeof(input_len)); Returning 0 from write makes no sense at all. 0 is not the result of the write() syscall, as for extended uverbs I've ensure the return value of the write() syscall was the size it was given. See http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/drivers/infiniband/core/uverbs_main.c?id=v3.19-rc6#n599 705 err = uverbs_ex_cmd_table[command](file, 706ucore, 707uhw); 708 709 if (err) 710 return err; 711 712 return written_count; See commit f21519b23c1 (IB/core: extended command: an improved infrastructure for uverbs commands) http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=f21519b23c1b6fa25366be4114ccf7fcf1c190f9 In the fullness of your patchset it will maintain the invariant that resp_len = sizeof(input_len) I don't catch your point: the response can be bigger than the request. Which seems OK to me considering what we have to work with, and a significantly better choice than 0. So the kernel is left with the comp_mask in the response to express the returned size. It was never the intent that the size should be computed from comp_mask. If the size is necessary it must be explicit. In this instance if the size is not returned then libibverbs will have to zero the entire user buffer before passing it to the kernel, because it has to ensure any tail for the user app is 0'd. The proposed patch ensure the integrity of the response regarding comp_mask: if a bit is set in response's comp_mask that means the related fields are presents (and valid). So before parsing the response fields, userspace have to check response's comp_mask: fields access must be protected by correct check on comp_mask ... but it might be useful for the userspace developer to clear the response buffer just in case he/she decided to be lazy with the check. Remember for santity we want comp mask bits for things that can't be 0 For me, it's better if a bit is set in response's comp_mask by the kernel when the kernel have written something in the related fields even if the those fields are all 0. and we want 0 for things that are not set. struct ib_query_device_ex res; ibv_query_device_ex(..,res,sizeof(res)); printf(%u,res.foo_cap); // 0 if unsupported is OK if (res.comp_mask COMP_BAR) printf(%u,res.bar_thingy); // 0 has meaning, must use COMP_BAR Ideally we want to minimize the number of COMP bits because it is a huge burden on the end user to work with them. Sure. So I think comp_mask is just an overly complicated way of expressing the version and the size of the response. The purpose of the output comp_mask is to allow drivers to declare they do not support the new structure members, and comp_mask bits should only be used with new structure members do not have a natural 'null' value. It's not (yet) about drivers as the output's comp_mask (in the response), is set by uverbs layer. Do you think we have to enforce a 1 to 1 relation between the request comp_mask's bits and the response comp_mask's bits ? In the query context I would be happy enough to just treat the in bound buffer as uninitialized buffer space, but certainly generally speaking the comp_mask+size should refer to the structure - so input/output are not directly related. OK. Further, we need to distinguish cases where additional data is mandatory or optional. query_device is clearly optional, the only purpose the input comp mask serves is to reduce expensive work in the driver by indicating that some result bits are not needed. That's not how I've understand the purpose of the request's comp_mask after reading the presentation: request's comp_mask describe the content of the request. Then that additional content can trigger the presence of additional fields in the response if needed. Agreed - what I described was an abuse that some early Mellanox patches for a query_ex included and it was just a shortcut to avoid defining a dedicated input structure. It seems that same scheme was copied into these new patches. OK It is perfectly OK for the kernel to ignore the input comp mask, and OK for userspace to typically request all bits. (indeed, I think this is a pretty silly optimization myself, and the original patch that motivated this was restructured so it is not needed
Re: [PATCH] ib_ipoib: Scatter-Gather support in connected mode
= netdev_priv(dev); if (test_bit(IPOIB_FLAG_ADMIN_CM, priv-flags)) - features = ~(NETIF_F_SG | NETIF_F_IP_CSUM | NETIF_F_TSO); + features = ~(NETIF_F_IP_CSUM | NETIF_F_TSO); return features; } @@ -233,7 +233,6 @@ int ipoib_set_mode(struct net_device *dev, const char *buf) will cause multicast packet drops\n); netdev_update_features(dev); rtnl_unlock(); - priv-tx_wr.send_flags = ~IB_SEND_IP_CSUM; Sure about this removal ? I don't see it related to scatter-gather. ipoib_flush_paths(dev); rtnl_lock(); Regards. -- Yann Droneaud OPTEYA -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/4] IB/uverbs: ex_query_device: check request's comp_mask
Hi, Le mardi 27 janvier 2015 à 08:50 +0200, Haggai Eran a écrit : On 26/01/2015 13:17, Yann Droneaud wrote: ... Le dimanche 25 janvier 2015 à 17:23 +0200, Haggai Eran a écrit : On 22/01/2015 15:28, Yann Droneaud wrote: This patch ensures the extended QUERY_DEVICE uverbs request's comp_mask has only known values. If userspace returns unknown features, -EINVAL will be returned, allowing to probe/discover which features are currently supported by the kernel. This probing process will be much more cumbersome than it needs to be because userspace will have to call QUERY_DEVICE repetitively with different comp_mask fields until it finds out the exact set of supported bits. O(log2(N)) I don't think user space developers will be happy having to do trial and error to determine what features the kernel driver supports. It might be even more then O(log2(N)). If my understanding of comp_mask bits usage is correct it would O(N). But it's not the time complexity I'm worried about, it's the fact that it requires user-space developers to go through hoops in order to get information that can be much more easily exported. But there's currently *NO* such mean that could give a hint to userspace developer whether one bit of request's comp_mask is currently effective in extended QUERY_DEVICE uverbs. While we have such mean in CREATE_FLOW and DESTROY_FLOW: unsupported bits trigger -EINVAL. In QUERY_DEVICE, userspace developer is allowed to set request's comp_mask to -1: it won't hurt it on current kernel, so why not using this value or any other random value ? The program will run: it's Works For Me. But the same program (either binary or source code) might fail on newer kernel where some bits in comp_mask gain a meaning not supported by the program. Or you had to add a different interface, dedicated to retrieve the exact supported feature mask. Moreover, it also ensure the requested features set in comp_mask are sequentially set, not skipping intermediate features, eg. the highest requested feature also request all the lower ones. This way each new feature will have to be stacked on top of the existing ones: this is especially important for the request and response data structures where fields are added after the current ones when expanded to support a new feature. I think it is perfectly acceptable that not all drivers will implement all available features, and so you shouldn't enforce this requirement. With regard to QUERY_DEVICE: the data structure layout depends on the comp_mask value. So either you propose a way to express multipart data structure (see CMSG or NETLINK), or we have to ensure the data structure is ever-growing, with each new chunck stacked over the existing ones: that's the purpose of : if (cmd.comp_mask (cmd.comp_mask + 1)) return -EINVAL; Also, it makes the comp_mask nothing more than a wasteful version number between 0 and 63. That's what I've already explained earlier in Re: [PATCH v3 06/17] IB/core: Add support for extended query device caps: http://mid.gmane.org/1421844612.13543.40.ca...@opteya.com Yes, you wrote there: Regarding comp_mask (not for this particular verb): It's not clear whether request's comp_mask describe the request or the response, as such I'm puzzled. How would the kernel and the userspace be able to parse the request and the response if they ignore unknown bits ? How would they be able to skip the unrecognized chunk of the request and response buffer ? How one bit in a comp_mask is related to a chunk in the request or response ? It's likely the kernel or userspace would have to skip the remaining comp_mask's bits after encountering an unknown bit as the size of the corresponding chunk in request or response would be unknown, making impossible to locate the corresponding chunk for the next bit set in comp_mask. Having said that, comp_mask is just a complicated way of expressing a version, which is turn describe a size (ever growing). It is my understanding that each comp_mask bit marks a set of fields in the command or in the response struct as valid, so the struct format remains the same and the kernel and userspace don't need to make difficult calculation as to where each field is, but you can still pass a high bit set in comp_mask with one of the lower bits cleared. How can the struct format remain the same, if some fields are added to implement newer feature ? I couldn't find this explicit detail in the mailing list, but I did found a presentation that was presented in OFA International Developer Workshop 2013 [1], that gave an example of of an verb where each comp_mask bit marked a different field as valid. Thanks for the link to the presentation. As I read it the presentation: - in request, comp_mask give hint to the kernel of additional fields in the request
Re: [PATCH 3/4] IB/uverbs: ex_query_device: check request's comp_mask
Hi, (adding linux-api@ for comments: We're introducing a new uverb in the InfiniBand subsystem: extended QUERY_DEVICE in v3.19: http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/drivers/infiniband/core/uverbs_cmd.c?id=v3.19-rc6#n3297 http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/include/uapi/rdma/ib_user_verbs.h?id=v3.19-rc6#n209 http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/include/uapi/rdma/ib_user_verbs.h?id=v3.19-rc6#n224 http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=5a77abf9a97a http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=860f10a799c8 As you may not already know, InfiniBand subsystem use a write() syscall[1] to issue ioctl()-like operations. Many operations (aka verbs) are available [2], for each there's a query data structures and for some there's a response data structure [3]. As a result to a write() operation, kernel could allocate resources on the task behalf and/or write data back to userspace provided buffers whose pointers were part of buffer passed to write(). I've expressed concern on the way the new extended QUERY_DEVICE was trying to be itself expandable: by silently ignoring shorter buffer, returning truncated data, the interface seems awkward. Re: [PATCH v2 06/17] IB/core: Add support for extended query device caps http://mid.gmane.org/1418216676.1.45.ca...@opteya.com Re: [PATCH v3 06/17] IB/core: Add support for extended query device caps http://mid.gmane.org/1418733236.2779.26.ca...@opteya.com http://mid.gmane.org/1418824811.3334.3.camel@dworkin http://mid.gmane.org/1421844612.13543.40.ca...@opteya.com I recognize the author's intention to provide a way for userspace to retrieve easily the highest supported information as something desirable. But I believe we must be more strict on the request content and fail for any unrecognized, unsupported, incorrect bits to make safer to extend the interface latter. I've submitted a patchset[4] to address these issues. But, while I'm convinced it the way to go, I'm not able to find how it could be made to satisfy the original author expectations. I hope linux-api@ readers could provide some insight regarding the way we should implement such interface [1] http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/drivers/infiniband/core/uverbs_main.c?id=v3.19-rc6#n599 [2] http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/drivers/infiniband/core/uverbs_main.c?id=v3.19-rc6#n81 [3] http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/include/uapi/rdma/ib_user_verbs.h?id=v3.19-rc6 [4] http://mid.gmane.org/cover.1421931555.git.ydrone...@opteya.com ) Le dimanche 25 janvier 2015 à 17:23 +0200, Haggai Eran a écrit : On 22/01/2015 15:28, Yann Droneaud wrote: This patch ensures the extended QUERY_DEVICE uverbs request's comp_mask has only known values. If userspace returns unknown features, -EINVAL will be returned, allowing to probe/discover which features are currently supported by the kernel. This probing process will be much more cumbersome than it needs to be because userspace will have to call QUERY_DEVICE repetitively with different comp_mask fields until it finds out the exact set of supported bits. O(log2(N)) Or you had to add a different interface, dedicated to retrieve the exact supported feature mask. Moreover, it also ensure the requested features set in comp_mask are sequentially set, not skipping intermediate features, eg. the highest requested feature also request all the lower ones. This way each new feature will have to be stacked on top of the existing ones: this is especially important for the request and response data structures where fields are added after the current ones when expanded to support a new feature. I think it is perfectly acceptable that not all drivers will implement all available features, and so you shouldn't enforce this requirement. With regard to QUERY_DEVICE: the data structure layout depends on the comp_mask value. So either you propose a way to express multipart data structure (see CMSG or NETLINK), or we have to ensure the data structure is ever-growing, with each new chunck stacked over the existing ones: that's the purpose of : if (cmd.comp_mask (cmd.comp_mask + 1)) return -EINVAL; Also, it makes the comp_mask nothing more than a wasteful version number between 0 and 63. That's what I've already explained earlier in Re: [PATCH v3 06/17] IB/core: Add support for extended query device caps: http://mid.gmane.org/1421844612.13543.40.ca...@opteya.com In the specific case of QUERY_DEVICE you might argue that there isn't any need for input comp_mask, only for output, and then you may enforce the input comp_mask will always be zero. The extended QUERY_DEVICE uverbs as currently merged is using comp_mask from input to choose
Re: [PATCH 1/4] IB/uverbs: ex_query_device: fail if output buffer size does not match
Hi, Le dimanche 25 janvier 2015 à 17:29 +0200, Haggai Eran a écrit : On 22/01/2015 15:28, Yann Droneaud wrote: Instead of silently truncating extended QUERY_DEVICE uverb's response, see commit 5a77abf9a97a (IB/core: Add support for extended query device caps)), this patch makes function ib_uverbs_ex_query_device() check the available space in the response buffer against the requested features set in comp_mask (currently only IB_USER_VERBS_EX_QUERY_DEVICE_ODP per commit 860f10a799c8 (IB/core: Add flags for on demand paging support). Currently you only need to check two possible sizes. With each additional field you will need to check another size. Yes, that's mandatory for any interface with a dynamic, ever growing, data structure. Sooner or later someone will forget to add a check and will create a non-backward compatible kernel. That's why we have to review userspace interface patch with lot of care. Anyway, such a change should be fixed easily as it's would be definitively a bug. If the behavior I'm proposing is in place, userspace won't be allowed to supply unsupported bits in a request, so older userspace requests on newer kernels won't trigger unsupported new behavior. This ensure we can extend the interfance with no risk of breaking existing program. If the response buffer is not large enough to store the expected response, -ENOSPC is returned to userspace so that it can adjust the size of its buffer. Note: as offsetof() is used to retrieve the size of the lower chunk of the response, beware that it only works if the upper chunk is right after, without any implicit padding. And, as the size of the latter chunk is added to the base size, implicit padding at the end of the structure is not taken in account. Both point must be taken in account when extending the uverbs functionalities. Another point future contributors will easily miss. One should use pahole tool when modifying data structure part of the ABI to ensure the data structure are correctly aligned on various architectures. Link: http://mid.gmane.org/cover.1421931555.git.ydrone...@opteya.com Cc: Sagi Grimberg sa...@mellanox.com Cc: Shachar Raindel rain...@mellanox.com Cc: Eli Cohen e...@mellanox.com Cc: Haggai Eran hagg...@mellanox.com Signed-off-by: Yann Droneaud ydrone...@opteya.com --- drivers/infiniband/core/uverbs_cmd.c | 13 - 1 file changed, 12 insertions(+), 1 deletion(-) diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c index 532d8eba8b02..8668b328b7e6 100644 --- a/drivers/infiniband/core/uverbs_cmd.c +++ b/drivers/infiniband/core/uverbs_cmd.c @@ -3302,6 +3302,7 @@ int ib_uverbs_ex_query_device(struct ib_uverbs_file *file, struct ib_uverbs_ex_query_device cmd; struct ib_device_attr attr; struct ib_device *device; + size_t resp_len; int err; device = file-device-ib_dev; @@ -3315,6 +3316,11 @@ int ib_uverbs_ex_query_device(struct ib_uverbs_file *file, if (cmd.reserved) return -EINVAL; + resp_len = offsetof(typeof(resp), odp_caps); + + if (ucore-outlen resp_len) + return -ENOSPC; + err = device-query_device(device, attr); if (err) return err; @@ -3325,6 +3331,11 @@ int ib_uverbs_ex_query_device(struct ib_uverbs_file *file, #ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING if (cmd.comp_mask IB_USER_VERBS_EX_QUERY_DEVICE_ODP) { + resp_len += sizeof(resp.odp_caps); + + if (ucore-outlen resp_len) + return -ENOSPC; + resp.odp_caps.general_caps = attr.odp_caps.general_caps; resp.odp_caps.per_transport_caps.rc_odp_caps = attr.odp_caps.per_transport_caps.rc_odp_caps; @@ -3336,7 +3347,7 @@ int ib_uverbs_ex_query_device(struct ib_uverbs_file *file, } #endif - err = ib_copy_to_udata(ucore, resp, sizeof(resp)); + err = ib_copy_to_udata(ucore, resp, resp_len); if (err) return err; Regards. -- Yann Droneaud OPTEYA -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/4] IB/core: ib_copy_to_udata(): don't silently truncate response
While ib_copy_to_udata() should check for the available output space as already proposed in some other patches [1][2][3], the changes brought by commit 5a77abf9a97a (IB/core: Add support for extended query device caps) are silently truncating the data to be written to userspace if the output buffer is not large enough to hold the response data. Silently truncating the response is not a reliable behavior as userspace is not given any hint about this truncation: userspace is leaved with garbage to play with. Not checking the response buffer size and writing past the userspace buffer is no good either, but it's the current behavior. So this patch revert the particular change on ib_copy_to_udata() as a better behavior is implemented in the upper level function ib_uverbs_ex_query_device(). [1] [PATCH 00/22] infiniband: improve userspace input check http://mid.gmane.org/cover.1376847403.git.ydrone...@opteya.com [2] [PATCH 03/22] infiniband: ib_copy_from_udata(): check input length http://mid.gmane.org/2bf102a41c51f61965ee09df827abe8fefb523a9.1376847403.git.ydrone...@opteya.com [3] [PATCH 04/22] infiniband: ib_copy_to_udata(): check output length http://mid.gmane.org/d27716a3a1c180f832d153a7402f65ea8a75b734.1376847403.git.ydrone...@opteya.com Link: http://mid.gmane.org/cover.1421931555.git.ydrone...@opteya.com Cc: Sagi Grimberg sa...@mellanox.com Cc: Shachar Raindel rain...@mellanox.com Cc: Eli Cohen e...@mellanox.com Cc: Haggai Eran hagg...@mellanox.com Signed-off-by: Yann Droneaud ydrone...@opteya.com --- include/rdma/ib_verbs.h | 5 + 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h index 0d74f1de99aa..65994a19e840 100644 --- a/include/rdma/ib_verbs.h +++ b/include/rdma/ib_verbs.h @@ -1707,10 +1707,7 @@ static inline int ib_copy_from_udata(void *dest, struct ib_udata *udata, size_t static inline int ib_copy_to_udata(struct ib_udata *udata, void *src, size_t len) { - size_t copy_sz; - - copy_sz = min_t(size_t, len, udata-outlen); - return copy_to_user(udata-outbuf, src, copy_sz) ? -EFAULT : 0; + return copy_to_user(udata-outbuf, src, len) ? -EFAULT : 0; } /** -- 2.1.0 -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/4] IB/uverbs: ex_query_device: check request's comp_mask
This patch ensures the extended QUERY_DEVICE uverbs request's comp_mask has only known values. If userspace returns unknown features, -EINVAL will be returned, allowing to probe/discover which features are currently supported by the kernel. Moreover, it also ensure the requested features set in comp_mask are sequentially set, not skipping intermediate features, eg. the highest requested feature also request all the lower ones. This way each new feature will have to be stacked on top of the existing ones: this is especially important for the request and response data structures where fields are added after the current ones when expanded to support a new feature. Link: http://mid.gmane.org/cover.1421931555.git.ydrone...@opteya.com Cc: Sagi Grimberg sa...@mellanox.com Cc: Shachar Raindel rain...@mellanox.com Cc: Eli Cohen e...@mellanox.com Cc: Haggai Eran hagg...@mellanox.com Signed-off-by: Yann Droneaud ydrone...@opteya.com --- drivers/infiniband/core/uverbs_cmd.c | 6 ++ 1 file changed, 6 insertions(+) diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c index 8668b328b7e6..80a1c90f1dbf 100644 --- a/drivers/infiniband/core/uverbs_cmd.c +++ b/drivers/infiniband/core/uverbs_cmd.c @@ -3313,6 +3313,12 @@ int ib_uverbs_ex_query_device(struct ib_uverbs_file *file, if (err) return err; + if (cmd.comp_mask (cmd.comp_mask + 1)) + return -EINVAL; + + if (cmd.comp_mask ~(__u32)IB_USER_VERBS_EX_QUERY_DEVICE_ODP) + return -EINVAL; + if (cmd.reserved) return -EINVAL; -- 2.1.0 -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/4] IB/uverbs: ex_query_device: no need to clear the whole structure
As only the requested fields are set and copied to userspace, there's no need to clear the content of the response structure beforehand. Link: http://mid.gmane.org/cover.1421931555.git.ydrone...@opteya.com Cc: Sagi Grimberg sa...@mellanox.com Cc: Shachar Raindel rain...@mellanox.com Cc: Eli Cohen e...@mellanox.com Cc: Haggai Eran hagg...@mellanox.com Signed-off-by: Yann Droneaud ydrone...@opteya.com --- drivers/infiniband/core/uverbs_cmd.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c index 80a1c90f1dbf..bedcd8499c9c 100644 --- a/drivers/infiniband/core/uverbs_cmd.c +++ b/drivers/infiniband/core/uverbs_cmd.c @@ -3331,9 +3331,9 @@ int ib_uverbs_ex_query_device(struct ib_uverbs_file *file, if (err) return err; - memset(resp, 0, sizeof(resp)); copy_query_dev_fields(file, resp.base, attr); resp.comp_mask = 0; + resp.reserved = 0; #ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING if (cmd.comp_mask IB_USER_VERBS_EX_QUERY_DEVICE_ODP) { @@ -3349,6 +3349,9 @@ int ib_uverbs_ex_query_device(struct ib_uverbs_file *file, attr.odp_caps.per_transport_caps.uc_odp_caps; resp.odp_caps.per_transport_caps.ud_odp_caps = attr.odp_caps.per_transport_caps.ud_odp_caps; + + resp.odp_caps.reserved = 0; + resp.comp_mask |= IB_USER_VERBS_EX_QUERY_DEVICE_ODP; } #endif -- 2.1.0 -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/4] IB/uverbs: ex_query_device: fail if output buffer size does not match
Instead of silently truncating extended QUERY_DEVICE uverb's response, see commit 5a77abf9a97a (IB/core: Add support for extended query device caps)), this patch makes function ib_uverbs_ex_query_device() check the available space in the response buffer against the requested features set in comp_mask (currently only IB_USER_VERBS_EX_QUERY_DEVICE_ODP per commit 860f10a799c8 (IB/core: Add flags for on demand paging support). If the response buffer is not large enough to store the expected response, -ENOSPC is returned to userspace so that it can adjust the size of its buffer. Note: as offsetof() is used to retrieve the size of the lower chunk of the response, beware that it only works if the upper chunk is right after, without any implicit padding. And, as the size of the latter chunk is added to the base size, implicit padding at the end of the structure is not taken in account. Both point must be taken in account when extending the uverbs functionalities. Link: http://mid.gmane.org/cover.1421931555.git.ydrone...@opteya.com Cc: Sagi Grimberg sa...@mellanox.com Cc: Shachar Raindel rain...@mellanox.com Cc: Eli Cohen e...@mellanox.com Cc: Haggai Eran hagg...@mellanox.com Signed-off-by: Yann Droneaud ydrone...@opteya.com --- drivers/infiniband/core/uverbs_cmd.c | 13 - 1 file changed, 12 insertions(+), 1 deletion(-) diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c index 532d8eba8b02..8668b328b7e6 100644 --- a/drivers/infiniband/core/uverbs_cmd.c +++ b/drivers/infiniband/core/uverbs_cmd.c @@ -3302,6 +3302,7 @@ int ib_uverbs_ex_query_device(struct ib_uverbs_file *file, struct ib_uverbs_ex_query_device cmd; struct ib_device_attr attr; struct ib_device *device; + size_t resp_len; int err; device = file-device-ib_dev; @@ -3315,6 +3316,11 @@ int ib_uverbs_ex_query_device(struct ib_uverbs_file *file, if (cmd.reserved) return -EINVAL; + resp_len = offsetof(typeof(resp), odp_caps); + + if (ucore-outlen resp_len) + return -ENOSPC; + err = device-query_device(device, attr); if (err) return err; @@ -3325,6 +3331,11 @@ int ib_uverbs_ex_query_device(struct ib_uverbs_file *file, #ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING if (cmd.comp_mask IB_USER_VERBS_EX_QUERY_DEVICE_ODP) { + resp_len += sizeof(resp.odp_caps); + + if (ucore-outlen resp_len) + return -ENOSPC; + resp.odp_caps.general_caps = attr.odp_caps.general_caps; resp.odp_caps.per_transport_caps.rc_odp_caps = attr.odp_caps.per_transport_caps.rc_odp_caps; @@ -3336,7 +3347,7 @@ int ib_uverbs_ex_query_device(struct ib_uverbs_file *file, } #endif - err = ib_copy_to_udata(ucore, resp, sizeof(resp)); + err = ib_copy_to_udata(ucore, resp, resp_len); if (err) return err; -- 2.1.0 -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/4] IB/core: extended query device caps cleanup for v3.19
Hi, Following discussions in thread [PATCH v3 06/17] IB/core: Add support for extended query device caps [1], I'm proposing the following patchset to implement a different behavior to handle the request parameters in ib_uverbs_ex_query_device() in order to restore the current behavior of ib_copy_to_udata(). The behavior is different from the initial one but I feel more confident it would allow a better maintainability. Additionally, I believe it's looking more like the behavior already implemented by other InfiniBand/RDMA kernel - userspace interfaces. [1] http://mid.gmane.org/1418733236.2779.26.ca...@opteya.com Yann Droneaud (4): IB/uverbs: ex_query_device: fail if output buffer size does not match IB/core: ib_copy_to_udata(): don't silently truncate response IB/uverbs: ex_query_device: check request's comp_mask IB/uverbs: ex_query_device: no need to clear the whole structure drivers/infiniband/core/uverbs_cmd.c | 24 ++-- include/rdma/ib_verbs.h | 5 + 2 files changed, 23 insertions(+), 6 deletions(-) -- 2.1.0 -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 06/17] IB/core: Add support for extended query device caps
Hi, Le mardi 16 décembre 2014 à 21:14 +0100, Yann Droneaud a écrit : Le mardi 16 décembre 2014 à 22:07 +0200, Or Gerlitz a écrit : On Tue, Dec 16, 2014 at 7:41 PM, Roland Dreier rol...@kernel.org wrote: On Tue, Dec 16, 2014 at 4:33 AM, Yann Droneaud ydrone...@opteya.com wrote: With the suggested change here, buffer overflow won't happen, but the error is silently ignored, allowing uverb to return a partial result, which is likely not expected by userspace as it's a bit difficult to handle it gracefully. So this has to be removed, and a check on userspace response buffer must be added to ib_uverbs_ex_query_device() instead. I'm not sure of the specifics of the change you're suggesting here. Would it be OK to go forward with the patch set we have, and then fix this issue before 3.19-rc2? Roland, Haggai will address the change in an incremental patch against your for-next (3.19-rc1) so the fix will be ready on time for 3.19-rc2 AFAICT, no fix has been posted on the list so far. Do you have a patch to address the issue ready ? Regards. -- Yann Droneaud OPTEYA -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 06/17] IB/core: Add support for extended query device caps
Hi, Le mercredi 17 décembre 2014 à 16:12 +0200, Haggai Eran a écrit : On 17/12/2014 16:00, Yann Droneaud wrote: Le mercredi 17 décembre 2014 à 08:54 +0200, Haggai Eran a écrit : On 16/12/2014 14:33, Yann Droneaud wrote: Le jeudi 11 décembre 2014 à 17:04 +0200, Haggai Eran a écrit : static inline int ib_copy_to_udata(struct ib_udata *udata, void *src, size_t len) { -return copy_to_user(udata-outbuf, src, len) ? -EFAULT : 0; +size_t copy_sz; + +copy_sz = min_t(size_t, len, udata-outlen); +return copy_to_user(udata-outbuf, src, copy_sz) ? -EFAULT : 0; } This is not the place to do this: as I'm guessing the purpose of this change from the patch in '[PATCH v3 07/17] IB/core: Add flags for on demand paging support', you're trying to handle uverbs call from a userspace program using a previous, shorter ABI. Yes, that was my intention. But that's hidding bug where userspace will get it wrong at passing the correct buffer / size for all others uverb calls. That cannot work that way. In a previous patchset [1], I've suggested to add a check in ib_copy_{from,to}_udata()[2][3] in order to check the input/output buffer size to not read/write past userspace provided buffer boundaries: in case of mismatch an error would be returned to userspace. With the suggested change here, buffer overflow won't happen, but the error is silently ignored, allowing uverb to return a partial result, which is likely not expected by userspace as it's a bit difficult to handle it gracefully. So this has to be removed, and a check on userspace response buffer must be added to ib_uverbs_ex_query_device() instead. I agree that we shouldn't silently ignore bugs in userspace, but I'm not sure the alternative is maintainable. If we have in the future N new extensions to this verb, will we need to validate the user space given output buffer is one of the N possible sizes? Yes. It would very easy for someone to forget one of the possible sizes, and thus blocking support for an older version of libibverbs. Such a bug would be hard to detect because it requires testing all previous versions of libibverbs. I think the problem you are trying to solve - userspace accidentally setting a smaller response size then required - will be detected immediately when one attempts to use that code. Additionnaly the size should be checked related to the flags set in the comp_mask: eg. requiring IB_USER_VERBS_EX_QUERY_DEVICE_ODP but not providing the expected response buffer should be an error. In a query verb like EX_QUERY_DEVICE, I would expect the user-space code to request all bits in the comp_mask, since there's very little benefit from requesting a specific set (only a slightly shorter response for the system call). The kernel would ignore bits it doesn't know, and the user-space would ignore bits it doesn't know in the response. Regarding comp_mask (not for this particular verb): It's not clear whether request's comp_mask describe the request or the response, as such I'm puzzled. How would the kernel and the userspace be able to parse the request and the response if they ignore unknown bits ? How would they be able to skip the unrecognized chunk of the request and response buffer ? How one bit in a comp_mask is related to a chunk in the request or response ? It's likely the kernel or userspace would have to skip the remaining comp_mask's bits after encountering an unknown bit as the size of the corresponding chunk in request or response would be unknown, making impossible to locate the corresponding chunk for the next bit set in comp_mask. Having said that, comp_mask is just a complicated way of expressing a version, which is turn describe a size (ever growing). Anyway, I wanted to point to the perf subsystem and tool: perf subsystem use a version in it's request (ok, it's a size): http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/include/uapi/linux/perf_event.h?id=v3.19-rc5#n255 It's checked in perf_copy_attr(): http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/kernel/events/core.c?id=v3.19-rc5#n7072 As you can see, any unknown value passed to the kernel is rejected with -EINVAL. Which is the way to follow, with respect to http://blog.ffwll.ch/2013/11/botching-up-ioctls.html rules. Then perf userspace tool is able to probe the maximum supported features: See __perf_evsel__open() http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/tools/perf/util/evsel.c?id=v3.19-rc5#n1038 Such complicated construct could be used in userspace to check what's the highest supported comp_mask is. I don't think it would solve our issue here, but this is a piece to be reviewed. By the way, we have to find how to handle the issue as soon as possible, because we can't let this thing working this way for v3.19. Regards. -- Yann Droneaud OPTEYA
Re: [PATCH v3 07/17] IB/core: Add flags for on demand paging support
; + resp.comp_mask |= IB_USER_VERBS_EX_QUERY_DEVICE_ODP; + } You need to clear the tail of the response otherwise, kernel will leak stack content to userspace: + #else /* !CONFIG_INFINIBAND_ON_DEMAND_PAGING */ + resp.odp_caps.general_caps = 0; + resp.odp_caps.per_transport_caps.rc_odp_caps = 0; + resp.odp_caps.per_transport_caps.uc_odp_caps = 0; + resp.odp_caps.per_transport_caps.ud_odp_caps = 0; +#endif + + resp.odp_caps.reserved = 0 err = ib_copy_to_udata(ucore, resp, sizeof(resp)); if (err) return err; diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h index 97a999f9e4d8..a41bc5a39ebf 100644 --- a/include/rdma/ib_verbs.h +++ b/include/rdma/ib_verbs.h @@ -123,7 +123,8 @@ enum ib_device_cap_flags { IB_DEVICE_MEM_WINDOW_TYPE_2A= (123), IB_DEVICE_MEM_WINDOW_TYPE_2B= (124), IB_DEVICE_MANAGED_FLOW_STEERING = (129), - IB_DEVICE_SIGNATURE_HANDOVER= (130) + IB_DEVICE_SIGNATURE_HANDOVER= (130), + IB_DEVICE_ON_DEMAND_PAGING = (131), }; enum ib_signature_prot_cap { @@ -143,6 +144,27 @@ enum ib_atomic_cap { IB_ATOMIC_GLOB }; +enum ib_odp_general_cap_bits { + IB_ODP_SUPPORT = 1 0, +}; + +enum ib_odp_transport_cap_bits { + IB_ODP_SUPPORT_SEND = 1 0, + IB_ODP_SUPPORT_RECV = 1 1, + IB_ODP_SUPPORT_WRITE= 1 2, + IB_ODP_SUPPORT_READ = 1 3, + IB_ODP_SUPPORT_ATOMIC = 1 4, +}; + +struct ib_odp_caps { + uint64_t general_caps; + struct { + uint32_t rc_odp_caps; + uint32_t uc_odp_caps; + uint32_t ud_odp_caps; + } per_transport_caps; +}; + struct ib_device_attr { u64 fw_ver; __be64 sys_image_guid; @@ -186,6 +208,7 @@ struct ib_device_attr { u8 local_ca_ack_delay; int sig_prot_cap; int sig_guard_cap; + struct ib_odp_caps odp_caps; }; enum ib_mtu { @@ -1073,7 +1096,8 @@ enum ib_access_flags { IB_ACCESS_REMOTE_READ = (12), IB_ACCESS_REMOTE_ATOMIC = (13), IB_ACCESS_MW_BIND = (14), - IB_ZERO_BASED = (15) + IB_ZERO_BASED = (15), + IB_ACCESS_ON_DEMAND = (16), }; struct ib_phys_buf { diff --git a/include/uapi/rdma/ib_user_verbs.h b/include/uapi/rdma/ib_user_verbs.h index e8a96071e352..4275b961bf60 100644 --- a/include/uapi/rdma/ib_user_verbs.h +++ b/include/uapi/rdma/ib_user_verbs.h @@ -202,15 +202,30 @@ struct ib_uverbs_query_device_resp { __u8 reserved[4]; }; +enum { + IB_USER_VERBS_EX_QUERY_DEVICE_ODP = 1ULL 0, +}; + struct ib_uverbs_ex_query_device { __u32 comp_mask; __u32 reserved; }; +struct ib_uverbs_odp_caps { + __u64 general_caps; + struct { + __u32 rc_odp_caps; + __u32 uc_odp_caps; + __u32 ud_odp_caps; + } per_transport_caps; + __u32 reserved; +}; + struct ib_uverbs_ex_query_device_resp { struct ib_uverbs_query_device_resp base; __u32 comp_mask; __u32 reserved; + struct ib_uverbs_odp_caps odp_caps; }; Hopefully, no kernel was released with ib_uverbs_ex_query_device_resp without odp_caps (eg. in between '[PATCH v3 06/17] IB/core: Add support for extended query device caps' and this one, or ib_uverbs_ex_query_device() should have been modified to handle shorter ib_uverbs_ex_query_device_resp to accomodate the ABI variations. struct ib_uverbs_query_port { Regards. -- Yann Droneaud -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 07/17] IB/core: Add flags for on demand paging support
Le mardi 16 décembre 2014 à 13:02 +0100, Yann Droneaud a écrit : Le jeudi 11 décembre 2014 à 17:04 +0200, Haggai Eran a écrit : From: Sagi Grimberg sa...@mellanox.com * Add a configuration option for enable on-demand paging support in the infiniband subsystem (CONFIG_INFINIBAND_ON_DEMAND_PAGING). In a later patch, this configuration option will select the MMU_NOTIFIER configuration option to enable mmu notifiers. * Add a flag for on demand paging (ODP) support in the IB device capabilities. * Add a flag to request ODP MR in the access flags to reg_mr. * Fail registrations done with the ODP flag when the low-level driver doesn't support this. * Change the conditions in which an MR will be writable to explicitly specify the access flags. This is to avoid making an MR writable just because it is an ODP MR. * Add a ODP capabilities to the extended query device verb. Signed-off-by: Sagi Grimberg sa...@mellanox.com Signed-off-by: Shachar Raindel rain...@mellanox.com Signed-off-by: Haggai Eran hagg...@mellanox.com --- drivers/infiniband/Kconfig | 10 ++ drivers/infiniband/core/umem.c | 8 +--- drivers/infiniband/core/uverbs_cmd.c | 25 + include/rdma/ib_verbs.h | 28 ++-- include/uapi/rdma/ib_user_verbs.h| 15 +++ 5 files changed, 81 insertions(+), 5 deletions(-) diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig index 77089399359b..089a2c2af329 100644 --- a/drivers/infiniband/Kconfig +++ b/drivers/infiniband/Kconfig @@ -38,6 +38,16 @@ config INFINIBAND_USER_MEM depends on INFINIBAND_USER_ACCESS != n default y +config INFINIBAND_ON_DEMAND_PAGING + bool InfiniBand on-demand paging support + depends on INFINIBAND_USER_MEM + default y + ---help--- + On demand paging support for the InfiniBand subsystem. + Together with driver support this allows registration of + memory regions without pinning their pages, fetching the + pages on demand instead. + config INFINIBAND_ADDR_TRANS bool depends on INFINIBAND diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c index 6f152628e0d2..c328e4693d14 100644 --- a/drivers/infiniband/core/umem.c +++ b/drivers/infiniband/core/umem.c @@ -107,13 +107,15 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr, umem-page_size = PAGE_SIZE; umem-pid = get_task_pid(current, PIDTYPE_PID); /* -* We ask for writable memory if any access flags other than -* remote read are set. Local write and remote write +* We ask for writable memory if any of the following +* access flags are set. Local write and remote write * obviously require write access. Remote atomic can do * things like fetch and add, which will modify memory, and * MW bind can change permissions by binding a window. */ - umem-writable = !!(access ~IB_ACCESS_REMOTE_READ); + umem-writable = !!(access + (IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_WRITE | +IB_ACCESS_REMOTE_ATOMIC | IB_ACCESS_MW_BIND)); /* We assume the memory is from hugetlb until proved otherwise */ umem-hugetlb = 1; diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c index c7a43624c96b..f9326ccda4b5 100644 --- a/drivers/infiniband/core/uverbs_cmd.c +++ b/drivers/infiniband/core/uverbs_cmd.c @@ -953,6 +953,18 @@ ssize_t ib_uverbs_reg_mr(struct ib_uverbs_file *file, goto err_free; } + if (cmd.access_flags IB_ACCESS_ON_DEMAND) { + struct ib_device_attr attr; + + ret = ib_query_device(pd-device, attr); + if (ret || !(attr.device_cap_flags + IB_DEVICE_ON_DEMAND_PAGING)) { + pr_debug(ODP support not available\n); + ret = -EINVAL; + goto err_put; + } + } + mr = pd-device-reg_user_mr(pd, cmd.start, cmd.length, cmd.hca_va, cmd.access_flags, udata); if (IS_ERR(mr)) { @@ -3289,6 +3301,19 @@ int ib_uverbs_ex_query_device(struct ib_uverbs_file *file, copy_query_dev_fields(file, resp.base, attr); resp.comp_mask = 0; +#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING + if (cmd.comp_mask IB_USER_VERBS_EX_QUERY_DEVICE_ODP) { + resp.odp_caps.general_caps = attr.odp_caps.general_caps; + resp.odp_caps.per_transport_caps.rc_odp_caps = + attr.odp_caps.per_transport_caps.rc_odp_caps; + resp.odp_caps.per_transport_caps.uc_odp_caps = + attr.odp_caps.per_transport_caps.uc_odp_caps; + resp.odp_caps.per_transport_caps.ud_odp_caps
Re: [PATCH v3 06/17] IB/core: Add support for extended query device caps
uverb to return a partial result, which is likely not expected by userspace as it's a bit difficult to handle it gracefully. So this has to be removed, and a check on userspace response buffer must be added to ib_uverbs_ex_query_device() instead. [1] [PATCH 00/22] infiniband: improve userspace input check http://marc.info/?i=cover.1376847403.git.ydrone...@opteya.com http://mid.gmane.org/cover.1376847403.git.ydrone...@opteya.com [2] [PATCH 03/22] infiniband: ib_copy_from_udata(): check input length http://mid.gmane.org/2bf102a41c51f61965ee09df827abe8fefb523a9.1376847403.git.ydrone...@opteya.com [3] [PATCH 04/22] infiniband: ib_copy_to_udata(): check output length http://mid.gmane.org/d27716a3a1c180f832d153a7402f65ea8a75b734.1376847403.git.ydrone...@opteya.com Regards. -- Yann Droneaud OPTEYA -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 06/17] IB/core: Add support for extended query device caps
Hi, Le mardi 16 décembre 2014 à 22:07 +0200, Or Gerlitz a écrit : On Tue, Dec 16, 2014 at 7:41 PM, Roland Dreier rol...@kernel.org wrote: On Tue, Dec 16, 2014 at 4:33 AM, Yann Droneaud ydrone...@opteya.com wrote: With the suggested change here, buffer overflow won't happen, but the error is silently ignored, allowing uverb to return a partial result, which is likely not expected by userspace as it's a bit difficult to handle it gracefully. So this has to be removed, and a check on userspace response buffer must be added to ib_uverbs_ex_query_device() instead. I'm not sure of the specifics of the change you're suggesting here. Would it be OK to go forward with the patch set we have, and then fix this issue before 3.19-rc2? Roland, Haggai will address the change in an incremental patch against your for-next (3.19-rc1) so the fix will be ready on time for 3.19-rc2 That's would be great. Thanks -- Yann Droneaud OPTEYA -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 04/17] IB/core: Add umem function to read data from user-space
Le jeudi 11 décembre 2014 à 13:09 +0200, Haggai Eran a écrit : On 10/12/2014 18:22, Yann Droneaud wrote: Hi, Le mardi 11 novembre 2014 à 18:36 +0200, Haggai Eran a écrit : In some drivers there's a need to read data from a user space area that was pinned using ib_umem, when running from a different process context. The ib_umem_copy_from function allows reading data from the physical pages pinned in the ib_umem struct. Signed-off-by: Haggai Eran hagg...@mellanox.com --- drivers/infiniband/core/umem.c | 26 ++ include/rdma/ib_umem.h | 2 ++ 2 files changed, 28 insertions(+) diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c index e0f883292374..77bec75963e7 100644 --- a/drivers/infiniband/core/umem.c +++ b/drivers/infiniband/core/umem.c @@ -292,3 +292,29 @@ int ib_umem_page_count(struct ib_umem *umem) return n; } EXPORT_SYMBOL(ib_umem_page_count); + +/* + * Copy from the given ib_umem's pages to the given buffer. + * + * umem - the umem to copy from + * offset - offset to start copying from + * dst - destination buffer + * length - buffer length + * + * Returns the number of copied bytes, or an error code. + */ +int ib_umem_copy_from(struct ib_umem *umem, size_t offset, void *dst, +size_t length) I would prefer the arguments in the same order as ib_copy_from_udata() int ib_umem_copy_from(void *dst, struct ib_umem *umem, size_t umem_offset, size_t length); No problem. +{ + size_t end = offset + length; + + if (offset umem-length || end umem-length || end offset) { + pr_err(ib_umem_copy_from not in range. offset: %zd umem length: %zd end: %zd\n, + offset, umem-length, end); + return -EINVAL; + } + I think the test could be rewritten as: if (offset umem-length || length umem-length - offset) That's one operation less. + return sg_pcopy_to_buffer(umem-sg_head.sgl, umem-nmap, dst, length, + offset + ib_umem_offset(umem)); +} +EXPORT_SYMBOL(ib_umem_copy_from); As the function return a int, no more than INT_MAX bytes (likely 2^31 - 1) can be copied. Perhaps changing the return type to to ssize_t would be better (and a check to enfore ssize_t maximum value). Or change the function could return 0 in case of success or a error code, just like ib_copy_from_udata(). Okay. I'll change it to match ib_copy_from_udata. We're checking the umem size in the call site of this function anyway, and the only reason I see sg_pcopy_to_buffer would return less than *length* bytes is when reaching the end of the scatterlist. As the length is compared against umem-length (+ offset), this would mean umem-length is not synchronized with the length of the data described by the scatter/gather list ? BTW, ib_copy_from_udata() is defined as an inline function. Would it be better to have ib_umem_copy_from() being an inline function too ? (In such case, I would remove the error message to not duplicate it across all modules using the function) Regards. -- Yann Droneaud OPTEYA -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 04/17] IB/core: Add umem function to read data from user-space
Hi Or, Le jeudi 11 décembre 2014 à 15:09 +0200, Or Gerlitz a écrit : On Thu, Dec 11, 2014 at 2:39 PM, Yann Droneaud ydrone...@opteya.com wrote: BTW, ib_copy_from_udata() is defined as an inline function. Would it be better to have ib_umem_copy_from() being an inline function too ? (In such case, I would remove the error message to not duplicate it across all modules using the function) Yann, Lets leave your 2nd comment to be addressed as future cleanup/improvement. Really, these patches are on the list for 7-8 moths and we were asking for feedback over all this time, jumping now with inlining comments and such when the merge window is just across the door isn't appropriate. Sure, I've no right and no power to delay or NACK anything: I'm just lazily following the list, and my involvement in InfiniBand/RDMA community is slowly dropping to 0, being involved in other (paid) projects, with some spike from time to time, waiting to find time to drain my patch queue. Unfortunately I'm not on the same schedule. Anyway, everyone is free to provide comments at anytime. And you're free to take them in account. It's a fact of kernel development: sometime a patchset take many months and multiple revision before being applied. Regards. -- Yann Droneaud OPTEYA -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 06/17] IB/core: Add support for extended query device caps
Hi Roland, Le mardi 09 décembre 2014 à 13:43 -0800, Roland Dreier a écrit : I was getting ready to apply the ODP series, but then I noticed: On Tue, Nov 11, 2014 at 8:36 AM, Haggai Eran hagg...@mellanox.com wrote: diff --git a/drivers/infiniband/core/uverbs_main.c b/drivers/infiniband/core/uverbs_main.c index 71ab83fde472..974025028790 100644 --- a/drivers/infiniband/core/uverbs_main.c +++ b/drivers/infiniband/core/uverbs_main.c @@ -122,7 +122,8 @@ static int (*uverbs_ex_cmd_table[])(struct ib_uverbs_file *file, struct ib_udata *ucore, struct ib_udata *uhw) = { [IB_USER_VERBS_EX_CMD_CREATE_FLOW] = ib_uverbs_ex_create_flow, - [IB_USER_VERBS_EX_CMD_DESTROY_FLOW] = ib_uverbs_ex_destroy_flow + [IB_USER_VERBS_EX_CMD_DESTROY_FLOW] = ib_uverbs_ex_destroy_flow, + [IB_USER_VERBS_EX_CMD_QUERY_DEVICE] = ib_uverbs_ex_query_device }; static void ib_uverbs_add_one(struct ib_device *device); diff --git a/include/uapi/rdma/ib_user_verbs.h b/include/uapi/rdma/ib_user_verbs.h index 26daf55ff76e..ed8c3d9da42c 100644 --- a/include/uapi/rdma/ib_user_verbs.h +++ b/include/uapi/rdma/ib_user_verbs.h @@ -90,8 +90,9 @@ enum { }; enum { + IB_USER_VERBS_EX_CMD_QUERY_DEVICE = IB_USER_VERBS_CMD_QUERY_DEVICE, IB_USER_VERBS_EX_CMD_CREATE_FLOW = IB_USER_VERBS_CMD_THRESHOLD, - IB_USER_VERBS_EX_CMD_DESTROY_FLOW + IB_USER_VERBS_EX_CMD_DESTROY_FLOW, }; And this makes no sense to me. I thought the whole point of EX commands was to add then after IB_USER_VERBS_CMD_THRESHOLD. In this case, if you make IB_USER_VERBS_EX_CMD_QUERY_DEVICE = IB_USER_VERBS_CMD_QUERY_DEVICE then doesn't the entry in uverbs_cmd_table[] for normal query device get overwritten with ib_uverbs_ex_query_device()?? IB_USER_VERBS_CMD_THRESHOLD was introduced as part of commit 400dbc96583f ('IB/core: Infrastructure for extensible uverbs commands'). When I've 'upgraded' it to more extensible scheme, in commit f21519b23c1b ('IB/core: extended command: an improved infrastructure for uverbs commands'), I've put the extended commands in a different namespace, with the idea we could later convert non-extended commands to extended ones and phase out the older ABI in the future. I seems that what I've envisioned is starting to happen. Regards. -- Yann Droneaud OPTEYA -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 06/17] IB/core: Add support for extended query device caps
a check for it being 0. +}; + +struct ib_uverbs_ex_query_device_resp { + struct ib_uverbs_query_device_resp base; + __u32 comp_mask; +}; + _ex response buffer are supposed to be aligned on 64bit boundary: you should probably add padding at the end of the structure and ensure it's cleared before send it to userspace. See commit f21519b23c1b ('IB/core: extended command: an improved infrastructure for uverbs commands'). Regards. -- Yann Droneaud -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 04/17] IB/core: Add umem function to read data from user-space
Hi, Le mardi 11 novembre 2014 à 18:36 +0200, Haggai Eran a écrit : In some drivers there's a need to read data from a user space area that was pinned using ib_umem, when running from a different process context. The ib_umem_copy_from function allows reading data from the physical pages pinned in the ib_umem struct. Signed-off-by: Haggai Eran hagg...@mellanox.com --- drivers/infiniband/core/umem.c | 26 ++ include/rdma/ib_umem.h | 2 ++ 2 files changed, 28 insertions(+) diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c index e0f883292374..77bec75963e7 100644 --- a/drivers/infiniband/core/umem.c +++ b/drivers/infiniband/core/umem.c @@ -292,3 +292,29 @@ int ib_umem_page_count(struct ib_umem *umem) return n; } EXPORT_SYMBOL(ib_umem_page_count); + +/* + * Copy from the given ib_umem's pages to the given buffer. + * + * umem - the umem to copy from + * offset - offset to start copying from + * dst - destination buffer + * length - buffer length + * + * Returns the number of copied bytes, or an error code. + */ +int ib_umem_copy_from(struct ib_umem *umem, size_t offset, void *dst, + size_t length) I would prefer the arguments in the same order as ib_copy_from_udata() int ib_umem_copy_from(void *dst, struct ib_umem *umem, size_t umem_offset, size_t length); +{ + size_t end = offset + length; + + if (offset umem-length || end umem-length || end offset) { + pr_err(ib_umem_copy_from not in range. offset: %zd umem length: %zd end: %zd\n, +offset, umem-length, end); + return -EINVAL; + } + + return sg_pcopy_to_buffer(umem-sg_head.sgl, umem-nmap, dst, length, + offset + ib_umem_offset(umem)); +} +EXPORT_SYMBOL(ib_umem_copy_from); As the function return a int, no more than INT_MAX bytes (likely 2^31 - 1) can be copied. Perhaps changing the return type to to ssize_t would be better (and a check to enfore ssize_t maximum value). Or change the function could return 0 in case of success or a error code, just like ib_copy_from_udata(). diff --git a/include/rdma/ib_umem.h b/include/rdma/ib_umem.h index 7ed6d4ff58dc..ee897724cbf8 100644 --- a/include/rdma/ib_umem.h +++ b/include/rdma/ib_umem.h @@ -84,6 +84,8 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr, size_t size, int access, int dmasync); void ib_umem_release(struct ib_umem *umem); int ib_umem_page_count(struct ib_umem *umem); +int ib_umem_copy_from(struct ib_umem *umem, size_t start, void *dst, + size_t length); #else /* CONFIG_INFINIBAND_USER_MEM */ -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH libibverbs 6/9] ibv_cmd_open_qp(): initialize reserved field in struct ibv_open_qp
Properly initialize data structure exchanged with kernel uverbs layer: - may remove valgrind spurious warnings; - may allow to use the field later for future expansion. Link: http://marc.info/?i=cover.1412163687.git.ydrone...@opteya.com Fixes: 5f5292bbc51fb ('Add ibv_open_qp() for XRC receive QPs') Cc: Sean Hefty sean.he...@intel.com Cc: Yishai Hadas yish...@mellanox.com Signed-off-by: Yann Droneaud ydrone...@opteya.com --- src/cmd.c | 1 + 1 file changed, 1 insertion(+) diff --git a/src/cmd.c b/src/cmd.c index f80ce4d30832..f9e4a03fd84d 100644 --- a/src/cmd.c +++ b/src/cmd.c @@ -785,6 +785,7 @@ int ibv_cmd_open_qp(struct ibv_context *context, struct verbs_qp *qp, cmd-pd_handle = xrcd-handle; cmd-qpn = attr-qp_num; cmd-qp_type = attr-qp_type; + memset(cmd-reserved, 0, sizeof(cmd-reserved)); if (write(context-cmd_fd, cmd, cmd_size) != cmd_size) return errno; -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH libibverbs 3/9] ibv_cmd_post_send(): initialize reserved field in struct ibv_kern_send_wr
Properly initialize data structure exchanged with kernel uverbs layer: - may remove valgrind spurious warnings; - may allow to use the field later for future expansion. Link: http://marc.info/?i=cover.1412163687.git.ydrone...@opteya.com Fixes: eb0663777c24 ('Add support for new datapath kernel commands') Signed-off-by: Yann Droneaud ydrone...@opteya.com --- src/cmd.c | 10 ++ 1 file changed, 10 insertions(+) diff --git a/src/cmd.c b/src/cmd.c index d057494260ee..8d8f7e8294e5 100644 --- a/src/cmd.c +++ b/src/cmd.c @@ -999,6 +999,9 @@ int ibv_cmd_post_send(struct ibv_qp *ibqp, struct ibv_send_wr *wr, tmp-wr.ud.ah = i-wr.ud.ah-handle; tmp-wr.ud.remote_qpn = i-wr.ud.remote_qpn; tmp-wr.ud.remote_qkey = i-wr.ud.remote_qkey; + memset(tmp-wr.ud.reserved, 0, + sizeof(tmp-wr.ud.reserved) + + (sizeof(tmp-wr) - sizeof(tmp-wr.ud))); } else { switch (i-opcode) { case IBV_WR_RDMA_WRITE: @@ -1007,6 +1010,9 @@ int ibv_cmd_post_send(struct ibv_qp *ibqp, struct ibv_send_wr *wr, tmp-wr.rdma.remote_addr = i-wr.rdma.remote_addr; tmp-wr.rdma.rkey = i-wr.rdma.rkey; + memset(tmp-wr.rdma.reserved, 0, + sizeof(tmp-wr.rdma.reserved) + + (sizeof(tmp-wr) - sizeof(tmp-wr.rdma))); break; case IBV_WR_ATOMIC_CMP_AND_SWP: case IBV_WR_ATOMIC_FETCH_AND_ADD: @@ -1016,8 +1022,12 @@ int ibv_cmd_post_send(struct ibv_qp *ibqp, struct ibv_send_wr *wr, i-wr.atomic.compare_add; tmp-wr.atomic.swap = i-wr.atomic.swap; tmp-wr.atomic.rkey = i-wr.atomic.rkey; + memset(tmp-wr.atomic.reserved, 0, + sizeof(tmp-wr.atomic.reserved) + + (sizeof(tmp-wr) - sizeof(tmp-wr.atomic))); break; default: + memset(tmp-wr, 0, sizeof(tmp-wr)); break; } } -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH libibverbs 1/9] ibv_cmd_create_ah(): initialize reserved fields in various structures
Reserved fields in struct ibv_kern_global_route, struct ibv_kern_ah_attr and struct ibv_create_ah must be properly initialized when exchanged with kernel uverbs layer, as: - it may remove valgrind spurious warnings; - it may allow to use the field later for future expansion. Link: http://marc.info/?i=cover.1412163687.git.ydrone...@opteya.com Fixes: eb0663777c24 ('Add support for new datapath kernel commands') Signed-off-by: Yann Droneaud ydrone...@opteya.com --- src/cmd.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/src/cmd.c b/src/cmd.c index 45ea06ff4705..b281082f2f4d 100644 --- a/src/cmd.c +++ b/src/cmd.c @@ -1179,16 +1179,19 @@ int ibv_cmd_create_ah(struct ibv_pd *pd, struct ibv_ah *ah, IBV_INIT_CMD_RESP(cmd, sizeof cmd, CREATE_AH, resp, sizeof resp); cmd.user_handle= (uintptr_t) ah; cmd.pd_handle = pd-handle; + cmd.reserved = 0; cmd.attr.dlid = attr-dlid; cmd.attr.sl= attr-sl; cmd.attr.src_path_bits = attr-src_path_bits; cmd.attr.static_rate = attr-static_rate; cmd.attr.is_global = attr-is_global; cmd.attr.port_num = attr-port_num; + cmd.attr.reserved = 0; cmd.attr.grh.flow_label= attr-grh.flow_label; cmd.attr.grh.sgid_index= attr-grh.sgid_index; cmd.attr.grh.hop_limit = attr-grh.hop_limit; cmd.attr.grh.traffic_class = attr-grh.traffic_class; + cmd.attr.grh.reserved = 0; memcpy(cmd.attr.grh.dgid, attr-grh.dgid.raw, 16); if (write(pd-context-cmd_fd, cmd, sizeof cmd) != sizeof cmd) -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH libibverbs 4/9] ibv_cmd_create_srq_ex(): set reserved field in struct ibv_create_xsrq
Properly initialize data structure exchanged with kernel uverbs layer: - may remove valgrind spurious warnings; - may allow to use the field later for future expansion. Link: http://marc.info/?i=cover.1412163687.git.ydrone...@opteya.com Fixes: 40c1365b2198 ('Add support for XRC SRQs') Cc: Sean Hefty sean.he...@intel.com Cc: Yishai Hadas yish...@mellanox.com Signed-off-by: Yann Droneaud ydrone...@opteya.com --- src/cmd.c | 1 + 1 file changed, 1 insertion(+) diff --git a/src/cmd.c b/src/cmd.c index 8d8f7e8294e5..f80ce4d30832 100644 --- a/src/cmd.c +++ b/src/cmd.c @@ -463,6 +463,7 @@ int ibv_cmd_create_srq_ex(struct ibv_context *context, cmd-max_wr = attr_ex-attr.max_wr; cmd-max_sge = attr_ex-attr.max_sge; cmd-srq_limit = attr_ex-attr.srq_limit; + cmd-reserved= 0; cmd-srq_type = (attr_ex-comp_mask IBV_SRQ_INIT_ATTR_TYPE) ? attr_ex-srq_type : IBV_SRQT_BASIC; -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH libibverbs 9/9] ibv_cmd_destroy_flow(): initialize comp_mask explicitly
There's no need to clear the whole struct ibv_destroy_flow with memset(): only comp_mask field is not explicitly initialized in the function. This patch removes call to memset() and initialize comp_mask to 0. Link: http://marc.info/?i=cover.1412163687.git.ydrone...@opteya.com Fixes: 389de6a6ef4e ('Add receive flow steering support') Cc: Hadar Hen Zion had...@mellanox.com Cc: Or Gerlitz ogerl...@mellanox.com Cc: Matan Barak mat...@mellanox.com Signed-off-by: Yann Droneaud ydrone...@opteya.com --- src/cmd.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/cmd.c b/src/cmd.c index 2f039f6852bb..d849e6dac214 100644 --- a/src/cmd.c +++ b/src/cmd.c @@ -1384,8 +1384,8 @@ int ibv_cmd_destroy_flow(struct ibv_flow *flow_id) struct ibv_destroy_flow cmd; int ret = 0; - memset(cmd, 0, sizeof(cmd)); IBV_INIT_CMD_EX(cmd, sizeof(cmd), DESTROY_FLOW); + cmd.comp_mask = 0; cmd.flow_handle = flow_id-handle; if (write(flow_id-context-cmd_fd, cmd, sizeof(cmd)) != sizeof(cmd)) -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH libibverbs 8/9] ibv_cmd_create_flow(): initialize flow_id-comp_mask
ibv_cmd_create_flow() allocates and returns a struct ibv_flow, but doesn't initialize its comp_mask field. Link: http://marc.info/?i=cover.1412163687.git.ydrone...@opteya.com Fixes: 389de6a6ef4 ('Add receive flow steering support') Cc: Hadar Hen Zion had...@mellanox.com Cc: Or Gerlitz ogerl...@mellanox.com Cc: Matan Barak mat...@mellanox.com Signed-off-by: Yann Droneaud ydrone...@opteya.com --- src/cmd.c | 1 + 1 file changed, 1 insertion(+) diff --git a/src/cmd.c b/src/cmd.c index dc155d6bf4c7..2f039f6852bb 100644 --- a/src/cmd.c +++ b/src/cmd.c @@ -1372,6 +1372,7 @@ struct ibv_flow *ibv_cmd_create_flow(struct ibv_qp *qp, flow_id-context = qp-context; flow_id-handle = resp.flow_handle; + flow_id-comp_mask = 0; return flow_id; err: free(flow_id); -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH libibverbs 5/9] kern-abi: remove unused qp_type union in struct ibv_kern_send_wr
Added by commit c7e3e61052dd ('Add support for XRC QPs') in struct ibv_kern_send_wr, qp_type union is not initialized in ibv_cmd_post_send(). Additionnaly, as reported by pahole, this field introduces padding in the end of struct ibv_kern_send_wr: struct ibv_kern_send_wr { /* ... */ union { struct { __u32 remote_srqn; /*56 4 */ } xrc; /* 4 */ } qp_type; /*56 4 */ /* size: 64, cachelines: 1, members: 7 */ /* padding: 4 */ }; This padding is not initialized too. Since this field is not used at all kernel side and not exposed to drivers, qp_type union in struct ibv_kern_send_wr could be removed from the structure. Link: http://marc.info/?i=cover.1412163687.git.ydrone...@opteya.com Fixes: c7e3e61052dd ('Add support for XRC QPs') Cc: Sean Hefty sean.he...@intel.com Cc: Yishai Hadas yish...@mellanox.com Signed-off-by: Yann Droneaud ydrone...@opteya.com --- include/infiniband/kern-abi.h | 5 - 1 file changed, 5 deletions(-) diff --git a/include/infiniband/kern-abi.h b/include/infiniband/kern-abi.h index 91b45d837239..7023a0294df2 100644 --- a/include/infiniband/kern-abi.h +++ b/include/infiniband/kern-abi.h @@ -676,11 +676,6 @@ struct ibv_kern_send_wr { __u32 reserved; } ud; } wr; - union { - struct { - __u32 remote_srqn; - } xrc; - } qp_type; }; struct ibv_kern_eth_filter { -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH libibverbs 7/9] ibv_cmd_create_flow(): explicitly clear fields
There's no need to clear the whole struct ibv_create_flow + struct ibv_kern_spec with memset(). This patch removes call to memset(), adds explicit initialization for reserved fields, set comp_mask explicitly Link: http://marc.info/?i=cover.1412163687.git.ydrone...@opteya.com Fixes: 389de6a6ef4e ('Add receive flow steering support') Cc: Hadar Hen Zion had...@mellanox.com Cc: Or Gerlitz ogerl...@mellanox.com Cc: Matan Barak mat...@mellanox.com Signed-off-by: Yann Droneaud ydrone...@opteya.com --- src/cmd.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/src/cmd.c b/src/cmd.c index f9e4a03fd84d..dc155d6bf4c7 100644 --- a/src/cmd.c +++ b/src/cmd.c @@ -1289,6 +1289,7 @@ static int ib_spec_to_kern_spec(struct ibv_flow_spec *ib_spec, struct ibv_kern_spec *kern_spec) { kern_spec-hdr.type = ib_spec-hdr.type; + kern_spec-hdr.reserved = 0; switch (ib_spec-hdr.type) { case IBV_FLOW_SPEC_ETH: @@ -1337,15 +1338,17 @@ struct ibv_flow *ibv_cmd_create_flow(struct ibv_qp *qp, flow_id = malloc(sizeof(*flow_id)); if (!flow_id) return NULL; - memset(cmd, 0, cmd_size); + cmd-comp_mask = 0; cmd-qp_handle = qp-handle; + cmd-flow_attr.size = 0; cmd-flow_attr.type = flow_attr-type; cmd-flow_attr.priority = flow_attr-priority; cmd-flow_attr.num_of_specs = flow_attr-num_of_specs; cmd-flow_attr.port = flow_attr-port; cmd-flow_attr.flags = flow_attr-flags; + cmd-flow_attr.reserved[0] = cmd-flow_attr.reserved[1] = 1; kern_spec = cmd + 1; ib_spec = flow_attr + 1; -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html